| 32 32 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 | // SPDX-License-Identifier: GPL-2.0-only /* * linux/fs/nfs/inode.c * * Copyright (C) 1992 Rick Sladkey * * nfs inode and superblock handling functions * * Modularised by Alan Cox <alan@lxorguk.ukuu.org.uk>, while hacking some * experimental NFS changes. Modularisation taken straight from SYS5 fs. * * Change to nfs_read_super() to permit NFS mounts to multi-homed hosts. * J.S.Peatfield@damtp.cam.ac.uk * */ #include <linux/module.h> #include <linux/init.h> #include <linux/sched/signal.h> #include <linux/time.h> #include <linux/kernel.h> #include <linux/mm.h> #include <linux/string.h> #include <linux/stat.h> #include <linux/errno.h> #include <linux/unistd.h> #include <linux/sunrpc/clnt.h> #include <linux/sunrpc/stats.h> #include <linux/sunrpc/metrics.h> #include <linux/nfs_fs.h> #include <linux/nfs_mount.h> #include <linux/nfs4_mount.h> #include <linux/lockd/bind.h> #include <linux/seq_file.h> #include <linux/mount.h> #include <linux/vfs.h> #include <linux/inet.h> #include <linux/nfs_xdr.h> #include <linux/slab.h> #include <linux/compat.h> #include <linux/freezer.h> #include <linux/uaccess.h> #include <linux/iversion.h> #include "nfs4_fs.h" #include "callback.h" #include "delegation.h" #include "iostat.h" #include "internal.h" #include "fscache.h" #include "pnfs.h" #include "nfs.h" #include "netns.h" #include "sysfs.h" #include "nfstrace.h" #define NFSDBG_FACILITY NFSDBG_VFS #define NFS_64_BIT_INODE_NUMBERS_ENABLED 1 /* Default is to see 64-bit inode numbers */ static bool enable_ino64 = NFS_64_BIT_INODE_NUMBERS_ENABLED; static int nfs_update_inode(struct inode *, struct nfs_fattr *); static struct kmem_cache * nfs_inode_cachep; static inline unsigned long nfs_fattr_to_ino_t(struct nfs_fattr *fattr) { return nfs_fileid_to_ino_t(fattr->fileid); } int nfs_wait_bit_killable(struct wait_bit_key *key, int mode) { if (unlikely(nfs_current_task_exiting())) return -EINTR; schedule(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0; } EXPORT_SYMBOL_GPL(nfs_wait_bit_killable); /** * nfs_compat_user_ino64 - returns the user-visible inode number * @fileid: 64-bit fileid * * This function returns a 32-bit inode number if the boot parameter * nfs.enable_ino64 is zero. */ u64 nfs_compat_user_ino64(u64 fileid) { #ifdef CONFIG_COMPAT compat_ulong_t ino; #else unsigned long ino; #endif if (enable_ino64) return fileid; ino = fileid; if (sizeof(ino) < sizeof(fileid)) ino ^= fileid >> (sizeof(fileid)-sizeof(ino)) * 8; return ino; } int nfs_drop_inode(struct inode *inode) { return NFS_STALE(inode) || inode_generic_drop(inode); } EXPORT_SYMBOL_GPL(nfs_drop_inode); void nfs_clear_inode(struct inode *inode) { /* * The following should never happen... */ WARN_ON_ONCE(nfs_have_writebacks(inode)); WARN_ON_ONCE(!list_empty(&NFS_I(inode)->open_files)); nfs_zap_acl_cache(inode); nfs_access_zap_cache(inode); nfs_fscache_clear_inode(inode); } EXPORT_SYMBOL_GPL(nfs_clear_inode); void nfs_evict_inode(struct inode *inode) { truncate_inode_pages_final(&inode->i_data); clear_inode(inode); nfs_clear_inode(inode); } int nfs_sync_inode(struct inode *inode) { inode_dio_wait(inode); return nfs_wb_all(inode); } EXPORT_SYMBOL_GPL(nfs_sync_inode); /** * nfs_sync_mapping - helper to flush all mmapped dirty data to disk * @mapping: pointer to struct address_space */ int nfs_sync_mapping(struct address_space *mapping) { int ret = 0; if (mapping->nrpages != 0) { unmap_mapping_range(mapping, 0, 0, 0); ret = nfs_wb_all(mapping->host); } return ret; } static int nfs_attribute_timeout(struct inode *inode) { struct nfs_inode *nfsi = NFS_I(inode); return !time_in_range_open(jiffies, nfsi->read_cache_jiffies, nfsi->read_cache_jiffies + nfsi->attrtimeo); } static bool nfs_check_cache_flags_invalid(struct inode *inode, unsigned long flags) { unsigned long cache_validity = READ_ONCE(NFS_I(inode)->cache_validity); return (cache_validity & flags) != 0; } bool nfs_check_cache_invalid(struct inode *inode, unsigned long flags) { if (nfs_check_cache_flags_invalid(inode, flags)) return true; return nfs_attribute_cache_expired(inode); } EXPORT_SYMBOL_GPL(nfs_check_cache_invalid); #ifdef CONFIG_NFS_V4_2 static bool nfs_has_xattr_cache(const struct nfs_inode *nfsi) { return nfsi->xattr_cache != NULL; } #else static bool nfs_has_xattr_cache(const struct nfs_inode *nfsi) { return false; } #endif void nfs_set_cache_invalid(struct inode *inode, unsigned long flags) { struct nfs_inode *nfsi = NFS_I(inode); if (nfs_have_delegated_attributes(inode)) { if (!(flags & NFS_INO_REVAL_FORCED)) flags &= ~(NFS_INO_INVALID_MODE | NFS_INO_INVALID_OTHER | NFS_INO_INVALID_BTIME | NFS_INO_INVALID_XATTR); flags &= ~(NFS_INO_INVALID_CHANGE | NFS_INO_INVALID_SIZE); } if (!nfs_has_xattr_cache(nfsi)) flags &= ~NFS_INO_INVALID_XATTR; if (flags & NFS_INO_INVALID_DATA) nfs_fscache_invalidate(inode, 0); flags &= ~NFS_INO_REVAL_FORCED; flags |= nfsi->cache_validity; if (inode->i_mapping->nrpages == 0) flags &= ~NFS_INO_INVALID_DATA; /* pairs with nfs_clear_invalid_mapping()'s smp_load_acquire() */ smp_store_release(&nfsi->cache_validity, flags); if (inode->i_mapping->nrpages == 0 || nfsi->cache_validity & NFS_INO_INVALID_DATA) { nfs_ooo_clear(nfsi); } trace_nfs_set_cache_invalid(inode, 0); } EXPORT_SYMBOL_GPL(nfs_set_cache_invalid); /* * Invalidate the local caches */ static void nfs_zap_caches_locked(struct inode *inode) { struct nfs_inode *nfsi = NFS_I(inode); int mode = inode->i_mode; nfs_inc_stats(inode, NFSIOS_ATTRINVALIDATE); nfsi->attrtimeo = NFS_MINATTRTIMEO(inode); nfsi->attrtimeo_timestamp = jiffies; if (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR | NFS_INO_INVALID_DATA | NFS_INO_INVALID_ACCESS | NFS_INO_INVALID_ACL | NFS_INO_INVALID_XATTR); else nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR | NFS_INO_INVALID_ACCESS | NFS_INO_INVALID_ACL | NFS_INO_INVALID_XATTR); nfs_zap_label_cache_locked(nfsi); } void nfs_zap_caches(struct inode *inode) { spin_lock(&inode->i_lock); nfs_zap_caches_locked(inode); spin_unlock(&inode->i_lock); } void nfs_zap_mapping(struct inode *inode, struct address_space *mapping) { if (mapping->nrpages != 0) { spin_lock(&inode->i_lock); nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA); spin_unlock(&inode->i_lock); } } void nfs_zap_acl_cache(struct inode *inode) { void (*clear_acl_cache)(struct inode *); clear_acl_cache = NFS_PROTO(inode)->clear_acl_cache; if (clear_acl_cache != NULL) clear_acl_cache(inode); spin_lock(&inode->i_lock); NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_ACL; spin_unlock(&inode->i_lock); } EXPORT_SYMBOL_GPL(nfs_zap_acl_cache); void nfs_invalidate_atime(struct inode *inode) { if (nfs_have_delegated_atime(inode)) return; spin_lock(&inode->i_lock); nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATIME); spin_unlock(&inode->i_lock); } EXPORT_SYMBOL_GPL(nfs_invalidate_atime); /* * Invalidate, but do not unhash, the inode. * NB: must be called with inode->i_lock held! */ static void nfs_set_inode_stale_locked(struct inode *inode) { set_bit(NFS_INO_STALE, &NFS_I(inode)->flags); nfs_zap_caches_locked(inode); trace_nfs_set_inode_stale(inode); } void nfs_set_inode_stale(struct inode *inode) { spin_lock(&inode->i_lock); nfs_set_inode_stale_locked(inode); spin_unlock(&inode->i_lock); } struct nfs_find_desc { struct nfs_fh *fh; struct nfs_fattr *fattr; }; /* * In NFSv3 we can have 64bit inode numbers. In order to support * this, and re-exported directories (also seen in NFSv2) * we are forced to allow 2 different inodes to have the same * i_ino. */ static int nfs_find_actor(struct inode *inode, void *opaque) { struct nfs_find_desc *desc = opaque; struct nfs_fh *fh = desc->fh; struct nfs_fattr *fattr = desc->fattr; if (NFS_FILEID(inode) != fattr->fileid) return 0; if (inode_wrong_type(inode, fattr->mode)) return 0; if (nfs_compare_fh(NFS_FH(inode), fh)) return 0; if (is_bad_inode(inode) || NFS_STALE(inode)) return 0; return 1; } static int nfs_init_locked(struct inode *inode, void *opaque) { struct nfs_find_desc *desc = opaque; struct nfs_fattr *fattr = desc->fattr; set_nfs_fileid(inode, fattr->fileid); inode->i_mode = fattr->mode; nfs_copy_fh(NFS_FH(inode), desc->fh); return 0; } #ifdef CONFIG_NFS_V4_SECURITY_LABEL static void nfs_clear_label_invalid(struct inode *inode) { spin_lock(&inode->i_lock); NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_LABEL; spin_unlock(&inode->i_lock); } void nfs_setsecurity(struct inode *inode, struct nfs_fattr *fattr) { int error; if (fattr->label == NULL) return; if ((fattr->valid & NFS_ATTR_FATTR_V4_SECURITY_LABEL) && inode->i_security) { error = security_inode_notifysecctx(inode, fattr->label->label, fattr->label->len); if (error) printk(KERN_ERR "%s() %s %d " "security_inode_notifysecctx() %d\n", __func__, (char *)fattr->label->label, fattr->label->len, error); nfs_clear_label_invalid(inode); } } struct nfs4_label *nfs4_label_alloc(struct nfs_server *server, gfp_t flags) { struct nfs4_label *label; if (!(server->caps & NFS_CAP_SECURITY_LABEL)) return NULL; label = kzalloc(sizeof(struct nfs4_label), flags); if (label == NULL) return ERR_PTR(-ENOMEM); label->label = kzalloc(NFS4_MAXLABELLEN, flags); if (label->label == NULL) { kfree(label); return ERR_PTR(-ENOMEM); } label->len = NFS4_MAXLABELLEN; return label; } EXPORT_SYMBOL_GPL(nfs4_label_alloc); #else void nfs_setsecurity(struct inode *inode, struct nfs_fattr *fattr) { } #endif EXPORT_SYMBOL_GPL(nfs_setsecurity); /* Search for inode identified by fh, fileid and i_mode in inode cache. */ struct inode * nfs_ilookup(struct super_block *sb, struct nfs_fattr *fattr, struct nfs_fh *fh) { struct nfs_find_desc desc = { .fh = fh, .fattr = fattr, }; struct inode *inode; unsigned long hash; if (!(fattr->valid & NFS_ATTR_FATTR_FILEID) || !(fattr->valid & NFS_ATTR_FATTR_TYPE)) return NULL; hash = nfs_fattr_to_ino_t(fattr); inode = ilookup5(sb, hash, nfs_find_actor, &desc); dprintk("%s: returning %p\n", __func__, inode); return inode; } static void nfs_inode_init_regular(struct nfs_inode *nfsi) { atomic_long_set(&nfsi->nrequests, 0); atomic_long_set(&nfsi->redirtied_pages, 0); INIT_LIST_HEAD(&nfsi->commit_info.list); atomic_long_set(&nfsi->commit_info.ncommit, 0); atomic_set(&nfsi->commit_info.rpcs_out, 0); mutex_init(&nfsi->commit_mutex); } static void nfs_inode_init_dir(struct nfs_inode *nfsi) { nfsi->cache_change_attribute = 0; memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf)); init_rwsem(&nfsi->rmdir_sem); } /* * This is our front-end to iget that looks up inodes by file handle * instead of inode number. */ struct inode * nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr) { struct nfs_find_desc desc = { .fh = fh, .fattr = fattr }; struct inode *inode = ERR_PTR(-ENOENT); u64 fattr_supported = NFS_SB(sb)->fattr_valid; unsigned long hash; nfs_attr_check_mountpoint(sb, fattr); if (nfs_attr_use_mounted_on_fileid(fattr)) fattr->fileid = fattr->mounted_on_fileid; else if ((fattr->valid & NFS_ATTR_FATTR_FILEID) == 0) goto out_no_inode; if ((fattr->valid & NFS_ATTR_FATTR_TYPE) == 0) goto out_no_inode; hash = nfs_fattr_to_ino_t(fattr); inode = iget5_locked(sb, hash, nfs_find_actor, nfs_init_locked, &desc); if (inode == NULL) { inode = ERR_PTR(-ENOMEM); goto out_no_inode; } if (inode_state_read_once(inode) & I_NEW) { struct nfs_inode *nfsi = NFS_I(inode); unsigned long now = jiffies; /* We set i_ino for the few things that still rely on it, * such as stat(2) */ inode->i_ino = hash; /* We can't support update_atime(), since the server will reset it */ inode->i_flags |= S_NOATIME|S_NOCMTIME; inode->i_mode = fattr->mode; nfsi->cache_validity = 0; if ((fattr->valid & NFS_ATTR_FATTR_MODE) == 0 && (fattr_supported & NFS_ATTR_FATTR_MODE)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_MODE); /* Why so? Because we want revalidate for devices/FIFOs, and * that's precisely what we have in nfs_file_inode_operations. */ inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->file_inode_ops; if (S_ISREG(inode->i_mode)) { inode->i_fop = NFS_SB(sb)->nfs_client->rpc_ops->file_ops; inode->i_data.a_ops = &nfs_file_aops; nfs_inode_init_regular(nfsi); mapping_set_large_folios(inode->i_mapping); } else if (S_ISDIR(inode->i_mode)) { inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops; inode->i_fop = &nfs_dir_operations; inode->i_data.a_ops = &nfs_dir_aops; nfs_inode_init_dir(nfsi); /* Deal with crossing mountpoints */ if (fattr->valid & NFS_ATTR_FATTR_MOUNTPOINT || fattr->valid & NFS_ATTR_FATTR_V4_REFERRAL) { if (fattr->valid & NFS_ATTR_FATTR_V4_REFERRAL) inode->i_op = &nfs_referral_inode_operations; else inode->i_op = &nfs_mountpoint_inode_operations; inode->i_fop = NULL; inode->i_flags |= S_AUTOMOUNT; } } else if (S_ISLNK(inode->i_mode)) { inode->i_op = &nfs_symlink_inode_operations; inode_nohighmem(inode); } else init_special_inode(inode, inode->i_mode, fattr->rdev); inode_set_atime(inode, 0, 0); inode_set_mtime(inode, 0, 0); inode_set_ctime(inode, 0, 0); memset(&nfsi->btime, 0, sizeof(nfsi->btime)); inode_set_iversion_raw(inode, 0); inode->i_size = 0; clear_nlink(inode); inode->i_uid = make_kuid(&init_user_ns, -2); inode->i_gid = make_kgid(&init_user_ns, -2); inode->i_blocks = 0; nfsi->write_io = 0; nfsi->read_io = 0; nfsi->read_cache_jiffies = fattr->time_start; nfsi->attr_gencount = fattr->gencount; if (fattr->valid & NFS_ATTR_FATTR_ATIME) inode_set_atime_to_ts(inode, fattr->atime); else if (fattr_supported & NFS_ATTR_FATTR_ATIME) nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATIME); if (fattr->valid & NFS_ATTR_FATTR_MTIME) inode_set_mtime_to_ts(inode, fattr->mtime); else if (fattr_supported & NFS_ATTR_FATTR_MTIME) nfs_set_cache_invalid(inode, NFS_INO_INVALID_MTIME); if (fattr->valid & NFS_ATTR_FATTR_CTIME) inode_set_ctime_to_ts(inode, fattr->ctime); else if (fattr_supported & NFS_ATTR_FATTR_CTIME) nfs_set_cache_invalid(inode, NFS_INO_INVALID_CTIME); if (fattr->valid & NFS_ATTR_FATTR_BTIME) nfsi->btime = fattr->btime; else if (fattr_supported & NFS_ATTR_FATTR_BTIME) nfs_set_cache_invalid(inode, NFS_INO_INVALID_BTIME); if (fattr->valid & NFS_ATTR_FATTR_CHANGE) inode_set_iversion_raw(inode, fattr->change_attr); else nfs_set_cache_invalid(inode, NFS_INO_INVALID_CHANGE); if (fattr->valid & NFS_ATTR_FATTR_SIZE) inode->i_size = nfs_size_to_loff_t(fattr->size); else nfs_set_cache_invalid(inode, NFS_INO_INVALID_SIZE); if (fattr->valid & NFS_ATTR_FATTR_NLINK) set_nlink(inode, fattr->nlink); else if (fattr_supported & NFS_ATTR_FATTR_NLINK) nfs_set_cache_invalid(inode, NFS_INO_INVALID_NLINK); else set_nlink(inode, 1); if (fattr->valid & NFS_ATTR_FATTR_OWNER) inode->i_uid = fattr->uid; else if (fattr_supported & NFS_ATTR_FATTR_OWNER) nfs_set_cache_invalid(inode, NFS_INO_INVALID_OTHER); if (fattr->valid & NFS_ATTR_FATTR_GROUP) inode->i_gid = fattr->gid; else if (fattr_supported & NFS_ATTR_FATTR_GROUP) nfs_set_cache_invalid(inode, NFS_INO_INVALID_OTHER); if (fattr->valid & NFS_ATTR_FATTR_BLOCKS_USED) inode->i_blocks = fattr->du.nfs2.blocks; else if (fattr_supported & NFS_ATTR_FATTR_BLOCKS_USED && fattr->size != 0) nfs_set_cache_invalid(inode, NFS_INO_INVALID_BLOCKS); if (fattr->valid & NFS_ATTR_FATTR_SPACE_USED) { /* * report the blocks in 512byte units */ inode->i_blocks = nfs_calc_block_size(fattr->du.nfs3.used); } else if (fattr_supported & NFS_ATTR_FATTR_SPACE_USED && fattr->size != 0) nfs_set_cache_invalid(inode, NFS_INO_INVALID_BLOCKS); nfs_setsecurity(inode, fattr); nfsi->attrtimeo = NFS_MINATTRTIMEO(inode); nfsi->attrtimeo_timestamp = now; nfsi->access_cache = RB_ROOT; nfs_fscache_init_inode(inode); unlock_new_inode(inode); } else { int err = nfs_refresh_inode(inode, fattr); if (err < 0) { iput(inode); inode = ERR_PTR(err); goto out_no_inode; } } dprintk("NFS: nfs_fhget(%s/%Lu fh_crc=0x%08x ct=%d)\n", inode->i_sb->s_id, (unsigned long long)NFS_FILEID(inode), nfs_display_fhandle_hash(fh), icount_read(inode)); out: return inode; out_no_inode: dprintk("nfs_fhget: iget failed with error %ld\n", PTR_ERR(inode)); goto out; } EXPORT_SYMBOL_GPL(nfs_fhget); static void nfs_fattr_fixup_delegated(struct inode *inode, struct nfs_fattr *fattr) { unsigned long cache_validity = NFS_I(inode)->cache_validity; if (nfs_have_delegated_mtime(inode)) { if (!(cache_validity & NFS_INO_INVALID_CTIME)) fattr->valid &= ~(NFS_ATTR_FATTR_PRECTIME | NFS_ATTR_FATTR_CTIME); if (!(cache_validity & NFS_INO_INVALID_MTIME)) fattr->valid &= ~(NFS_ATTR_FATTR_PREMTIME | NFS_ATTR_FATTR_MTIME); if (!(cache_validity & NFS_INO_INVALID_ATIME)) fattr->valid &= ~NFS_ATTR_FATTR_ATIME; } else if (nfs_have_delegated_atime(inode)) { if (!(cache_validity & NFS_INO_INVALID_ATIME)) fattr->valid &= ~NFS_ATTR_FATTR_ATIME; } } static void nfs_set_timestamps_to_ts(struct inode *inode, struct iattr *attr) { unsigned int cache_flags = 0; if (attr->ia_valid & ATTR_MTIME_SET) { struct timespec64 ctime = inode_get_ctime(inode); struct timespec64 mtime = inode_get_mtime(inode); struct timespec64 now; int updated = 0; now = inode_set_ctime_current(inode); if (!timespec64_equal(&now, &ctime)) updated |= S_CTIME; inode_set_mtime_to_ts(inode, attr->ia_mtime); if (!timespec64_equal(&now, &mtime)) updated |= S_MTIME; inode_maybe_inc_iversion(inode, updated); cache_flags |= NFS_INO_INVALID_CTIME | NFS_INO_INVALID_MTIME; } if (attr->ia_valid & ATTR_ATIME_SET) { inode_set_atime_to_ts(inode, attr->ia_atime); cache_flags |= NFS_INO_INVALID_ATIME; } NFS_I(inode)->cache_validity &= ~cache_flags; } static void nfs_update_timestamps(struct inode *inode, unsigned int ia_valid) { enum file_time_flags time_flags = 0; unsigned int cache_flags = 0; if (ia_valid & ATTR_MTIME) { time_flags |= S_MTIME | S_CTIME; cache_flags |= NFS_INO_INVALID_CTIME | NFS_INO_INVALID_MTIME; } if (ia_valid & ATTR_ATIME) { time_flags |= S_ATIME; cache_flags |= NFS_INO_INVALID_ATIME; } inode_update_timestamps(inode, time_flags); NFS_I(inode)->cache_validity &= ~cache_flags; } void nfs_update_delegated_atime(struct inode *inode) { spin_lock(&inode->i_lock); if (nfs_have_delegated_atime(inode)) nfs_update_timestamps(inode, ATTR_ATIME); spin_unlock(&inode->i_lock); } void nfs_update_delegated_mtime_locked(struct inode *inode) { if (nfs_have_delegated_mtime(inode)) nfs_update_timestamps(inode, ATTR_MTIME); } void nfs_update_delegated_mtime(struct inode *inode) { spin_lock(&inode->i_lock); nfs_update_delegated_mtime_locked(inode); spin_unlock(&inode->i_lock); } EXPORT_SYMBOL_GPL(nfs_update_delegated_mtime); #define NFS_VALID_ATTRS (ATTR_MODE|ATTR_UID|ATTR_GID|ATTR_SIZE|ATTR_ATIME|ATTR_ATIME_SET|ATTR_MTIME|ATTR_MTIME_SET|ATTR_FILE|ATTR_OPEN) int nfs_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct iattr *attr) { struct inode *inode = d_inode(dentry); struct nfs_fattr *fattr; loff_t oldsize; int error = 0; kuid_t task_uid = current_fsuid(); kuid_t owner_uid = inode->i_uid; nfs_inc_stats(inode, NFSIOS_VFSSETATTR); /* skip mode change if it's just for clearing setuid/setgid */ if (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID)) attr->ia_valid &= ~ATTR_MODE; if (S_ISREG(inode->i_mode)) nfs_file_block_o_direct(NFS_I(inode)); oldsize = i_size_read(inode); if (attr->ia_valid & ATTR_SIZE) { BUG_ON(!S_ISREG(inode->i_mode)); error = inode_newsize_ok(inode, attr->ia_size); if (error) return error; if (attr->ia_size == oldsize) attr->ia_valid &= ~ATTR_SIZE; } if (nfs_have_delegated_mtime(inode) && attr->ia_valid & ATTR_MTIME) { spin_lock(&inode->i_lock); if (attr->ia_valid & ATTR_MTIME_SET) { if (uid_eq(task_uid, owner_uid)) { nfs_set_timestamps_to_ts(inode, attr); attr->ia_valid &= ~(ATTR_MTIME|ATTR_MTIME_SET| ATTR_ATIME|ATTR_ATIME_SET); } } else { nfs_update_timestamps(inode, attr->ia_valid); attr->ia_valid &= ~(ATTR_MTIME|ATTR_ATIME); } spin_unlock(&inode->i_lock); } else if (nfs_have_delegated_atime(inode) && attr->ia_valid & ATTR_ATIME && !(attr->ia_valid & ATTR_MTIME)) { if (attr->ia_valid & ATTR_ATIME_SET) { if (uid_eq(task_uid, owner_uid)) { spin_lock(&inode->i_lock); nfs_set_timestamps_to_ts(inode, attr); spin_unlock(&inode->i_lock); attr->ia_valid &= ~(ATTR_ATIME|ATTR_ATIME_SET); } } else { nfs_update_delegated_atime(inode); attr->ia_valid &= ~ATTR_ATIME; } } /* Optimization: if the end result is no change, don't RPC */ if (((attr->ia_valid & NFS_VALID_ATTRS) & ~(ATTR_FILE|ATTR_OPEN)) == 0) return 0; trace_nfs_setattr_enter(inode); /* Write all dirty data */ if (S_ISREG(inode->i_mode)) nfs_sync_inode(inode); fattr = nfs_alloc_fattr_with_label(NFS_SERVER(inode)); if (fattr == NULL) { error = -ENOMEM; goto out; } error = NFS_PROTO(inode)->setattr(dentry, fattr, attr); if (error == 0) { if (attr->ia_valid & ATTR_SIZE) nfs_truncate_last_folio(inode->i_mapping, oldsize, attr->ia_size); error = nfs_refresh_inode(inode, fattr); } nfs_free_fattr(fattr); out: trace_nfs_setattr_exit(inode, error); return error; } EXPORT_SYMBOL_GPL(nfs_setattr); /** * nfs_vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used * @offset: file offset to start truncating * * This is a copy of the common vmtruncate, but with the locking * corrected to take into account the fact that NFS requires * inode->i_size to be updated under the inode->i_lock. * Note: must be called with inode->i_lock held! */ static int nfs_vmtruncate(struct inode * inode, loff_t offset) { int err; err = inode_newsize_ok(inode, offset); if (err) goto out; trace_nfs_size_truncate(inode, offset); i_size_write(inode, offset); /* Optimisation */ if (offset == 0) { NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_DATA; nfs_ooo_clear(NFS_I(inode)); } NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_SIZE; spin_unlock(&inode->i_lock); truncate_pagecache(inode, offset); nfs_update_delegated_mtime_locked(inode); spin_lock(&inode->i_lock); out: return err; } /** * nfs_setattr_update_inode - Update inode metadata after a setattr call. * @inode: pointer to struct inode * @attr: pointer to struct iattr * @fattr: pointer to struct nfs_fattr * * Note: we do this in the *proc.c in order to ensure that * it works for things like exclusive creates too. */ void nfs_setattr_update_inode(struct inode *inode, struct iattr *attr, struct nfs_fattr *fattr) { /* Barrier: bump the attribute generation count. */ nfs_fattr_set_barrier(fattr); spin_lock(&inode->i_lock); NFS_I(inode)->attr_gencount = fattr->gencount; if ((attr->ia_valid & ATTR_SIZE) != 0) { if (!nfs_have_delegated_mtime(inode)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_MTIME); nfs_set_cache_invalid(inode, NFS_INO_INVALID_BLOCKS); nfs_inc_stats(inode, NFSIOS_SETATTRTRUNC); nfs_vmtruncate(inode, attr->ia_size); } if ((attr->ia_valid & (ATTR_MODE|ATTR_UID|ATTR_GID)) != 0) { NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_CTIME; if ((attr->ia_valid & ATTR_KILL_SUID) != 0 && inode->i_mode & S_ISUID) inode->i_mode &= ~S_ISUID; if (setattr_should_drop_sgid(&nop_mnt_idmap, inode)) inode->i_mode &= ~S_ISGID; if ((attr->ia_valid & ATTR_MODE) != 0) { int mode = attr->ia_mode & S_IALLUGO; mode |= inode->i_mode & ~S_IALLUGO; inode->i_mode = mode; } if ((attr->ia_valid & ATTR_UID) != 0) inode->i_uid = attr->ia_uid; if ((attr->ia_valid & ATTR_GID) != 0) inode->i_gid = attr->ia_gid; if (fattr->valid & NFS_ATTR_FATTR_CTIME) inode_set_ctime_to_ts(inode, fattr->ctime); else nfs_set_cache_invalid(inode, NFS_INO_INVALID_CHANGE | NFS_INO_INVALID_CTIME); nfs_set_cache_invalid(inode, NFS_INO_INVALID_ACCESS | NFS_INO_INVALID_ACL); } if (attr->ia_valid & (ATTR_ATIME_SET|ATTR_ATIME)) { NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_ATIME | NFS_INO_INVALID_CTIME); if (fattr->valid & NFS_ATTR_FATTR_ATIME) inode_set_atime_to_ts(inode, fattr->atime); else if (attr->ia_valid & ATTR_ATIME_SET) inode_set_atime_to_ts(inode, attr->ia_atime); else nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATIME); if (fattr->valid & NFS_ATTR_FATTR_CTIME) inode_set_ctime_to_ts(inode, fattr->ctime); else nfs_set_cache_invalid(inode, NFS_INO_INVALID_CHANGE | NFS_INO_INVALID_CTIME); } if (attr->ia_valid & (ATTR_MTIME_SET|ATTR_MTIME)) { NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_MTIME | NFS_INO_INVALID_CTIME); if (fattr->valid & NFS_ATTR_FATTR_MTIME) inode_set_mtime_to_ts(inode, fattr->mtime); else if (attr->ia_valid & ATTR_MTIME_SET) inode_set_mtime_to_ts(inode, attr->ia_mtime); else nfs_set_cache_invalid(inode, NFS_INO_INVALID_MTIME); if (fattr->valid & NFS_ATTR_FATTR_CTIME) inode_set_ctime_to_ts(inode, fattr->ctime); else nfs_set_cache_invalid(inode, NFS_INO_INVALID_CHANGE | NFS_INO_INVALID_CTIME); } if (fattr->valid) nfs_update_inode(inode, fattr); spin_unlock(&inode->i_lock); } EXPORT_SYMBOL_GPL(nfs_setattr_update_inode); /* * Don't request help from readdirplus if the file is being written to, * or if attribute caching is turned off */ static bool nfs_getattr_readdirplus_enable(const struct inode *inode) { return nfs_server_capable(inode, NFS_CAP_READDIRPLUS) && !nfs_have_writebacks(inode) && NFS_MAXATTRTIMEO(inode) > 5 * HZ; } static void nfs_readdirplus_parent_cache_miss(struct dentry *dentry) { if (!IS_ROOT(dentry)) { struct dentry *parent = dget_parent(dentry); nfs_readdir_record_entry_cache_miss(d_inode(parent)); dput(parent); } } static void nfs_readdirplus_parent_cache_hit(struct dentry *dentry) { if (!IS_ROOT(dentry)) { struct dentry *parent = dget_parent(dentry); nfs_readdir_record_entry_cache_hit(d_inode(parent)); dput(parent); } } static u32 nfs_get_valid_attrmask(struct inode *inode) { u64 fattr_valid = NFS_SERVER(inode)->fattr_valid; unsigned long cache_validity = READ_ONCE(NFS_I(inode)->cache_validity); u32 reply_mask = STATX_INO | STATX_TYPE; if (!(cache_validity & NFS_INO_INVALID_ATIME)) reply_mask |= STATX_ATIME; if (!(cache_validity & NFS_INO_INVALID_CTIME)) reply_mask |= STATX_CTIME; if (!(cache_validity & NFS_INO_INVALID_MTIME)) reply_mask |= STATX_MTIME; if (!(cache_validity & NFS_INO_INVALID_SIZE)) reply_mask |= STATX_SIZE; if (!(cache_validity & NFS_INO_INVALID_NLINK)) reply_mask |= STATX_NLINK; if (!(cache_validity & NFS_INO_INVALID_MODE)) reply_mask |= STATX_MODE; if (!(cache_validity & NFS_INO_INVALID_OTHER)) reply_mask |= STATX_UID | STATX_GID; if (!(cache_validity & NFS_INO_INVALID_BLOCKS)) reply_mask |= STATX_BLOCKS; if (!(cache_validity & NFS_INO_INVALID_BTIME) && (fattr_valid & NFS_ATTR_FATTR_BTIME)) reply_mask |= STATX_BTIME; if (!(cache_validity & NFS_INO_INVALID_CHANGE)) reply_mask |= STATX_CHANGE_COOKIE; return reply_mask; } int nfs_getattr(struct mnt_idmap *idmap, const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = d_inode(path->dentry); struct nfs_server *server = NFS_SERVER(inode); u64 fattr_valid = server->fattr_valid; unsigned long cache_validity; int err = 0; bool force_sync = query_flags & AT_STATX_FORCE_SYNC; bool do_update = false; bool readdirplus_enabled = nfs_getattr_readdirplus_enable(inode); trace_nfs_getattr_enter(inode); request_mask &= STATX_TYPE | STATX_MODE | STATX_NLINK | STATX_UID | STATX_GID | STATX_ATIME | STATX_MTIME | STATX_CTIME | STATX_INO | STATX_SIZE | STATX_BLOCKS | STATX_BTIME | STATX_CHANGE_COOKIE; if (!(fattr_valid & NFS_ATTR_FATTR_BTIME)) request_mask &= ~STATX_BTIME; if ((query_flags & AT_STATX_DONT_SYNC) && !force_sync) { if (readdirplus_enabled) nfs_readdirplus_parent_cache_hit(path->dentry); goto out_no_revalidate; } /* Flush out writes to the server in order to update c/mtime/version. */ if ((request_mask & (STATX_CTIME | STATX_MTIME | STATX_CHANGE_COOKIE)) && S_ISREG(inode->i_mode)) { if (nfs_have_delegated_mtime(inode)) filemap_fdatawrite(inode->i_mapping); else filemap_write_and_wait(inode->i_mapping); } /* * We may force a getattr if the user cares about atime. * * Note that we only have to check the vfsmount flags here: * - NFS always sets S_NOATIME by so checking it would give a * bogus result * - NFS never sets SB_NOATIME or SB_NODIRATIME so there is * no point in checking those. */ if ((path->mnt->mnt_flags & MNT_NOATIME) || ((path->mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))) request_mask &= ~STATX_ATIME; /* Is the user requesting attributes that might need revalidation? */ if (!(request_mask & (STATX_MODE|STATX_NLINK|STATX_ATIME|STATX_CTIME| STATX_MTIME|STATX_UID|STATX_GID| STATX_SIZE|STATX_BLOCKS|STATX_BTIME| STATX_CHANGE_COOKIE))) goto out_no_revalidate; /* Check whether the cached attributes are stale */ do_update |= force_sync || nfs_attribute_cache_expired(inode); cache_validity = READ_ONCE(NFS_I(inode)->cache_validity); do_update |= cache_validity & NFS_INO_INVALID_CHANGE; if (request_mask & STATX_ATIME) do_update |= cache_validity & NFS_INO_INVALID_ATIME; if (request_mask & STATX_CTIME) do_update |= cache_validity & NFS_INO_INVALID_CTIME; if (request_mask & STATX_MTIME) do_update |= cache_validity & NFS_INO_INVALID_MTIME; if (request_mask & STATX_SIZE) do_update |= cache_validity & NFS_INO_INVALID_SIZE; if (request_mask & STATX_NLINK) do_update |= cache_validity & NFS_INO_INVALID_NLINK; if (request_mask & STATX_MODE) do_update |= cache_validity & NFS_INO_INVALID_MODE; if (request_mask & (STATX_UID | STATX_GID)) do_update |= cache_validity & NFS_INO_INVALID_OTHER; if (request_mask & STATX_BLOCKS) do_update |= cache_validity & NFS_INO_INVALID_BLOCKS; if (request_mask & STATX_BTIME) do_update |= cache_validity & NFS_INO_INVALID_BTIME; if (do_update) { if (readdirplus_enabled) nfs_readdirplus_parent_cache_miss(path->dentry); err = __nfs_revalidate_inode(server, inode); if (err) goto out; } else if (readdirplus_enabled) nfs_readdirplus_parent_cache_hit(path->dentry); out_no_revalidate: /* Only return attributes that were revalidated. */ stat->result_mask = nfs_get_valid_attrmask(inode) | request_mask; generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat); stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode)); stat->change_cookie = inode_peek_iversion_raw(inode); stat->attributes_mask |= STATX_ATTR_CHANGE_MONOTONIC; if (server->change_attr_type != NFS4_CHANGE_TYPE_IS_UNDEFINED) stat->attributes |= STATX_ATTR_CHANGE_MONOTONIC; if (S_ISDIR(inode->i_mode)) stat->blksize = NFS_SERVER(inode)->dtsize; stat->btime = NFS_I(inode)->btime; /* Special handling for STATX_DIOALIGN and STATX_DIO_READ_ALIGN * - NFS doesn't have DIO alignment constraints, avoid getting * these DIO attrs from remote and just respond with most * accommodating limits (so client will issue supported DIO). * - this is unintuitive, but the most coarse-grained * dio_offset_align is the most accommodating. */ if ((request_mask & (STATX_DIOALIGN | STATX_DIO_READ_ALIGN)) && S_ISREG(inode->i_mode)) { stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN; stat->dio_mem_align = 4; /* 4-byte alignment */ stat->dio_offset_align = PAGE_SIZE; stat->dio_read_offset_align = stat->dio_offset_align; } out: trace_nfs_getattr_exit(inode, err); return err; } EXPORT_SYMBOL_GPL(nfs_getattr); static void nfs_init_lock_context(struct nfs_lock_context *l_ctx) { refcount_set(&l_ctx->count, 1); l_ctx->lockowner = current->files; INIT_LIST_HEAD(&l_ctx->list); atomic_set(&l_ctx->io_count, 0); } static struct nfs_lock_context *__nfs_find_lock_context(struct nfs_open_context *ctx) { struct nfs_lock_context *pos; list_for_each_entry_rcu(pos, &ctx->lock_context.list, list) { if (pos->lockowner != current->files) continue; if (refcount_inc_not_zero(&pos->count)) return pos; } return NULL; } struct nfs_lock_context *nfs_get_lock_context(struct nfs_open_context *ctx) { struct nfs_lock_context *res, *new = NULL; struct inode *inode = d_inode(ctx->dentry); rcu_read_lock(); res = __nfs_find_lock_context(ctx); rcu_read_unlock(); if (res == NULL) { new = kmalloc(sizeof(*new), GFP_KERNEL_ACCOUNT); if (new == NULL) return ERR_PTR(-ENOMEM); nfs_init_lock_context(new); spin_lock(&inode->i_lock); res = __nfs_find_lock_context(ctx); if (res == NULL) { new->open_context = get_nfs_open_context(ctx); if (new->open_context) { list_add_tail_rcu(&new->list, &ctx->lock_context.list); res = new; new = NULL; } else res = ERR_PTR(-EBADF); } spin_unlock(&inode->i_lock); kfree(new); } return res; } EXPORT_SYMBOL_GPL(nfs_get_lock_context); void nfs_put_lock_context(struct nfs_lock_context *l_ctx) { struct nfs_open_context *ctx = l_ctx->open_context; struct inode *inode = d_inode(ctx->dentry); if (!refcount_dec_and_lock(&l_ctx->count, &inode->i_lock)) return; list_del_rcu(&l_ctx->list); spin_unlock(&inode->i_lock); put_nfs_open_context(ctx); kfree_rcu(l_ctx, rcu_head); } EXPORT_SYMBOL_GPL(nfs_put_lock_context); /** * nfs_close_context - Common close_context() routine NFSv2/v3 * @ctx: pointer to context * @is_sync: is this a synchronous close * * Ensure that the attributes are up to date if we're mounted * with close-to-open semantics and we have cached data that will * need to be revalidated on open. */ void nfs_close_context(struct nfs_open_context *ctx, int is_sync) { struct nfs_inode *nfsi; struct inode *inode; if (!(ctx->mode & FMODE_WRITE)) return; if (!is_sync) return; inode = d_inode(ctx->dentry); if (nfs_have_read_or_write_delegation(inode)) return; nfsi = NFS_I(inode); if (inode->i_mapping->nrpages == 0) return; if (nfsi->cache_validity & NFS_INO_INVALID_DATA) return; if (!list_empty(&nfsi->open_files)) return; if (NFS_SERVER(inode)->flags & NFS_MOUNT_NOCTO) return; nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE | NFS_INO_INVALID_SIZE); } EXPORT_SYMBOL_GPL(nfs_close_context); struct nfs_open_context *alloc_nfs_open_context(struct dentry *dentry, fmode_t f_mode, struct file *filp) { struct nfs_open_context *ctx; ctx = kmalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT); if (!ctx) return ERR_PTR(-ENOMEM); nfs_sb_active(dentry->d_sb); ctx->dentry = dget(dentry); if (filp) ctx->cred = get_cred(filp->f_cred); else ctx->cred = get_current_cred(); rcu_assign_pointer(ctx->ll_cred, NULL); ctx->state = NULL; ctx->mode = f_mode; ctx->flags = 0; ctx->error = 0; ctx->flock_owner = (fl_owner_t)filp; nfs_init_lock_context(&ctx->lock_context); ctx->lock_context.open_context = ctx; INIT_LIST_HEAD(&ctx->list); ctx->mdsthreshold = NULL; nfs_localio_file_init(&ctx->nfl); return ctx; } EXPORT_SYMBOL_GPL(alloc_nfs_open_context); struct nfs_open_context *get_nfs_open_context(struct nfs_open_context *ctx) { if (ctx != NULL && refcount_inc_not_zero(&ctx->lock_context.count)) return ctx; return NULL; } EXPORT_SYMBOL_GPL(get_nfs_open_context); static void __put_nfs_open_context(struct nfs_open_context *ctx, int is_sync) { struct inode *inode = d_inode(ctx->dentry); struct super_block *sb = ctx->dentry->d_sb; if (!refcount_dec_and_test(&ctx->lock_context.count)) return; if (!list_empty(&ctx->list)) { spin_lock(&inode->i_lock); list_del_rcu(&ctx->list); spin_unlock(&inode->i_lock); } if (inode != NULL) NFS_PROTO(inode)->close_context(ctx, is_sync); put_cred(ctx->cred); dput(ctx->dentry); nfs_sb_deactive(sb); put_rpccred(rcu_dereference_protected(ctx->ll_cred, 1)); kfree(ctx->mdsthreshold); nfs_close_local_fh(&ctx->nfl); kfree_rcu(ctx, rcu_head); } void put_nfs_open_context(struct nfs_open_context *ctx) { __put_nfs_open_context(ctx, 0); } EXPORT_SYMBOL_GPL(put_nfs_open_context); static void put_nfs_open_context_sync(struct nfs_open_context *ctx) { __put_nfs_open_context(ctx, 1); } /* * Ensure that mmap has a recent RPC credential for use when writing out * shared pages */ void nfs_inode_attach_open_context(struct nfs_open_context *ctx) { struct inode *inode = d_inode(ctx->dentry); struct nfs_inode *nfsi = NFS_I(inode); spin_lock(&inode->i_lock); if (list_empty(&nfsi->open_files) && nfs_ooo_test(nfsi)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA | NFS_INO_REVAL_FORCED); list_add_tail_rcu(&ctx->list, &nfsi->open_files); spin_unlock(&inode->i_lock); } EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context); void nfs_file_set_open_context(struct file *filp, struct nfs_open_context *ctx) { filp->private_data = get_nfs_open_context(ctx); set_bit(NFS_CONTEXT_FILE_OPEN, &ctx->flags); if (list_empty(&ctx->list)) nfs_inode_attach_open_context(ctx); } EXPORT_SYMBOL_GPL(nfs_file_set_open_context); /* * Given an inode, search for an open context with the desired characteristics */ struct nfs_open_context *nfs_find_open_context(struct inode *inode, const struct cred *cred, fmode_t mode) { struct nfs_inode *nfsi = NFS_I(inode); struct nfs_open_context *pos, *ctx = NULL; rcu_read_lock(); list_for_each_entry_rcu(pos, &nfsi->open_files, list) { if (cred != NULL && cred_fscmp(pos->cred, cred) != 0) continue; if ((pos->mode & (FMODE_READ|FMODE_WRITE)) != mode) continue; if (!test_bit(NFS_CONTEXT_FILE_OPEN, &pos->flags)) continue; ctx = get_nfs_open_context(pos); if (ctx) break; } rcu_read_unlock(); return ctx; } void nfs_file_clear_open_context(struct file *filp) { struct nfs_open_context *ctx = nfs_file_open_context(filp); if (ctx) { struct inode *inode = d_inode(ctx->dentry); clear_bit(NFS_CONTEXT_FILE_OPEN, &ctx->flags); /* * We fatal error on write before. Try to writeback * every page again. */ if (ctx->error < 0) invalidate_inode_pages2(inode->i_mapping); filp->private_data = NULL; put_nfs_open_context_sync(ctx); } } /* * These allocate and release file read/write context information. */ int nfs_open(struct inode *inode, struct file *filp) { struct nfs_open_context *ctx; ctx = alloc_nfs_open_context(file_dentry(filp), flags_to_mode(filp->f_flags), filp); if (IS_ERR(ctx)) return PTR_ERR(ctx); nfs_file_set_open_context(filp, ctx); put_nfs_open_context(ctx); nfs_fscache_open_file(inode, filp); return 0; } /* * This function is called whenever some part of NFS notices that * the cached attributes have to be refreshed. */ int __nfs_revalidate_inode(struct nfs_server *server, struct inode *inode) { int status = -ESTALE; struct nfs_fattr *fattr = NULL; struct nfs_inode *nfsi = NFS_I(inode); dfprintk(PAGECACHE, "NFS: revalidating (%s/%Lu)\n", inode->i_sb->s_id, (unsigned long long)NFS_FILEID(inode)); trace_nfs_revalidate_inode_enter(inode); if (is_bad_inode(inode)) goto out; if (NFS_STALE(inode)) goto out; /* pNFS: Attributes aren't updated until we layoutcommit */ if (S_ISREG(inode->i_mode)) { status = pnfs_sync_inode(inode, false); if (status) goto out; } else if (nfs_have_directory_delegation(inode)) { status = 0; goto out; } status = -ENOMEM; fattr = nfs_alloc_fattr_with_label(NFS_SERVER(inode)); if (fattr == NULL) goto out; nfs_inc_stats(inode, NFSIOS_INODEREVALIDATE); status = NFS_PROTO(inode)->getattr(server, NFS_FH(inode), fattr, inode); if (status != 0) { dfprintk(PAGECACHE, "nfs_revalidate_inode: (%s/%Lu) getattr failed, error=%d\n", inode->i_sb->s_id, (unsigned long long)NFS_FILEID(inode), status); switch (status) { case -ETIMEDOUT: /* A soft timeout occurred. Use cached information? */ if (server->flags & NFS_MOUNT_SOFTREVAL) status = 0; break; case -ESTALE: if (!S_ISDIR(inode->i_mode)) nfs_set_inode_stale(inode); else nfs_zap_caches(inode); } goto out; } status = nfs_refresh_inode(inode, fattr); if (status) { dfprintk(PAGECACHE, "nfs_revalidate_inode: (%s/%Lu) refresh failed, error=%d\n", inode->i_sb->s_id, (unsigned long long)NFS_FILEID(inode), status); goto out; } if (nfsi->cache_validity & NFS_INO_INVALID_ACL) nfs_zap_acl_cache(inode); nfs_setsecurity(inode, fattr); dfprintk(PAGECACHE, "NFS: (%s/%Lu) revalidation complete\n", inode->i_sb->s_id, (unsigned long long)NFS_FILEID(inode)); out: nfs_free_fattr(fattr); trace_nfs_revalidate_inode_exit(inode, status); return status; } int nfs_attribute_cache_expired(struct inode *inode) { if (nfs_have_delegated_attributes(inode)) return 0; return nfs_attribute_timeout(inode); } /** * nfs_revalidate_inode - Revalidate the inode attributes * @inode: pointer to inode struct * @flags: cache flags to check * * Updates inode attribute information by retrieving the data from the server. */ int nfs_revalidate_inode(struct inode *inode, unsigned long flags) { if (!nfs_check_cache_invalid(inode, flags)) return NFS_STALE(inode) ? -ESTALE : 0; return __nfs_revalidate_inode(NFS_SERVER(inode), inode); } EXPORT_SYMBOL_GPL(nfs_revalidate_inode); static int nfs_invalidate_mapping(struct inode *inode, struct address_space *mapping) { int ret; nfs_fscache_invalidate(inode, 0); if (mapping->nrpages != 0) { if (S_ISREG(inode->i_mode)) { ret = nfs_sync_mapping(mapping); if (ret < 0) return ret; } ret = invalidate_inode_pages2(mapping); if (ret < 0) return ret; } nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE); dfprintk(PAGECACHE, "NFS: (%s/%Lu) data cache invalidated\n", inode->i_sb->s_id, (unsigned long long)NFS_FILEID(inode)); return 0; } /** * nfs_clear_invalid_mapping - Conditionally clear a mapping * @mapping: pointer to mapping * * If the NFS_INO_INVALID_DATA inode flag is set, clear the mapping. */ int nfs_clear_invalid_mapping(struct address_space *mapping) { struct inode *inode = mapping->host; struct nfs_inode *nfsi = NFS_I(inode); unsigned long *bitlock = &nfsi->flags; int ret = 0; /* * We must clear NFS_INO_INVALID_DATA first to ensure that * invalidations that come in while we're shooting down the mappings * are respected. But, that leaves a race window where one revalidator * can clear the flag, and then another checks it before the mapping * gets invalidated. Fix that by serializing access to this part of * the function. * * At the same time, we need to allow other tasks to see whether we * might be in the middle of invalidating the pages, so we only set * the bit lock here if it looks like we're going to be doing that. */ for (;;) { ret = wait_on_bit_action(bitlock, NFS_INO_INVALIDATING, nfs_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); if (ret) goto out; smp_rmb(); /* pairs with smp_wmb() below */ if (test_bit(NFS_INO_INVALIDATING, bitlock)) continue; /* pairs with nfs_set_cache_invalid()'s smp_store_release() */ if (!(smp_load_acquire(&nfsi->cache_validity) & NFS_INO_INVALID_DATA)) goto out; /* Slow-path that double-checks with spinlock held */ spin_lock(&inode->i_lock); if (test_bit(NFS_INO_INVALIDATING, bitlock)) { spin_unlock(&inode->i_lock); continue; } if (nfsi->cache_validity & NFS_INO_INVALID_DATA) break; spin_unlock(&inode->i_lock); goto out; } set_bit(NFS_INO_INVALIDATING, bitlock); smp_wmb(); nfsi->cache_validity &= ~NFS_INO_INVALID_DATA; nfs_ooo_clear(nfsi); spin_unlock(&inode->i_lock); trace_nfs_invalidate_mapping_enter(inode); ret = nfs_invalidate_mapping(inode, mapping); trace_nfs_invalidate_mapping_exit(inode, ret); clear_bit_unlock(NFS_INO_INVALIDATING, bitlock); smp_mb__after_atomic(); wake_up_bit(bitlock, NFS_INO_INVALIDATING); out: return ret; } bool nfs_mapping_need_revalidate_inode(struct inode *inode) { return nfs_check_cache_invalid(inode, NFS_INO_INVALID_CHANGE) || NFS_STALE(inode); } int nfs_revalidate_mapping_rcu(struct inode *inode) { struct nfs_inode *nfsi = NFS_I(inode); unsigned long *bitlock = &nfsi->flags; int ret = 0; if (IS_SWAPFILE(inode)) goto out; if (nfs_mapping_need_revalidate_inode(inode)) { ret = -ECHILD; goto out; } spin_lock(&inode->i_lock); if (test_bit(NFS_INO_INVALIDATING, bitlock) || (nfsi->cache_validity & NFS_INO_INVALID_DATA)) ret = -ECHILD; spin_unlock(&inode->i_lock); out: return ret; } /** * nfs_revalidate_mapping - Revalidate the pagecache * @inode: pointer to host inode * @mapping: pointer to mapping */ int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping) { /* swapfiles are not supposed to be shared. */ if (IS_SWAPFILE(inode)) return 0; if (nfs_mapping_need_revalidate_inode(inode)) { int ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode); if (ret < 0) return ret; } return nfs_clear_invalid_mapping(mapping); } static bool nfs_file_has_writers(struct nfs_inode *nfsi) { struct inode *inode = &nfsi->vfs_inode; if (!S_ISREG(inode->i_mode)) return false; if (list_empty(&nfsi->open_files)) return false; return inode_is_open_for_write(inode); } static bool nfs_file_has_buffered_writers(struct nfs_inode *nfsi) { return nfs_file_has_writers(nfsi) && nfs_file_io_is_buffered(nfsi); } static void nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr *fattr) { struct timespec64 ts; if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE) && (fattr->valid & NFS_ATTR_FATTR_CHANGE) && inode_eq_iversion_raw(inode, fattr->pre_change_attr)) { inode_set_iversion_raw(inode, fattr->change_attr); if (S_ISDIR(inode->i_mode)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA); else if (nfs_server_capable(inode, NFS_CAP_XATTR)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_XATTR); } /* If we have atomic WCC data, we may update some attributes */ ts = inode_get_ctime(inode); if ((fattr->valid & NFS_ATTR_FATTR_PRECTIME) && (fattr->valid & NFS_ATTR_FATTR_CTIME) && timespec64_equal(&ts, &fattr->pre_ctime)) { inode_set_ctime_to_ts(inode, fattr->ctime); } ts = inode_get_mtime(inode); if ((fattr->valid & NFS_ATTR_FATTR_PREMTIME) && (fattr->valid & NFS_ATTR_FATTR_MTIME) && timespec64_equal(&ts, &fattr->pre_mtime)) { inode_set_mtime_to_ts(inode, fattr->mtime); } if ((fattr->valid & NFS_ATTR_FATTR_PRESIZE) && (fattr->valid & NFS_ATTR_FATTR_SIZE) && i_size_read(inode) == nfs_size_to_loff_t(fattr->pre_size) && !nfs_have_writebacks(inode)) { trace_nfs_size_wcc(inode, fattr->size); i_size_write(inode, nfs_size_to_loff_t(fattr->size)); } } /** * nfs_check_inode_attributes - verify consistency of the inode attribute cache * @inode: pointer to inode * @fattr: updated attributes * * Verifies the attribute cache. If we have just changed the attributes, * so that fattr carries weak cache consistency data, then it may * also update the ctime/mtime/change_attribute. */ static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fattr) { struct nfs_inode *nfsi = NFS_I(inode); loff_t cur_size, new_isize; unsigned long invalid = 0; struct timespec64 ts; if (nfs_have_delegated_attributes(inode)) return 0; if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) { /* Only a mounted-on-fileid? Just exit */ if (fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) return 0; /* Has the inode gone and changed behind our back? */ } else if (nfsi->fileid != fattr->fileid) { /* Is this perhaps the mounted-on fileid? */ if ((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) && nfsi->fileid == fattr->mounted_on_fileid) return 0; return -ESTALE; } if ((fattr->valid & NFS_ATTR_FATTR_TYPE) && inode_wrong_type(inode, fattr->mode)) return -ESTALE; if (!nfs_file_has_buffered_writers(nfsi)) { /* Verify a few of the more important attributes */ if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && !inode_eq_iversion_raw(inode, fattr->change_attr)) invalid |= NFS_INO_INVALID_CHANGE; ts = inode_get_mtime(inode); if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && !timespec64_equal(&ts, &fattr->mtime)) invalid |= NFS_INO_INVALID_MTIME; ts = inode_get_ctime(inode); if ((fattr->valid & NFS_ATTR_FATTR_CTIME) && !timespec64_equal(&ts, &fattr->ctime)) invalid |= NFS_INO_INVALID_CTIME; if (fattr->valid & NFS_ATTR_FATTR_SIZE) { cur_size = i_size_read(inode); new_isize = nfs_size_to_loff_t(fattr->size); if (cur_size != new_isize) invalid |= NFS_INO_INVALID_SIZE; } } /* Have any file permissions changed? */ if ((fattr->valid & NFS_ATTR_FATTR_MODE) && (inode->i_mode & S_IALLUGO) != (fattr->mode & S_IALLUGO)) invalid |= NFS_INO_INVALID_MODE; if ((fattr->valid & NFS_ATTR_FATTR_OWNER) && !uid_eq(inode->i_uid, fattr->uid)) invalid |= NFS_INO_INVALID_OTHER; if ((fattr->valid & NFS_ATTR_FATTR_GROUP) && !gid_eq(inode->i_gid, fattr->gid)) invalid |= NFS_INO_INVALID_OTHER; /* Has the link count changed? */ if ((fattr->valid & NFS_ATTR_FATTR_NLINK) && inode->i_nlink != fattr->nlink) invalid |= NFS_INO_INVALID_NLINK; ts = inode_get_atime(inode); if ((fattr->valid & NFS_ATTR_FATTR_ATIME) && !timespec64_equal(&ts, &fattr->atime)) invalid |= NFS_INO_INVALID_ATIME; if (invalid != 0) nfs_set_cache_invalid(inode, invalid); nfsi->read_cache_jiffies = fattr->time_start; return 0; } static atomic_long_t nfs_attr_generation_counter; static unsigned long nfs_read_attr_generation_counter(void) { return atomic_long_read(&nfs_attr_generation_counter); } unsigned long nfs_inc_attr_generation_counter(void) { return atomic_long_inc_return(&nfs_attr_generation_counter); } EXPORT_SYMBOL_GPL(nfs_inc_attr_generation_counter); void nfs_fattr_init(struct nfs_fattr *fattr) { fattr->valid = 0; fattr->time_start = jiffies; fattr->gencount = nfs_inc_attr_generation_counter(); fattr->owner_name = NULL; fattr->group_name = NULL; fattr->mdsthreshold = NULL; } EXPORT_SYMBOL_GPL(nfs_fattr_init); /** * nfs_fattr_set_barrier * @fattr: attributes * * Used to set a barrier after an attribute was updated. This * barrier ensures that older attributes from RPC calls that may * have raced with our update cannot clobber these new values. * Note that you are still responsible for ensuring that other * operations which change the attribute on the server do not * collide. */ void nfs_fattr_set_barrier(struct nfs_fattr *fattr) { fattr->gencount = nfs_inc_attr_generation_counter(); } struct nfs_fattr *nfs_alloc_fattr(void) { struct nfs_fattr *fattr; fattr = kmalloc(sizeof(*fattr), GFP_KERNEL); if (fattr != NULL) { nfs_fattr_init(fattr); fattr->label = NULL; } return fattr; } EXPORT_SYMBOL_GPL(nfs_alloc_fattr); struct nfs_fattr *nfs_alloc_fattr_with_label(struct nfs_server *server) { struct nfs_fattr *fattr = nfs_alloc_fattr(); if (!fattr) return NULL; fattr->label = nfs4_label_alloc(server, GFP_KERNEL); if (IS_ERR(fattr->label)) { kfree(fattr); return NULL; } return fattr; } EXPORT_SYMBOL_GPL(nfs_alloc_fattr_with_label); struct nfs_fh *nfs_alloc_fhandle(void) { struct nfs_fh *fh; fh = kmalloc(sizeof(struct nfs_fh), GFP_KERNEL); if (fh != NULL) fh->size = 0; return fh; } EXPORT_SYMBOL_GPL(nfs_alloc_fhandle); #ifdef NFS_DEBUG /* * _nfs_display_fhandle_hash - calculate the crc32 hash for the filehandle * in the same way that wireshark does * * @fh: file handle * * For debugging only. */ u32 _nfs_display_fhandle_hash(const struct nfs_fh *fh) { /* wireshark uses 32-bit AUTODIN crc and does a bitwise * not on the result */ return nfs_fhandle_hash(fh); } EXPORT_SYMBOL_GPL(_nfs_display_fhandle_hash); /* * _nfs_display_fhandle - display an NFS file handle on the console * * @fh: file handle to display * @caption: display caption * * For debugging only. */ void _nfs_display_fhandle(const struct nfs_fh *fh, const char *caption) { unsigned short i; if (fh == NULL || fh->size == 0) { printk(KERN_DEFAULT "%s at %p is empty\n", caption, fh); return; } printk(KERN_DEFAULT "%s at %p is %u bytes, crc: 0x%08x:\n", caption, fh, fh->size, _nfs_display_fhandle_hash(fh)); for (i = 0; i < fh->size; i += 16) { __be32 *pos = (__be32 *)&fh->data[i]; switch ((fh->size - i - 1) >> 2) { case 0: printk(KERN_DEFAULT " %08x\n", be32_to_cpup(pos)); break; case 1: printk(KERN_DEFAULT " %08x %08x\n", be32_to_cpup(pos), be32_to_cpup(pos + 1)); break; case 2: printk(KERN_DEFAULT " %08x %08x %08x\n", be32_to_cpup(pos), be32_to_cpup(pos + 1), be32_to_cpup(pos + 2)); break; default: printk(KERN_DEFAULT " %08x %08x %08x %08x\n", be32_to_cpup(pos), be32_to_cpup(pos + 1), be32_to_cpup(pos + 2), be32_to_cpup(pos + 3)); } } } EXPORT_SYMBOL_GPL(_nfs_display_fhandle); #endif /** * nfs_inode_attrs_cmp_generic - compare attributes * @fattr: attributes * @inode: pointer to inode * * Attempt to divine whether or not an RPC call reply carrying stale * attributes got scheduled after another call carrying updated ones. * Note also the check for wraparound of 'attr_gencount' * * The function returns '1' if it thinks the attributes in @fattr are * more recent than the ones cached in @inode. Otherwise it returns * the value '0'. */ static int nfs_inode_attrs_cmp_generic(const struct nfs_fattr *fattr, const struct inode *inode) { unsigned long attr_gencount = NFS_I(inode)->attr_gencount; return (long)(fattr->gencount - attr_gencount) > 0 || (long)(attr_gencount - nfs_read_attr_generation_counter()) > 0; } /** * nfs_inode_attrs_cmp_monotonic - compare attributes * @fattr: attributes * @inode: pointer to inode * * Attempt to divine whether or not an RPC call reply carrying stale * attributes got scheduled after another call carrying updated ones. * * We assume that the server observes monotonic semantics for * the change attribute, so a larger value means that the attributes in * @fattr are more recent, in which case the function returns the * value '1'. * A return value of '0' indicates no measurable change * A return value of '-1' means that the attributes in @inode are * more recent. */ static int nfs_inode_attrs_cmp_monotonic(const struct nfs_fattr *fattr, const struct inode *inode) { s64 diff = fattr->change_attr - inode_peek_iversion_raw(inode); if (diff > 0) return 1; return diff == 0 ? 0 : -1; } /** * nfs_inode_attrs_cmp_strict_monotonic - compare attributes * @fattr: attributes * @inode: pointer to inode * * Attempt to divine whether or not an RPC call reply carrying stale * attributes got scheduled after another call carrying updated ones. * * We assume that the server observes strictly monotonic semantics for * the change attribute, so a larger value means that the attributes in * @fattr are more recent, in which case the function returns the * value '1'. * A return value of '-1' means that the attributes in @inode are * more recent or unchanged. */ static int nfs_inode_attrs_cmp_strict_monotonic(const struct nfs_fattr *fattr, const struct inode *inode) { return nfs_inode_attrs_cmp_monotonic(fattr, inode) > 0 ? 1 : -1; } /** * nfs_inode_attrs_cmp - compare attributes * @fattr: attributes * @inode: pointer to inode * * This function returns '1' if it thinks the attributes in @fattr are * more recent than the ones cached in @inode. It returns '-1' if * the attributes in @inode are more recent than the ones in @fattr, * and it returns 0 if not sure. */ static int nfs_inode_attrs_cmp(const struct nfs_fattr *fattr, const struct inode *inode) { if (nfs_inode_attrs_cmp_generic(fattr, inode) > 0) return 1; switch (NFS_SERVER(inode)->change_attr_type) { case NFS4_CHANGE_TYPE_IS_UNDEFINED: break; case NFS4_CHANGE_TYPE_IS_TIME_METADATA: if (!(fattr->valid & NFS_ATTR_FATTR_CHANGE)) break; return nfs_inode_attrs_cmp_monotonic(fattr, inode); default: if (!(fattr->valid & NFS_ATTR_FATTR_CHANGE)) break; return nfs_inode_attrs_cmp_strict_monotonic(fattr, inode); } return 0; } /** * nfs_inode_finish_partial_attr_update - complete a previous inode update * @fattr: attributes * @inode: pointer to inode * * Returns '1' if the last attribute update left the inode cached * attributes in a partially unrevalidated state, and @fattr * matches the change attribute of that partial update. * Otherwise returns '0'. */ static int nfs_inode_finish_partial_attr_update(const struct nfs_fattr *fattr, const struct inode *inode) { const unsigned long check_valid = NFS_INO_INVALID_ATIME | NFS_INO_INVALID_CTIME | NFS_INO_INVALID_MTIME | NFS_INO_INVALID_SIZE | NFS_INO_INVALID_BLOCKS | NFS_INO_INVALID_OTHER | NFS_INO_INVALID_NLINK | NFS_INO_INVALID_BTIME; unsigned long cache_validity = NFS_I(inode)->cache_validity; enum nfs4_change_attr_type ctype = NFS_SERVER(inode)->change_attr_type; if (ctype != NFS4_CHANGE_TYPE_IS_UNDEFINED && !(cache_validity & NFS_INO_INVALID_CHANGE) && (cache_validity & check_valid) != 0 && (fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && nfs_inode_attrs_cmp_monotonic(fattr, inode) == 0) return 1; return 0; } static void nfs_ooo_merge(struct nfs_inode *nfsi, u64 start, u64 end) { int i, cnt; if (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER) /* No point merging anything */ return; if (!nfsi->ooo) { nfsi->ooo = kmalloc(sizeof(*nfsi->ooo), GFP_ATOMIC); if (!nfsi->ooo) { nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER; return; } nfsi->ooo->cnt = 0; } /* add this range, merging if possible */ cnt = nfsi->ooo->cnt; for (i = 0; i < cnt; i++) { if (end == nfsi->ooo->gap[i].start) end = nfsi->ooo->gap[i].end; else if (start == nfsi->ooo->gap[i].end) start = nfsi->ooo->gap[i].start; else continue; /* Remove 'i' from table and loop to insert the new range */ cnt -= 1; nfsi->ooo->gap[i] = nfsi->ooo->gap[cnt]; i = -1; } if (start != end) { if (cnt >= ARRAY_SIZE(nfsi->ooo->gap)) { nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER; kfree(nfsi->ooo); nfsi->ooo = NULL; return; } nfsi->ooo->gap[cnt].start = start; nfsi->ooo->gap[cnt].end = end; cnt += 1; } nfsi->ooo->cnt = cnt; } static void nfs_ooo_record(struct nfs_inode *nfsi, struct nfs_fattr *fattr) { /* This reply was out-of-order, so record in the * pre/post change id, possibly cancelling * gaps created when iversion was jumpped forward. */ if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) && (fattr->valid & NFS_ATTR_FATTR_PRECHANGE)) nfs_ooo_merge(nfsi, fattr->change_attr, fattr->pre_change_attr); } static int nfs_refresh_inode_locked(struct inode *inode, struct nfs_fattr *fattr) { int attr_cmp = nfs_inode_attrs_cmp(fattr, inode); int ret = 0; trace_nfs_refresh_inode_enter(inode); if (attr_cmp > 0 || nfs_inode_finish_partial_attr_update(fattr, inode)) ret = nfs_update_inode(inode, fattr); else { nfs_ooo_record(NFS_I(inode), fattr); if (attr_cmp == 0) ret = nfs_check_inode_attributes(inode, fattr); } trace_nfs_refresh_inode_exit(inode, ret); return ret; } /** * nfs_refresh_inode - try to update the inode attribute cache * @inode: pointer to inode * @fattr: updated attributes * * Check that an RPC call that returned attributes has not overlapped with * other recent updates of the inode metadata, then decide whether it is * safe to do a full update of the inode attributes, or whether just to * call nfs_check_inode_attributes. */ int nfs_refresh_inode(struct inode *inode, struct nfs_fattr *fattr) { int status; if ((fattr->valid & NFS_ATTR_FATTR) == 0) return 0; spin_lock(&inode->i_lock); status = nfs_refresh_inode_locked(inode, fattr); spin_unlock(&inode->i_lock); return status; } EXPORT_SYMBOL_GPL(nfs_refresh_inode); static int nfs_post_op_update_inode_locked(struct inode *inode, struct nfs_fattr *fattr, unsigned int invalid) { if (S_ISDIR(inode->i_mode)) invalid |= NFS_INO_INVALID_DATA; nfs_set_cache_invalid(inode, invalid); if ((fattr->valid & NFS_ATTR_FATTR) == 0) return 0; return nfs_refresh_inode_locked(inode, fattr); } /** * nfs_post_op_update_inode - try to update the inode attribute cache * @inode: pointer to inode * @fattr: updated attributes * * After an operation that has changed the inode metadata, mark the * attribute cache as being invalid, then try to update it. * * NB: if the server didn't return any post op attributes, this * function will force the retrieval of attributes before the next * NFS request. Thus it should be used only for operations that * are expected to change one or more attributes, to avoid * unnecessary NFS requests and trips through nfs_update_inode(). */ int nfs_post_op_update_inode(struct inode *inode, struct nfs_fattr *fattr) { int status; spin_lock(&inode->i_lock); nfs_fattr_set_barrier(fattr); status = nfs_post_op_update_inode_locked(inode, fattr, NFS_INO_INVALID_CHANGE | NFS_INO_INVALID_CTIME | NFS_INO_REVAL_FORCED); spin_unlock(&inode->i_lock); return status; } EXPORT_SYMBOL_GPL(nfs_post_op_update_inode); /** * nfs_post_op_update_inode_force_wcc_locked - update the inode attribute cache * @inode: pointer to inode * @fattr: updated attributes * * After an operation that has changed the inode metadata, mark the * attribute cache as being invalid, then try to update it. Fake up * weak cache consistency data, if none exist. * * This function is mainly designed to be used by the ->write_done() functions. */ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fattr *fattr) { int attr_cmp = nfs_inode_attrs_cmp(fattr, inode); int status; /* Don't do a WCC update if these attributes are already stale */ if (attr_cmp < 0) return 0; if ((fattr->valid & NFS_ATTR_FATTR) == 0 || !attr_cmp) { /* Record the pre/post change info before clearing PRECHANGE */ nfs_ooo_record(NFS_I(inode), fattr); fattr->valid &= ~(NFS_ATTR_FATTR_PRECHANGE | NFS_ATTR_FATTR_PRESIZE | NFS_ATTR_FATTR_PREMTIME | NFS_ATTR_FATTR_PRECTIME); goto out_noforce; } if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && (fattr->valid & NFS_ATTR_FATTR_PRECHANGE) == 0) { fattr->pre_change_attr = inode_peek_iversion_raw(inode); fattr->valid |= NFS_ATTR_FATTR_PRECHANGE; } if ((fattr->valid & NFS_ATTR_FATTR_CTIME) != 0 && (fattr->valid & NFS_ATTR_FATTR_PRECTIME) == 0) { fattr->pre_ctime = inode_get_ctime(inode); fattr->valid |= NFS_ATTR_FATTR_PRECTIME; } if ((fattr->valid & NFS_ATTR_FATTR_MTIME) != 0 && (fattr->valid & NFS_ATTR_FATTR_PREMTIME) == 0) { fattr->pre_mtime = inode_get_mtime(inode); fattr->valid |= NFS_ATTR_FATTR_PREMTIME; } if ((fattr->valid & NFS_ATTR_FATTR_SIZE) != 0 && (fattr->valid & NFS_ATTR_FATTR_PRESIZE) == 0) { fattr->pre_size = i_size_read(inode); fattr->valid |= NFS_ATTR_FATTR_PRESIZE; } out_noforce: status = nfs_post_op_update_inode_locked(inode, fattr, NFS_INO_INVALID_CHANGE | NFS_INO_INVALID_CTIME | NFS_INO_INVALID_MTIME | NFS_INO_INVALID_BLOCKS); return status; } /** * nfs_post_op_update_inode_force_wcc - try to update the inode attribute cache * @inode: pointer to inode * @fattr: updated attributes * * After an operation that has changed the inode metadata, mark the * attribute cache as being invalid, then try to update it. Fake up * weak cache consistency data, if none exist. * * This function is mainly designed to be used by the ->write_done() functions. */ int nfs_post_op_update_inode_force_wcc(struct inode *inode, struct nfs_fattr *fattr) { int status; spin_lock(&inode->i_lock); nfs_fattr_set_barrier(fattr); status = nfs_post_op_update_inode_force_wcc_locked(inode, fattr); spin_unlock(&inode->i_lock); return status; } EXPORT_SYMBOL_GPL(nfs_post_op_update_inode_force_wcc); /* * Many nfs protocol calls return the new file attributes after * an operation. Here we update the inode to reflect the state * of the server's inode. * * This is a bit tricky because we have to make sure all dirty pages * have been sent off to the server before calling invalidate_inode_pages. * To make sure no other process adds more write requests while we try * our best to flush them, we make them sleep during the attribute refresh. * * A very similar scenario holds for the dir cache. */ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr) { struct nfs_server *server = NFS_SERVER(inode); struct nfs_inode *nfsi = NFS_I(inode); loff_t cur_isize, new_isize; u64 fattr_supported = server->fattr_valid; unsigned long invalid = 0; unsigned long now = jiffies; unsigned long save_cache_validity; bool have_writers = nfs_file_has_buffered_writers(nfsi); bool cache_revalidated = true; bool attr_changed = false; bool have_delegation; dfprintk(VFS, "NFS: %s(%s/%lu fh_crc=0x%08x ct=%d info=0x%llx)\n", __func__, inode->i_sb->s_id, inode->i_ino, nfs_display_fhandle_hash(NFS_FH(inode)), icount_read(inode), fattr->valid); if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) { /* Only a mounted-on-fileid? Just exit */ if (fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) return 0; /* Has the inode gone and changed behind our back? */ } else if (nfsi->fileid != fattr->fileid) { /* Is this perhaps the mounted-on fileid? */ if ((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) && nfsi->fileid == fattr->mounted_on_fileid) return 0; printk(KERN_ERR "NFS: server %s error: fileid changed\n" "fsid %s: expected fileid 0x%Lx, got 0x%Lx\n", NFS_SERVER(inode)->nfs_client->cl_hostname, inode->i_sb->s_id, (long long)nfsi->fileid, (long long)fattr->fileid); goto out_err; } /* * Make sure the inode's type hasn't changed. */ if ((fattr->valid & NFS_ATTR_FATTR_TYPE) && inode_wrong_type(inode, fattr->mode)) { /* * Big trouble! The inode has become a different object. */ printk(KERN_DEBUG "NFS: %s: inode %lu mode changed, %07o to %07o\n", __func__, inode->i_ino, inode->i_mode, fattr->mode); goto out_err; } /* Update the fsid? */ if (S_ISDIR(inode->i_mode) && (fattr->valid & NFS_ATTR_FATTR_FSID) && !nfs_fsid_equal(&server->fsid, &fattr->fsid) && !IS_AUTOMOUNT(inode)) server->fsid = fattr->fsid; /* Save the delegation state before clearing cache_validity */ have_delegation = nfs_have_delegated_attributes(inode); /* * Update the read time so we don't revalidate too often. */ nfsi->read_cache_jiffies = fattr->time_start; /* Fix up any delegated attributes in the struct nfs_fattr */ nfs_fattr_fixup_delegated(inode, fattr); save_cache_validity = nfsi->cache_validity; nfsi->cache_validity &= ~(NFS_INO_INVALID_ATTR | NFS_INO_INVALID_ATIME | NFS_INO_REVAL_FORCED | NFS_INO_INVALID_BLOCKS); /* Do atomic weak cache consistency updates */ nfs_wcc_update_inode(inode, fattr); if (pnfs_layoutcommit_outstanding(inode)) { nfsi->cache_validity |= save_cache_validity & (NFS_INO_INVALID_CHANGE | NFS_INO_INVALID_CTIME | NFS_INO_INVALID_MTIME | NFS_INO_INVALID_SIZE | NFS_INO_INVALID_BLOCKS); cache_revalidated = false; } /* More cache consistency checks */ if (fattr->valid & NFS_ATTR_FATTR_CHANGE) { if (!have_writers && nfsi->ooo && nfsi->ooo->cnt == 1 && nfsi->ooo->gap[0].end == inode_peek_iversion_raw(inode)) { /* There is one remaining gap that hasn't been * merged into iversion - do that now. */ inode_set_iversion_raw(inode, nfsi->ooo->gap[0].start); kfree(nfsi->ooo); nfsi->ooo = NULL; } if (!inode_eq_iversion_raw(inode, fattr->change_attr)) { /* Could it be a race with writeback? */ if (!(have_writers || have_delegation)) { invalid |= NFS_INO_INVALID_DATA | NFS_INO_INVALID_ACCESS | NFS_INO_INVALID_ACL | NFS_INO_INVALID_XATTR; /* Force revalidate of all attributes */ save_cache_validity |= NFS_INO_INVALID_CTIME | NFS_INO_INVALID_MTIME | NFS_INO_INVALID_SIZE | NFS_INO_INVALID_BLOCKS | NFS_INO_INVALID_NLINK | NFS_INO_INVALID_MODE | NFS_INO_INVALID_OTHER | NFS_INO_INVALID_BTIME; if (S_ISDIR(inode->i_mode)) nfs_force_lookup_revalidate(inode); attr_changed = true; dprintk("NFS: change_attr change on server for file %s/%ld\n", inode->i_sb->s_id, inode->i_ino); } else if (!have_delegation) { nfs_ooo_record(nfsi, fattr); nfs_ooo_merge(nfsi, inode_peek_iversion_raw(inode), fattr->change_attr); } inode_set_iversion_raw(inode, fattr->change_attr); } } else { nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_CHANGE; if (!have_delegation || (nfsi->cache_validity & NFS_INO_INVALID_CHANGE) != 0) cache_revalidated = false; } if (fattr->valid & NFS_ATTR_FATTR_MTIME) inode_set_mtime_to_ts(inode, fattr->mtime); else if (fattr_supported & NFS_ATTR_FATTR_MTIME) nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_MTIME; if (fattr->valid & NFS_ATTR_FATTR_CTIME) inode_set_ctime_to_ts(inode, fattr->ctime); else if (fattr_supported & NFS_ATTR_FATTR_CTIME) nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_CTIME; if (fattr->valid & NFS_ATTR_FATTR_BTIME) nfsi->btime = fattr->btime; else if (fattr_supported & NFS_ATTR_FATTR_BTIME) nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_BTIME; /* Check if our cached file size is stale */ if (fattr->valid & NFS_ATTR_FATTR_SIZE) { new_isize = nfs_size_to_loff_t(fattr->size); cur_isize = i_size_read(inode); if (new_isize != cur_isize && !have_delegation) { /* Do we perhaps have any outstanding writes, or has * the file grown beyond our last write? */ if (!nfs_have_writebacks(inode) || new_isize > cur_isize) { trace_nfs_size_update(inode, new_isize); i_size_write(inode, new_isize); if (!have_writers) invalid |= NFS_INO_INVALID_DATA; } } if (new_isize == 0 && !(fattr->valid & (NFS_ATTR_FATTR_SPACE_USED | NFS_ATTR_FATTR_BLOCKS_USED))) { fattr->du.nfs3.used = 0; fattr->valid |= NFS_ATTR_FATTR_SPACE_USED; } } else nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_SIZE; if (fattr->valid & NFS_ATTR_FATTR_ATIME) inode_set_atime_to_ts(inode, fattr->atime); else if (fattr_supported & NFS_ATTR_FATTR_ATIME) nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_ATIME; if (fattr->valid & NFS_ATTR_FATTR_MODE) { if ((inode->i_mode & S_IALLUGO) != (fattr->mode & S_IALLUGO)) { umode_t newmode = inode->i_mode & S_IFMT; newmode |= fattr->mode & S_IALLUGO; inode->i_mode = newmode; invalid |= NFS_INO_INVALID_ACCESS | NFS_INO_INVALID_ACL; } } else if (fattr_supported & NFS_ATTR_FATTR_MODE) nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_MODE; if (fattr->valid & NFS_ATTR_FATTR_OWNER) { if (!uid_eq(inode->i_uid, fattr->uid)) { invalid |= NFS_INO_INVALID_ACCESS | NFS_INO_INVALID_ACL; inode->i_uid = fattr->uid; } } else if (fattr_supported & NFS_ATTR_FATTR_OWNER) nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_OTHER; if (fattr->valid & NFS_ATTR_FATTR_GROUP) { if (!gid_eq(inode->i_gid, fattr->gid)) { invalid |= NFS_INO_INVALID_ACCESS | NFS_INO_INVALID_ACL; inode->i_gid = fattr->gid; } } else if (fattr_supported & NFS_ATTR_FATTR_GROUP) nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_OTHER; if (fattr->valid & NFS_ATTR_FATTR_NLINK) { if (inode->i_nlink != fattr->nlink) set_nlink(inode, fattr->nlink); } else if (fattr_supported & NFS_ATTR_FATTR_NLINK) nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_NLINK; if (fattr->valid & NFS_ATTR_FATTR_SPACE_USED) { /* * report the blocks in 512byte units */ inode->i_blocks = nfs_calc_block_size(fattr->du.nfs3.used); } else if (fattr_supported & NFS_ATTR_FATTR_SPACE_USED) nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_BLOCKS; if (fattr->valid & NFS_ATTR_FATTR_BLOCKS_USED) inode->i_blocks = fattr->du.nfs2.blocks; else if (fattr_supported & NFS_ATTR_FATTR_BLOCKS_USED) nfsi->cache_validity |= save_cache_validity & NFS_INO_INVALID_BLOCKS; /* Update attrtimeo value if we're out of the unstable period */ if (attr_changed) { nfs_inc_stats(inode, NFSIOS_ATTRINVALIDATE); nfsi->attrtimeo = NFS_MINATTRTIMEO(inode); nfsi->attrtimeo_timestamp = now; /* Set barrier to be more recent than all outstanding updates */ nfsi->attr_gencount = nfs_inc_attr_generation_counter(); } else { if (cache_revalidated) { if (!time_in_range_open(now, nfsi->attrtimeo_timestamp, nfsi->attrtimeo_timestamp + nfsi->attrtimeo)) { nfsi->attrtimeo <<= 1; if (nfsi->attrtimeo > NFS_MAXATTRTIMEO(inode)) nfsi->attrtimeo = NFS_MAXATTRTIMEO(inode); } nfsi->attrtimeo_timestamp = now; } /* Set the barrier to be more recent than this fattr */ if ((long)(fattr->gencount - nfsi->attr_gencount) > 0) nfsi->attr_gencount = fattr->gencount; } /* Don't invalidate the data if we were to blame */ if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))) invalid &= ~NFS_INO_INVALID_DATA; nfs_set_cache_invalid(inode, invalid); return 0; out_err: /* * No need to worry about unhashing the dentry, as the * lookup validation will know that the inode is bad. * (But we fall through to invalidate the caches.) */ nfs_set_inode_stale_locked(inode); return -ESTALE; } struct inode *nfs_alloc_inode(struct super_block *sb) { struct nfs_inode *nfsi; nfsi = alloc_inode_sb(sb, nfs_inode_cachep, GFP_KERNEL); if (!nfsi) return NULL; nfsi->flags = 0UL; nfsi->cache_validity = 0UL; nfsi->ooo = NULL; #if IS_ENABLED(CONFIG_NFS_V4) nfsi->nfs4_acl = NULL; #endif /* CONFIG_NFS_V4 */ #ifdef CONFIG_NFS_V4_2 nfsi->xattr_cache = NULL; #endif nfs_netfs_inode_init(nfsi); return &nfsi->vfs_inode; } EXPORT_SYMBOL_GPL(nfs_alloc_inode); void nfs_free_inode(struct inode *inode) { kfree(NFS_I(inode)->ooo); kmem_cache_free(nfs_inode_cachep, NFS_I(inode)); } EXPORT_SYMBOL_GPL(nfs_free_inode); static inline void nfs4_init_once(struct nfs_inode *nfsi) { #if IS_ENABLED(CONFIG_NFS_V4) INIT_LIST_HEAD(&nfsi->open_states); nfsi->delegation = NULL; init_rwsem(&nfsi->rwsem); nfsi->layout = NULL; #endif } static void init_once(void *foo) { struct nfs_inode *nfsi = foo; inode_init_once(&nfsi->vfs_inode); INIT_LIST_HEAD(&nfsi->open_files); INIT_LIST_HEAD(&nfsi->access_cache_entry_lru); INIT_LIST_HEAD(&nfsi->access_cache_inode_lru); nfs4_init_once(nfsi); } static int __init nfs_init_inodecache(void) { nfs_inode_cachep = kmem_cache_create("nfs_inode_cache", sizeof(struct nfs_inode), 0, (SLAB_RECLAIM_ACCOUNT| SLAB_ACCOUNT), init_once); if (nfs_inode_cachep == NULL) return -ENOMEM; return 0; } static void nfs_destroy_inodecache(void) { /* * Make sure all delayed rcu free inodes are flushed before we * destroy cache. */ rcu_barrier(); kmem_cache_destroy(nfs_inode_cachep); } struct workqueue_struct *nfslocaliod_workqueue; struct workqueue_struct *nfsiod_workqueue; EXPORT_SYMBOL_GPL(nfsiod_workqueue); /* * Destroy the nfsiod workqueues */ static void nfsiod_stop(void) { struct workqueue_struct *wq; wq = nfsiod_workqueue; if (wq != NULL) { nfsiod_workqueue = NULL; destroy_workqueue(wq); } #if IS_ENABLED(CONFIG_NFS_LOCALIO) wq = nfslocaliod_workqueue; if (wq != NULL) { nfslocaliod_workqueue = NULL; destroy_workqueue(wq); } #endif /* CONFIG_NFS_LOCALIO */ } /* * Start the nfsiod workqueues */ static int nfsiod_start(void) { dprintk("RPC: creating workqueue nfsiod\n"); nfsiod_workqueue = alloc_workqueue("nfsiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); if (nfsiod_workqueue == NULL) return -ENOMEM; #if IS_ENABLED(CONFIG_NFS_LOCALIO) /* * localio writes need to use a normal (non-memreclaim) workqueue. * When we start getting low on space, XFS goes and calls flush_work() on * a non-memreclaim work queue, which causes a priority inversion problem. */ dprintk("RPC: creating workqueue nfslocaliod\n"); nfslocaliod_workqueue = alloc_workqueue("nfslocaliod", WQ_UNBOUND, 0); if (unlikely(nfslocaliod_workqueue == NULL)) { nfsiod_stop(); return -ENOMEM; } #endif /* CONFIG_NFS_LOCALIO */ return 0; } unsigned int nfs_net_id; EXPORT_SYMBOL_GPL(nfs_net_id); static int nfs_net_init(struct net *net) { struct nfs_net *nn = net_generic(net, nfs_net_id); int err; nfs_clients_init(net); if (!rpc_proc_register(net, &nn->rpcstats)) { err = -ENOMEM; goto err_proc_rpc; } err = nfs_fs_proc_net_init(net); if (err) goto err_proc_nfs; return 0; err_proc_nfs: rpc_proc_unregister(net, "nfs"); err_proc_rpc: nfs_clients_exit(net); return err; } static void nfs_net_exit(struct net *net) { rpc_proc_unregister(net, "nfs"); nfs_fs_proc_net_exit(net); nfs_clients_exit(net); } static struct pernet_operations nfs_net_ops = { .init = nfs_net_init, .exit = nfs_net_exit, .id = &nfs_net_id, .size = sizeof(struct nfs_net), }; #ifdef CONFIG_KEYS static struct key *nfs_keyring; static int __init nfs_init_keyring(void) { nfs_keyring = keyring_alloc(".nfs", GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, current_cred(), (KEY_POS_ALL & ~KEY_POS_SETATTR) | (KEY_USR_ALL & ~KEY_USR_SETATTR), KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL); return PTR_ERR_OR_ZERO(nfs_keyring); } static void nfs_exit_keyring(void) { key_put(nfs_keyring); } #else static inline int nfs_init_keyring(void) { return 0; } static inline void nfs_exit_keyring(void) { } #endif /* CONFIG_KEYS */ /* * Initialize NFS */ static int __init init_nfs_fs(void) { int err; err = nfs_init_keyring(); if (err) return err; err = nfs_sysfs_init(); if (err < 0) goto out10; err = register_pernet_subsys(&nfs_net_ops); if (err < 0) goto out9; err = nfsiod_start(); if (err) goto out7; err = nfs_fs_proc_init(); if (err) goto out6; err = nfs_init_nfspagecache(); if (err) goto out5; err = nfs_init_inodecache(); if (err) goto out4; err = nfs_init_readpagecache(); if (err) goto out3; err = nfs_init_writepagecache(); if (err) goto out2; err = nfs_init_directcache(); if (err) goto out1; err = register_nfs_fs(); if (err) goto out0; return 0; out0: nfs_destroy_directcache(); out1: nfs_destroy_writepagecache(); out2: nfs_destroy_readpagecache(); out3: nfs_destroy_inodecache(); out4: nfs_destroy_nfspagecache(); out5: nfs_fs_proc_exit(); out6: nfsiod_stop(); out7: unregister_pernet_subsys(&nfs_net_ops); out9: nfs_sysfs_exit(); out10: nfs_exit_keyring(); return err; } static void __exit exit_nfs_fs(void) { nfs_destroy_directcache(); nfs_destroy_writepagecache(); nfs_destroy_readpagecache(); nfs_destroy_inodecache(); nfs_destroy_nfspagecache(); unregister_pernet_subsys(&nfs_net_ops); unregister_nfs_fs(); nfs_fs_proc_exit(); nfsiod_stop(); nfs_sysfs_exit(); nfs_exit_keyring(); } /* Not quite true; I just maintain it */ MODULE_AUTHOR("Olaf Kirch <okir@monad.swb.de>"); MODULE_DESCRIPTION("NFS client support"); MODULE_LICENSE("GPL"); module_param(enable_ino64, bool, 0644); module_init(init_nfs_fs) module_exit(exit_nfs_fs) |
| 3 5554 5554 88 5522 2 5411 5307 5310 17 5525 175 5318 13 5527 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef INT_BLK_MQ_H #define INT_BLK_MQ_H #include <linux/blk-mq.h> #include "blk-stat.h" struct blk_mq_tag_set; struct elevator_tags; struct blk_mq_ctxs { struct kobject kobj; struct blk_mq_ctx __percpu *queue_ctx; }; /** * struct blk_mq_ctx - State for a software queue facing the submitting CPUs */ struct blk_mq_ctx { struct { spinlock_t lock; struct list_head rq_lists[HCTX_MAX_TYPES]; } ____cacheline_aligned_in_smp; unsigned int cpu; unsigned short index_hw[HCTX_MAX_TYPES]; struct blk_mq_hw_ctx *hctxs[HCTX_MAX_TYPES]; struct request_queue *queue; struct blk_mq_ctxs *ctxs; struct kobject kobj; } ____cacheline_aligned_in_smp; enum { BLK_MQ_NO_TAG = -1U, BLK_MQ_TAG_MIN = 1, BLK_MQ_TAG_MAX = BLK_MQ_NO_TAG - 1, }; #define BLK_MQ_CPU_WORK_BATCH (8) typedef unsigned int __bitwise blk_insert_t; #define BLK_MQ_INSERT_AT_HEAD ((__force blk_insert_t)0x01) void blk_mq_submit_bio(struct bio *bio); int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob, unsigned int flags); void blk_mq_exit_queue(struct request_queue *q); struct elevator_tags *blk_mq_update_nr_requests(struct request_queue *q, struct elevator_tags *tags, unsigned int nr); void blk_mq_wake_waiters(struct request_queue *q); bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *, bool); void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list); struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *start); void blk_mq_put_rq_ref(struct request *rq); /* * Internal helpers for allocating/freeing the request map */ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, unsigned int hctx_idx); void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags); struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set, unsigned int hctx_idx, unsigned int depth); void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, unsigned int hctx_idx); /* * CPU -> queue mappings */ extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int); /* * blk_mq_map_queue_type() - map (hctx_type,cpu) to hardware queue * @q: request queue * @type: the hctx type index * @cpu: CPU */ static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q, enum hctx_type type, unsigned int cpu) { return queue_hctx((q), (q->tag_set->map[type].mq_map[cpu])); } static inline enum hctx_type blk_mq_get_hctx_type(blk_opf_t opf) { enum hctx_type type = HCTX_TYPE_DEFAULT; /* * The caller ensure that if REQ_POLLED, poll must be enabled. */ if (opf & REQ_POLLED) type = HCTX_TYPE_POLL; else if ((opf & REQ_OP_MASK) == REQ_OP_READ) type = HCTX_TYPE_READ; return type; } /* * blk_mq_map_queue() - map (cmd_flags,type) to hardware queue * @opf: operation type (REQ_OP_*) and flags (e.g. REQ_POLLED). * @ctx: software queue cpu ctx */ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(blk_opf_t opf, struct blk_mq_ctx *ctx) { return ctx->hctxs[blk_mq_get_hctx_type(opf)]; } /* * Default to double of smaller one between hw queue_depth and * 128, since we don't split into sync/async like the old code * did. Additionally, this is a per-hw queue depth. */ static inline unsigned int blk_mq_default_nr_requests( struct blk_mq_tag_set *set) { return 2 * min_t(unsigned int, set->queue_depth, BLKDEV_DEFAULT_RQ); } /* * sysfs helpers */ extern void blk_mq_sysfs_init(struct request_queue *q); extern void blk_mq_sysfs_deinit(struct request_queue *q); int blk_mq_sysfs_register(struct gendisk *disk); void blk_mq_sysfs_unregister(struct gendisk *disk); int blk_mq_sysfs_register_hctxs(struct request_queue *q); void blk_mq_sysfs_unregister_hctxs(struct request_queue *q); extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx); void blk_mq_free_plug_rqs(struct blk_plug *plug); void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule); void blk_mq_cancel_work_sync(struct request_queue *q); void blk_mq_release(struct request_queue *q); static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q, unsigned int cpu) { return per_cpu_ptr(q->queue_ctx, cpu); } /* * This assumes per-cpu software queueing queues. They could be per-node * as well, for instance. For now this is hardcoded as-is. Note that we don't * care about preemption, since we know the ctx's are persistent. This does * mean that we can't rely on ctx always matching the currently running CPU. */ static inline struct blk_mq_ctx *blk_mq_get_ctx(struct request_queue *q) { return __blk_mq_get_ctx(q, raw_smp_processor_id()); } struct blk_mq_alloc_data { /* input parameter */ struct request_queue *q; blk_mq_req_flags_t flags; unsigned int shallow_depth; blk_opf_t cmd_flags; req_flags_t rq_flags; /* allocate multiple requests/tags in one go */ unsigned int nr_tags; struct rq_list *cached_rqs; /* input & output parameter */ struct blk_mq_ctx *ctx; struct blk_mq_hw_ctx *hctx; }; struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, unsigned int flags, int node); void blk_mq_free_tags(struct blk_mq_tag_set *set, struct blk_mq_tags *tags); unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data); unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags, unsigned int *offset); void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx, unsigned int tag); void blk_mq_put_tags(struct blk_mq_tags *tags, int *tag_array, int nr_tags); void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set, unsigned int size); void blk_mq_tag_update_sched_shared_tags(struct request_queue *q, unsigned int nr); void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool); void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn, void *priv); void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void *priv); static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt, struct blk_mq_hw_ctx *hctx) { if (!hctx) return &bt->ws[0]; return sbq_wait_ptr(bt, &hctx->wait_index); } void __blk_mq_tag_busy(struct blk_mq_hw_ctx *); void __blk_mq_tag_idle(struct blk_mq_hw_ctx *); static inline void blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) { if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) __blk_mq_tag_busy(hctx); } static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) { if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) __blk_mq_tag_idle(hctx); } static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, unsigned int tag) { return tag < tags->nr_reserved_tags; } static inline bool blk_mq_is_shared_tags(unsigned int flags) { return flags & BLK_MQ_F_TAG_HCTX_SHARED; } static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data) { if (data->rq_flags & RQF_SCHED_TAGS) return data->hctx->sched_tags; return data->hctx->tags; } static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx) { /* Fast path: hardware queue is not stopped most of the time. */ if (likely(!test_bit(BLK_MQ_S_STOPPED, &hctx->state))) return false; /* * This barrier is used to order adding of dispatch list before and * the test of BLK_MQ_S_STOPPED below. Pairs with the memory barrier * in blk_mq_start_stopped_hw_queue() so that dispatch code could * either see BLK_MQ_S_STOPPED is cleared or dispatch list is not * empty to avoid missing dispatching requests. */ smp_mb(); return test_bit(BLK_MQ_S_STOPPED, &hctx->state); } static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) { return hctx->nr_ctx && hctx->tags; } void blk_mq_in_driver_rw(struct block_device *part, unsigned int inflight[2]); static inline void blk_mq_put_dispatch_budget(struct request_queue *q, int budget_token) { if (q->mq_ops->put_budget) q->mq_ops->put_budget(q, budget_token); } static inline int blk_mq_get_dispatch_budget(struct request_queue *q) { if (q->mq_ops->get_budget) return q->mq_ops->get_budget(q); return 0; } static inline void blk_mq_set_rq_budget_token(struct request *rq, int token) { if (token < 0) return; if (rq->q->mq_ops->set_rq_budget_token) rq->q->mq_ops->set_rq_budget_token(rq, token); } static inline int blk_mq_get_rq_budget_token(struct request *rq) { if (rq->q->mq_ops->get_rq_budget_token) return rq->q->mq_ops->get_rq_budget_token(rq); return -1; } static inline void __blk_mq_add_active_requests(struct blk_mq_hw_ctx *hctx, int val) { if (blk_mq_is_shared_tags(hctx->flags)) atomic_add(val, &hctx->queue->nr_active_requests_shared_tags); else atomic_add(val, &hctx->nr_active); } static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx) { __blk_mq_add_active_requests(hctx, 1); } static inline void __blk_mq_sub_active_requests(struct blk_mq_hw_ctx *hctx, int val) { if (blk_mq_is_shared_tags(hctx->flags)) atomic_sub(val, &hctx->queue->nr_active_requests_shared_tags); else atomic_sub(val, &hctx->nr_active); } static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx) { __blk_mq_sub_active_requests(hctx, 1); } static inline void blk_mq_add_active_requests(struct blk_mq_hw_ctx *hctx, int val) { if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) __blk_mq_add_active_requests(hctx, val); } static inline void blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx) { if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) __blk_mq_inc_active_requests(hctx); } static inline void blk_mq_sub_active_requests(struct blk_mq_hw_ctx *hctx, int val) { if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) __blk_mq_sub_active_requests(hctx, val); } static inline void blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx) { if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) __blk_mq_dec_active_requests(hctx); } static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx) { if (blk_mq_is_shared_tags(hctx->flags)) return atomic_read(&hctx->queue->nr_active_requests_shared_tags); return atomic_read(&hctx->nr_active); } static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq) { blk_mq_dec_active_requests(hctx); blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag); rq->tag = BLK_MQ_NO_TAG; } static inline void blk_mq_put_driver_tag(struct request *rq) { if (rq->tag == BLK_MQ_NO_TAG || rq->internal_tag == BLK_MQ_NO_TAG) return; __blk_mq_put_driver_tag(rq->mq_hctx, rq); } bool __blk_mq_alloc_driver_tag(struct request *rq); static inline bool blk_mq_get_driver_tag(struct request *rq) { if (rq->tag == BLK_MQ_NO_TAG && !__blk_mq_alloc_driver_tag(rq)) return false; return true; } static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap) { int cpu; for_each_possible_cpu(cpu) qmap->mq_map[cpu] = 0; } /* Free all requests on the list */ static inline void blk_mq_free_requests(struct list_head *list) { while (!list_empty(list)) { struct request *rq = list_entry_rq(list->next); list_del_init(&rq->queuelist); blk_mq_free_request(rq); } } /* * For shared tag users, we track the number of currently active users * and attempt to provide a fair share of the tag depth for each of them. */ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt) { unsigned int depth, users; if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) return true; /* * Don't try dividing an ant */ if (bt->sb.depth == 1) return true; if (blk_mq_is_shared_tags(hctx->flags)) { struct request_queue *q = hctx->queue; if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) return true; } else { if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) return true; } users = READ_ONCE(hctx->tags->active_queues); if (!users) return true; /* * Allow at least some tags */ depth = max((bt->sb.depth + users - 1) / users, 4U); return __blk_mq_active_requests(hctx) < depth; } /* run the code block in @dispatch_ops with rcu/srcu read lock held */ #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ do { \ if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) { \ struct blk_mq_tag_set *__tag_set = (q)->tag_set; \ int srcu_idx; \ \ might_sleep_if(check_sleep); \ srcu_idx = srcu_read_lock(__tag_set->srcu); \ (dispatch_ops); \ srcu_read_unlock(__tag_set->srcu, srcu_idx); \ } else { \ rcu_read_lock(); \ (dispatch_ops); \ rcu_read_unlock(); \ } \ } while (0) #define blk_mq_run_dispatch_ops(q, dispatch_ops) \ __blk_mq_run_dispatch_ops(q, true, dispatch_ops) \ static inline bool blk_mq_can_poll(struct request_queue *q) { return (q->limits.features & BLK_FEAT_POLL) && q->tag_set->map[HCTX_TYPE_POLL].nr_queues; } #endif |
| 1 1 2012 2012 2012 2006 2009 2011 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2002 Andi Kleen, SuSE Labs. * Thanks to Ben LaHaise for precious feedback. */ #include <linux/highmem.h> #include <linux/memblock.h> #include <linux/sched.h> #include <linux/mm.h> #include <linux/interrupt.h> #include <linux/seq_file.h> #include <linux/proc_fs.h> #include <linux/debugfs.h> #include <linux/pfn.h> #include <linux/percpu.h> #include <linux/gfp.h> #include <linux/pci.h> #include <linux/vmalloc.h> #include <linux/libnvdimm.h> #include <linux/vmstat.h> #include <linux/kernel.h> #include <linux/cc_platform.h> #include <linux/set_memory.h> #include <linux/memregion.h> #include <asm/e820/api.h> #include <asm/processor.h> #include <asm/tlbflush.h> #include <asm/sections.h> #include <asm/setup.h> #include <linux/uaccess.h> #include <asm/pgalloc.h> #include <asm/proto.h> #include <asm/memtype.h> #include "../mm_internal.h" /* * The current flushing context - we pass it instead of 5 arguments: */ struct cpa_data { unsigned long *vaddr; pgd_t *pgd; pgprot_t mask_set; pgprot_t mask_clr; unsigned long numpages; unsigned long curpage; unsigned long pfn; unsigned int flags; unsigned int force_split : 1, force_static_prot : 1, force_flush_all : 1; struct page **pages; }; enum cpa_warn { CPA_CONFLICT, CPA_PROTECT, CPA_DETECT, }; static const int cpa_warn_level = CPA_PROTECT; /* * Serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity mappings) * using cpa_lock. So that we don't allow any other cpu, with stale large tlb * entries change the page attribute in parallel to some other cpu * splitting a large page entry along with changing the attribute. */ static DEFINE_SPINLOCK(cpa_lock); #define CPA_FLUSHTLB 1 #define CPA_ARRAY 2 #define CPA_PAGES_ARRAY 4 #define CPA_NO_CHECK_ALIAS 8 /* Do not search for aliases */ #define CPA_COLLAPSE 16 /* try to collapse large pages */ static inline pgprot_t cachemode2pgprot(enum page_cache_mode pcm) { return __pgprot(cachemode2protval(pcm)); } #ifdef CONFIG_PROC_FS static unsigned long direct_pages_count[PG_LEVEL_NUM]; void update_page_count(int level, unsigned long pages) { /* Protect against CPA */ spin_lock(&pgd_lock); direct_pages_count[level] += pages; spin_unlock(&pgd_lock); } static void split_page_count(int level) { if (direct_pages_count[level] == 0) return; direct_pages_count[level]--; if (system_state == SYSTEM_RUNNING) { if (level == PG_LEVEL_2M) count_vm_event(DIRECT_MAP_LEVEL2_SPLIT); else if (level == PG_LEVEL_1G) count_vm_event(DIRECT_MAP_LEVEL3_SPLIT); } direct_pages_count[level - 1] += PTRS_PER_PTE; } static void collapse_page_count(int level) { direct_pages_count[level]++; if (system_state == SYSTEM_RUNNING) { if (level == PG_LEVEL_2M) count_vm_event(DIRECT_MAP_LEVEL2_COLLAPSE); else if (level == PG_LEVEL_1G) count_vm_event(DIRECT_MAP_LEVEL3_COLLAPSE); } direct_pages_count[level - 1] -= PTRS_PER_PTE; } void arch_report_meminfo(struct seq_file *m) { seq_printf(m, "DirectMap4k: %8lu kB\n", direct_pages_count[PG_LEVEL_4K] << 2); #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) seq_printf(m, "DirectMap2M: %8lu kB\n", direct_pages_count[PG_LEVEL_2M] << 11); #else seq_printf(m, "DirectMap4M: %8lu kB\n", direct_pages_count[PG_LEVEL_2M] << 12); #endif if (direct_gbpages) seq_printf(m, "DirectMap1G: %8lu kB\n", direct_pages_count[PG_LEVEL_1G] << 20); } #else static inline void split_page_count(int level) { } static inline void collapse_page_count(int level) { } #endif #ifdef CONFIG_X86_CPA_STATISTICS static unsigned long cpa_1g_checked; static unsigned long cpa_1g_sameprot; static unsigned long cpa_1g_preserved; static unsigned long cpa_2m_checked; static unsigned long cpa_2m_sameprot; static unsigned long cpa_2m_preserved; static unsigned long cpa_4k_install; static inline void cpa_inc_1g_checked(void) { cpa_1g_checked++; } static inline void cpa_inc_2m_checked(void) { cpa_2m_checked++; } static inline void cpa_inc_4k_install(void) { data_race(cpa_4k_install++); } static inline void cpa_inc_lp_sameprot(int level) { if (level == PG_LEVEL_1G) cpa_1g_sameprot++; else cpa_2m_sameprot++; } static inline void cpa_inc_lp_preserved(int level) { if (level == PG_LEVEL_1G) cpa_1g_preserved++; else cpa_2m_preserved++; } static int cpastats_show(struct seq_file *m, void *p) { seq_printf(m, "1G pages checked: %16lu\n", cpa_1g_checked); seq_printf(m, "1G pages sameprot: %16lu\n", cpa_1g_sameprot); seq_printf(m, "1G pages preserved: %16lu\n", cpa_1g_preserved); seq_printf(m, "2M pages checked: %16lu\n", cpa_2m_checked); seq_printf(m, "2M pages sameprot: %16lu\n", cpa_2m_sameprot); seq_printf(m, "2M pages preserved: %16lu\n", cpa_2m_preserved); seq_printf(m, "4K pages set-checked: %16lu\n", cpa_4k_install); return 0; } static int cpastats_open(struct inode *inode, struct file *file) { return single_open(file, cpastats_show, NULL); } static const struct file_operations cpastats_fops = { .open = cpastats_open, .read = seq_read, .llseek = seq_lseek, .release = single_release, }; static int __init cpa_stats_init(void) { debugfs_create_file("cpa_stats", S_IRUSR, arch_debugfs_dir, NULL, &cpastats_fops); return 0; } late_initcall(cpa_stats_init); #else static inline void cpa_inc_1g_checked(void) { } static inline void cpa_inc_2m_checked(void) { } static inline void cpa_inc_4k_install(void) { } static inline void cpa_inc_lp_sameprot(int level) { } static inline void cpa_inc_lp_preserved(int level) { } #endif static inline int within(unsigned long addr, unsigned long start, unsigned long end) { return addr >= start && addr < end; } #ifdef CONFIG_X86_64 static inline int within_inclusive(unsigned long addr, unsigned long start, unsigned long end) { return addr >= start && addr <= end; } /* * The kernel image is mapped into two places in the virtual address space * (addresses without KASLR, of course): * * 1. The kernel direct map (0xffff880000000000) * 2. The "high kernel map" (0xffffffff81000000) * * We actually execute out of #2. If we get the address of a kernel symbol, it * points to #2, but almost all physical-to-virtual translations point to #1. * * This is so that we can have both a directmap of all physical memory *and* * take full advantage of the limited (s32) immediate addressing range (2G) * of x86_64. * * See Documentation/arch/x86/x86_64/mm.rst for more detail. */ static inline unsigned long highmap_start_pfn(void) { return __pa_symbol(_text) >> PAGE_SHIFT; } static inline unsigned long highmap_end_pfn(void) { /* Do not reference physical address outside the kernel. */ return __pa_symbol(roundup(_brk_end, PMD_SIZE) - 1) >> PAGE_SHIFT; } static bool __cpa_pfn_in_highmap(unsigned long pfn) { /* * Kernel text has an alias mapping at a high address, known * here as "highmap". */ return within_inclusive(pfn, highmap_start_pfn(), highmap_end_pfn()); } #else static bool __cpa_pfn_in_highmap(unsigned long pfn) { /* There is no highmap on 32-bit */ return false; } #endif /* * See set_mce_nospec(). * * Machine check recovery code needs to change cache mode of poisoned pages to * UC to avoid speculative access logging another error. But passing the * address of the 1:1 mapping to set_memory_uc() is a fine way to encourage a * speculative access. So we cheat and flip the top bit of the address. This * works fine for the code that updates the page tables. But at the end of the * process we need to flush the TLB and cache and the non-canonical address * causes a #GP fault when used by the INVLPG and CLFLUSH instructions. * * But in the common case we already have a canonical address. This code * will fix the top bit if needed and is a no-op otherwise. */ static inline unsigned long fix_addr(unsigned long addr) { #ifdef CONFIG_X86_64 return (long)(addr << 1) >> 1; #else return addr; #endif } static unsigned long __cpa_addr(struct cpa_data *cpa, unsigned long idx) { if (cpa->flags & CPA_PAGES_ARRAY) { struct page *page = cpa->pages[idx]; if (unlikely(PageHighMem(page))) return 0; return (unsigned long)page_address(page); } if (cpa->flags & CPA_ARRAY) return cpa->vaddr[idx]; return *cpa->vaddr + idx * PAGE_SIZE; } /* * Flushing functions */ static void clflush_cache_range_opt(void *vaddr, unsigned int size) { const unsigned long clflush_size = boot_cpu_data.x86_clflush_size; void *p = (void *)((unsigned long)vaddr & ~(clflush_size - 1)); void *vend = vaddr + size; if (p >= vend) return; for (; p < vend; p += clflush_size) clflushopt(p); } /** * clflush_cache_range - flush a cache range with clflush * @vaddr: virtual start address * @size: number of bytes to flush * * CLFLUSHOPT is an unordered instruction which needs fencing with MFENCE or * SFENCE to avoid ordering issues. */ void clflush_cache_range(void *vaddr, unsigned int size) { mb(); clflush_cache_range_opt(vaddr, size); mb(); } EXPORT_SYMBOL_GPL(clflush_cache_range); #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_invalidate_pmem(void *addr, size_t size) { clflush_cache_range(addr, size); } EXPORT_SYMBOL_GPL(arch_invalidate_pmem); #endif #ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION bool cpu_cache_has_invalidate_memregion(void) { return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); } EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, "DEVMEM"); int cpu_cache_invalidate_memregion(phys_addr_t start, size_t len) { if (WARN_ON_ONCE(!cpu_cache_has_invalidate_memregion())) return -ENXIO; wbinvd_on_all_cpus(); return 0; } EXPORT_SYMBOL_NS_GPL(cpu_cache_invalidate_memregion, "DEVMEM"); #endif static void __cpa_flush_all(void *arg) { unsigned long cache = (unsigned long)arg; /* * Flush all to work around Errata in early athlons regarding * large page flushing. */ __flush_tlb_all(); if (cache && boot_cpu_data.x86 >= 4) wbinvd(); } static void cpa_flush_all(unsigned long cache) { BUG_ON(irqs_disabled() && !early_boot_irqs_disabled); on_each_cpu(__cpa_flush_all, (void *) cache, 1); } static int collapse_large_pages(unsigned long addr, struct list_head *pgtables); static void cpa_collapse_large_pages(struct cpa_data *cpa) { unsigned long start, addr, end; struct ptdesc *ptdesc, *tmp; LIST_HEAD(pgtables); int collapsed = 0; int i; if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) { for (i = 0; i < cpa->numpages; i++) collapsed += collapse_large_pages(__cpa_addr(cpa, i), &pgtables); } else { addr = __cpa_addr(cpa, 0); start = addr & PMD_MASK; end = addr + PAGE_SIZE * cpa->numpages; for (addr = start; within(addr, start, end); addr += PMD_SIZE) collapsed += collapse_large_pages(addr, &pgtables); } if (!collapsed) return; flush_tlb_all(); list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { list_del(&ptdesc->pt_list); pagetable_free(ptdesc); } } static void cpa_flush(struct cpa_data *cpa, int cache) { unsigned long start, end; unsigned int i; BUG_ON(irqs_disabled() && !early_boot_irqs_disabled); if (cache && !static_cpu_has(X86_FEATURE_CLFLUSH)) { cpa_flush_all(cache); goto collapse_large_pages; } start = fix_addr(__cpa_addr(cpa, 0)); end = start + cpa->numpages * PAGE_SIZE; if (cpa->force_flush_all) end = TLB_FLUSH_ALL; flush_tlb_kernel_range(start, end); if (!cache) goto collapse_large_pages; mb(); for (i = 0; i < cpa->numpages; i++) { unsigned long addr = __cpa_addr(cpa, i); unsigned int level; pte_t *pte = lookup_address(addr, &level); /* * Only flush present addresses: */ if (pte && (pte_val(*pte) & _PAGE_PRESENT)) clflush_cache_range_opt((void *)fix_addr(addr), PAGE_SIZE); } mb(); collapse_large_pages: if (cpa->flags & CPA_COLLAPSE) cpa_collapse_large_pages(cpa); } static bool overlaps(unsigned long r1_start, unsigned long r1_end, unsigned long r2_start, unsigned long r2_end) { return (r1_start <= r2_end && r1_end >= r2_start) || (r2_start <= r1_end && r2_end >= r1_start); } #ifdef CONFIG_PCI_BIOS /* * The BIOS area between 640k and 1Mb needs to be executable for PCI BIOS * based config access (CONFIG_PCI_GOBIOS) support. */ #define BIOS_PFN PFN_DOWN(BIOS_BEGIN) #define BIOS_PFN_END PFN_DOWN(BIOS_END - 1) static pgprotval_t protect_pci_bios(unsigned long spfn, unsigned long epfn) { if (pcibios_enabled && overlaps(spfn, epfn, BIOS_PFN, BIOS_PFN_END)) return _PAGE_NX; return 0; } #else static pgprotval_t protect_pci_bios(unsigned long spfn, unsigned long epfn) { return 0; } #endif /* * The .rodata section needs to be read-only. Using the pfn catches all * aliases. This also includes __ro_after_init, so do not enforce until * kernel_set_to_readonly is true. */ static pgprotval_t protect_rodata(unsigned long spfn, unsigned long epfn) { unsigned long epfn_ro, spfn_ro = PFN_DOWN(__pa_symbol(__start_rodata)); /* * Note: __end_rodata is at page aligned and not inclusive, so * subtract 1 to get the last enforced PFN in the rodata area. */ epfn_ro = PFN_DOWN(__pa_symbol(__end_rodata)) - 1; if (kernel_set_to_readonly && overlaps(spfn, epfn, spfn_ro, epfn_ro)) return _PAGE_RW; return 0; } /* * Protect kernel text against becoming non executable by forbidding * _PAGE_NX. This protects only the high kernel mapping (_text -> _etext) * out of which the kernel actually executes. Do not protect the low * mapping. * * This does not cover __inittext since that is gone after boot. */ static pgprotval_t protect_kernel_text(unsigned long start, unsigned long end) { unsigned long t_end = (unsigned long)_etext - 1; unsigned long t_start = (unsigned long)_text; if (overlaps(start, end, t_start, t_end)) return _PAGE_NX; return 0; } #if defined(CONFIG_X86_64) /* * Once the kernel maps the text as RO (kernel_set_to_readonly is set), * kernel text mappings for the large page aligned text, rodata sections * will be always read-only. For the kernel identity mappings covering the * holes caused by this alignment can be anything that user asks. * * This will preserve the large page mappings for kernel text/data at no * extra cost. */ static pgprotval_t protect_kernel_text_ro(unsigned long start, unsigned long end) { unsigned long t_end = (unsigned long)__end_rodata_hpage_align - 1; unsigned long t_start = (unsigned long)_text; unsigned int level; if (!kernel_set_to_readonly || !overlaps(start, end, t_start, t_end)) return 0; /* * Don't enforce the !RW mapping for the kernel text mapping, if * the current mapping is already using small page mapping. No * need to work hard to preserve large page mappings in this case. * * This also fixes the Linux Xen paravirt guest boot failure caused * by unexpected read-only mappings for kernel identity * mappings. In this paravirt guest case, the kernel text mapping * and the kernel identity mapping share the same page-table pages, * so the protections for kernel text and identity mappings have to * be the same. */ if (lookup_address(start, &level) && (level != PG_LEVEL_4K)) return _PAGE_RW; return 0; } #else static pgprotval_t protect_kernel_text_ro(unsigned long start, unsigned long end) { return 0; } #endif static inline bool conflicts(pgprot_t prot, pgprotval_t val) { return (pgprot_val(prot) & ~val) != pgprot_val(prot); } static inline void check_conflict(int warnlvl, pgprot_t prot, pgprotval_t val, unsigned long start, unsigned long end, unsigned long pfn, const char *txt) { static const char *lvltxt[] = { [CPA_CONFLICT] = "conflict", [CPA_PROTECT] = "protect", [CPA_DETECT] = "detect", }; if (warnlvl > cpa_warn_level || !conflicts(prot, val)) return; pr_warn("CPA %8s %10s: 0x%016lx - 0x%016lx PFN %lx req %016llx prevent %016llx\n", lvltxt[warnlvl], txt, start, end, pfn, (unsigned long long)pgprot_val(prot), (unsigned long long)val); } /* * Certain areas of memory on x86 require very specific protection flags, * for example the BIOS area or kernel text. Callers don't always get this * right (again, ioremap() on BIOS memory is not uncommon) so this function * checks and fixes these known static required protection bits. */ static inline pgprot_t static_protections(pgprot_t prot, unsigned long start, unsigned long pfn, unsigned long npg, unsigned long lpsize, int warnlvl) { pgprotval_t forbidden, res; unsigned long end; /* * There is no point in checking RW/NX conflicts when the requested * mapping is setting the page !PRESENT. */ if (!(pgprot_val(prot) & _PAGE_PRESENT)) return prot; /* Operate on the virtual address */ end = start + npg * PAGE_SIZE - 1; res = protect_kernel_text(start, end); check_conflict(warnlvl, prot, res, start, end, pfn, "Text NX"); forbidden = res; /* * Special case to preserve a large page. If the change spawns the * full large page mapping then there is no point to split it * up. Happens with ftrace and is going to be removed once ftrace * switched to text_poke(). */ if (lpsize != (npg * PAGE_SIZE) || (start & (lpsize - 1))) { res = protect_kernel_text_ro(start, end); check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO"); forbidden |= res; } /* Check the PFN directly */ res = protect_pci_bios(pfn, pfn + npg - 1); check_conflict(warnlvl, prot, res, start, end, pfn, "PCIBIOS NX"); forbidden |= res; res = protect_rodata(pfn, pfn + npg - 1); check_conflict(warnlvl, prot, res, start, end, pfn, "Rodata RO"); forbidden |= res; return __pgprot(pgprot_val(prot) & ~forbidden); } /* * Validate strict W^X semantics. */ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long start, unsigned long pfn, unsigned long npg, bool nx, bool rw) { unsigned long end; /* * 32-bit has some unfixable W+X issues, like EFI code * and writeable data being in the same page. Disable * detection and enforcement there. */ if (IS_ENABLED(CONFIG_X86_32)) return new; /* Only verify when NX is supported: */ if (!(__supported_pte_mask & _PAGE_NX)) return new; if (!((pgprot_val(old) ^ pgprot_val(new)) & (_PAGE_RW | _PAGE_NX))) return new; if ((pgprot_val(new) & (_PAGE_RW | _PAGE_NX)) != _PAGE_RW) return new; /* Non-leaf translation entries can disable writing or execution. */ if (!rw || nx) return new; end = start + npg * PAGE_SIZE - 1; WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n", (unsigned long long)pgprot_val(old), (unsigned long long)pgprot_val(new), start, end, pfn); /* * For now, allow all permission change attempts by returning the * attempted permissions. This can 'return old' to actively * refuse the permission change at a later time. */ return new; } /* * Lookup the page table entry for a virtual address in a specific pgd. * Return a pointer to the entry (or NULL if the entry does not exist), * the level of the entry, and the effective NX and RW bits of all * page table levels. */ pte_t *lookup_address_in_pgd_attr(pgd_t *pgd, unsigned long address, unsigned int *level, bool *nx, bool *rw) { p4d_t *p4d; pud_t *pud; pmd_t *pmd; *level = PG_LEVEL_256T; *nx = false; *rw = true; if (pgd_none(*pgd)) return NULL; *level = PG_LEVEL_512G; *nx |= pgd_flags(*pgd) & _PAGE_NX; *rw &= pgd_flags(*pgd) & _PAGE_RW; p4d = p4d_offset(pgd, address); if (p4d_none(*p4d)) return NULL; if (p4d_leaf(*p4d) || !p4d_present(*p4d)) return (pte_t *)p4d; *level = PG_LEVEL_1G; *nx |= p4d_flags(*p4d) & _PAGE_NX; *rw &= p4d_flags(*p4d) & _PAGE_RW; pud = pud_offset(p4d, address); if (pud_none(*pud)) return NULL; if (pud_leaf(*pud) || !pud_present(*pud)) return (pte_t *)pud; *level = PG_LEVEL_2M; *nx |= pud_flags(*pud) & _PAGE_NX; *rw &= pud_flags(*pud) & _PAGE_RW; pmd = pmd_offset(pud, address); if (pmd_none(*pmd)) return NULL; if (pmd_leaf(*pmd) || !pmd_present(*pmd)) return (pte_t *)pmd; *level = PG_LEVEL_4K; *nx |= pmd_flags(*pmd) & _PAGE_NX; *rw &= pmd_flags(*pmd) & _PAGE_RW; return pte_offset_kernel(pmd, address); } /* * Lookup the page table entry for a virtual address in a specific pgd. * Return a pointer to the entry and the level of the mapping. */ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address, unsigned int *level) { bool nx, rw; return lookup_address_in_pgd_attr(pgd, address, level, &nx, &rw); } /* * Lookup the page table entry for a virtual address. Return a pointer * to the entry and the level of the mapping. * * Note: the function returns p4d, pud or pmd either when the entry is marked * large or when the present bit is not set. Otherwise it returns NULL. */ pte_t *lookup_address(unsigned long address, unsigned int *level) { return lookup_address_in_pgd(pgd_offset_k(address), address, level); } EXPORT_SYMBOL_GPL(lookup_address); static pte_t *_lookup_address_cpa(struct cpa_data *cpa, unsigned long address, unsigned int *level, bool *nx, bool *rw) { pgd_t *pgd; if (!cpa->pgd) pgd = pgd_offset_k(address); else pgd = cpa->pgd + pgd_index(address); return lookup_address_in_pgd_attr(pgd, address, level, nx, rw); } /* * Lookup the PMD entry for a virtual address. Return a pointer to the entry * or NULL if not present. */ pmd_t *lookup_pmd_address(unsigned long address) { pgd_t *pgd; p4d_t *p4d; pud_t *pud; pgd = pgd_offset_k(address); if (pgd_none(*pgd)) return NULL; p4d = p4d_offset(pgd, address); if (p4d_none(*p4d) || p4d_leaf(*p4d) || !p4d_present(*p4d)) return NULL; pud = pud_offset(p4d, address); if (pud_none(*pud) || pud_leaf(*pud) || !pud_present(*pud)) return NULL; return pmd_offset(pud, address); } /* * This is necessary because __pa() does not work on some * kinds of memory, like vmalloc() or the alloc_remap() * areas on 32-bit NUMA systems. The percpu areas can * end up in this kind of memory, for instance. * * Note that as long as the PTEs are well-formed with correct PFNs, this * works without checking the PRESENT bit in the leaf PTE. This is unlike * the similar vmalloc_to_page() and derivatives. Callers may depend on * this behavior. * * This could be optimized, but it is only used in paths that are not perf * sensitive, and keeping it unoptimized should increase the testing coverage * for the more obscure platforms. */ phys_addr_t slow_virt_to_phys(void *__virt_addr) { unsigned long virt_addr = (unsigned long)__virt_addr; phys_addr_t phys_addr; unsigned long offset; enum pg_level level; pte_t *pte; pte = lookup_address(virt_addr, &level); BUG_ON(!pte); /* * pXX_pfn() returns unsigned long, which must be cast to phys_addr_t * before being left-shifted PAGE_SHIFT bits -- this trick is to * make 32-PAE kernel work correctly. */ switch (level) { case PG_LEVEL_1G: phys_addr = (phys_addr_t)pud_pfn(*(pud_t *)pte) << PAGE_SHIFT; offset = virt_addr & ~PUD_MASK; break; case PG_LEVEL_2M: phys_addr = (phys_addr_t)pmd_pfn(*(pmd_t *)pte) << PAGE_SHIFT; offset = virt_addr & ~PMD_MASK; break; default: phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT; offset = virt_addr & ~PAGE_MASK; } return (phys_addr_t)(phys_addr | offset); } EXPORT_SYMBOL_GPL(slow_virt_to_phys); /* * Set the new pmd in all the pgds we know about: */ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) { /* change init_mm */ set_pte_atomic(kpte, pte); #ifdef CONFIG_X86_32 { struct page *page; list_for_each_entry(page, &pgd_list, lru) { pgd_t *pgd; p4d_t *p4d; pud_t *pud; pmd_t *pmd; pgd = (pgd_t *)page_address(page) + pgd_index(address); p4d = p4d_offset(pgd, address); pud = pud_offset(p4d, address); pmd = pmd_offset(pud, address); set_pte_atomic((pte_t *)pmd, pte); } } #endif } static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot) { /* * _PAGE_GLOBAL means "global page" for present PTEs. * But, it is also used to indicate _PAGE_PROTNONE * for non-present PTEs. * * This ensures that a _PAGE_GLOBAL PTE going from * present to non-present is not confused as * _PAGE_PROTNONE. */ if (!(pgprot_val(prot) & _PAGE_PRESENT)) pgprot_val(prot) &= ~_PAGE_GLOBAL; return prot; } static int __should_split_large_page(pte_t *kpte, unsigned long address, struct cpa_data *cpa) { unsigned long numpages, pmask, psize, lpaddr, pfn, old_pfn; pgprot_t old_prot, new_prot, req_prot, chk_prot; pte_t new_pte, *tmp; enum pg_level level; bool nx, rw; /* * Check for races, another CPU might have split this page * up already: */ tmp = _lookup_address_cpa(cpa, address, &level, &nx, &rw); if (tmp != kpte) return 1; switch (level) { case PG_LEVEL_2M: old_prot = pmd_pgprot(*(pmd_t *)kpte); old_pfn = pmd_pfn(*(pmd_t *)kpte); cpa_inc_2m_checked(); break; case PG_LEVEL_1G: old_prot = pud_pgprot(*(pud_t *)kpte); old_pfn = pud_pfn(*(pud_t *)kpte); cpa_inc_1g_checked(); break; default: return -EINVAL; } psize = page_level_size(level); pmask = page_level_mask(level); /* * Calculate the number of pages, which fit into this large * page starting at address: */ lpaddr = (address + psize) & pmask; numpages = (lpaddr - address) >> PAGE_SHIFT; if (numpages < cpa->numpages) cpa->numpages = numpages; /* * We are safe now. Check whether the new pgprot is the same: * Convert protection attributes to 4k-format, as cpa->mask* are set * up accordingly. */ /* Clear PSE (aka _PAGE_PAT) and move PAT bit to correct position */ req_prot = pgprot_large_2_4k(old_prot); pgprot_val(req_prot) &= ~pgprot_val(cpa->mask_clr); pgprot_val(req_prot) |= pgprot_val(cpa->mask_set); /* * req_prot is in format of 4k pages. It must be converted to large * page format: the caching mode includes the PAT bit located at * different bit positions in the two formats. */ req_prot = pgprot_4k_2_large(req_prot); req_prot = pgprot_clear_protnone_bits(req_prot); if (pgprot_val(req_prot) & _PAGE_PRESENT) pgprot_val(req_prot) |= _PAGE_PSE; /* * old_pfn points to the large page base pfn. So we need to add the * offset of the virtual address: */ pfn = old_pfn + ((address & (psize - 1)) >> PAGE_SHIFT); cpa->pfn = pfn; /* * Calculate the large page base address and the number of 4K pages * in the large page */ lpaddr = address & pmask; numpages = psize >> PAGE_SHIFT; /* * Sanity check that the existing mapping is correct versus the static * protections. static_protections() guards against !PRESENT, so no * extra conditional required here. */ chk_prot = static_protections(old_prot, lpaddr, old_pfn, numpages, psize, CPA_CONFLICT); if (WARN_ON_ONCE(pgprot_val(chk_prot) != pgprot_val(old_prot))) { /* * Split the large page and tell the split code to * enforce static protections. */ cpa->force_static_prot = 1; return 1; } /* * Optimization: If the requested pgprot is the same as the current * pgprot, then the large page can be preserved and no updates are * required independent of alignment and length of the requested * range. The above already established that the current pgprot is * correct, which in consequence makes the requested pgprot correct * as well if it is the same. The static protection scan below will * not come to a different conclusion. */ if (pgprot_val(req_prot) == pgprot_val(old_prot)) { cpa_inc_lp_sameprot(level); return 0; } /* * If the requested range does not cover the full page, split it up */ if (address != lpaddr || cpa->numpages != numpages) return 1; /* * Check whether the requested pgprot is conflicting with a static * protection requirement in the large page. */ new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages, psize, CPA_DETECT); new_prot = verify_rwx(old_prot, new_prot, lpaddr, old_pfn, numpages, nx, rw); /* * If there is a conflict, split the large page. * * There used to be a 4k wise evaluation trying really hard to * preserve the large pages, but experimentation has shown, that this * does not help at all. There might be corner cases which would * preserve one large page occasionally, but it's really not worth the * extra code and cycles for the common case. */ if (pgprot_val(req_prot) != pgprot_val(new_prot)) return 1; /* All checks passed. Update the large page mapping. */ new_pte = pfn_pte(old_pfn, new_prot); __set_pmd_pte(kpte, address, new_pte); cpa->flags |= CPA_FLUSHTLB; cpa_inc_lp_preserved(level); return 0; } static int should_split_large_page(pte_t *kpte, unsigned long address, struct cpa_data *cpa) { int do_split; if (cpa->force_split) return 1; spin_lock(&pgd_lock); do_split = __should_split_large_page(kpte, address, cpa); spin_unlock(&pgd_lock); return do_split; } static void split_set_pte(struct cpa_data *cpa, pte_t *pte, unsigned long pfn, pgprot_t ref_prot, unsigned long address, unsigned long size) { unsigned int npg = PFN_DOWN(size); pgprot_t prot; /* * If should_split_large_page() discovered an inconsistent mapping, * remove the invalid protection in the split mapping. */ if (!cpa->force_static_prot) goto set; /* Hand in lpsize = 0 to enforce the protection mechanism */ prot = static_protections(ref_prot, address, pfn, npg, 0, CPA_PROTECT); if (pgprot_val(prot) == pgprot_val(ref_prot)) goto set; /* * If this is splitting a PMD, fix it up. PUD splits cannot be * fixed trivially as that would require to rescan the newly * installed PMD mappings after returning from split_large_page() * so an eventual further split can allocate the necessary PTE * pages. Warn for now and revisit it in case this actually * happens. */ if (size == PAGE_SIZE) ref_prot = prot; else pr_warn_once("CPA: Cannot fixup static protections for PUD split\n"); set: set_pte(pte, pfn_pte(pfn, ref_prot)); } static int __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address, struct page *base) { unsigned long lpaddr, lpinc, ref_pfn, pfn, pfninc = 1; pte_t *pbase = (pte_t *)page_address(base); unsigned int i, level; pgprot_t ref_prot; bool nx, rw; pte_t *tmp; spin_lock(&pgd_lock); /* * Check for races, another CPU might have split this page * up for us already: */ tmp = _lookup_address_cpa(cpa, address, &level, &nx, &rw); if (tmp != kpte) { spin_unlock(&pgd_lock); return 1; } paravirt_alloc_pte(&init_mm, page_to_pfn(base)); switch (level) { case PG_LEVEL_2M: ref_prot = pmd_pgprot(*(pmd_t *)kpte); /* * Clear PSE (aka _PAGE_PAT) and move * PAT bit to correct position. */ ref_prot = pgprot_large_2_4k(ref_prot); ref_pfn = pmd_pfn(*(pmd_t *)kpte); lpaddr = address & PMD_MASK; lpinc = PAGE_SIZE; break; case PG_LEVEL_1G: ref_prot = pud_pgprot(*(pud_t *)kpte); ref_pfn = pud_pfn(*(pud_t *)kpte); pfninc = PMD_SIZE >> PAGE_SHIFT; lpaddr = address & PUD_MASK; lpinc = PMD_SIZE; /* * Clear the PSE flags if the PRESENT flag is not set * otherwise pmd_present() will return true even on a non * present pmd. */ if (!(pgprot_val(ref_prot) & _PAGE_PRESENT)) pgprot_val(ref_prot) &= ~_PAGE_PSE; break; default: spin_unlock(&pgd_lock); return 1; } ref_prot = pgprot_clear_protnone_bits(ref_prot); /* * Get the target pfn from the original entry: */ pfn = ref_pfn; for (i = 0; i < PTRS_PER_PTE; i++, pfn += pfninc, lpaddr += lpinc) split_set_pte(cpa, pbase + i, pfn, ref_prot, lpaddr, lpinc); if (virt_addr_valid(address)) { unsigned long pfn = PFN_DOWN(__pa(address)); if (pfn_range_is_mapped(pfn, pfn + 1)) split_page_count(level); } /* * Install the new, split up pagetable. * * We use the standard kernel pagetable protections for the new * pagetable protections, the actual ptes set above control the * primary protection behavior: */ __set_pmd_pte(kpte, address, mk_pte(base, __pgprot(_KERNPG_TABLE))); /* * Do a global flush tlb after splitting the large page * and before we do the actual change page attribute in the PTE. * * Without this, we violate the TLB application note, that says: * "The TLBs may contain both ordinary and large-page * translations for a 4-KByte range of linear addresses. This * may occur if software modifies the paging structures so that * the page size used for the address range changes. If the two * translations differ with respect to page frame or attributes * (e.g., permissions), processor behavior is undefined and may * be implementation-specific." * * We do this global tlb flush inside the cpa_lock, so that we * don't allow any other cpu, with stale tlb entries change the * page attribute in parallel, that also falls into the * just split large page entry. */ flush_tlb_all(); spin_unlock(&pgd_lock); return 0; } static int split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address) { struct page *base; if (!debug_pagealloc_enabled()) spin_unlock(&cpa_lock); base = alloc_pages(GFP_KERNEL, 0); if (!debug_pagealloc_enabled()) spin_lock(&cpa_lock); if (!base) return -ENOMEM; if (__split_large_page(cpa, kpte, address, base)) __free_page(base); return 0; } static int collapse_pmd_page(pmd_t *pmd, unsigned long addr, struct list_head *pgtables) { pmd_t _pmd, old_pmd; pte_t *pte, first; unsigned long pfn; pgprot_t pgprot; int i = 0; if (!cpu_feature_enabled(X86_FEATURE_PSE)) return 0; addr &= PMD_MASK; pte = pte_offset_kernel(pmd, addr); first = *pte; pfn = pte_pfn(first); /* Make sure alignment is suitable */ if (PFN_PHYS(pfn) & ~PMD_MASK) return 0; /* The page is 4k intentionally */ if (pte_flags(first) & _PAGE_KERNEL_4K) return 0; /* Check that the rest of PTEs are compatible with the first one */ for (i = 1, pte++; i < PTRS_PER_PTE; i++, pte++) { pte_t entry = *pte; if (!pte_present(entry)) return 0; if (pte_flags(entry) != pte_flags(first)) return 0; if (pte_pfn(entry) != pte_pfn(first) + i) return 0; } old_pmd = *pmd; /* Success: set up a large page */ pgprot = pgprot_4k_2_large(pte_pgprot(first)); pgprot_val(pgprot) |= _PAGE_PSE; _pmd = pfn_pmd(pfn, pgprot); set_pmd(pmd, _pmd); /* Queue the page table to be freed after TLB flush */ list_add(&page_ptdesc(pmd_page(old_pmd))->pt_list, pgtables); if (IS_ENABLED(CONFIG_X86_32)) { struct page *page; /* Update all PGD tables to use the same large page */ list_for_each_entry(page, &pgd_list, lru) { pgd_t *pgd = (pgd_t *)page_address(page) + pgd_index(addr); p4d_t *p4d = p4d_offset(pgd, addr); pud_t *pud = pud_offset(p4d, addr); pmd_t *pmd = pmd_offset(pud, addr); /* Something is wrong if entries doesn't match */ if (WARN_ON(pmd_val(old_pmd) != pmd_val(*pmd))) continue; set_pmd(pmd, _pmd); } } if (virt_addr_valid(addr) && pfn_range_is_mapped(pfn, pfn + 1)) collapse_page_count(PG_LEVEL_2M); return 1; } static int collapse_pud_page(pud_t *pud, unsigned long addr, struct list_head *pgtables) { unsigned long pfn; pmd_t *pmd, first; int i; if (!direct_gbpages) return 0; addr &= PUD_MASK; pmd = pmd_offset(pud, addr); first = *pmd; /* * To restore PUD page all PMD entries must be large and * have suitable alignment */ pfn = pmd_pfn(first); if (!pmd_leaf(first) || (PFN_PHYS(pfn) & ~PUD_MASK)) return 0; /* * To restore PUD page, all following PMDs must be compatible with the * first one. */ for (i = 1, pmd++; i < PTRS_PER_PMD; i++, pmd++) { pmd_t entry = *pmd; if (!pmd_present(entry) || !pmd_leaf(entry)) return 0; if (pmd_flags(entry) != pmd_flags(first)) return 0; if (pmd_pfn(entry) != pmd_pfn(first) + i * PTRS_PER_PTE) return 0; } /* Restore PUD page and queue page table to be freed after TLB flush */ list_add(&page_ptdesc(pud_page(*pud))->pt_list, pgtables); set_pud(pud, pfn_pud(pfn, pmd_pgprot(first))); if (virt_addr_valid(addr) && pfn_range_is_mapped(pfn, pfn + 1)) collapse_page_count(PG_LEVEL_1G); return 1; } /* * Collapse PMD and PUD pages in the kernel mapping around the address where * possible. * * Caller must flush TLB and free page tables queued on the list before * touching the new entries. CPU must not see TLB entries of different size * with different attributes. */ static int collapse_large_pages(unsigned long addr, struct list_head *pgtables) { int collapsed = 0; pgd_t *pgd; p4d_t *p4d; pud_t *pud; pmd_t *pmd; addr &= PMD_MASK; spin_lock(&pgd_lock); pgd = pgd_offset_k(addr); if (pgd_none(*pgd)) goto out; p4d = p4d_offset(pgd, addr); if (p4d_none(*p4d)) goto out; pud = pud_offset(p4d, addr); if (!pud_present(*pud) || pud_leaf(*pud)) goto out; pmd = pmd_offset(pud, addr); if (!pmd_present(*pmd) || pmd_leaf(*pmd)) goto out; collapsed = collapse_pmd_page(pmd, addr, pgtables); if (collapsed) collapsed += collapse_pud_page(pud, addr, pgtables); out: spin_unlock(&pgd_lock); return collapsed; } static bool try_to_free_pte_page(pte_t *pte) { int i; for (i = 0; i < PTRS_PER_PTE; i++) if (!pte_none(pte[i])) return false; free_page((unsigned long)pte); return true; } static bool try_to_free_pmd_page(pmd_t *pmd) { int i; for (i = 0; i < PTRS_PER_PMD; i++) if (!pmd_none(pmd[i])) return false; free_page((unsigned long)pmd); return true; } static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end) { pte_t *pte = pte_offset_kernel(pmd, start); while (start < end) { set_pte(pte, __pte(0)); start += PAGE_SIZE; pte++; } if (try_to_free_pte_page((pte_t *)pmd_page_vaddr(*pmd))) { pmd_clear(pmd); return true; } return false; } static void __unmap_pmd_range(pud_t *pud, pmd_t *pmd, unsigned long start, unsigned long end) { if (unmap_pte_range(pmd, start, end)) if (try_to_free_pmd_page(pud_pgtable(*pud))) pud_clear(pud); } static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end) { pmd_t *pmd = pmd_offset(pud, start); /* * Not on a 2MB page boundary? */ if (start & (PMD_SIZE - 1)) { unsigned long next_page = (start + PMD_SIZE) & PMD_MASK; unsigned long pre_end = min_t(unsigned long, end, next_page); __unmap_pmd_range(pud, pmd, start, pre_end); start = pre_end; pmd++; } /* * Try to unmap in 2M chunks. */ while (end - start >= PMD_SIZE) { if (pmd_leaf(*pmd)) pmd_clear(pmd); else __unmap_pmd_range(pud, pmd, start, start + PMD_SIZE); start += PMD_SIZE; pmd++; } /* * 4K leftovers? */ if (start < end) return __unmap_pmd_range(pud, pmd, start, end); /* * Try again to free the PMD page if haven't succeeded above. */ if (!pud_none(*pud)) if (try_to_free_pmd_page(pud_pgtable(*pud))) pud_clear(pud); } static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end) { pud_t *pud = pud_offset(p4d, start); /* * Not on a GB page boundary? */ if (start & (PUD_SIZE - 1)) { unsigned long next_page = (start + PUD_SIZE) & PUD_MASK; unsigned long pre_end = min_t(unsigned long, end, next_page); unmap_pmd_range(pud, start, pre_end); start = pre_end; pud++; } /* * Try to unmap in 1G chunks? */ while (end - start >= PUD_SIZE) { if (pud_leaf(*pud)) pud_clear(pud); else unmap_pmd_range(pud, start, start + PUD_SIZE); start += PUD_SIZE; pud++; } /* * 2M leftovers? */ if (start < end) unmap_pmd_range(pud, start, end); /* * No need to try to free the PUD page because we'll free it in * populate_pgd's error path */ } static int alloc_pte_page(pmd_t *pmd) { pte_t *pte = (pte_t *)get_zeroed_page(GFP_KERNEL); if (!pte) return -1; set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE)); return 0; } static int alloc_pmd_page(pud_t *pud) { pmd_t *pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL); if (!pmd) return -1; set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE)); return 0; } static void populate_pte(struct cpa_data *cpa, unsigned long start, unsigned long end, unsigned num_pages, pmd_t *pmd, pgprot_t pgprot) { pte_t *pte; pte = pte_offset_kernel(pmd, start); pgprot = pgprot_clear_protnone_bits(pgprot); while (num_pages-- && start < end) { set_pte(pte, pfn_pte(cpa->pfn, pgprot)); start += PAGE_SIZE; cpa->pfn++; pte++; } } static long populate_pmd(struct cpa_data *cpa, unsigned long start, unsigned long end, unsigned num_pages, pud_t *pud, pgprot_t pgprot) { long cur_pages = 0; pmd_t *pmd; pgprot_t pmd_pgprot; /* * Not on a 2M boundary? */ if (start & (PMD_SIZE - 1)) { unsigned long pre_end = start + (num_pages << PAGE_SHIFT); unsigned long next_page = (start + PMD_SIZE) & PMD_MASK; pre_end = min_t(unsigned long, pre_end, next_page); cur_pages = (pre_end - start) >> PAGE_SHIFT; cur_pages = min_t(unsigned int, num_pages, cur_pages); /* * Need a PTE page? */ pmd = pmd_offset(pud, start); if (pmd_none(*pmd)) if (alloc_pte_page(pmd)) return -1; populate_pte(cpa, start, pre_end, cur_pages, pmd, pgprot); start = pre_end; } /* * We mapped them all? */ if (num_pages == cur_pages) return cur_pages; pmd_pgprot = pgprot_4k_2_large(pgprot); while (end - start >= PMD_SIZE) { /* * We cannot use a 1G page so allocate a PMD page if needed. */ if (pud_none(*pud)) if (alloc_pmd_page(pud)) return -1; pmd = pmd_offset(pud, start); set_pmd(pmd, pmd_mkhuge(pfn_pmd(cpa->pfn, canon_pgprot(pmd_pgprot)))); start += PMD_SIZE; cpa->pfn += PMD_SIZE >> PAGE_SHIFT; cur_pages += PMD_SIZE >> PAGE_SHIFT; } /* * Map trailing 4K pages. */ if (start < end) { pmd = pmd_offset(pud, start); if (pmd_none(*pmd)) if (alloc_pte_page(pmd)) return -1; populate_pte(cpa, start, end, num_pages - cur_pages, pmd, pgprot); } return num_pages; } static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d, pgprot_t pgprot) { pud_t *pud; unsigned long end; long cur_pages = 0; pgprot_t pud_pgprot; end = start + (cpa->numpages << PAGE_SHIFT); /* * Not on a Gb page boundary? => map everything up to it with * smaller pages. */ if (start & (PUD_SIZE - 1)) { unsigned long pre_end; unsigned long next_page = (start + PUD_SIZE) & PUD_MASK; pre_end = min_t(unsigned long, end, next_page); cur_pages = (pre_end - start) >> PAGE_SHIFT; cur_pages = min_t(int, (int)cpa->numpages, cur_pages); pud = pud_offset(p4d, start); /* * Need a PMD page? */ if (pud_none(*pud)) if (alloc_pmd_page(pud)) return -1; cur_pages = populate_pmd(cpa, start, pre_end, cur_pages, pud, pgprot); if (cur_pages < 0) return cur_pages; start = pre_end; } /* We mapped them all? */ if (cpa->numpages == cur_pages) return cur_pages; pud = pud_offset(p4d, start); pud_pgprot = pgprot_4k_2_large(pgprot); /* * Map everything starting from the Gb boundary, possibly with 1G pages */ while (boot_cpu_has(X86_FEATURE_GBPAGES) && end - start >= PUD_SIZE) { set_pud(pud, pud_mkhuge(pfn_pud(cpa->pfn, canon_pgprot(pud_pgprot)))); start += PUD_SIZE; cpa->pfn += PUD_SIZE >> PAGE_SHIFT; cur_pages += PUD_SIZE >> PAGE_SHIFT; pud++; } /* Map trailing leftover */ if (start < end) { long tmp; pud = pud_offset(p4d, start); if (pud_none(*pud)) if (alloc_pmd_page(pud)) return -1; tmp = populate_pmd(cpa, start, end, cpa->numpages - cur_pages, pud, pgprot); if (tmp < 0) return cur_pages; cur_pages += tmp; } return cur_pages; } /* * Restrictions for kernel page table do not necessarily apply when mapping in * an alternate PGD. */ static int populate_pgd(struct cpa_data *cpa, unsigned long addr) { pgprot_t pgprot = __pgprot(_KERNPG_TABLE); pud_t *pud = NULL; /* shut up gcc */ p4d_t *p4d; pgd_t *pgd_entry; long ret; pgd_entry = cpa->pgd + pgd_index(addr); if (pgd_none(*pgd_entry)) { p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL); if (!p4d) return -1; set_pgd(pgd_entry, __pgd(__pa(p4d) | _KERNPG_TABLE)); } /* * Allocate a PUD page and hand it down for mapping. */ p4d = p4d_offset(pgd_entry, addr); if (p4d_none(*p4d)) { pud = (pud_t *)get_zeroed_page(GFP_KERNEL); if (!pud) return -1; set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE)); } pgprot_val(pgprot) &= ~pgprot_val(cpa->mask_clr); pgprot_val(pgprot) |= pgprot_val(cpa->mask_set); ret = populate_pud(cpa, addr, p4d, pgprot); if (ret < 0) { /* * Leave the PUD page in place in case some other CPU or thread * already found it, but remove any useless entries we just * added to it. */ unmap_pud_range(p4d, addr, addr + (cpa->numpages << PAGE_SHIFT)); return ret; } cpa->numpages = ret; return 0; } static int __cpa_process_fault(struct cpa_data *cpa, unsigned long vaddr, int primary) { if (cpa->pgd) { /* * Right now, we only execute this code path when mapping * the EFI virtual memory map regions, no other users * provide a ->pgd value. This may change in the future. */ return populate_pgd(cpa, vaddr); } /* * Ignore all non primary paths. */ if (!primary) { cpa->numpages = 1; return 0; } /* * Ignore the NULL PTE for kernel identity mapping, as it is expected * to have holes. * Also set numpages to '1' indicating that we processed cpa req for * one virtual address page and its pfn. TBD: numpages can be set based * on the initial value and the level returned by lookup_address(). */ if (within(vaddr, PAGE_OFFSET, PAGE_OFFSET + (max_pfn_mapped << PAGE_SHIFT))) { cpa->numpages = 1; cpa->pfn = __pa(vaddr) >> PAGE_SHIFT; return 0; } else if (__cpa_pfn_in_highmap(cpa->pfn)) { /* Faults in the highmap are OK, so do not warn: */ return -EFAULT; } else { WARN(1, KERN_WARNING "CPA: called for zero pte. " "vaddr = %lx cpa->vaddr = %lx\n", vaddr, *cpa->vaddr); return -EFAULT; } } static int __change_page_attr(struct cpa_data *cpa, int primary) { unsigned long address; int do_split, err; unsigned int level; pte_t *kpte, old_pte; bool nx, rw; address = __cpa_addr(cpa, cpa->curpage); repeat: kpte = _lookup_address_cpa(cpa, address, &level, &nx, &rw); if (!kpte) return __cpa_process_fault(cpa, address, primary); old_pte = *kpte; if (pte_none(old_pte)) return __cpa_process_fault(cpa, address, primary); if (level == PG_LEVEL_4K) { pte_t new_pte; pgprot_t old_prot = pte_pgprot(old_pte); pgprot_t new_prot = pte_pgprot(old_pte); unsigned long pfn = pte_pfn(old_pte); pgprot_val(new_prot) &= ~pgprot_val(cpa->mask_clr); pgprot_val(new_prot) |= pgprot_val(cpa->mask_set); cpa_inc_4k_install(); /* Hand in lpsize = 0 to enforce the protection mechanism */ new_prot = static_protections(new_prot, address, pfn, 1, 0, CPA_PROTECT); new_prot = verify_rwx(old_prot, new_prot, address, pfn, 1, nx, rw); new_prot = pgprot_clear_protnone_bits(new_prot); /* * We need to keep the pfn from the existing PTE, * after all we're only going to change its attributes * not the memory it points to */ new_pte = pfn_pte(pfn, new_prot); cpa->pfn = pfn; /* * Do we really change anything ? */ if (pte_val(old_pte) != pte_val(new_pte)) { set_pte_atomic(kpte, new_pte); cpa->flags |= CPA_FLUSHTLB; } cpa->numpages = 1; return 0; } /* * Check, whether we can keep the large page intact * and just change the pte: */ do_split = should_split_large_page(kpte, address, cpa); /* * When the range fits into the existing large page, * return. cp->numpages and cpa->tlbflush have been updated in * try_large_page: */ if (do_split <= 0) return do_split; /* * We have to split the large page: */ err = split_large_page(cpa, kpte, address); if (!err) goto repeat; return err; } static int __change_page_attr_set_clr(struct cpa_data *cpa, int primary); /* * Check the directmap and "high kernel map" 'aliases'. */ static int cpa_process_alias(struct cpa_data *cpa) { struct cpa_data alias_cpa; unsigned long laddr = (unsigned long)__va(cpa->pfn << PAGE_SHIFT); unsigned long vaddr; int ret; if (!pfn_range_is_mapped(cpa->pfn, cpa->pfn + 1)) return 0; /* * No need to redo, when the primary call touched the direct * mapping already: */ vaddr = __cpa_addr(cpa, cpa->curpage); if (!(within(vaddr, PAGE_OFFSET, PAGE_OFFSET + (max_pfn_mapped << PAGE_SHIFT)))) { alias_cpa = *cpa; alias_cpa.vaddr = &laddr; alias_cpa.flags &= ~(CPA_PAGES_ARRAY | CPA_ARRAY); alias_cpa.curpage = 0; /* Directmap always has NX set, do not modify. */ if (__supported_pte_mask & _PAGE_NX) { alias_cpa.mask_clr.pgprot &= ~_PAGE_NX; alias_cpa.mask_set.pgprot &= ~_PAGE_NX; } cpa->force_flush_all = 1; ret = __change_page_attr_set_clr(&alias_cpa, 0); if (ret) return ret; } #ifdef CONFIG_X86_64 /* * If the primary call didn't touch the high mapping already * and the physical address is inside the kernel map, we need * to touch the high mapped kernel as well: */ if (!within(vaddr, (unsigned long)_text, _brk_end) && __cpa_pfn_in_highmap(cpa->pfn)) { unsigned long temp_cpa_vaddr = (cpa->pfn << PAGE_SHIFT) + __START_KERNEL_map - phys_base; alias_cpa = *cpa; alias_cpa.vaddr = &temp_cpa_vaddr; alias_cpa.flags &= ~(CPA_PAGES_ARRAY | CPA_ARRAY); alias_cpa.curpage = 0; /* * [_text, _brk_end) also covers data, do not modify NX except * in cases where the highmap is the primary target. */ if (__supported_pte_mask & _PAGE_NX) { alias_cpa.mask_clr.pgprot &= ~_PAGE_NX; alias_cpa.mask_set.pgprot &= ~_PAGE_NX; } cpa->force_flush_all = 1; /* * The high mapping range is imprecise, so ignore the * return value. */ __change_page_attr_set_clr(&alias_cpa, 0); } #endif return 0; } static int __change_page_attr_set_clr(struct cpa_data *cpa, int primary) { unsigned long numpages = cpa->numpages; unsigned long rempages = numpages; int ret = 0; /* * No changes, easy! */ if (!(pgprot_val(cpa->mask_set) | pgprot_val(cpa->mask_clr)) && !cpa->force_split) return ret; while (rempages) { /* * Store the remaining nr of pages for the large page * preservation check. */ cpa->numpages = rempages; /* for array changes, we can't use large page */ if (cpa->flags & (CPA_ARRAY | CPA_PAGES_ARRAY)) cpa->numpages = 1; if (!debug_pagealloc_enabled()) spin_lock(&cpa_lock); ret = __change_page_attr(cpa, primary); if (!debug_pagealloc_enabled()) spin_unlock(&cpa_lock); if (ret) goto out; if (primary && !(cpa->flags & CPA_NO_CHECK_ALIAS)) { ret = cpa_process_alias(cpa); if (ret) goto out; } /* * Adjust the number of pages with the result of the * CPA operation. Either a large page has been * preserved or a single page update happened. */ BUG_ON(cpa->numpages > rempages || !cpa->numpages); rempages -= cpa->numpages; cpa->curpage += cpa->numpages; } out: /* Restore the original numpages */ cpa->numpages = numpages; return ret; } static int change_page_attr_set_clr(unsigned long *addr, int numpages, pgprot_t mask_set, pgprot_t mask_clr, int force_split, int in_flag, struct page **pages) { struct cpa_data cpa; int ret, cache; memset(&cpa, 0, sizeof(cpa)); /* * Check, if we are requested to set a not supported * feature. Clearing non-supported features is OK. */ mask_set = canon_pgprot(mask_set); if (!pgprot_val(mask_set) && !pgprot_val(mask_clr) && !force_split) return 0; /* Ensure we are PAGE_SIZE aligned */ if (in_flag & CPA_ARRAY) { int i; for (i = 0; i < numpages; i++) { if (addr[i] & ~PAGE_MASK) { addr[i] &= PAGE_MASK; WARN_ON_ONCE(1); } } } else if (!(in_flag & CPA_PAGES_ARRAY)) { /* * in_flag of CPA_PAGES_ARRAY implies it is aligned. * No need to check in that case */ if (*addr & ~PAGE_MASK) { *addr &= PAGE_MASK; /* * People should not be passing in unaligned addresses: */ WARN_ON_ONCE(1); } } /* Must avoid aliasing mappings in the highmem code */ kmap_flush_unused(); vm_unmap_aliases(); cpa.vaddr = addr; cpa.pages = pages; cpa.numpages = numpages; cpa.mask_set = mask_set; cpa.mask_clr = mask_clr; cpa.flags = in_flag; cpa.curpage = 0; cpa.force_split = force_split; ret = __change_page_attr_set_clr(&cpa, 1); /* * Check whether we really changed something: */ if (!(cpa.flags & CPA_FLUSHTLB)) goto out; /* * No need to flush, when we did not set any of the caching * attributes: */ cache = !!pgprot2cachemode(mask_set); /* * On error; flush everything to be sure. */ if (ret) { cpa_flush_all(cache); goto out; } cpa_flush(&cpa, cache); out: return ret; } static inline int change_page_attr_set(unsigned long *addr, int numpages, pgprot_t mask, int array) { return change_page_attr_set_clr(addr, numpages, mask, __pgprot(0), 0, (array ? CPA_ARRAY : 0), NULL); } static inline int change_page_attr_clear(unsigned long *addr, int numpages, pgprot_t mask, int array) { return change_page_attr_set_clr(addr, numpages, __pgprot(0), mask, 0, (array ? CPA_ARRAY : 0), NULL); } static inline int cpa_set_pages_array(struct page **pages, int numpages, pgprot_t mask) { return change_page_attr_set_clr(NULL, numpages, mask, __pgprot(0), 0, CPA_PAGES_ARRAY, pages); } static inline int cpa_clear_pages_array(struct page **pages, int numpages, pgprot_t mask) { return change_page_attr_set_clr(NULL, numpages, __pgprot(0), mask, 0, CPA_PAGES_ARRAY, pages); } /* * __set_memory_prot is an internal helper for callers that have been passed * a pgprot_t value from upper layers and a reservation has already been taken. * If you want to set the pgprot to a specific page protocol, use the * set_memory_xx() functions. */ int __set_memory_prot(unsigned long addr, int numpages, pgprot_t prot) { return change_page_attr_set_clr(&addr, numpages, prot, __pgprot(~pgprot_val(prot)), 0, 0, NULL); } int _set_memory_uc(unsigned long addr, int numpages) { /* * for now UC MINUS. see comments in ioremap() * If you really need strong UC use ioremap_uc(), but note * that you cannot override IO areas with set_memory_*() as * these helpers cannot work with IO memory. */ return change_page_attr_set(&addr, numpages, cachemode2pgprot(_PAGE_CACHE_MODE_UC_MINUS), 0); } int set_memory_uc(unsigned long addr, int numpages) { int ret; /* * for now UC MINUS. see comments in ioremap() */ ret = memtype_reserve(__pa(addr), __pa(addr) + numpages * PAGE_SIZE, _PAGE_CACHE_MODE_UC_MINUS, NULL); if (ret) goto out_err; ret = _set_memory_uc(addr, numpages); if (ret) goto out_free; return 0; out_free: memtype_free(__pa(addr), __pa(addr) + numpages * PAGE_SIZE); out_err: return ret; } EXPORT_SYMBOL(set_memory_uc); int _set_memory_wc(unsigned long addr, int numpages) { int ret; ret = change_page_attr_set(&addr, numpages, cachemode2pgprot(_PAGE_CACHE_MODE_UC_MINUS), 0); if (!ret) { ret = change_page_attr_set_clr(&addr, numpages, cachemode2pgprot(_PAGE_CACHE_MODE_WC), __pgprot(_PAGE_CACHE_MASK), 0, 0, NULL); } return ret; } int set_memory_wc(unsigned long addr, int numpages) { int ret; ret = memtype_reserve(__pa(addr), __pa(addr) + numpages * PAGE_SIZE, _PAGE_CACHE_MODE_WC, NULL); if (ret) return ret; ret = _set_memory_wc(addr, numpages); if (ret) memtype_free(__pa(addr), __pa(addr) + numpages * PAGE_SIZE); return ret; } EXPORT_SYMBOL(set_memory_wc); int _set_memory_wt(unsigned long addr, int numpages) { return change_page_attr_set(&addr, numpages, cachemode2pgprot(_PAGE_CACHE_MODE_WT), 0); } int _set_memory_wb(unsigned long addr, int numpages) { /* WB cache mode is hard wired to all cache attribute bits being 0 */ return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_CACHE_MASK), 0); } int set_memory_wb(unsigned long addr, int numpages) { int ret; ret = _set_memory_wb(addr, numpages); if (ret) return ret; memtype_free(__pa(addr), __pa(addr) + numpages * PAGE_SIZE); return 0; } EXPORT_SYMBOL(set_memory_wb); /* Prevent speculative access to a page by marking it not-present */ #ifdef CONFIG_X86_64 int set_mce_nospec(unsigned long pfn) { unsigned long decoy_addr; int rc; /* SGX pages are not in the 1:1 map */ if (arch_is_platform_page(pfn << PAGE_SHIFT)) return 0; /* * We would like to just call: * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1); * but doing that would radically increase the odds of a * speculative access to the poison page because we'd have * the virtual address of the kernel 1:1 mapping sitting * around in registers. * Instead we get tricky. We create a non-canonical address * that looks just like the one we want, but has bit 63 flipped. * This relies on set_memory_XX() properly sanitizing any __pa() * results with __PHYSICAL_MASK or PTE_PFN_MASK. */ decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63)); rc = set_memory_np(decoy_addr, 1); if (rc) pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn); return rc; } EXPORT_SYMBOL_GPL(set_mce_nospec); /* Restore full speculative operation to the pfn. */ int clear_mce_nospec(unsigned long pfn) { unsigned long addr = (unsigned long) pfn_to_kaddr(pfn); return set_memory_p(addr, 1); } EXPORT_SYMBOL_GPL(clear_mce_nospec); #endif /* CONFIG_X86_64 */ int set_memory_x(unsigned long addr, int numpages) { if (!(__supported_pte_mask & _PAGE_NX)) return 0; return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_NX), 0); } int set_memory_nx(unsigned long addr, int numpages) { if (!(__supported_pte_mask & _PAGE_NX)) return 0; return change_page_attr_set(&addr, numpages, __pgprot(_PAGE_NX), 0); } int set_memory_ro(unsigned long addr, int numpages) { return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0); } int set_memory_rox(unsigned long addr, int numpages) { pgprot_t clr = __pgprot(_PAGE_RW | _PAGE_DIRTY); if (__supported_pte_mask & _PAGE_NX) clr.pgprot |= _PAGE_NX; return change_page_attr_set_clr(&addr, numpages, __pgprot(0), clr, 0, CPA_COLLAPSE, NULL); } int set_memory_rw(unsigned long addr, int numpages) { return change_page_attr_set(&addr, numpages, __pgprot(_PAGE_RW), 0); } int set_memory_np(unsigned long addr, int numpages) { return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_PRESENT), 0); } int set_memory_np_noalias(unsigned long addr, int numpages) { return change_page_attr_set_clr(&addr, numpages, __pgprot(0), __pgprot(_PAGE_PRESENT), 0, CPA_NO_CHECK_ALIAS, NULL); } int set_memory_p(unsigned long addr, int numpages) { return change_page_attr_set(&addr, numpages, __pgprot(_PAGE_PRESENT), 0); } int set_memory_4k(unsigned long addr, int numpages) { return change_page_attr_set_clr(&addr, numpages, __pgprot(_PAGE_KERNEL_4K), __pgprot(0), 1, 0, NULL); } int set_memory_nonglobal(unsigned long addr, int numpages) { return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_GLOBAL), 0); } int set_memory_global(unsigned long addr, int numpages) { return change_page_attr_set(&addr, numpages, __pgprot(_PAGE_GLOBAL), 0); } /* * __set_memory_enc_pgtable() is used for the hypervisors that get * informed about "encryption" status via page tables. */ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) { pgprot_t empty = __pgprot(0); struct cpa_data cpa; int ret; /* Should not be working on unaligned addresses */ if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr)) addr &= PAGE_MASK; memset(&cpa, 0, sizeof(cpa)); cpa.vaddr = &addr; cpa.numpages = numpages; cpa.mask_set = enc ? pgprot_encrypted(empty) : pgprot_decrypted(empty); cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty); cpa.pgd = init_mm.pgd; /* Must avoid aliasing mappings in the highmem code */ kmap_flush_unused(); vm_unmap_aliases(); /* Flush the caches as needed before changing the encryption attribute. */ if (x86_platform.guest.enc_tlb_flush_required(enc)) cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required()); /* Notify hypervisor that we are about to set/clr encryption attribute. */ ret = x86_platform.guest.enc_status_change_prepare(addr, numpages, enc); if (ret) goto vmm_fail; ret = __change_page_attr_set_clr(&cpa, 1); /* * After changing the encryption attribute, we need to flush TLBs again * in case any speculative TLB caching occurred (but no need to flush * caches again). We could just use cpa_flush_all(), but in case TLB * flushing gets optimized in the cpa_flush() path use the same logic * as above. */ cpa_flush(&cpa, 0); if (ret) return ret; /* Notify hypervisor that we have successfully set/clr encryption attribute. */ ret = x86_platform.guest.enc_status_change_finish(addr, numpages, enc); if (ret) goto vmm_fail; return 0; vmm_fail: WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s: %d\n", (void *)addr, numpages, enc ? "private" : "shared", ret); return ret; } /* * The lock serializes conversions between private and shared memory. * * It is taken for read on conversion. A write lock guarantees that no * concurrent conversions are in progress. */ static DECLARE_RWSEM(mem_enc_lock); /* * Stop new private<->shared conversions. * * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete. * The lock is not released to prevent new conversions from being started. */ bool set_memory_enc_stop_conversion(void) { /* * In a crash scenario, sleep is not allowed. Try to take the lock. * Failure indicates that there is a race with the conversion. */ if (oops_in_progress) return down_write_trylock(&mem_enc_lock); down_write(&mem_enc_lock); return true; } static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) { int ret = 0; if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) { if (!down_read_trylock(&mem_enc_lock)) return -EBUSY; ret = __set_memory_enc_pgtable(addr, numpages, enc); up_read(&mem_enc_lock); } return ret; } int set_memory_encrypted(unsigned long addr, int numpages) { return __set_memory_enc_dec(addr, numpages, true); } EXPORT_SYMBOL_GPL(set_memory_encrypted); int set_memory_decrypted(unsigned long addr, int numpages) { return __set_memory_enc_dec(addr, numpages, false); } EXPORT_SYMBOL_GPL(set_memory_decrypted); int set_pages_uc(struct page *page, int numpages) { unsigned long addr = (unsigned long)page_address(page); return set_memory_uc(addr, numpages); } EXPORT_SYMBOL(set_pages_uc); static int _set_pages_array(struct page **pages, int numpages, enum page_cache_mode new_type) { unsigned long start; unsigned long end; enum page_cache_mode set_type; int i; int free_idx; int ret; for (i = 0; i < numpages; i++) { if (PageHighMem(pages[i])) continue; start = page_to_pfn(pages[i]) << PAGE_SHIFT; end = start + PAGE_SIZE; if (memtype_reserve(start, end, new_type, NULL)) goto err_out; } /* If WC, set to UC- first and then WC */ set_type = (new_type == _PAGE_CACHE_MODE_WC) ? _PAGE_CACHE_MODE_UC_MINUS : new_type; ret = cpa_set_pages_array(pages, numpages, cachemode2pgprot(set_type)); if (!ret && new_type == _PAGE_CACHE_MODE_WC) ret = change_page_attr_set_clr(NULL, numpages, cachemode2pgprot( _PAGE_CACHE_MODE_WC), __pgprot(_PAGE_CACHE_MASK), 0, CPA_PAGES_ARRAY, pages); if (ret) goto err_out; return 0; /* Success */ err_out: free_idx = i; for (i = 0; i < free_idx; i++) { if (PageHighMem(pages[i])) continue; start = page_to_pfn(pages[i]) << PAGE_SHIFT; end = start + PAGE_SIZE; memtype_free(start, end); } return -EINVAL; } int set_pages_array_uc(struct page **pages, int numpages) { return _set_pages_array(pages, numpages, _PAGE_CACHE_MODE_UC_MINUS); } EXPORT_SYMBOL(set_pages_array_uc); int set_pages_array_wc(struct page **pages, int numpages) { return _set_pages_array(pages, numpages, _PAGE_CACHE_MODE_WC); } EXPORT_SYMBOL(set_pages_array_wc); int set_pages_wb(struct page *page, int numpages) { unsigned long addr = (unsigned long)page_address(page); return set_memory_wb(addr, numpages); } EXPORT_SYMBOL(set_pages_wb); int set_pages_array_wb(struct page **pages, int numpages) { int retval; unsigned long start; unsigned long end; int i; /* WB cache mode is hard wired to all cache attribute bits being 0 */ retval = cpa_clear_pages_array(pages, numpages, __pgprot(_PAGE_CACHE_MASK)); if (retval) return retval; for (i = 0; i < numpages; i++) { if (PageHighMem(pages[i])) continue; start = page_to_pfn(pages[i]) << PAGE_SHIFT; end = start + PAGE_SIZE; memtype_free(start, end); } return 0; } EXPORT_SYMBOL(set_pages_array_wb); int set_pages_ro(struct page *page, int numpages) { unsigned long addr = (unsigned long)page_address(page); return set_memory_ro(addr, numpages); } int set_pages_rw(struct page *page, int numpages) { unsigned long addr = (unsigned long)page_address(page); return set_memory_rw(addr, numpages); } static int __set_pages_p(struct page *page, int numpages) { unsigned long tempaddr = (unsigned long) page_address(page); struct cpa_data cpa = { .vaddr = &tempaddr, .pgd = NULL, .numpages = numpages, .mask_set = __pgprot(_PAGE_PRESENT | _PAGE_RW), .mask_clr = __pgprot(0), .flags = CPA_NO_CHECK_ALIAS }; /* * No alias checking needed for setting present flag. otherwise, * we may need to break large pages for 64-bit kernel text * mappings (this adds to complexity if we want to do this from * atomic context especially). Let's keep it simple! */ return __change_page_attr_set_clr(&cpa, 1); } static int __set_pages_np(struct page *page, int numpages) { unsigned long tempaddr = (unsigned long) page_address(page); struct cpa_data cpa = { .vaddr = &tempaddr, .pgd = NULL, .numpages = numpages, .mask_set = __pgprot(0), .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY), .flags = CPA_NO_CHECK_ALIAS }; /* * No alias checking needed for setting not present flag. otherwise, * we may need to break large pages for 64-bit kernel text * mappings (this adds to complexity if we want to do this from * atomic context especially). Let's keep it simple! */ return __change_page_attr_set_clr(&cpa, 1); } int set_direct_map_invalid_noflush(struct page *page) { return __set_pages_np(page, 1); } int set_direct_map_default_noflush(struct page *page) { return __set_pages_p(page, 1); } int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) { if (valid) return __set_pages_p(page, nr); return __set_pages_np(page, nr); } #ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { if (PageHighMem(page)) return; if (!enable) { debug_check_no_locks_freed(page_address(page), numpages * PAGE_SIZE); } /* * The return value is ignored as the calls cannot fail. * Large pages for identity mappings are not used at boot time * and hence no memory allocations during large page split. */ if (enable) __set_pages_p(page, numpages); else __set_pages_np(page, numpages); /* * We should perform an IPI and flush all tlbs, * but that can deadlock->flush only current cpu. * Preemption needs to be disabled around __flush_tlb_all() due to * CR3 reload in __native_flush_tlb(). */ preempt_disable(); __flush_tlb_all(); preempt_enable(); arch_flush_lazy_mmu_mode(); } #endif /* CONFIG_DEBUG_PAGEALLOC */ bool kernel_page_present(struct page *page) { unsigned int level; pte_t *pte; if (PageHighMem(page)) return false; pte = lookup_address((unsigned long)page_address(page), &level); return (pte_val(*pte) & _PAGE_PRESENT); } int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, unsigned numpages, unsigned long page_flags) { int retval = -EINVAL; struct cpa_data cpa = { .vaddr = &address, .pfn = pfn, .pgd = pgd, .numpages = numpages, .mask_set = __pgprot(0), .mask_clr = __pgprot(~page_flags & (_PAGE_NX|_PAGE_RW|_PAGE_DIRTY)), .flags = CPA_NO_CHECK_ALIAS, }; WARN_ONCE(num_online_cpus() > 1, "Don't call after initializing SMP"); if (!(__supported_pte_mask & _PAGE_NX)) goto out; if (!(page_flags & _PAGE_ENC)) cpa.mask_clr = pgprot_encrypted(cpa.mask_clr); cpa.mask_set = __pgprot(_PAGE_PRESENT | page_flags); retval = __change_page_attr_set_clr(&cpa, 1); __flush_tlb_all(); out: return retval; } /* * __flush_tlb_all() flushes mappings only on current CPU and hence this * function shouldn't be used in an SMP environment. Presently, it's used only * during boot (way before smp_init()) by EFI subsystem and hence is ok. */ int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address, unsigned long numpages) { int retval; /* * The typical sequence for unmapping is to find a pte through * lookup_address_in_pgd() (ideally, it should never return NULL because * the address is already mapped) and change its protections. As pfn is * the *target* of a mapping, it's not useful while unmapping. */ struct cpa_data cpa = { .vaddr = &address, .pfn = 0, .pgd = pgd, .numpages = numpages, .mask_set = __pgprot(0), .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY), .flags = CPA_NO_CHECK_ALIAS, }; WARN_ONCE(num_online_cpus() > 1, "Don't call after initializing SMP"); retval = __change_page_attr_set_clr(&cpa, 1); __flush_tlb_all(); return retval; } /* * The testcases use internal knowledge of the implementation that shouldn't * be exposed to the rest of the kernel. Include these directly here. */ #ifdef CONFIG_CPA_DEBUG #include "cpa-test.c" #endif |
| 451 303 303 929 930 502 451 303 795 796 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (C) 2008 Red Hat, Inc., Eric Paris <eparis@redhat.com> */ /* * Basic idea behind the notification queue: An fsnotify group (like inotify) * sends the userspace notification about events asynchronously some time after * the event happened. When inotify gets an event it will need to add that * event to the group notify queue. Since a single event might need to be on * multiple group's notification queues we can't add the event directly to each * queue and instead add a small "event_holder" to each queue. This event_holder * has a pointer back to the original event. Since the majority of events are * going to end up on one, and only one, notification queue we embed one * event_holder into each event. This means we have a single allocation instead * of always needing two. If the embedded event_holder is already in use by * another group a new event_holder (from fsnotify_event_holder_cachep) will be * allocated and used. */ #include <linux/fs.h> #include <linux/init.h> #include <linux/kernel.h> #include <linux/list.h> #include <linux/module.h> #include <linux/mount.h> #include <linux/mutex.h> #include <linux/namei.h> #include <linux/path.h> #include <linux/slab.h> #include <linux/spinlock.h> #include <linux/atomic.h> #include <linux/fsnotify_backend.h> #include "fsnotify.h" static atomic_t fsnotify_sync_cookie = ATOMIC_INIT(0); /** * fsnotify_get_cookie - return a unique cookie for use in synchronizing events. * Called from fsnotify_move, which is inlined into filesystem modules. */ u32 fsnotify_get_cookie(void) { return atomic_inc_return(&fsnotify_sync_cookie); } EXPORT_SYMBOL_GPL(fsnotify_get_cookie); void fsnotify_destroy_event(struct fsnotify_group *group, struct fsnotify_event *event) { /* Overflow events are per-group and we don't want to free them */ if (!event || event == group->overflow_event) return; /* * If the event is still queued, we have a problem... Do an unreliable * lockless check first to avoid locking in the common case. The * locking may be necessary for permission events which got removed * from the list by a different CPU than the one freeing the event. */ if (!list_empty(&event->list)) { spin_lock(&group->notification_lock); WARN_ON(!list_empty(&event->list)); spin_unlock(&group->notification_lock); } group->ops->free_event(group, event); } /* * Try to add an event to the notification queue. * The group can later pull this event off the queue to deal with. * The group can use the @merge hook to merge the event with a queued event. * The group can use the @insert hook to insert the event into hash table. * The function returns: * 0 if the event was added to a queue * 1 if the event was merged with some other queued event * 2 if the event was not queued - either the queue of events has overflown * or the group is shutting down. */ int fsnotify_insert_event(struct fsnotify_group *group, struct fsnotify_event *event, int (*merge)(struct fsnotify_group *, struct fsnotify_event *), void (*insert)(struct fsnotify_group *, struct fsnotify_event *)) { int ret = 0; struct list_head *list = &group->notification_list; pr_debug("%s: group=%p event=%p\n", __func__, group, event); spin_lock(&group->notification_lock); if (group->shutdown) { spin_unlock(&group->notification_lock); return 2; } if (event == group->overflow_event || group->q_len >= group->max_events) { ret = 2; /* Queue overflow event only if it isn't already queued */ if (!list_empty(&group->overflow_event->list)) { spin_unlock(&group->notification_lock); return ret; } event = group->overflow_event; goto queue; } if (!list_empty(list) && merge) { ret = merge(group, event); if (ret) { spin_unlock(&group->notification_lock); return ret; } } queue: group->q_len++; list_add_tail(&event->list, list); if (insert) insert(group, event); spin_unlock(&group->notification_lock); wake_up(&group->notification_waitq); kill_fasync(&group->fsn_fa, SIGIO, POLL_IN); return ret; } void fsnotify_remove_queued_event(struct fsnotify_group *group, struct fsnotify_event *event) { assert_spin_locked(&group->notification_lock); /* * We need to init list head for the case of overflow event so that * check in fsnotify_add_event() works */ list_del_init(&event->list); group->q_len--; } /* * Return the first event on the notification list without removing it. * Returns NULL if the list is empty. */ struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group) { assert_spin_locked(&group->notification_lock); if (fsnotify_notify_queue_is_empty(group)) return NULL; return list_first_entry(&group->notification_list, struct fsnotify_event, list); } /* * Remove and return the first event from the notification list. It is the * responsibility of the caller to destroy the obtained event */ struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group) { struct fsnotify_event *event = fsnotify_peek_first_event(group); if (!event) return NULL; pr_debug("%s: group=%p event=%p\n", __func__, group, event); fsnotify_remove_queued_event(group, event); return event; } /* * Called when a group is being torn down to clean up any outstanding * event notifications. */ void fsnotify_flush_notify(struct fsnotify_group *group) { struct fsnotify_event *event; spin_lock(&group->notification_lock); while (!fsnotify_notify_queue_is_empty(group)) { event = fsnotify_remove_first_event(group); spin_unlock(&group->notification_lock); fsnotify_destroy_event(group, event); spin_lock(&group->notification_lock); } spin_unlock(&group->notification_lock); } |
| 4 4 4 4 4 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 | // SPDX-License-Identifier: GPL-2.0-only /* * This only handles 32bit MTRR on 32bit hosts. This is strictly wrong * because MTRRs can span up to 40 bits (36bits on most modern x86) */ #include <linux/export.h> #include <linux/init.h> #include <linux/io.h> #include <linux/mm.h> #include <linux/cc_platform.h> #include <linux/string_choices.h> #include <asm/processor-flags.h> #include <asm/cacheinfo.h> #include <asm/cpufeature.h> #include <asm/cpu_device_id.h> #include <asm/hypervisor.h> #include <asm/mshyperv.h> #include <asm/tlbflush.h> #include <asm/mtrr.h> #include <asm/msr.h> #include <asm/memtype.h> #include "mtrr.h" struct fixed_range_block { int base_msr; /* start address of an MTRR block */ int ranges; /* number of MTRRs in this block */ }; static struct fixed_range_block fixed_range_blocks[] = { { MSR_MTRRfix64K_00000, 1 }, /* one 64k MTRR */ { MSR_MTRRfix16K_80000, 2 }, /* two 16k MTRRs */ { MSR_MTRRfix4K_C0000, 8 }, /* eight 4k MTRRs */ {} }; struct cache_map { u64 start; u64 end; u64 flags; u64 type:8; u64 fixed:1; }; bool mtrr_debug; static int __init mtrr_param_setup(char *str) { int rc = 0; if (!str) return -EINVAL; if (!strcmp(str, "debug")) mtrr_debug = true; else rc = -EINVAL; return rc; } early_param("mtrr", mtrr_param_setup); /* * CACHE_MAP_MAX is the maximum number of memory ranges in cache_map, where * no 2 adjacent ranges have the same cache mode (those would be merged). * The number is based on the worst case: * - no two adjacent fixed MTRRs share the same cache mode * - one variable MTRR is spanning a huge area with mode WB * - 255 variable MTRRs with mode UC all overlap with the WB MTRR, creating 2 * additional ranges each (result like "ababababa...aba" with a = WB, b = UC), * accounting for MTRR_MAX_VAR_RANGES * 2 - 1 range entries * - a TOP_MEM2 area (even with overlapping an UC MTRR can't add 2 range entries * to the possible maximum, as it always starts at 4GB, thus it can't be in * the middle of that MTRR, unless that MTRR starts at 0, which would remove * the initial "a" from the "abababa" pattern above) * The map won't contain ranges with no matching MTRR (those fall back to the * default cache mode). */ #define CACHE_MAP_MAX (MTRR_NUM_FIXED_RANGES + MTRR_MAX_VAR_RANGES * 2) static struct cache_map init_cache_map[CACHE_MAP_MAX] __initdata; static struct cache_map *cache_map __refdata = init_cache_map; static unsigned int cache_map_size = CACHE_MAP_MAX; static unsigned int cache_map_n; static unsigned int cache_map_fixed; static unsigned long smp_changes_mask; static int mtrr_state_set; u64 mtrr_tom2; struct mtrr_state_type mtrr_state; /* Reserved bits in the high portion of the MTRRphysBaseN MSR. */ u32 phys_hi_rsvd; /* * BIOS is expected to clear MtrrFixDramModEn bit, see for example * "BIOS and Kernel Developer's Guide for the AMD Athlon 64 and AMD * Opteron Processors" (26094 Rev. 3.30 February 2006), section * "13.2.1.2 SYSCFG Register": "The MtrrFixDramModEn bit should be set * to 1 during BIOS initialization of the fixed MTRRs, then cleared to * 0 for operation." */ static inline void k8_check_syscfg_dram_mod_en(void) { u32 lo, hi; if (!((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (boot_cpu_data.x86 >= 0x0f))) return; if (cc_platform_has(CC_ATTR_HOST_SEV_SNP)) return; rdmsr(MSR_AMD64_SYSCFG, lo, hi); if (lo & K8_MTRRFIXRANGE_DRAM_MODIFY) { pr_err(FW_WARN "MTRR: CPU %u: SYSCFG[MtrrFixDramModEn]" " not cleared by BIOS, clearing this bit\n", smp_processor_id()); lo &= ~K8_MTRRFIXRANGE_DRAM_MODIFY; mtrr_wrmsr(MSR_AMD64_SYSCFG, lo, hi); } } /* Get the size of contiguous MTRR range */ static u64 get_mtrr_size(u64 mask) { u64 size; mask |= (u64)phys_hi_rsvd << 32; size = -mask; return size; } static u8 get_var_mtrr_state(unsigned int reg, u64 *start, u64 *size) { struct mtrr_var_range *mtrr = mtrr_state.var_ranges + reg; if (!(mtrr->mask_lo & MTRR_PHYSMASK_V)) return MTRR_TYPE_INVALID; *start = (((u64)mtrr->base_hi) << 32) + (mtrr->base_lo & PAGE_MASK); *size = get_mtrr_size((((u64)mtrr->mask_hi) << 32) + (mtrr->mask_lo & PAGE_MASK)); return mtrr->base_lo & MTRR_PHYSBASE_TYPE; } static u8 get_effective_type(u8 type1, u8 type2) { if (type1 == MTRR_TYPE_UNCACHABLE || type2 == MTRR_TYPE_UNCACHABLE) return MTRR_TYPE_UNCACHABLE; if ((type1 == MTRR_TYPE_WRBACK && type2 == MTRR_TYPE_WRTHROUGH) || (type1 == MTRR_TYPE_WRTHROUGH && type2 == MTRR_TYPE_WRBACK)) return MTRR_TYPE_WRTHROUGH; if (type1 != type2) return MTRR_TYPE_UNCACHABLE; return type1; } static void rm_map_entry_at(int idx) { cache_map_n--; if (cache_map_n > idx) { memmove(cache_map + idx, cache_map + idx + 1, sizeof(*cache_map) * (cache_map_n - idx)); } } /* * Add an entry into cache_map at a specific index. Merges adjacent entries if * appropriate. Return the number of merges for correcting the scan index * (this is needed as merging will reduce the number of entries, which will * result in skipping entries in future iterations if the scan index isn't * corrected). * Note that the corrected index can never go below -1 (resulting in being 0 in * the next scan iteration), as "2" is returned only if the current index is * larger than zero. */ static int add_map_entry_at(u64 start, u64 end, u8 type, int idx) { bool merge_prev = false, merge_next = false; if (start >= end) return 0; if (idx > 0) { struct cache_map *prev = cache_map + idx - 1; if (!prev->fixed && start == prev->end && type == prev->type) merge_prev = true; } if (idx < cache_map_n) { struct cache_map *next = cache_map + idx; if (!next->fixed && end == next->start && type == next->type) merge_next = true; } if (merge_prev && merge_next) { cache_map[idx - 1].end = cache_map[idx].end; rm_map_entry_at(idx); return 2; } if (merge_prev) { cache_map[idx - 1].end = end; return 1; } if (merge_next) { cache_map[idx].start = start; return 1; } /* Sanity check: the array should NEVER be too small! */ if (cache_map_n == cache_map_size) { WARN(1, "MTRR cache mode memory map exhausted!\n"); cache_map_n = cache_map_fixed; return 0; } if (cache_map_n > idx) { memmove(cache_map + idx + 1, cache_map + idx, sizeof(*cache_map) * (cache_map_n - idx)); } cache_map[idx].start = start; cache_map[idx].end = end; cache_map[idx].type = type; cache_map[idx].fixed = 0; cache_map_n++; return 0; } /* Clear a part of an entry. Return 1 if start of entry is still valid. */ static int clr_map_range_at(u64 start, u64 end, int idx) { int ret = start != cache_map[idx].start; u64 tmp; if (start == cache_map[idx].start && end == cache_map[idx].end) { rm_map_entry_at(idx); } else if (start == cache_map[idx].start) { cache_map[idx].start = end; } else if (end == cache_map[idx].end) { cache_map[idx].end = start; } else { tmp = cache_map[idx].end; cache_map[idx].end = start; add_map_entry_at(end, tmp, cache_map[idx].type, idx + 1); } return ret; } /* * Add MTRR to the map. The current map is scanned and each part of the MTRR * either overlapping with an existing entry or with a hole in the map is * handled separately. */ static void add_map_entry(u64 start, u64 end, u8 type) { u8 new_type, old_type; u64 tmp; int i; for (i = 0; i < cache_map_n && start < end; i++) { if (start >= cache_map[i].end) continue; if (start < cache_map[i].start) { /* Region start has no overlap. */ tmp = min(end, cache_map[i].start); i -= add_map_entry_at(start, tmp, type, i); start = tmp; continue; } new_type = get_effective_type(type, cache_map[i].type); old_type = cache_map[i].type; if (cache_map[i].fixed || new_type == old_type) { /* Cut off start of new entry. */ start = cache_map[i].end; continue; } /* Handle only overlapping part of region. */ tmp = min(end, cache_map[i].end); i += clr_map_range_at(start, tmp, i); i -= add_map_entry_at(start, tmp, new_type, i); start = tmp; } /* Add rest of region after last map entry (rest might be empty). */ add_map_entry_at(start, end, type, i); } /* Add variable MTRRs to cache map. */ static void map_add_var(void) { u64 start, size; unsigned int i; u8 type; /* * Add AMD TOP_MEM2 area. Can't be added in mtrr_build_map(), as it * needs to be added again when rebuilding the map due to potentially * having moved as a result of variable MTRRs for memory below 4GB. */ if (mtrr_tom2) { add_map_entry(BIT_ULL(32), mtrr_tom2, MTRR_TYPE_WRBACK); cache_map[cache_map_n - 1].fixed = 1; } for (i = 0; i < num_var_ranges; i++) { type = get_var_mtrr_state(i, &start, &size); if (type != MTRR_TYPE_INVALID) add_map_entry(start, start + size, type); } } /* * Rebuild map by replacing variable entries. Needs to be called when MTRR * registers are being changed after boot, as such changes could include * removals of registers, which are complicated to handle without rebuild of * the map. */ void generic_rebuild_map(void) { if (mtrr_if != &generic_mtrr_ops) return; cache_map_n = cache_map_fixed; map_add_var(); } static unsigned int __init get_cache_map_size(void) { return cache_map_fixed + 2 * num_var_ranges + (mtrr_tom2 != 0); } /* Build the cache_map containing the cache modes per memory range. */ void __init mtrr_build_map(void) { u64 start, end, size; unsigned int i; u8 type; /* Add fixed MTRRs, optimize for adjacent entries with same type. */ if (mtrr_state.enabled & MTRR_STATE_MTRR_FIXED_ENABLED) { /* * Start with 64k size fixed entries, preset 1st one (hence the * loop below is starting with index 1). */ start = 0; end = size = 0x10000; type = mtrr_state.fixed_ranges[0]; for (i = 1; i < MTRR_NUM_FIXED_RANGES; i++) { /* 8 64k entries, then 16 16k ones, rest 4k. */ if (i == 8 || i == 24) size >>= 2; if (mtrr_state.fixed_ranges[i] != type) { add_map_entry(start, end, type); start = end; type = mtrr_state.fixed_ranges[i]; } end += size; } add_map_entry(start, end, type); } /* Mark fixed, they take precedence. */ for (i = 0; i < cache_map_n; i++) cache_map[i].fixed = 1; cache_map_fixed = cache_map_n; map_add_var(); pr_info("MTRR map: %u entries (%u fixed + %u variable; max %u), built from %u variable MTRRs\n", cache_map_n, cache_map_fixed, cache_map_n - cache_map_fixed, get_cache_map_size(), num_var_ranges + (mtrr_tom2 != 0)); if (mtrr_debug) { for (i = 0; i < cache_map_n; i++) { pr_info("%3u: %016llx-%016llx %s\n", i, cache_map[i].start, cache_map[i].end - 1, mtrr_attrib_to_str(cache_map[i].type)); } } } /* Copy the cache_map from __initdata memory to dynamically allocated one. */ void __init mtrr_copy_map(void) { unsigned int new_size = get_cache_map_size(); if (!mtrr_state.enabled || !new_size) { cache_map = NULL; return; } mutex_lock(&mtrr_mutex); cache_map = kcalloc(new_size, sizeof(*cache_map), GFP_KERNEL); if (cache_map) { memmove(cache_map, init_cache_map, cache_map_n * sizeof(*cache_map)); cache_map_size = new_size; } else { mtrr_state.enabled = 0; pr_err("MTRRs disabled due to allocation failure for lookup map.\n"); } mutex_unlock(&mtrr_mutex); } /** * guest_force_mtrr_state - set static MTRR state for a guest * * Used to set MTRR state via different means (e.g. with data obtained from * a hypervisor). * Is allowed only for special cases when running virtualized. Must be called * from the x86_init.hyper.init_platform() hook. It can be called only once. * The MTRR state can't be changed afterwards. To ensure that, X86_FEATURE_MTRR * is cleared. * * @var: MTRR variable range array to use * @num_var: length of the @var array * @def_type: default caching type */ void guest_force_mtrr_state(struct mtrr_var_range *var, unsigned int num_var, mtrr_type def_type) { unsigned int i; /* Only allowed to be called once before mtrr_bp_init(). */ if (WARN_ON_ONCE(mtrr_state_set)) return; /* Only allowed when running virtualized. */ if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) return; /* * Only allowed for special virtualization cases: * - when running as Hyper-V, SEV-SNP guest using vTOM * - when running as Xen PV guest * - when running as SEV-SNP or TDX guest to avoid unnecessary * VMM communication/Virtualization exceptions (#VC, #VE) */ if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !hv_is_isolation_supported() && !cpu_feature_enabled(X86_FEATURE_XENPV) && !cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) return; /* Disable MTRR in order to disable MTRR modifications. */ setup_clear_cpu_cap(X86_FEATURE_MTRR); if (var) { if (num_var > MTRR_MAX_VAR_RANGES) { pr_warn("Trying to overwrite MTRR state with %u variable entries\n", num_var); num_var = MTRR_MAX_VAR_RANGES; } for (i = 0; i < num_var; i++) mtrr_state.var_ranges[i] = var[i]; num_var_ranges = num_var; } mtrr_state.def_type = def_type; mtrr_state.enabled |= MTRR_STATE_MTRR_ENABLED; mtrr_state_set = 1; } static u8 type_merge(u8 type, u8 new_type, u8 *uniform) { u8 effective_type; if (type == MTRR_TYPE_INVALID) return new_type; effective_type = get_effective_type(type, new_type); if (type != effective_type) *uniform = 0; return effective_type; } /** * mtrr_type_lookup - look up memory type in MTRR * * @start: Begin of the physical address range * @end: End of the physical address range * @uniform: output argument: * - 1: the returned MTRR type is valid for the whole region * - 0: otherwise * * Return Values: * MTRR_TYPE_(type) - The effective MTRR type for the region * MTRR_TYPE_INVALID - MTRR is disabled */ u8 mtrr_type_lookup(u64 start, u64 end, u8 *uniform) { u8 type = MTRR_TYPE_INVALID; unsigned int i; if (!mtrr_state_set) { /* Uniformity is unknown. */ *uniform = 0; return MTRR_TYPE_UNCACHABLE; } *uniform = 1; if (!(mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED)) return MTRR_TYPE_UNCACHABLE; for (i = 0; i < cache_map_n && start < end; i++) { /* Region after current map entry? -> continue with next one. */ if (start >= cache_map[i].end) continue; /* Start of region not covered by current map entry? */ if (start < cache_map[i].start) { /* At least some part of region has default type. */ type = type_merge(type, mtrr_state.def_type, uniform); /* End of region not covered, too? -> lookup done. */ if (end <= cache_map[i].start) return type; } /* At least part of region covered by map entry. */ type = type_merge(type, cache_map[i].type, uniform); start = cache_map[i].end; } /* End of region past last entry in map? -> use default type. */ if (start < end) type = type_merge(type, mtrr_state.def_type, uniform); return type; } /* Get the MSR pair relating to a var range */ static void get_mtrr_var_range(unsigned int index, struct mtrr_var_range *vr) { rdmsr(MTRRphysBase_MSR(index), vr->base_lo, vr->base_hi); rdmsr(MTRRphysMask_MSR(index), vr->mask_lo, vr->mask_hi); } /* Fill the MSR pair relating to a var range */ void fill_mtrr_var_range(unsigned int index, u32 base_lo, u32 base_hi, u32 mask_lo, u32 mask_hi) { struct mtrr_var_range *vr; vr = mtrr_state.var_ranges; vr[index].base_lo = base_lo; vr[index].base_hi = base_hi; vr[index].mask_lo = mask_lo; vr[index].mask_hi = mask_hi; } static void get_fixed_ranges(mtrr_type *frs) { unsigned int *p = (unsigned int *)frs; int i; k8_check_syscfg_dram_mod_en(); rdmsr(MSR_MTRRfix64K_00000, p[0], p[1]); for (i = 0; i < 2; i++) rdmsr(MSR_MTRRfix16K_80000 + i, p[2 + i * 2], p[3 + i * 2]); for (i = 0; i < 8; i++) rdmsr(MSR_MTRRfix4K_C0000 + i, p[6 + i * 2], p[7 + i * 2]); } void mtrr_save_fixed_ranges(void *info) { if (mtrr_state.have_fixed) get_fixed_ranges(mtrr_state.fixed_ranges); } static unsigned __initdata last_fixed_start; static unsigned __initdata last_fixed_end; static mtrr_type __initdata last_fixed_type; static void __init print_fixed_last(void) { if (!last_fixed_end) return; pr_info(" %05X-%05X %s\n", last_fixed_start, last_fixed_end - 1, mtrr_attrib_to_str(last_fixed_type)); last_fixed_end = 0; } static void __init update_fixed_last(unsigned base, unsigned end, mtrr_type type) { last_fixed_start = base; last_fixed_end = end; last_fixed_type = type; } static void __init print_fixed(unsigned base, unsigned step, const mtrr_type *types) { unsigned i; for (i = 0; i < 8; ++i, ++types, base += step) { if (last_fixed_end == 0) { update_fixed_last(base, base + step, *types); continue; } if (last_fixed_end == base && last_fixed_type == *types) { last_fixed_end = base + step; continue; } /* new segments: gap or different type */ print_fixed_last(); update_fixed_last(base, base + step, *types); } } static void __init print_mtrr_state(void) { unsigned int i; int high_width; pr_info("MTRR default type: %s\n", mtrr_attrib_to_str(mtrr_state.def_type)); if (mtrr_state.have_fixed) { pr_info("MTRR fixed ranges %s:\n", str_enabled_disabled( (mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED) && (mtrr_state.enabled & MTRR_STATE_MTRR_FIXED_ENABLED))); print_fixed(0x00000, 0x10000, mtrr_state.fixed_ranges + 0); for (i = 0; i < 2; ++i) print_fixed(0x80000 + i * 0x20000, 0x04000, mtrr_state.fixed_ranges + (i + 1) * 8); for (i = 0; i < 8; ++i) print_fixed(0xC0000 + i * 0x08000, 0x01000, mtrr_state.fixed_ranges + (i + 3) * 8); /* tail */ print_fixed_last(); } pr_info("MTRR variable ranges %s:\n", str_enabled_disabled(mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED)); high_width = (boot_cpu_data.x86_phys_bits - (32 - PAGE_SHIFT) + 3) / 4; for (i = 0; i < num_var_ranges; ++i) { if (mtrr_state.var_ranges[i].mask_lo & MTRR_PHYSMASK_V) pr_info(" %u base %0*X%05X000 mask %0*X%05X000 %s\n", i, high_width, mtrr_state.var_ranges[i].base_hi, mtrr_state.var_ranges[i].base_lo >> 12, high_width, mtrr_state.var_ranges[i].mask_hi, mtrr_state.var_ranges[i].mask_lo >> 12, mtrr_attrib_to_str(mtrr_state.var_ranges[i].base_lo & MTRR_PHYSBASE_TYPE)); else pr_info(" %u disabled\n", i); } if (mtrr_tom2) pr_info("TOM2: %016llx aka %lldM\n", mtrr_tom2, mtrr_tom2>>20); } /* Grab all of the MTRR state for this CPU into *state */ bool __init get_mtrr_state(void) { struct mtrr_var_range *vrs; unsigned lo, dummy; unsigned int i; vrs = mtrr_state.var_ranges; rdmsr(MSR_MTRRcap, lo, dummy); mtrr_state.have_fixed = lo & MTRR_CAP_FIX; for (i = 0; i < num_var_ranges; i++) get_mtrr_var_range(i, &vrs[i]); if (mtrr_state.have_fixed) get_fixed_ranges(mtrr_state.fixed_ranges); rdmsr(MSR_MTRRdefType, lo, dummy); mtrr_state.def_type = lo & MTRR_DEF_TYPE_TYPE; mtrr_state.enabled = (lo & MTRR_DEF_TYPE_ENABLE) >> MTRR_STATE_SHIFT; if (amd_special_default_mtrr()) { unsigned low, high; /* TOP_MEM2 */ rdmsr(MSR_K8_TOP_MEM2, low, high); mtrr_tom2 = high; mtrr_tom2 <<= 32; mtrr_tom2 |= low; mtrr_tom2 &= 0xffffff800000ULL; } if (mtrr_debug) print_mtrr_state(); mtrr_state_set = 1; return !!(mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED); } /* Some BIOS's are messed up and don't set all MTRRs the same! */ void __init mtrr_state_warn(void) { unsigned long mask = smp_changes_mask; if (!mask) return; if (mask & MTRR_CHANGE_MASK_FIXED) pr_warn("mtrr: your CPUs had inconsistent fixed MTRR settings\n"); if (mask & MTRR_CHANGE_MASK_VARIABLE) pr_warn("mtrr: your CPUs had inconsistent variable MTRR settings\n"); if (mask & MTRR_CHANGE_MASK_DEFTYPE) pr_warn("mtrr: your CPUs had inconsistent MTRRdefType settings\n"); pr_info("mtrr: probably your BIOS does not setup all CPUs.\n"); pr_info("mtrr: corrected configuration.\n"); } /* * Doesn't attempt to pass an error out to MTRR users * because it's quite complicated in some cases and probably not * worth it because the best error handling is to ignore it. */ void mtrr_wrmsr(unsigned msr, unsigned a, unsigned b) { if (wrmsr_safe(msr, a, b) < 0) { pr_err("MTRR: CPU %u: Writing MSR %x to %x:%x failed\n", smp_processor_id(), msr, a, b); } } /** * set_fixed_range - checks & updates a fixed-range MTRR if it * differs from the value it should have * @msr: MSR address of the MTTR which should be checked and updated * @changed: pointer which indicates whether the MTRR needed to be changed * @msrwords: pointer to the MSR values which the MSR should have */ static void set_fixed_range(int msr, bool *changed, unsigned int *msrwords) { unsigned lo, hi; rdmsr(msr, lo, hi); if (lo != msrwords[0] || hi != msrwords[1]) { mtrr_wrmsr(msr, msrwords[0], msrwords[1]); *changed = true; } } /** * generic_get_free_region - Get a free MTRR. * @base: The starting (base) address of the region. * @size: The size (in bytes) of the region. * @replace_reg: mtrr index to be replaced; set to invalid value if none. * * Returns: The index of the region on success, else negative on error. */ int generic_get_free_region(unsigned long base, unsigned long size, int replace_reg) { unsigned long lbase, lsize; mtrr_type ltype; int i, max; max = num_var_ranges; if (replace_reg >= 0 && replace_reg < max) return replace_reg; for (i = 0; i < max; ++i) { mtrr_if->get(i, &lbase, &lsize, <ype); if (lsize == 0) return i; } return -ENOSPC; } static void generic_get_mtrr(unsigned int reg, unsigned long *base, unsigned long *size, mtrr_type *type) { u32 mask_lo, mask_hi, base_lo, base_hi; unsigned int hi; u64 tmp, mask; /* * get_mtrr doesn't need to update mtrr_state, also it could be called * from any cpu, so try to print it out directly. */ get_cpu(); rdmsr(MTRRphysMask_MSR(reg), mask_lo, mask_hi); if (!(mask_lo & MTRR_PHYSMASK_V)) { /* Invalid (i.e. free) range */ *base = 0; *size = 0; *type = 0; goto out_put_cpu; } rdmsr(MTRRphysBase_MSR(reg), base_lo, base_hi); /* Work out the shifted address mask: */ tmp = (u64)mask_hi << 32 | (mask_lo & PAGE_MASK); mask = (u64)phys_hi_rsvd << 32 | tmp; /* Expand tmp with high bits to all 1s: */ hi = fls64(tmp); if (hi > 0) { tmp |= ~((1ULL<<(hi - 1)) - 1); if (tmp != mask) { pr_warn("mtrr: your BIOS has configured an incorrect mask, fixing it.\n"); add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK); mask = tmp; } } /* * This works correctly if size is a power of two, i.e. a * contiguous range: */ *size = -mask >> PAGE_SHIFT; *base = (u64)base_hi << (32 - PAGE_SHIFT) | base_lo >> PAGE_SHIFT; *type = base_lo & MTRR_PHYSBASE_TYPE; out_put_cpu: put_cpu(); } /** * set_fixed_ranges - checks & updates the fixed-range MTRRs if they * differ from the saved set * @frs: pointer to fixed-range MTRR values, saved by get_fixed_ranges() */ static int set_fixed_ranges(mtrr_type *frs) { unsigned long long *saved = (unsigned long long *)frs; bool changed = false; int block = -1, range; k8_check_syscfg_dram_mod_en(); while (fixed_range_blocks[++block].ranges) { for (range = 0; range < fixed_range_blocks[block].ranges; range++) set_fixed_range(fixed_range_blocks[block].base_msr + range, &changed, (unsigned int *)saved++); } return changed; } /* * Set the MSR pair relating to a var range. * Returns true if changes are made. */ static bool set_mtrr_var_ranges(unsigned int index, struct mtrr_var_range *vr) { unsigned int lo, hi; bool changed = false; rdmsr(MTRRphysBase_MSR(index), lo, hi); if ((vr->base_lo & ~MTRR_PHYSBASE_RSVD) != (lo & ~MTRR_PHYSBASE_RSVD) || (vr->base_hi & ~phys_hi_rsvd) != (hi & ~phys_hi_rsvd)) { mtrr_wrmsr(MTRRphysBase_MSR(index), vr->base_lo, vr->base_hi); changed = true; } rdmsr(MTRRphysMask_MSR(index), lo, hi); if ((vr->mask_lo & ~MTRR_PHYSMASK_RSVD) != (lo & ~MTRR_PHYSMASK_RSVD) || (vr->mask_hi & ~phys_hi_rsvd) != (hi & ~phys_hi_rsvd)) { mtrr_wrmsr(MTRRphysMask_MSR(index), vr->mask_lo, vr->mask_hi); changed = true; } return changed; } static u32 deftype_lo, deftype_hi; /** * set_mtrr_state - Set the MTRR state for this CPU. * * NOTE: The CPU must already be in a safe state for MTRR changes, including * measures that only a single CPU can be active in set_mtrr_state() in * order to not be subject to races for usage of deftype_lo. This is * accomplished by taking cache_disable_lock. * RETURNS: 0 if no changes made, else a mask indicating what was changed. */ static unsigned long set_mtrr_state(void) { unsigned long change_mask = 0; unsigned int i; for (i = 0; i < num_var_ranges; i++) { if (set_mtrr_var_ranges(i, &mtrr_state.var_ranges[i])) change_mask |= MTRR_CHANGE_MASK_VARIABLE; } if (mtrr_state.have_fixed && set_fixed_ranges(mtrr_state.fixed_ranges)) change_mask |= MTRR_CHANGE_MASK_FIXED; /* * Set_mtrr_restore restores the old value of MTRRdefType, * so to set it we fiddle with the saved value: */ if ((deftype_lo & MTRR_DEF_TYPE_TYPE) != mtrr_state.def_type || ((deftype_lo & MTRR_DEF_TYPE_ENABLE) >> MTRR_STATE_SHIFT) != mtrr_state.enabled) { deftype_lo = (deftype_lo & MTRR_DEF_TYPE_DISABLE) | mtrr_state.def_type | (mtrr_state.enabled << MTRR_STATE_SHIFT); change_mask |= MTRR_CHANGE_MASK_DEFTYPE; } return change_mask; } void mtrr_disable(void) { /* Save MTRR state */ rdmsr(MSR_MTRRdefType, deftype_lo, deftype_hi); /* Disable MTRRs, and set the default type to uncached */ mtrr_wrmsr(MSR_MTRRdefType, deftype_lo & MTRR_DEF_TYPE_DISABLE, deftype_hi); } void mtrr_enable(void) { /* Intel (P6) standard MTRRs */ mtrr_wrmsr(MSR_MTRRdefType, deftype_lo, deftype_hi); } void mtrr_generic_set_state(void) { unsigned long mask, count; /* Actually set the state */ mask = set_mtrr_state(); /* Use the atomic bitops to update the global mask */ for (count = 0; count < sizeof(mask) * 8; ++count) { if (mask & 0x01) set_bit(count, &smp_changes_mask); mask >>= 1; } } /** * generic_set_mtrr - set variable MTRR register on the local CPU. * * @reg: The register to set. * @base: The base address of the region. * @size: The size of the region. If this is 0 the region is disabled. * @type: The type of the region. * * Returns nothing. */ static void generic_set_mtrr(unsigned int reg, unsigned long base, unsigned long size, mtrr_type type) { unsigned long flags; struct mtrr_var_range *vr; vr = &mtrr_state.var_ranges[reg]; local_irq_save(flags); cache_disable(); if (size == 0) { /* * The invalid bit is kept in the mask, so we simply * clear the relevant mask register to disable a range. */ mtrr_wrmsr(MTRRphysMask_MSR(reg), 0, 0); memset(vr, 0, sizeof(struct mtrr_var_range)); } else { vr->base_lo = base << PAGE_SHIFT | type; vr->base_hi = (base >> (32 - PAGE_SHIFT)) & ~phys_hi_rsvd; vr->mask_lo = -size << PAGE_SHIFT | MTRR_PHYSMASK_V; vr->mask_hi = (-size >> (32 - PAGE_SHIFT)) & ~phys_hi_rsvd; mtrr_wrmsr(MTRRphysBase_MSR(reg), vr->base_lo, vr->base_hi); mtrr_wrmsr(MTRRphysMask_MSR(reg), vr->mask_lo, vr->mask_hi); } cache_enable(); local_irq_restore(flags); } int generic_validate_add_page(unsigned long base, unsigned long size, unsigned int type) { unsigned long lbase, last; /* * For Intel PPro stepping <= 7 * must be 4 MiB aligned and not touch 0x70000000 -> 0x7003FFFF */ if (mtrr_if == &generic_mtrr_ops && boot_cpu_data.x86_vfm == INTEL_PENTIUM_PRO && boot_cpu_data.x86_stepping <= 7) { if (base & ((1 << (22 - PAGE_SHIFT)) - 1)) { pr_warn("mtrr: base(0x%lx000) is not 4 MiB aligned\n", base); return -EINVAL; } if (!(base + size < 0x70000 || base > 0x7003F) && (type == MTRR_TYPE_WRCOMB || type == MTRR_TYPE_WRBACK)) { pr_warn("mtrr: writable mtrr between 0x70000000 and 0x7003FFFF may hang the CPU.\n"); return -EINVAL; } } /* * Check upper bits of base and last are equal and lower bits are 0 * for base and 1 for last */ last = base + size - 1; for (lbase = base; !(lbase & 1) && (last & 1); lbase = lbase >> 1, last = last >> 1) ; if (lbase != last) { pr_warn("mtrr: base(0x%lx000) is not aligned on a size(0x%lx000) boundary\n", base, size); return -EINVAL; } return 0; } static int generic_have_wrcomb(void) { unsigned long config, dummy; rdmsr(MSR_MTRRcap, config, dummy); return config & MTRR_CAP_WC; } int positive_have_wrcomb(void) { return 1; } /* * Generic structure... */ const struct mtrr_ops generic_mtrr_ops = { .get = generic_get_mtrr, .get_free_region = generic_get_free_region, .set = generic_set_mtrr, .validate_add_page = generic_validate_add_page, .have_wrcomb = generic_have_wrcomb, }; |
| 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 | // SPDX-License-Identifier: GPL-2.0-or-later /* * * Copyright (c) International Business Machines Corp., 2002,2008 * Author(s): Steve French (sfrench@us.ibm.com) * * Error mapping routines from Samba libsmb/errormap.c * Copyright (C) Andrew Tridgell 2001 */ #include <linux/net.h> #include <linux/string.h> #include <linux/in.h> #include <linux/ctype.h> #include <linux/fs.h> #include <asm/div64.h> #include <asm/byteorder.h> #include <linux/inet.h> #include "cifsfs.h" #include "cifspdu.h" #include "cifsglob.h" #include "cifsproto.h" #include "smberr.h" #include "cifs_debug.h" #include "nterr.h" struct smb_to_posix_error { __u16 smb_err; int posix_code; }; static const struct smb_to_posix_error mapping_table_ERRDOS[] = { {ERRbadfunc, -EINVAL}, {ERRbadfile, -ENOENT}, {ERRbadpath, -ENOTDIR}, {ERRnofids, -EMFILE}, {ERRnoaccess, -EACCES}, {ERRbadfid, -EBADF}, {ERRbadmcb, -EIO}, {ERRnomem, -EREMOTEIO}, {ERRbadmem, -EFAULT}, {ERRbadenv, -EFAULT}, {ERRbadformat, -EINVAL}, {ERRbadaccess, -EACCES}, {ERRbaddata, -EIO}, {ERRbaddrive, -ENXIO}, {ERRremcd, -EACCES}, {ERRdiffdevice, -EXDEV}, {ERRnofiles, -ENOENT}, {ERRwriteprot, -EROFS}, {ERRbadshare, -EBUSY}, {ERRlock, -EACCES}, {ERRunsup, -EINVAL}, {ERRnosuchshare, -ENXIO}, {ERRfilexists, -EEXIST}, {ERRinvparm, -EINVAL}, {ERRdiskfull, -ENOSPC}, {ERRinvname, -ENOENT}, {ERRinvlevel, -EOPNOTSUPP}, {ERRdirnotempty, -ENOTEMPTY}, {ERRnotlocked, -ENOLCK}, {ERRcancelviolation, -ENOLCK}, {ERRalreadyexists, -EEXIST}, {ERRmoredata, -EOVERFLOW}, {ERReasnotsupported, -EOPNOTSUPP}, {ErrQuota, -EDQUOT}, {ErrNotALink, -ENOLINK}, {ERRnetlogonNotStarted, -ENOPROTOOPT}, {ERRsymlink, -EOPNOTSUPP}, {ErrTooManyLinks, -EMLINK}, {0, 0} }; static const struct smb_to_posix_error mapping_table_ERRSRV[] = { {ERRerror, -EIO}, {ERRbadpw, -EACCES}, /* was EPERM */ {ERRbadtype, -EREMOTE}, {ERRaccess, -EACCES}, {ERRinvtid, -ENXIO}, {ERRinvnetname, -ENXIO}, {ERRinvdevice, -ENXIO}, {ERRqfull, -ENOSPC}, {ERRqtoobig, -ENOSPC}, {ERRqeof, -EIO}, {ERRinvpfid, -EBADF}, {ERRsmbcmd, -EBADRQC}, {ERRsrverror, -EIO}, {ERRbadBID, -EIO}, {ERRfilespecs, -EINVAL}, {ERRbadLink, -EIO}, {ERRbadpermits, -EINVAL}, {ERRbadPID, -ESRCH}, {ERRsetattrmode, -EINVAL}, {ERRpaused, -EHOSTDOWN}, {ERRmsgoff, -EHOSTDOWN}, {ERRnoroom, -ENOSPC}, {ERRrmuns, -EUSERS}, {ERRtimeout, -ETIME}, {ERRnoresource, -EREMOTEIO}, {ERRtoomanyuids, -EUSERS}, {ERRbaduid, -EACCES}, {ERRusempx, -EIO}, {ERRusestd, -EIO}, {ERR_NOTIFY_ENUM_DIR, -ENOBUFS}, {ERRnoSuchUser, -EACCES}, /* {ERRaccountexpired, -EACCES}, {ERRbadclient, -EACCES}, {ERRbadLogonTime, -EACCES}, {ERRpasswordExpired, -EACCES},*/ {ERRaccountexpired, -EKEYEXPIRED}, {ERRbadclient, -EACCES}, {ERRbadLogonTime, -EACCES}, {ERRpasswordExpired, -EKEYEXPIRED}, {ERRnosupport, -EINVAL}, {0, 0} }; /* * Convert a string containing text IPv4 or IPv6 address to binary form. * * Returns 0 on failure. */ static int cifs_inet_pton(const int address_family, const char *cp, int len, void *dst) { int ret = 0; /* calculate length by finding first slash or NULL */ if (address_family == AF_INET) ret = in4_pton(cp, len, dst, '\\', NULL); else if (address_family == AF_INET6) ret = in6_pton(cp, len, dst , '\\', NULL); cifs_dbg(NOISY, "address conversion returned %d for %*.*s\n", ret, len, len, cp); if (ret > 0) ret = 1; return ret; } /* * Try to convert a string to an IPv4 address and then attempt to convert * it to an IPv6 address if that fails. Set the family field if either * succeeds. If it's an IPv6 address and it has a '%' sign in it, try to * treat the part following it as a numeric sin6_scope_id. * * Returns 0 on failure. */ int cifs_convert_address(struct sockaddr *dst, const char *src, int len) { int rc, alen, slen; const char *pct; char scope_id[13]; struct sockaddr_in *s4 = (struct sockaddr_in *) dst; struct sockaddr_in6 *s6 = (struct sockaddr_in6 *) dst; /* IPv4 address */ if (cifs_inet_pton(AF_INET, src, len, &s4->sin_addr.s_addr)) { s4->sin_family = AF_INET; return 1; } /* attempt to exclude the scope ID from the address part */ pct = memchr(src, '%', len); alen = pct ? pct - src : len; rc = cifs_inet_pton(AF_INET6, src, alen, &s6->sin6_addr.s6_addr); if (!rc) return rc; s6->sin6_family = AF_INET6; if (pct) { /* grab the scope ID */ slen = len - (alen + 1); if (slen <= 0 || slen > 12) return 0; memcpy(scope_id, pct + 1, slen); scope_id[slen] = '\0'; rc = kstrtouint(scope_id, 0, &s6->sin6_scope_id); rc = (rc == 0) ? 1 : 0; } return rc; } void cifs_set_port(struct sockaddr *addr, const unsigned short int port) { switch (addr->sa_family) { case AF_INET: ((struct sockaddr_in *)addr)->sin_port = htons(port); break; case AF_INET6: ((struct sockaddr_in6 *)addr)->sin6_port = htons(port); break; } } /***************************************************************************** *convert a NT status code to a dos class/code *****************************************************************************/ /* NT status -> dos error map */ static const struct { __u8 dos_class; __u16 dos_code; __u32 ntstatus; } ntstatus_to_dos_map[] = { { ERRDOS, ERRgeneral, NT_STATUS_UNSUCCESSFUL}, { ERRDOS, ERRbadfunc, NT_STATUS_NOT_IMPLEMENTED}, { ERRDOS, ERRinvlevel, NT_STATUS_INVALID_INFO_CLASS}, { ERRDOS, 24, NT_STATUS_INFO_LENGTH_MISMATCH}, { ERRHRD, ERRgeneral, NT_STATUS_ACCESS_VIOLATION}, { ERRHRD, ERRgeneral, NT_STATUS_IN_PAGE_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_PAGEFILE_QUOTA}, { ERRDOS, ERRbadfid, NT_STATUS_INVALID_HANDLE}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_INITIAL_STACK}, { ERRDOS, 193, NT_STATUS_BAD_INITIAL_PC}, { ERRDOS, 87, NT_STATUS_INVALID_CID}, { ERRHRD, ERRgeneral, NT_STATUS_TIMER_NOT_CANCELED}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER}, { ERRDOS, ERRbadfile, NT_STATUS_NO_SUCH_DEVICE}, { ERRDOS, ERRbadfile, NT_STATUS_NO_SUCH_FILE}, { ERRDOS, ERRbadfunc, NT_STATUS_INVALID_DEVICE_REQUEST}, { ERRDOS, 38, NT_STATUS_END_OF_FILE}, { ERRDOS, 34, NT_STATUS_WRONG_VOLUME}, { ERRDOS, 21, NT_STATUS_NO_MEDIA_IN_DEVICE}, { ERRHRD, ERRgeneral, NT_STATUS_UNRECOGNIZED_MEDIA}, { ERRDOS, 27, NT_STATUS_NONEXISTENT_SECTOR}, /* { This NT error code was 'sqashed' from NT_STATUS_MORE_PROCESSING_REQUIRED to NT_STATUS_OK during the session setup } */ { ERRDOS, ERRnomem, NT_STATUS_NO_MEMORY}, { ERRDOS, 487, NT_STATUS_CONFLICTING_ADDRESSES}, { ERRDOS, 487, NT_STATUS_NOT_MAPPED_VIEW}, { ERRDOS, 87, NT_STATUS_UNABLE_TO_FREE_VM}, { ERRDOS, 87, NT_STATUS_UNABLE_TO_DELETE_SECTION}, { ERRDOS, 2142, NT_STATUS_INVALID_SYSTEM_SERVICE}, { ERRHRD, ERRgeneral, NT_STATUS_ILLEGAL_INSTRUCTION}, { ERRDOS, ERRnoaccess, NT_STATUS_INVALID_LOCK_SEQUENCE}, { ERRDOS, ERRnoaccess, NT_STATUS_INVALID_VIEW_SIZE}, { ERRDOS, 193, NT_STATUS_INVALID_FILE_FOR_SECTION}, { ERRDOS, ERRnoaccess, NT_STATUS_ALREADY_COMMITTED}, /* { This NT error code was 'sqashed' from NT_STATUS_ACCESS_DENIED to NT_STATUS_TRUSTED_RELATIONSHIP_FAILURE during the session setup } */ { ERRDOS, ERRnoaccess, NT_STATUS_ACCESS_DENIED}, { ERRDOS, 111, NT_STATUS_BUFFER_TOO_SMALL}, { ERRDOS, ERRbadfid, NT_STATUS_OBJECT_TYPE_MISMATCH}, { ERRHRD, ERRgeneral, NT_STATUS_NONCONTINUABLE_EXCEPTION}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_DISPOSITION}, { ERRHRD, ERRgeneral, NT_STATUS_UNWIND}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_STACK}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_UNWIND_TARGET}, { ERRDOS, 158, NT_STATUS_NOT_LOCKED}, { ERRHRD, ERRgeneral, NT_STATUS_PARITY_ERROR}, { ERRDOS, 487, NT_STATUS_UNABLE_TO_DECOMMIT_VM}, { ERRDOS, 487, NT_STATUS_NOT_COMMITTED}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_PORT_ATTRIBUTES}, { ERRHRD, ERRgeneral, NT_STATUS_PORT_MESSAGE_TOO_LONG}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_MIX}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_QUOTA_LOWER}, { ERRHRD, ERRgeneral, NT_STATUS_DISK_CORRUPT_ERROR}, { /* mapping changed since shell does lookup on * expects FileNotFound */ ERRDOS, ERRbadfile, NT_STATUS_OBJECT_NAME_INVALID}, { ERRDOS, ERRbadfile, NT_STATUS_OBJECT_NAME_NOT_FOUND}, { ERRDOS, ERRalreadyexists, NT_STATUS_OBJECT_NAME_COLLISION}, { ERRHRD, ERRgeneral, NT_STATUS_HANDLE_NOT_WAITABLE}, { ERRDOS, ERRbadfid, NT_STATUS_PORT_DISCONNECTED}, { ERRHRD, ERRgeneral, NT_STATUS_DEVICE_ALREADY_ATTACHED}, { ERRDOS, 161, NT_STATUS_OBJECT_PATH_INVALID}, { ERRDOS, ERRbadpath, NT_STATUS_OBJECT_PATH_NOT_FOUND}, { ERRDOS, 161, NT_STATUS_OBJECT_PATH_SYNTAX_BAD}, { ERRHRD, ERRgeneral, NT_STATUS_DATA_OVERRUN}, { ERRHRD, ERRgeneral, NT_STATUS_DATA_LATE_ERROR}, { ERRDOS, 23, NT_STATUS_DATA_ERROR}, { ERRDOS, 23, NT_STATUS_CRC_ERROR}, { ERRDOS, ERRnomem, NT_STATUS_SECTION_TOO_BIG}, { ERRDOS, ERRnoaccess, NT_STATUS_PORT_CONNECTION_REFUSED}, { ERRDOS, ERRbadfid, NT_STATUS_INVALID_PORT_HANDLE}, { ERRDOS, ERRbadshare, NT_STATUS_SHARING_VIOLATION}, { ERRHRD, ERRgeneral, NT_STATUS_QUOTA_EXCEEDED}, { ERRDOS, 87, NT_STATUS_INVALID_PAGE_PROTECTION}, { ERRDOS, 288, NT_STATUS_MUTANT_NOT_OWNED}, { ERRDOS, 298, NT_STATUS_SEMAPHORE_LIMIT_EXCEEDED}, { ERRDOS, 87, NT_STATUS_PORT_ALREADY_SET}, { ERRDOS, 87, NT_STATUS_SECTION_NOT_IMAGE}, { ERRDOS, 156, NT_STATUS_SUSPEND_COUNT_EXCEEDED}, { ERRDOS, ERRnoaccess, NT_STATUS_THREAD_IS_TERMINATING}, { ERRDOS, 87, NT_STATUS_BAD_WORKING_SET_LIMIT}, { ERRDOS, 87, NT_STATUS_INCOMPATIBLE_FILE_MAP}, { ERRDOS, 87, NT_STATUS_SECTION_PROTECTION}, { ERRDOS, ERReasnotsupported, NT_STATUS_EAS_NOT_SUPPORTED}, { ERRDOS, 255, NT_STATUS_EA_TOO_LARGE}, { ERRHRD, ERRgeneral, NT_STATUS_NONEXISTENT_EA_ENTRY}, { ERRHRD, ERRgeneral, NT_STATUS_NO_EAS_ON_FILE}, { ERRHRD, ERRgeneral, NT_STATUS_EA_CORRUPT_ERROR}, { ERRDOS, ERRlock, NT_STATUS_FILE_LOCK_CONFLICT}, { ERRDOS, ERRlock, NT_STATUS_LOCK_NOT_GRANTED}, { ERRDOS, ERRbadfile, NT_STATUS_DELETE_PENDING}, { ERRDOS, ERRunsup, NT_STATUS_CTL_FILE_NOT_SUPPORTED}, { ERRHRD, ERRgeneral, NT_STATUS_UNKNOWN_REVISION}, { ERRHRD, ERRgeneral, NT_STATUS_REVISION_MISMATCH}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_OWNER}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_PRIMARY_GROUP}, { ERRHRD, ERRgeneral, NT_STATUS_NO_IMPERSONATION_TOKEN}, { ERRHRD, ERRgeneral, NT_STATUS_CANT_DISABLE_MANDATORY}, { ERRDOS, 2215, NT_STATUS_NO_LOGON_SERVERS}, { ERRHRD, ERRgeneral, NT_STATUS_NO_SUCH_LOGON_SESSION}, { ERRHRD, ERRgeneral, NT_STATUS_NO_SUCH_PRIVILEGE}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_ACCOUNT_NAME}, { ERRHRD, ERRgeneral, NT_STATUS_USER_EXISTS}, /* { This NT error code was 'sqashed' from NT_STATUS_NO_SUCH_USER to NT_STATUS_LOGON_FAILURE during the session setup } */ { ERRDOS, ERRnoaccess, NT_STATUS_NO_SUCH_USER}, { /* could map to 2238 */ ERRHRD, ERRgeneral, NT_STATUS_GROUP_EXISTS}, { ERRHRD, ERRgeneral, NT_STATUS_NO_SUCH_GROUP}, { ERRHRD, ERRgeneral, NT_STATUS_MEMBER_IN_GROUP}, { ERRHRD, ERRgeneral, NT_STATUS_MEMBER_NOT_IN_GROUP}, { ERRHRD, ERRgeneral, NT_STATUS_LAST_ADMIN}, /* { This NT error code was 'sqashed' from NT_STATUS_WRONG_PASSWORD to NT_STATUS_LOGON_FAILURE during the session setup } */ { ERRSRV, ERRbadpw, NT_STATUS_WRONG_PASSWORD}, { ERRHRD, ERRgeneral, NT_STATUS_ILL_FORMED_PASSWORD}, { ERRHRD, ERRgeneral, NT_STATUS_PASSWORD_RESTRICTION}, { ERRDOS, ERRnoaccess, NT_STATUS_LOGON_FAILURE}, { ERRHRD, ERRgeneral, NT_STATUS_ACCOUNT_RESTRICTION}, { ERRSRV, ERRbadLogonTime, NT_STATUS_INVALID_LOGON_HOURS}, { ERRSRV, ERRbadclient, NT_STATUS_INVALID_WORKSTATION}, { ERRSRV, ERRpasswordExpired, NT_STATUS_PASSWORD_EXPIRED}, { ERRSRV, ERRaccountexpired, NT_STATUS_ACCOUNT_DISABLED}, { ERRHRD, ERRgeneral, NT_STATUS_NONE_MAPPED}, { ERRHRD, ERRgeneral, NT_STATUS_TOO_MANY_LUIDS_REQUESTED}, { ERRHRD, ERRgeneral, NT_STATUS_LUIDS_EXHAUSTED}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_SUB_AUTHORITY}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_ACL}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_SID}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_SECURITY_DESCR}, { ERRDOS, 127, NT_STATUS_PROCEDURE_NOT_FOUND}, { ERRDOS, 193, NT_STATUS_INVALID_IMAGE_FORMAT}, { ERRHRD, ERRgeneral, NT_STATUS_NO_TOKEN}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_INHERITANCE_ACL}, { ERRDOS, 158, NT_STATUS_RANGE_NOT_LOCKED}, { ERRDOS, 112, NT_STATUS_DISK_FULL}, { ERRHRD, ERRgeneral, NT_STATUS_SERVER_DISABLED}, { ERRHRD, ERRgeneral, NT_STATUS_SERVER_NOT_DISABLED}, { ERRDOS, 68, NT_STATUS_TOO_MANY_GUIDS_REQUESTED}, { ERRDOS, 259, NT_STATUS_GUIDS_EXHAUSTED}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_ID_AUTHORITY}, { ERRDOS, 259, NT_STATUS_AGENTS_EXHAUSTED}, { ERRDOS, 154, NT_STATUS_INVALID_VOLUME_LABEL}, { ERRDOS, 14, NT_STATUS_SECTION_NOT_EXTENDED}, { ERRDOS, 487, NT_STATUS_NOT_MAPPED_DATA}, { ERRHRD, ERRgeneral, NT_STATUS_RESOURCE_DATA_NOT_FOUND}, { ERRHRD, ERRgeneral, NT_STATUS_RESOURCE_TYPE_NOT_FOUND}, { ERRHRD, ERRgeneral, NT_STATUS_RESOURCE_NAME_NOT_FOUND}, { ERRHRD, ERRgeneral, NT_STATUS_ARRAY_BOUNDS_EXCEEDED}, { ERRHRD, ERRgeneral, NT_STATUS_FLOAT_DENORMAL_OPERAND}, { ERRHRD, ERRgeneral, NT_STATUS_FLOAT_DIVIDE_BY_ZERO}, { ERRHRD, ERRgeneral, NT_STATUS_FLOAT_INEXACT_RESULT}, { ERRHRD, ERRgeneral, NT_STATUS_FLOAT_INVALID_OPERATION}, { ERRHRD, ERRgeneral, NT_STATUS_FLOAT_OVERFLOW}, { ERRHRD, ERRgeneral, NT_STATUS_FLOAT_STACK_CHECK}, { ERRHRD, ERRgeneral, NT_STATUS_FLOAT_UNDERFLOW}, { ERRHRD, ERRgeneral, NT_STATUS_INTEGER_DIVIDE_BY_ZERO}, { ERRDOS, 534, NT_STATUS_INTEGER_OVERFLOW}, { ERRHRD, ERRgeneral, NT_STATUS_PRIVILEGED_INSTRUCTION}, { ERRDOS, ERRnomem, NT_STATUS_TOO_MANY_PAGING_FILES}, { ERRHRD, ERRgeneral, NT_STATUS_FILE_INVALID}, { ERRHRD, ERRgeneral, NT_STATUS_ALLOTTED_SPACE_EXCEEDED}, /* { This NT error code was 'sqashed' from NT_STATUS_INSUFFICIENT_RESOURCES to NT_STATUS_INSUFF_SERVER_RESOURCES during the session setup } */ { ERRDOS, ERRnoresource, NT_STATUS_INSUFFICIENT_RESOURCES}, { ERRDOS, ERRbadpath, NT_STATUS_DFS_EXIT_PATH_FOUND}, { ERRDOS, 23, NT_STATUS_DEVICE_DATA_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_DEVICE_NOT_CONNECTED}, { ERRDOS, 21, NT_STATUS_DEVICE_POWER_FAILURE}, { ERRDOS, 487, NT_STATUS_FREE_VM_NOT_AT_BASE}, { ERRDOS, 487, NT_STATUS_MEMORY_NOT_ALLOCATED}, { ERRHRD, ERRgeneral, NT_STATUS_WORKING_SET_QUOTA}, { ERRDOS, 19, NT_STATUS_MEDIA_WRITE_PROTECTED}, { ERRDOS, 21, NT_STATUS_DEVICE_NOT_READY}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_GROUP_ATTRIBUTES}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_IMPERSONATION_LEVEL}, { ERRHRD, ERRgeneral, NT_STATUS_CANT_OPEN_ANONYMOUS}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_VALIDATION_CLASS}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_TOKEN_TYPE}, { ERRDOS, 87, NT_STATUS_BAD_MASTER_BOOT_RECORD}, { ERRHRD, ERRgeneral, NT_STATUS_INSTRUCTION_MISALIGNMENT}, { ERRDOS, ERRpipebusy, NT_STATUS_INSTANCE_NOT_AVAILABLE}, { ERRDOS, ERRpipebusy, NT_STATUS_PIPE_NOT_AVAILABLE}, { ERRDOS, ERRbadpipe, NT_STATUS_INVALID_PIPE_STATE}, { ERRDOS, ERRpipebusy, NT_STATUS_PIPE_BUSY}, { ERRDOS, ERRbadfunc, NT_STATUS_ILLEGAL_FUNCTION}, { ERRDOS, ERRnotconnected, NT_STATUS_PIPE_DISCONNECTED}, { ERRDOS, ERRpipeclosing, NT_STATUS_PIPE_CLOSING}, { ERRHRD, ERRgeneral, NT_STATUS_PIPE_CONNECTED}, { ERRHRD, ERRgeneral, NT_STATUS_PIPE_LISTENING}, { ERRDOS, ERRbadpipe, NT_STATUS_INVALID_READ_MODE}, { ERRDOS, 121, NT_STATUS_IO_TIMEOUT}, { ERRDOS, 38, NT_STATUS_FILE_FORCED_CLOSED}, { ERRHRD, ERRgeneral, NT_STATUS_PROFILING_NOT_STARTED}, { ERRHRD, ERRgeneral, NT_STATUS_PROFILING_NOT_STOPPED}, { ERRHRD, ERRgeneral, NT_STATUS_COULD_NOT_INTERPRET}, { ERRDOS, ERRnoaccess, NT_STATUS_FILE_IS_A_DIRECTORY}, { ERRDOS, ERRunsup, NT_STATUS_NOT_SUPPORTED}, { ERRDOS, 51, NT_STATUS_REMOTE_NOT_LISTENING}, { ERRDOS, 52, NT_STATUS_DUPLICATE_NAME}, { ERRDOS, 53, NT_STATUS_BAD_NETWORK_PATH}, { ERRDOS, 54, NT_STATUS_NETWORK_BUSY}, { ERRDOS, 55, NT_STATUS_DEVICE_DOES_NOT_EXIST}, { ERRDOS, 56, NT_STATUS_TOO_MANY_COMMANDS}, { ERRDOS, 57, NT_STATUS_ADAPTER_HARDWARE_ERROR}, { ERRDOS, 58, NT_STATUS_INVALID_NETWORK_RESPONSE}, { ERRDOS, 59, NT_STATUS_UNEXPECTED_NETWORK_ERROR}, { ERRDOS, 60, NT_STATUS_BAD_REMOTE_ADAPTER}, { ERRDOS, 61, NT_STATUS_PRINT_QUEUE_FULL}, { ERRDOS, 62, NT_STATUS_NO_SPOOL_SPACE}, { ERRDOS, 63, NT_STATUS_PRINT_CANCELLED}, { ERRDOS, 64, NT_STATUS_NETWORK_NAME_DELETED}, { ERRDOS, 65, NT_STATUS_NETWORK_ACCESS_DENIED}, { ERRDOS, 66, NT_STATUS_BAD_DEVICE_TYPE}, { ERRDOS, ERRnosuchshare, NT_STATUS_BAD_NETWORK_NAME}, { ERRDOS, 68, NT_STATUS_TOO_MANY_NAMES}, { ERRDOS, 69, NT_STATUS_TOO_MANY_SESSIONS}, { ERRDOS, 70, NT_STATUS_SHARING_PAUSED}, { ERRDOS, 71, NT_STATUS_REQUEST_NOT_ACCEPTED}, { ERRDOS, 72, NT_STATUS_REDIRECTOR_PAUSED}, { ERRDOS, 88, NT_STATUS_NET_WRITE_FAULT}, { ERRHRD, ERRgeneral, NT_STATUS_PROFILING_AT_LIMIT}, { ERRDOS, ERRdiffdevice, NT_STATUS_NOT_SAME_DEVICE}, { ERRDOS, ERRnoaccess, NT_STATUS_FILE_RENAMED}, { ERRDOS, 240, NT_STATUS_VIRTUAL_CIRCUIT_CLOSED}, { ERRHRD, ERRgeneral, NT_STATUS_NO_SECURITY_ON_OBJECT}, { ERRHRD, ERRgeneral, NT_STATUS_CANT_WAIT}, { ERRDOS, ERRpipeclosing, NT_STATUS_PIPE_EMPTY}, { ERRHRD, ERRgeneral, NT_STATUS_CANT_ACCESS_DOMAIN_INFO}, { ERRHRD, ERRgeneral, NT_STATUS_CANT_TERMINATE_SELF}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_SERVER_STATE}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_DOMAIN_STATE}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_DOMAIN_ROLE}, { ERRHRD, ERRgeneral, NT_STATUS_NO_SUCH_DOMAIN}, { ERRHRD, ERRgeneral, NT_STATUS_DOMAIN_EXISTS}, { ERRHRD, ERRgeneral, NT_STATUS_DOMAIN_LIMIT_EXCEEDED}, { ERRDOS, 300, NT_STATUS_OPLOCK_NOT_GRANTED}, { ERRDOS, 301, NT_STATUS_INVALID_OPLOCK_PROTOCOL}, { ERRHRD, ERRgeneral, NT_STATUS_INTERNAL_DB_CORRUPTION}, { ERRHRD, ERRgeneral, NT_STATUS_INTERNAL_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_GENERIC_NOT_MAPPED}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_DESCRIPTOR_FORMAT}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_USER_BUFFER}, { ERRHRD, ERRgeneral, NT_STATUS_UNEXPECTED_IO_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_UNEXPECTED_MM_CREATE_ERR}, { ERRHRD, ERRgeneral, NT_STATUS_UNEXPECTED_MM_MAP_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_UNEXPECTED_MM_EXTEND_ERR}, { ERRHRD, ERRgeneral, NT_STATUS_NOT_LOGON_PROCESS}, { ERRHRD, ERRgeneral, NT_STATUS_LOGON_SESSION_EXISTS}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_1}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_2}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_3}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_4}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_5}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_6}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_7}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_8}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_9}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_10}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_11}, { ERRDOS, 87, NT_STATUS_INVALID_PARAMETER_12}, { ERRDOS, ERRbadpath, NT_STATUS_REDIRECTOR_NOT_STARTED}, { ERRHRD, ERRgeneral, NT_STATUS_REDIRECTOR_STARTED}, { ERRHRD, ERRgeneral, NT_STATUS_STACK_OVERFLOW}, { ERRHRD, ERRgeneral, NT_STATUS_NO_SUCH_PACKAGE}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_FUNCTION_TABLE}, { ERRDOS, 203, 0xc0000100}, { ERRDOS, 145, NT_STATUS_DIRECTORY_NOT_EMPTY}, { ERRHRD, ERRgeneral, NT_STATUS_FILE_CORRUPT_ERROR}, { ERRDOS, 267, NT_STATUS_NOT_A_DIRECTORY}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_LOGON_SESSION_STATE}, { ERRHRD, ERRgeneral, NT_STATUS_LOGON_SESSION_COLLISION}, { ERRDOS, 206, NT_STATUS_NAME_TOO_LONG}, { ERRDOS, 2401, NT_STATUS_FILES_OPEN}, { ERRDOS, 2404, NT_STATUS_CONNECTION_IN_USE}, { ERRHRD, ERRgeneral, NT_STATUS_MESSAGE_NOT_FOUND}, { ERRDOS, ERRnoaccess, NT_STATUS_PROCESS_IS_TERMINATING}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_LOGON_TYPE}, { ERRHRD, ERRgeneral, NT_STATUS_NO_GUID_TRANSLATION}, { ERRHRD, ERRgeneral, NT_STATUS_CANNOT_IMPERSONATE}, { ERRHRD, ERRgeneral, NT_STATUS_IMAGE_ALREADY_LOADED}, { ERRHRD, ERRgeneral, NT_STATUS_ABIOS_NOT_PRESENT}, { ERRHRD, ERRgeneral, NT_STATUS_ABIOS_LID_NOT_EXIST}, { ERRHRD, ERRgeneral, NT_STATUS_ABIOS_LID_ALREADY_OWNED}, { ERRHRD, ERRgeneral, NT_STATUS_ABIOS_NOT_LID_OWNER}, { ERRHRD, ERRgeneral, NT_STATUS_ABIOS_INVALID_COMMAND}, { ERRHRD, ERRgeneral, NT_STATUS_ABIOS_INVALID_LID}, { ERRHRD, ERRgeneral, NT_STATUS_ABIOS_SELECTOR_NOT_AVAILABLE}, { ERRHRD, ERRgeneral, NT_STATUS_ABIOS_INVALID_SELECTOR}, { ERRHRD, ERRgeneral, NT_STATUS_NO_LDT}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_LDT_SIZE}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_LDT_OFFSET}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_LDT_DESCRIPTOR}, { ERRDOS, 193, NT_STATUS_INVALID_IMAGE_NE_FORMAT}, { ERRHRD, ERRgeneral, NT_STATUS_RXACT_INVALID_STATE}, { ERRHRD, ERRgeneral, NT_STATUS_RXACT_COMMIT_FAILURE}, { ERRHRD, ERRgeneral, NT_STATUS_MAPPED_FILE_SIZE_ZERO}, { ERRDOS, ERRnofids, NT_STATUS_TOO_MANY_OPENED_FILES}, { ERRHRD, ERRgeneral, NT_STATUS_CANCELLED}, { ERRDOS, ERRnoaccess, NT_STATUS_CANNOT_DELETE}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_COMPUTER_NAME}, { ERRDOS, ERRnoaccess, NT_STATUS_FILE_DELETED}, { ERRHRD, ERRgeneral, NT_STATUS_SPECIAL_ACCOUNT}, { ERRHRD, ERRgeneral, NT_STATUS_SPECIAL_GROUP}, { ERRHRD, ERRgeneral, NT_STATUS_SPECIAL_USER}, { ERRHRD, ERRgeneral, NT_STATUS_MEMBERS_PRIMARY_GROUP}, { ERRDOS, ERRbadfid, NT_STATUS_FILE_CLOSED}, { ERRHRD, ERRgeneral, NT_STATUS_TOO_MANY_THREADS}, { ERRHRD, ERRgeneral, NT_STATUS_THREAD_NOT_IN_PROCESS}, { ERRHRD, ERRgeneral, NT_STATUS_TOKEN_ALREADY_IN_USE}, { ERRHRD, ERRgeneral, NT_STATUS_PAGEFILE_QUOTA_EXCEEDED}, { ERRHRD, ERRgeneral, NT_STATUS_COMMITMENT_LIMIT}, { ERRDOS, 193, NT_STATUS_INVALID_IMAGE_LE_FORMAT}, { ERRDOS, 193, NT_STATUS_INVALID_IMAGE_NOT_MZ}, { ERRDOS, 193, NT_STATUS_INVALID_IMAGE_PROTECT}, { ERRDOS, 193, NT_STATUS_INVALID_IMAGE_WIN_16}, { ERRHRD, ERRgeneral, NT_STATUS_LOGON_SERVER_CONFLICT}, { ERRHRD, ERRgeneral, NT_STATUS_TIME_DIFFERENCE_AT_DC}, { ERRHRD, ERRgeneral, NT_STATUS_SYNCHRONIZATION_REQUIRED}, { ERRDOS, 126, NT_STATUS_DLL_NOT_FOUND}, { ERRHRD, ERRgeneral, NT_STATUS_OPEN_FAILED}, { ERRHRD, ERRgeneral, NT_STATUS_IO_PRIVILEGE_FAILED}, { ERRDOS, 182, NT_STATUS_ORDINAL_NOT_FOUND}, { ERRDOS, 127, NT_STATUS_ENTRYPOINT_NOT_FOUND}, { ERRHRD, ERRgeneral, NT_STATUS_CONTROL_C_EXIT}, { ERRDOS, 64, NT_STATUS_LOCAL_DISCONNECT}, { ERRDOS, 64, NT_STATUS_REMOTE_DISCONNECT}, { ERRDOS, 51, NT_STATUS_REMOTE_RESOURCES}, { ERRDOS, 59, NT_STATUS_LINK_FAILED}, { ERRDOS, 59, NT_STATUS_LINK_TIMEOUT}, { ERRDOS, 59, NT_STATUS_INVALID_CONNECTION}, { ERRDOS, 59, NT_STATUS_INVALID_ADDRESS}, { ERRHRD, ERRgeneral, NT_STATUS_DLL_INIT_FAILED}, { ERRHRD, ERRgeneral, NT_STATUS_MISSING_SYSTEMFILE}, { ERRHRD, ERRgeneral, NT_STATUS_UNHANDLED_EXCEPTION}, { ERRHRD, ERRgeneral, NT_STATUS_APP_INIT_FAILURE}, { ERRHRD, ERRgeneral, NT_STATUS_PAGEFILE_CREATE_FAILED}, { ERRHRD, ERRgeneral, NT_STATUS_NO_PAGEFILE}, { ERRDOS, 124, NT_STATUS_INVALID_LEVEL}, { ERRDOS, 86, NT_STATUS_WRONG_PASSWORD_CORE}, { ERRHRD, ERRgeneral, NT_STATUS_ILLEGAL_FLOAT_CONTEXT}, { ERRDOS, 109, NT_STATUS_PIPE_BROKEN}, { ERRHRD, ERRgeneral, NT_STATUS_REGISTRY_CORRUPT}, { ERRHRD, ERRgeneral, NT_STATUS_REGISTRY_IO_FAILED}, { ERRHRD, ERRgeneral, NT_STATUS_NO_EVENT_PAIR}, { ERRHRD, ERRgeneral, NT_STATUS_UNRECOGNIZED_VOLUME}, { ERRHRD, ERRgeneral, NT_STATUS_SERIAL_NO_DEVICE_INITED}, { ERRHRD, ERRgeneral, NT_STATUS_NO_SUCH_ALIAS}, { ERRHRD, ERRgeneral, NT_STATUS_MEMBER_NOT_IN_ALIAS}, { ERRHRD, ERRgeneral, NT_STATUS_MEMBER_IN_ALIAS}, { ERRHRD, ERRgeneral, NT_STATUS_ALIAS_EXISTS}, { ERRHRD, ERRgeneral, NT_STATUS_LOGON_NOT_GRANTED}, { ERRHRD, ERRgeneral, NT_STATUS_TOO_MANY_SECRETS}, { ERRHRD, ERRgeneral, NT_STATUS_SECRET_TOO_LONG}, { ERRHRD, ERRgeneral, NT_STATUS_INTERNAL_DB_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_FULLSCREEN_MODE}, { ERRHRD, ERRgeneral, NT_STATUS_TOO_MANY_CONTEXT_IDS}, { ERRDOS, ERRnoaccess, NT_STATUS_LOGON_TYPE_NOT_GRANTED}, { ERRHRD, ERRgeneral, NT_STATUS_NOT_REGISTRY_FILE}, { ERRHRD, ERRgeneral, NT_STATUS_NT_CROSS_ENCRYPTION_REQUIRED}, { ERRHRD, ERRgeneral, NT_STATUS_DOMAIN_CTRLR_CONFIG_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_FT_MISSING_MEMBER}, { ERRHRD, ERRgeneral, NT_STATUS_ILL_FORMED_SERVICE_ENTRY}, { ERRHRD, ERRgeneral, NT_STATUS_ILLEGAL_CHARACTER}, { ERRHRD, ERRgeneral, NT_STATUS_UNMAPPABLE_CHARACTER}, { ERRHRD, ERRgeneral, NT_STATUS_UNDEFINED_CHARACTER}, { ERRHRD, ERRgeneral, NT_STATUS_FLOPPY_VOLUME}, { ERRHRD, ERRgeneral, NT_STATUS_FLOPPY_ID_MARK_NOT_FOUND}, { ERRHRD, ERRgeneral, NT_STATUS_FLOPPY_WRONG_CYLINDER}, { ERRHRD, ERRgeneral, NT_STATUS_FLOPPY_UNKNOWN_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_FLOPPY_BAD_REGISTERS}, { ERRHRD, ERRgeneral, NT_STATUS_DISK_RECALIBRATE_FAILED}, { ERRHRD, ERRgeneral, NT_STATUS_DISK_OPERATION_FAILED}, { ERRHRD, ERRgeneral, NT_STATUS_DISK_RESET_FAILED}, { ERRHRD, ERRgeneral, NT_STATUS_SHARED_IRQ_BUSY}, { ERRHRD, ERRgeneral, NT_STATUS_FT_ORPHANING}, { ERRHRD, ERRgeneral, 0xc000016e}, { ERRHRD, ERRgeneral, 0xc000016f}, { ERRHRD, ERRgeneral, 0xc0000170}, { ERRHRD, ERRgeneral, 0xc0000171}, { ERRHRD, ERRgeneral, NT_STATUS_PARTITION_FAILURE}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_BLOCK_LENGTH}, { ERRHRD, ERRgeneral, NT_STATUS_DEVICE_NOT_PARTITIONED}, { ERRHRD, ERRgeneral, NT_STATUS_UNABLE_TO_LOCK_MEDIA}, { ERRHRD, ERRgeneral, NT_STATUS_UNABLE_TO_UNLOAD_MEDIA}, { ERRHRD, ERRgeneral, NT_STATUS_EOM_OVERFLOW}, { ERRHRD, ERRgeneral, NT_STATUS_NO_MEDIA}, { ERRHRD, ERRgeneral, 0xc0000179}, { ERRHRD, ERRgeneral, NT_STATUS_NO_SUCH_MEMBER}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_MEMBER}, { ERRHRD, ERRgeneral, NT_STATUS_KEY_DELETED}, { ERRHRD, ERRgeneral, NT_STATUS_NO_LOG_SPACE}, { ERRHRD, ERRgeneral, NT_STATUS_TOO_MANY_SIDS}, { ERRHRD, ERRgeneral, NT_STATUS_LM_CROSS_ENCRYPTION_REQUIRED}, { ERRHRD, ERRgeneral, NT_STATUS_KEY_HAS_CHILDREN}, { ERRHRD, ERRgeneral, NT_STATUS_CHILD_MUST_BE_VOLATILE}, { ERRDOS, 87, NT_STATUS_DEVICE_CONFIGURATION_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_DRIVER_INTERNAL_ERROR}, { ERRDOS, 22, NT_STATUS_INVALID_DEVICE_STATE}, { ERRHRD, ERRgeneral, NT_STATUS_IO_DEVICE_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_DEVICE_PROTOCOL_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_BACKUP_CONTROLLER}, { ERRHRD, ERRgeneral, NT_STATUS_LOG_FILE_FULL}, { ERRDOS, 19, NT_STATUS_TOO_LATE}, { ERRDOS, ERRnoaccess, NT_STATUS_NO_TRUST_LSA_SECRET}, /* { This NT error code was 'sqashed' from NT_STATUS_NO_TRUST_SAM_ACCOUNT to NT_STATUS_TRUSTED_RELATIONSHIP_FAILURE during the session setup } */ { ERRDOS, ERRnoaccess, NT_STATUS_NO_TRUST_SAM_ACCOUNT}, { ERRDOS, ERRnoaccess, NT_STATUS_TRUSTED_DOMAIN_FAILURE}, { ERRDOS, ERRnoaccess, NT_STATUS_TRUSTED_RELATIONSHIP_FAILURE}, { ERRHRD, ERRgeneral, NT_STATUS_EVENTLOG_FILE_CORRUPT}, { ERRHRD, ERRgeneral, NT_STATUS_EVENTLOG_CANT_START}, { ERRDOS, ERRnoaccess, NT_STATUS_TRUST_FAILURE}, { ERRHRD, ERRgeneral, NT_STATUS_MUTANT_LIMIT_EXCEEDED}, { ERRDOS, ERRnetlogonNotStarted, NT_STATUS_NETLOGON_NOT_STARTED}, { ERRSRV, ERRaccountexpired, NT_STATUS_ACCOUNT_EXPIRED}, { ERRHRD, ERRgeneral, NT_STATUS_POSSIBLE_DEADLOCK}, { ERRHRD, ERRgeneral, NT_STATUS_NETWORK_CREDENTIAL_CONFLICT}, { ERRHRD, ERRgeneral, NT_STATUS_REMOTE_SESSION_LIMIT}, { ERRHRD, ERRgeneral, NT_STATUS_EVENTLOG_FILE_CHANGED}, { ERRDOS, ERRnoaccess, NT_STATUS_NOLOGON_INTERDOMAIN_TRUST_ACCOUNT}, { ERRDOS, ERRnoaccess, NT_STATUS_NOLOGON_WORKSTATION_TRUST_ACCOUNT}, { ERRDOS, ERRnoaccess, NT_STATUS_NOLOGON_SERVER_TRUST_ACCOUNT}, /* { This NT error code was 'sqashed' from NT_STATUS_DOMAIN_TRUST_INCONSISTENT to NT_STATUS_LOGON_FAILURE during the session setup } */ { ERRDOS, ERRnoaccess, NT_STATUS_DOMAIN_TRUST_INCONSISTENT}, { ERRHRD, ERRgeneral, NT_STATUS_FS_DRIVER_REQUIRED}, { ERRHRD, ERRgeneral, NT_STATUS_NO_USER_SESSION_KEY}, { ERRDOS, 59, NT_STATUS_USER_SESSION_DELETED}, { ERRHRD, ERRgeneral, NT_STATUS_RESOURCE_LANG_NOT_FOUND}, { ERRDOS, ERRnoresource, NT_STATUS_INSUFF_SERVER_RESOURCES}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_BUFFER_SIZE}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_ADDRESS_COMPONENT}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_ADDRESS_WILDCARD}, { ERRDOS, 68, NT_STATUS_TOO_MANY_ADDRESSES}, { ERRDOS, 52, NT_STATUS_ADDRESS_ALREADY_EXISTS}, { ERRDOS, 64, NT_STATUS_ADDRESS_CLOSED}, { ERRDOS, 64, NT_STATUS_CONNECTION_DISCONNECTED}, { ERRDOS, 64, NT_STATUS_CONNECTION_RESET}, { ERRDOS, 68, NT_STATUS_TOO_MANY_NODES}, { ERRDOS, 59, NT_STATUS_TRANSACTION_ABORTED}, { ERRDOS, 59, NT_STATUS_TRANSACTION_TIMED_OUT}, { ERRDOS, 59, NT_STATUS_TRANSACTION_NO_RELEASE}, { ERRDOS, 59, NT_STATUS_TRANSACTION_NO_MATCH}, { ERRDOS, 59, NT_STATUS_TRANSACTION_RESPONDED}, { ERRDOS, 59, NT_STATUS_TRANSACTION_INVALID_ID}, { ERRDOS, 59, NT_STATUS_TRANSACTION_INVALID_TYPE}, { ERRDOS, ERRunsup, NT_STATUS_NOT_SERVER_SESSION}, { ERRDOS, ERRunsup, NT_STATUS_NOT_CLIENT_SESSION}, { ERRHRD, ERRgeneral, NT_STATUS_CANNOT_LOAD_REGISTRY_FILE}, { ERRHRD, ERRgeneral, NT_STATUS_DEBUG_ATTACH_FAILED}, { ERRHRD, ERRgeneral, NT_STATUS_SYSTEM_PROCESS_TERMINATED}, { ERRHRD, ERRgeneral, NT_STATUS_DATA_NOT_ACCEPTED}, { ERRHRD, ERRgeneral, NT_STATUS_NO_BROWSER_SERVERS_FOUND}, { ERRHRD, ERRgeneral, NT_STATUS_VDM_HARD_ERROR}, { ERRHRD, ERRgeneral, NT_STATUS_DRIVER_CANCEL_TIMEOUT}, { ERRHRD, ERRgeneral, NT_STATUS_REPLY_MESSAGE_MISMATCH}, { ERRHRD, ERRgeneral, NT_STATUS_MAPPED_ALIGNMENT}, { ERRDOS, 193, NT_STATUS_IMAGE_CHECKSUM_MISMATCH}, { ERRHRD, ERRgeneral, NT_STATUS_LOST_WRITEBEHIND_DATA}, { ERRHRD, ERRgeneral, NT_STATUS_CLIENT_SERVER_PARAMETERS_INVALID}, { ERRSRV, ERRpasswordExpired, NT_STATUS_PASSWORD_MUST_CHANGE}, { ERRHRD, ERRgeneral, NT_STATUS_NOT_FOUND}, { ERRHRD, ERRgeneral, NT_STATUS_NOT_TINY_STREAM}, { ERRHRD, ERRgeneral, NT_STATUS_RECOVERY_FAILURE}, { ERRHRD, ERRgeneral, NT_STATUS_STACK_OVERFLOW_READ}, { ERRHRD, ERRgeneral, NT_STATUS_FAIL_CHECK}, { ERRHRD, ERRgeneral, NT_STATUS_DUPLICATE_OBJECTID}, { ERRHRD, ERRgeneral, NT_STATUS_OBJECTID_EXISTS}, { ERRHRD, ERRgeneral, NT_STATUS_CONVERT_TO_LARGE}, { ERRHRD, ERRgeneral, NT_STATUS_RETRY}, { ERRHRD, ERRgeneral, NT_STATUS_FOUND_OUT_OF_SCOPE}, { ERRHRD, ERRgeneral, NT_STATUS_ALLOCATE_BUCKET}, { ERRHRD, ERRgeneral, NT_STATUS_PROPSET_NOT_FOUND}, { ERRHRD, ERRgeneral, NT_STATUS_MARSHALL_OVERFLOW}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_VARIANT}, { ERRHRD, ERRgeneral, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND}, { ERRDOS, ERRnoaccess, NT_STATUS_ACCOUNT_LOCKED_OUT}, { ERRDOS, ERRbadfid, NT_STATUS_HANDLE_NOT_CLOSABLE}, { ERRHRD, ERRgeneral, NT_STATUS_CONNECTION_REFUSED}, { ERRHRD, ERRgeneral, NT_STATUS_GRACEFUL_DISCONNECT}, { ERRHRD, ERRgeneral, NT_STATUS_ADDRESS_ALREADY_ASSOCIATED}, { ERRHRD, ERRgeneral, NT_STATUS_ADDRESS_NOT_ASSOCIATED}, { ERRHRD, ERRgeneral, NT_STATUS_CONNECTION_INVALID}, { ERRHRD, ERRgeneral, NT_STATUS_CONNECTION_ACTIVE}, { ERRHRD, ERRgeneral, NT_STATUS_NETWORK_UNREACHABLE}, { ERRHRD, ERRgeneral, NT_STATUS_HOST_UNREACHABLE}, { ERRHRD, ERRgeneral, NT_STATUS_PROTOCOL_UNREACHABLE}, { ERRHRD, ERRgeneral, NT_STATUS_PORT_UNREACHABLE}, { ERRHRD, ERRgeneral, NT_STATUS_REQUEST_ABORTED}, { ERRHRD, ERRgeneral, NT_STATUS_CONNECTION_ABORTED}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_COMPRESSION_BUFFER}, { ERRHRD, ERRgeneral, NT_STATUS_USER_MAPPED_FILE}, { ERRHRD, ERRgeneral, NT_STATUS_AUDIT_FAILED}, { ERRHRD, ERRgeneral, NT_STATUS_TIMER_RESOLUTION_NOT_SET}, { ERRHRD, ERRgeneral, NT_STATUS_CONNECTION_COUNT_LIMIT}, { ERRHRD, ERRgeneral, NT_STATUS_LOGIN_TIME_RESTRICTION}, { ERRHRD, ERRgeneral, NT_STATUS_LOGIN_WKSTA_RESTRICTION}, { ERRDOS, 193, NT_STATUS_IMAGE_MP_UP_MISMATCH}, { ERRHRD, ERRgeneral, 0xc000024a}, { ERRHRD, ERRgeneral, 0xc000024b}, { ERRHRD, ERRgeneral, 0xc000024c}, { ERRHRD, ERRgeneral, 0xc000024d}, { ERRHRD, ERRgeneral, 0xc000024e}, { ERRHRD, ERRgeneral, 0xc000024f}, { ERRHRD, ERRgeneral, NT_STATUS_INSUFFICIENT_LOGON_INFO}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_DLL_ENTRYPOINT}, { ERRHRD, ERRgeneral, NT_STATUS_BAD_SERVICE_ENTRYPOINT}, { ERRHRD, ERRgeneral, NT_STATUS_LPC_REPLY_LOST}, { ERRHRD, ERRgeneral, NT_STATUS_IP_ADDRESS_CONFLICT1}, { ERRHRD, ERRgeneral, NT_STATUS_IP_ADDRESS_CONFLICT2}, { ERRHRD, ERRgeneral, NT_STATUS_REGISTRY_QUOTA_LIMIT}, { ERRSRV, 3, NT_STATUS_PATH_NOT_COVERED}, { ERRHRD, ERRgeneral, NT_STATUS_NO_CALLBACK_ACTIVE}, { ERRHRD, ERRgeneral, NT_STATUS_LICENSE_QUOTA_EXCEEDED}, { ERRHRD, ERRgeneral, NT_STATUS_PWD_TOO_SHORT}, { ERRHRD, ERRgeneral, NT_STATUS_PWD_TOO_RECENT}, { ERRHRD, ERRgeneral, NT_STATUS_PWD_HISTORY_CONFLICT}, { ERRHRD, ERRgeneral, 0xc000025d}, { ERRHRD, ERRgeneral, NT_STATUS_PLUGPLAY_NO_DEVICE}, { ERRHRD, ERRgeneral, NT_STATUS_UNSUPPORTED_COMPRESSION}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_HW_PROFILE}, { ERRHRD, ERRgeneral, NT_STATUS_INVALID_PLUGPLAY_DEVICE_PATH}, { ERRDOS, 182, NT_STATUS_DRIVER_ORDINAL_NOT_FOUND}, { ERRDOS, 127, NT_STATUS_DRIVER_ENTRYPOINT_NOT_FOUND}, { ERRDOS, 288, NT_STATUS_RESOURCE_NOT_OWNED}, { ERRDOS, ErrTooManyLinks, NT_STATUS_TOO_MANY_LINKS}, { ERRHRD, ERRgeneral, NT_STATUS_QUOTA_LIST_INCONSISTENT}, { ERRHRD, ERRgeneral, NT_STATUS_FILE_IS_OFFLINE}, { ERRDOS, 21, 0xc000026e}, { ERRDOS, 161, 0xc0000281}, { ERRDOS, ERRnoaccess, 0xc000028a}, { ERRDOS, ERRnoaccess, 0xc000028b}, { ERRHRD, ERRgeneral, 0xc000028c}, { ERRDOS, ERRnoaccess, 0xc000028d}, { ERRDOS, ERRnoaccess, 0xc000028e}, { ERRDOS, ERRnoaccess, 0xc000028f}, { ERRDOS, ERRnoaccess, 0xc0000290}, { ERRDOS, ERRbadfunc, 0xc000029c}, { ERRDOS, ERRsymlink, NT_STATUS_STOPPED_ON_SYMLINK}, { ERRDOS, ERRinvlevel, 0x007c0001}, { 0, 0, 0 } }; /***************************************************************************** Print an error message from the status code *****************************************************************************/ static void cifs_print_status(__u32 status_code) { int idx = 0; while (nt_errs[idx].nt_errstr != NULL) { if (nt_errs[idx].nt_errcode == status_code) { pr_notice("Status code returned 0x%08x %s\n", status_code, nt_errs[idx].nt_errstr); return; } idx++; } return; } static void ntstatus_to_dos(__u32 ntstatus, __u8 *eclass, __u16 *ecode) { int i; if (ntstatus == 0) { *eclass = 0; *ecode = 0; return; } for (i = 0; ntstatus_to_dos_map[i].ntstatus; i++) { if (ntstatus == ntstatus_to_dos_map[i].ntstatus) { *eclass = ntstatus_to_dos_map[i].dos_class; *ecode = ntstatus_to_dos_map[i].dos_code; return; } } *eclass = ERRHRD; *ecode = ERRgeneral; } int map_smb_to_linux_error(char *buf, bool logErr) { struct smb_hdr *smb = (struct smb_hdr *)buf; unsigned int i; int rc = -EIO; /* if transport error smb error may not be set */ __u8 smberrclass; __u16 smberrcode; /* BB if NT Status codes - map NT BB */ /* old style smb error codes */ if (smb->Status.CifsError == 0) return 0; if (smb->Flags2 & SMBFLG2_ERR_STATUS) { /* translate the newer STATUS codes to old style SMB errors * and then to POSIX errors */ __u32 err = le32_to_cpu(smb->Status.CifsError); if (logErr && (err != (NT_STATUS_MORE_PROCESSING_REQUIRED))) cifs_print_status(err); else if (cifsFYI & CIFS_RC) cifs_print_status(err); ntstatus_to_dos(err, &smberrclass, &smberrcode); } else { smberrclass = smb->Status.DosError.ErrorClass; smberrcode = le16_to_cpu(smb->Status.DosError.Error); } /* old style errors */ /* DOS class smb error codes - map DOS */ if (smberrclass == ERRDOS) { /* 1 byte field no need to byte reverse */ for (i = 0; i < sizeof(mapping_table_ERRDOS) / sizeof(struct smb_to_posix_error); i++) { if (mapping_table_ERRDOS[i].smb_err == 0) break; else if (mapping_table_ERRDOS[i].smb_err == smberrcode) { rc = mapping_table_ERRDOS[i].posix_code; break; } /* else try next error mapping one to see if match */ } } else if (smberrclass == ERRSRV) { /* server class of error codes */ for (i = 0; i < sizeof(mapping_table_ERRSRV) / sizeof(struct smb_to_posix_error); i++) { if (mapping_table_ERRSRV[i].smb_err == 0) break; else if (mapping_table_ERRSRV[i].smb_err == smberrcode) { rc = mapping_table_ERRSRV[i].posix_code; break; } /* else try next error mapping to see if match */ } } /* else ERRHRD class errors or junk - return EIO */ /* special cases for NT status codes which cannot be translated to DOS codes */ if (smb->Flags2 & SMBFLG2_ERR_STATUS) { __u32 err = le32_to_cpu(smb->Status.CifsError); if (err == (NT_STATUS_NOT_A_REPARSE_POINT)) rc = -ENODATA; else if (err == (NT_STATUS_PRIVILEGE_NOT_HELD)) rc = -EPERM; } cifs_dbg(FYI, "Mapping smb error code 0x%x to POSIX err %d\n", le32_to_cpu(smb->Status.CifsError), rc); /* generic corrective action e.g. reconnect SMB session on * ERRbaduid could be added */ if (rc == -EIO) smb_EIO2(smb_eio_trace_smb1_received_error, le32_to_cpu(smb->Status.CifsError), le16_to_cpu(smb->Flags2)); return rc; } int map_and_check_smb_error(struct TCP_Server_Info *server, struct mid_q_entry *mid, bool logErr) { int rc; struct smb_hdr *smb = (struct smb_hdr *)mid->resp_buf; rc = map_smb_to_linux_error((char *)smb, logErr); if (rc == -EACCES && !(smb->Flags2 & SMBFLG2_ERR_STATUS)) { /* possible ERRBaduid */ __u8 class = smb->Status.DosError.ErrorClass; __u16 code = le16_to_cpu(smb->Status.DosError.Error); /* switch can be used to handle different errors */ if (class == ERRSRV && code == ERRbaduid) { cifs_dbg(FYI, "Server returned 0x%x, reconnecting session...\n", code); cifs_signal_cifsd_for_reconnect(server, false); } } return rc; } /* * calculate the size of the SMB message based on the fixed header * portion, the number of word parameters and the data portion of the message */ unsigned int smbCalcSize(void *buf) { struct smb_hdr *ptr = buf; return (sizeof(struct smb_hdr) + (2 * ptr->WordCount) + 2 /* size of the bcc field */ + get_bcc(ptr)); } /* The following are taken from fs/ntfs/util.c */ #define NTFS_TIME_OFFSET ((u64)(369*365 + 89) * 24 * 3600 * 10000000) /* * Convert the NT UTC (based 1601-01-01, in hundred nanosecond units) * into Unix UTC (based 1970-01-01, in seconds). */ struct timespec64 cifs_NTtimeToUnix(__le64 ntutc) { struct timespec64 ts; /* BB what about the timezone? BB */ /* Subtract the NTFS time offset, then convert to 1s intervals. */ s64 t = le64_to_cpu(ntutc) - NTFS_TIME_OFFSET; u64 abs_t; /* * Unfortunately can not use normal 64 bit division on 32 bit arch, but * the alternative, do_div, does not work with negative numbers so have * to special case them */ if (t < 0) { abs_t = -t; ts.tv_nsec = (time64_t)(do_div(abs_t, 10000000) * 100); ts.tv_nsec = -ts.tv_nsec; ts.tv_sec = -abs_t; } else { abs_t = t; ts.tv_nsec = (time64_t)do_div(abs_t, 10000000) * 100; ts.tv_sec = abs_t; } return ts; } /* Convert the Unix UTC into NT UTC. */ u64 cifs_UnixTimeToNT(struct timespec64 t) { /* Convert to 100ns intervals and then add the NTFS time offset. */ return (u64) t.tv_sec * 10000000 + t.tv_nsec/100 + NTFS_TIME_OFFSET; } static const int total_days_of_prev_months[] = { 0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334 }; struct timespec64 cnvrtDosUnixTm(__le16 le_date, __le16 le_time, int offset) { struct timespec64 ts; time64_t sec, days; int min, day, month, year; u16 date = le16_to_cpu(le_date); u16 time = le16_to_cpu(le_time); SMB_TIME *st = (SMB_TIME *)&time; SMB_DATE *sd = (SMB_DATE *)&date; cifs_dbg(FYI, "date %d time %d\n", date, time); sec = 2 * st->TwoSeconds; min = st->Minutes; if ((sec > 59) || (min > 59)) cifs_dbg(VFS, "Invalid time min %d sec %lld\n", min, sec); sec += (min * 60); sec += 60 * 60 * st->Hours; if (st->Hours > 24) cifs_dbg(VFS, "Invalid hours %d\n", st->Hours); day = sd->Day; month = sd->Month; if (day < 1 || day > 31 || month < 1 || month > 12) { cifs_dbg(VFS, "Invalid date, month %d day: %d\n", month, day); day = clamp(day, 1, 31); month = clamp(month, 1, 12); } month -= 1; days = day + total_days_of_prev_months[month]; days += 3652; /* account for difference in days between 1980 and 1970 */ year = sd->Year; days += year * 365; days += (year/4); /* leap year */ /* generalized leap year calculation is more complex, ie no leap year for years/100 except for years/400, but since the maximum number for DOS year is 2**7, the last year is 1980+127, which means we need only consider 2 special case years, ie the years 2000 and 2100, and only adjust for the lack of leap year for the year 2100, as 2000 was a leap year (divisible by 400) */ if (year >= 120) /* the year 2100 */ days = days - 1; /* do not count leap year for the year 2100 */ /* adjust for leap year where we are still before leap day */ if (year != 120) days -= ((year & 0x03) == 0) && (month < 2 ? 1 : 0); sec += 24 * 60 * 60 * days; ts.tv_sec = sec + offset; /* cifs_dbg(FYI, "sec after cnvrt dos to unix time %d\n",sec); */ ts.tv_nsec = 0; return ts; } |
| 363 360 2 345 316 147 20 20 20 88 79 15 84 78 19 16 15 5 5 15 13 10 250 177 187 117 11 11 11 11 11 11 11 11 8 8 239 240 172 1 83 2 2 11 11 10 1 3 1 1 381 382 235 239 239 238 1 1 112 364 1 1 10 353 333 12 224 280 362 349 347 107 2 101 11 47 13 36 298 270 5 9 4 268 6 38 229 226 263 11 11 11 2 10 204 436 10 200 1 3 201 336 83 328 327 328 328 1 1 328 317 10 9 1 12 311 6 314 10 2 315 18 315 18 365 172 338 364 364 19 353 17 5 234 25 1 24 24 24 177 3 174 174 174 171 3 173 171 171 171 174 3 3 3 3 8 8 8 8 8 8 51 51 51 44 6 4 6 8 2 10 3 7 10 3 7 10 10 10 84 84 84 84 83 2 2 2 2 2 2 2 115 20 115 115 112 5 110 49 84 114 50 50 1 1 31 18 47 18 31 1 31 1 1 1 1 1 1 1 1 1 1 41 41 1 1 28 9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 | // SPDX-License-Identifier: GPL-2.0 /* * * Copyright (C) 2019-2021 Paragon Software GmbH, All rights reserved. * */ #include <linux/blkdev.h> #include <linux/buffer_head.h> #include <linux/fs.h> #include <linux/kernel.h> #include "debug.h" #include "ntfs.h" #include "ntfs_fs.h" static const struct INDEX_NAMES { const __le16 *name; u8 name_len; } s_index_names[INDEX_MUTEX_TOTAL] = { { I30_NAME, ARRAY_SIZE(I30_NAME) }, { SII_NAME, ARRAY_SIZE(SII_NAME) }, { SDH_NAME, ARRAY_SIZE(SDH_NAME) }, { SO_NAME, ARRAY_SIZE(SO_NAME) }, { SQ_NAME, ARRAY_SIZE(SQ_NAME) }, { SR_NAME, ARRAY_SIZE(SR_NAME) }, }; /* * cmp_fnames - Compare two names in index. * * if l1 != 0 * Both names are little endian on-disk ATTR_FILE_NAME structs. * else * key1 - cpu_str, key2 - ATTR_FILE_NAME */ static int cmp_fnames(const void *key1, size_t l1, const void *key2, size_t l2, const void *data) { const struct ATTR_FILE_NAME *f2 = key2; const struct ntfs_sb_info *sbi = data; const struct ATTR_FILE_NAME *f1; u16 fsize2; bool both_case; if (l2 <= offsetof(struct ATTR_FILE_NAME, name)) return -1; fsize2 = fname_full_size(f2); if (l2 < fsize2) return -1; both_case = f2->type != FILE_NAME_DOS && !sbi->options->nocase; if (!l1) { const struct le_str *s2 = (struct le_str *)&f2->name_len; /* * If names are equal (case insensitive) * try to compare it case sensitive. */ return ntfs_cmp_names_cpu(key1, s2, sbi->upcase, both_case); } f1 = key1; return ntfs_cmp_names(f1->name, f1->name_len, f2->name, f2->name_len, sbi->upcase, both_case); } /* * cmp_uint - $SII of $Secure and $Q of Quota */ static int cmp_uint(const void *key1, size_t l1, const void *key2, size_t l2, const void *data) { const u32 *k1 = key1; const u32 *k2 = key2; if (l2 < sizeof(u32)) return -1; if (*k1 < *k2) return -1; if (*k1 > *k2) return 1; return 0; } /* * cmp_sdh - $SDH of $Secure */ static int cmp_sdh(const void *key1, size_t l1, const void *key2, size_t l2, const void *data) { const struct SECURITY_KEY *k1 = key1; const struct SECURITY_KEY *k2 = key2; u32 t1, t2; if (l2 < sizeof(struct SECURITY_KEY)) return -1; t1 = le32_to_cpu(k1->hash); t2 = le32_to_cpu(k2->hash); /* First value is a hash value itself. */ if (t1 < t2) return -1; if (t1 > t2) return 1; /* Second value is security Id. */ if (data) { t1 = le32_to_cpu(k1->sec_id); t2 = le32_to_cpu(k2->sec_id); if (t1 < t2) return -1; if (t1 > t2) return 1; } return 0; } /* * cmp_uints - $O of ObjId and "$R" for Reparse. */ static int cmp_uints(const void *key1, size_t l1, const void *key2, size_t l2, const void *data) { const __le32 *k1 = key1; const __le32 *k2 = key2; size_t count; if ((size_t)data == 1) { /* * ni_delete_all -> ntfs_remove_reparse -> * delete all with this reference. * k1, k2 - pointers to REPARSE_KEY */ k1 += 1; // Skip REPARSE_KEY.ReparseTag k2 += 1; // Skip REPARSE_KEY.ReparseTag if (l2 <= sizeof(int)) return -1; l2 -= sizeof(int); if (l1 <= sizeof(int)) return 1; l1 -= sizeof(int); } if (l2 < sizeof(int)) return -1; for (count = min(l1, l2) >> 2; count > 0; --count, ++k1, ++k2) { u32 t1 = le32_to_cpu(*k1); u32 t2 = le32_to_cpu(*k2); if (t1 > t2) return 1; if (t1 < t2) return -1; } if (l1 > l2) return 1; if (l1 < l2) return -1; return 0; } static inline NTFS_CMP_FUNC get_cmp_func(const struct INDEX_ROOT *root) { switch (root->type) { case ATTR_NAME: if (root->rule == NTFS_COLLATION_TYPE_FILENAME) return &cmp_fnames; break; case ATTR_ZERO: switch (root->rule) { case NTFS_COLLATION_TYPE_UINT: return &cmp_uint; case NTFS_COLLATION_TYPE_SECURITY_HASH: return &cmp_sdh; case NTFS_COLLATION_TYPE_UINTS: return &cmp_uints; default: break; } break; default: break; } return NULL; } struct bmp_buf { struct ATTRIB *b; struct mft_inode *mi; struct buffer_head *bh; ulong *buf; size_t bit; u32 nbits; u64 new_valid; }; static int bmp_buf_get(struct ntfs_index *indx, struct ntfs_inode *ni, size_t bit, struct bmp_buf *bbuf) { struct ATTRIB *b; size_t data_size, valid_size, vbo, off = bit >> 3; struct ntfs_sb_info *sbi = ni->mi.sbi; CLST vcn = off >> sbi->cluster_bits; struct ATTR_LIST_ENTRY *le = NULL; struct buffer_head *bh; struct super_block *sb; u32 blocksize; const struct INDEX_NAMES *in = &s_index_names[indx->type]; bbuf->bh = NULL; b = ni_find_attr(ni, NULL, &le, ATTR_BITMAP, in->name, in->name_len, &vcn, &bbuf->mi); bbuf->b = b; if (!b) return -EINVAL; if (!b->non_res) { data_size = le32_to_cpu(b->res.data_size); if (off >= data_size) return -EINVAL; bbuf->buf = (ulong *)resident_data(b); bbuf->bit = 0; bbuf->nbits = data_size * 8; return 0; } data_size = le64_to_cpu(b->nres.data_size); if (WARN_ON(off >= data_size)) { /* Looks like filesystem error. */ return -EINVAL; } valid_size = le64_to_cpu(b->nres.valid_size); bh = ntfs_bread_run(sbi, &indx->bitmap_run, off); if (!bh) return -EIO; if (IS_ERR(bh)) return PTR_ERR(bh); bbuf->bh = bh; if (buffer_locked(bh)) __wait_on_buffer(bh); lock_buffer(bh); sb = sbi->sb; blocksize = sb->s_blocksize; vbo = off & ~(size_t)sbi->block_mask; bbuf->new_valid = vbo + blocksize; if (bbuf->new_valid <= valid_size) bbuf->new_valid = 0; else if (bbuf->new_valid > data_size) bbuf->new_valid = data_size; if (vbo >= valid_size) { memset(bh->b_data, 0, blocksize); } else if (vbo + blocksize > valid_size) { u32 voff = valid_size & sbi->block_mask; memset(bh->b_data + voff, 0, blocksize - voff); } bbuf->buf = (ulong *)bh->b_data; bbuf->bit = 8 * (off & ~(size_t)sbi->block_mask); bbuf->nbits = 8 * blocksize; return 0; } static void bmp_buf_put(struct bmp_buf *bbuf, bool dirty) { struct buffer_head *bh = bbuf->bh; struct ATTRIB *b = bbuf->b; if (!bh) { if (b && !b->non_res && dirty) bbuf->mi->dirty = true; return; } if (!dirty) goto out; if (bbuf->new_valid) { b->nres.valid_size = cpu_to_le64(bbuf->new_valid); bbuf->mi->dirty = true; } set_buffer_uptodate(bh); mark_buffer_dirty(bh); out: unlock_buffer(bh); put_bh(bh); } /* * indx_mark_used - Mark the bit @bit as used. */ static int indx_mark_used(struct ntfs_index *indx, struct ntfs_inode *ni, size_t bit) { int err; struct bmp_buf bbuf; err = bmp_buf_get(indx, ni, bit, &bbuf); if (err) return err; __set_bit_le(bit - bbuf.bit, bbuf.buf); bmp_buf_put(&bbuf, true); return 0; } /* * indx_mark_free - Mark the bit @bit as free. */ static int indx_mark_free(struct ntfs_index *indx, struct ntfs_inode *ni, size_t bit) { int err; struct bmp_buf bbuf; err = bmp_buf_get(indx, ni, bit, &bbuf); if (err) return err; __clear_bit_le(bit - bbuf.bit, bbuf.buf); bmp_buf_put(&bbuf, true); return 0; } /* * scan_nres_bitmap * * If ntfs_readdir calls this function (indx_used_bit -> scan_nres_bitmap), * inode is shared locked and no ni_lock. * Use rw_semaphore for read/write access to bitmap_run. */ static int scan_nres_bitmap(struct ntfs_inode *ni, struct ATTRIB *bitmap, struct ntfs_index *indx, size_t from, bool (*fn)(const ulong *buf, u32 bit, u32 bits, size_t *ret), size_t *ret) { struct ntfs_sb_info *sbi = ni->mi.sbi; struct super_block *sb = sbi->sb; struct runs_tree *run = &indx->bitmap_run; struct rw_semaphore *lock = &indx->run_lock; u32 nbits = sb->s_blocksize * 8; u32 blocksize = sb->s_blocksize; u64 valid_size = le64_to_cpu(bitmap->nres.valid_size); u64 data_size = le64_to_cpu(bitmap->nres.data_size); sector_t eblock = bytes_to_block(sb, data_size); size_t vbo = from >> 3; sector_t blk = (vbo & sbi->cluster_mask) >> sb->s_blocksize_bits; sector_t vblock = vbo >> sb->s_blocksize_bits; sector_t blen, block; CLST lcn, clen, vcn, vcn_next; size_t idx; struct buffer_head *bh; bool ok; *ret = MINUS_ONE_T; if (vblock >= eblock) return 0; from &= nbits - 1; vcn = vbo >> sbi->cluster_bits; down_read(lock); ok = run_lookup_entry(run, vcn, &lcn, &clen, &idx); up_read(lock); next_run: if (!ok) { int err; const struct INDEX_NAMES *name = &s_index_names[indx->type]; down_write(lock); err = attr_load_runs_vcn(ni, ATTR_BITMAP, name->name, name->name_len, run, vcn); up_write(lock); if (err) return err; down_read(lock); ok = run_lookup_entry(run, vcn, &lcn, &clen, &idx); up_read(lock); if (!ok) return -EINVAL; } blen = (sector_t)clen * sbi->blocks_per_cluster; block = (sector_t)lcn * sbi->blocks_per_cluster; for (; blk < blen; blk++, from = 0) { bh = ntfs_bread(sb, block + blk); if (!bh) return -EIO; vbo = (u64)vblock << sb->s_blocksize_bits; if (vbo >= valid_size) { memset(bh->b_data, 0, blocksize); } else if (vbo + blocksize > valid_size) { u32 voff = valid_size & sbi->block_mask; memset(bh->b_data + voff, 0, blocksize - voff); } if (vbo + blocksize > data_size) nbits = 8 * (data_size - vbo); ok = nbits > from ? (*fn)((ulong *)bh->b_data, from, nbits, ret) : false; put_bh(bh); if (ok) { *ret += 8 * vbo; return 0; } if (++vblock >= eblock) { *ret = MINUS_ONE_T; return 0; } } blk = 0; vcn_next = vcn + clen; down_read(lock); ok = run_get_entry(run, ++idx, &vcn, &lcn, &clen) && vcn == vcn_next; if (!ok) vcn = vcn_next; up_read(lock); goto next_run; } static bool scan_for_free(const ulong *buf, u32 bit, u32 bits, size_t *ret) { size_t pos = find_next_zero_bit_le(buf, bits, bit); if (pos >= bits) return false; *ret = pos; return true; } /* * indx_find_free - Look for free bit. * * Return: -1 if no free bits. */ static int indx_find_free(struct ntfs_index *indx, struct ntfs_inode *ni, size_t *bit, struct ATTRIB **bitmap) { struct ATTRIB *b; struct ATTR_LIST_ENTRY *le = NULL; const struct INDEX_NAMES *in = &s_index_names[indx->type]; int err; b = ni_find_attr(ni, NULL, &le, ATTR_BITMAP, in->name, in->name_len, NULL, NULL); if (!b) return -ENOENT; *bitmap = b; *bit = MINUS_ONE_T; if (!b->non_res) { u32 nbits = 8 * le32_to_cpu(b->res.data_size); size_t pos = find_next_zero_bit_le(resident_data(b), nbits, 0); if (pos < nbits) *bit = pos; } else { err = scan_nres_bitmap(ni, b, indx, 0, &scan_for_free, bit); if (err) return err; } return 0; } static bool scan_for_used(const ulong *buf, u32 bit, u32 bits, size_t *ret) { size_t pos = find_next_bit_le(buf, bits, bit); if (pos >= bits) return false; *ret = pos; return true; } /* * indx_used_bit - Look for used bit. * * Return: MINUS_ONE_T if no used bits. */ int indx_used_bit(struct ntfs_index *indx, struct ntfs_inode *ni, size_t *bit) { struct ATTRIB *b; struct ATTR_LIST_ENTRY *le = NULL; size_t from = *bit; const struct INDEX_NAMES *in = &s_index_names[indx->type]; int err; b = ni_find_attr(ni, NULL, &le, ATTR_BITMAP, in->name, in->name_len, NULL, NULL); if (!b) return -ENOENT; *bit = MINUS_ONE_T; if (!b->non_res) { u32 nbits = le32_to_cpu(b->res.data_size) * 8; size_t pos = find_next_bit_le(resident_data(b), nbits, from); if (pos < nbits) *bit = pos; } else { err = scan_nres_bitmap(ni, b, indx, from, &scan_for_used, bit); if (err) return err; } return 0; } /* * hdr_find_split * * Find a point at which the index allocation buffer would like to be split. * NOTE: This function should never return 'END' entry NULL returns on error. */ static const struct NTFS_DE *hdr_find_split(const struct INDEX_HDR *hdr) { size_t o; const struct NTFS_DE *e = hdr_first_de(hdr); u32 used_2 = le32_to_cpu(hdr->used) >> 1; u16 esize; if (!e || de_is_last(e)) return NULL; esize = le16_to_cpu(e->size); for (o = le32_to_cpu(hdr->de_off) + esize; o < used_2; o += esize) { const struct NTFS_DE *p = e; e = Add2Ptr(hdr, o); /* We must not return END entry. */ if (de_is_last(e)) return p; esize = le16_to_cpu(e->size); } return e; } /* * hdr_insert_head - Insert some entries at the beginning of the buffer. * * It is used to insert entries into a newly-created buffer. */ static const struct NTFS_DE *hdr_insert_head(struct INDEX_HDR *hdr, const void *ins, u32 ins_bytes) { u32 to_move; struct NTFS_DE *e = hdr_first_de(hdr); u32 used = le32_to_cpu(hdr->used); if (!e) return NULL; /* Now we just make room for the inserted entries and jam it in. */ to_move = used - le32_to_cpu(hdr->de_off); memmove(Add2Ptr(e, ins_bytes), e, to_move); memcpy(e, ins, ins_bytes); hdr->used = cpu_to_le32(used + ins_bytes); return e; } /* * index_hdr_check * * return true if INDEX_HDR is valid */ static bool index_hdr_check(const struct INDEX_HDR *hdr, u32 bytes) { u32 end = le32_to_cpu(hdr->used); u32 tot = le32_to_cpu(hdr->total); u32 off = le32_to_cpu(hdr->de_off); if (!IS_ALIGNED(off, 8) || tot > bytes || end > tot || size_add(off, sizeof(struct NTFS_DE)) > end) { /* incorrect index buffer. */ return false; } return true; } /* * index_buf_check * * return true if INDEX_BUFFER seems is valid */ static bool index_buf_check(const struct INDEX_BUFFER *ib, u32 bytes, const CLST *vbn) { const struct NTFS_RECORD_HEADER *rhdr = &ib->rhdr; u16 fo = le16_to_cpu(rhdr->fix_off); u16 fn = le16_to_cpu(rhdr->fix_num); if (bytes <= offsetof(struct INDEX_BUFFER, ihdr) || rhdr->sign != NTFS_INDX_SIGNATURE || fo < sizeof(struct INDEX_BUFFER) /* Check index buffer vbn. */ || (vbn && *vbn != le64_to_cpu(ib->vbn)) || (fo % sizeof(short)) || fo + fn * sizeof(short) >= bytes || fn != ((bytes >> SECTOR_SHIFT) + 1)) { /* incorrect index buffer. */ return false; } return index_hdr_check(&ib->ihdr, bytes - offsetof(struct INDEX_BUFFER, ihdr)); } void fnd_clear(struct ntfs_fnd *fnd) { int i; for (i = fnd->level - 1; i >= 0; i--) { struct indx_node *n = fnd->nodes[i]; if (!n) continue; put_indx_node(n); fnd->nodes[i] = NULL; } fnd->level = 0; fnd->root_de = NULL; } static int fnd_push(struct ntfs_fnd *fnd, struct indx_node *n, struct NTFS_DE *e) { int i = fnd->level; if (i < 0 || i >= ARRAY_SIZE(fnd->nodes)) return -EINVAL; fnd->nodes[i] = n; fnd->de[i] = e; fnd->level += 1; return 0; } static struct indx_node *fnd_pop(struct ntfs_fnd *fnd) { struct indx_node *n; int i = fnd->level; i -= 1; n = fnd->nodes[i]; fnd->nodes[i] = NULL; fnd->level = i; return n; } static bool fnd_is_empty(struct ntfs_fnd *fnd) { if (!fnd->level) return !fnd->root_de; return !fnd->de[fnd->level - 1]; } /* * hdr_find_e - Locate an entry the index buffer. * * If no matching entry is found, it returns the first entry which is greater * than the desired entry If the search key is greater than all the entries the * buffer, it returns the 'end' entry. This function does a binary search of the * current index buffer, for the first entry that is <= to the search value. * * Return: NULL if error. */ static struct NTFS_DE *hdr_find_e(const struct ntfs_index *indx, const struct INDEX_HDR *hdr, const void *key, size_t key_len, const void *ctx, int *diff) { struct NTFS_DE *e, *found = NULL; NTFS_CMP_FUNC cmp = indx->cmp; int min_idx = 0, mid_idx, max_idx = 0; int diff2; int table_size = 8; u32 e_size, e_key_len; u32 end = le32_to_cpu(hdr->used); u32 off = le32_to_cpu(hdr->de_off); u32 total = le32_to_cpu(hdr->total); u16 offs[128]; if (unlikely(!cmp)) return NULL; fill_table: if (end > total) return NULL; if (size_add(off, sizeof(struct NTFS_DE)) > end) return NULL; e = Add2Ptr(hdr, off); e_size = le16_to_cpu(e->size); if (e_size < sizeof(struct NTFS_DE) || off + e_size > end) return NULL; if (!de_is_last(e)) { offs[max_idx] = off; off += e_size; max_idx++; if (max_idx < table_size) goto fill_table; max_idx--; } binary_search: e_key_len = le16_to_cpu(e->key_size); diff2 = (*cmp)(key, key_len, e + 1, e_key_len, ctx); if (diff2 > 0) { if (found) { min_idx = mid_idx + 1; } else { if (de_is_last(e)) return NULL; max_idx = 0; table_size = min(table_size * 2, (int)ARRAY_SIZE(offs)); goto fill_table; } } else if (diff2 < 0) { if (found) max_idx = mid_idx - 1; else max_idx--; found = e; } else { *diff = 0; return e; } if (min_idx > max_idx) { *diff = -1; return found; } mid_idx = (min_idx + max_idx) >> 1; e = Add2Ptr(hdr, offs[mid_idx]); goto binary_search; } /* * hdr_insert_de - Insert an index entry into the buffer. * * 'before' should be a pointer previously returned from hdr_find_e. */ static struct NTFS_DE *hdr_insert_de(const struct ntfs_index *indx, struct INDEX_HDR *hdr, const struct NTFS_DE *de, struct NTFS_DE *before, const void *ctx) { int diff; size_t off = PtrOffset(hdr, before); u32 used = le32_to_cpu(hdr->used); u32 total = le32_to_cpu(hdr->total); u16 de_size = le16_to_cpu(de->size); /* First, check to see if there's enough room. */ if (used + de_size > total) return NULL; /* We know there's enough space, so we know we'll succeed. */ if (before) { /* Check that before is inside Index. */ if (off >= used || off < le32_to_cpu(hdr->de_off) || off + le16_to_cpu(before->size) > total) { return NULL; } goto ok; } /* No insert point is applied. Get it manually. */ before = hdr_find_e(indx, hdr, de + 1, le16_to_cpu(de->key_size), ctx, &diff); if (!before) return NULL; off = PtrOffset(hdr, before); ok: /* Now we just make room for the entry and jam it in. */ memmove(Add2Ptr(before, de_size), before, used - off); hdr->used = cpu_to_le32(used + de_size); memcpy(before, de, de_size); return before; } /* * hdr_delete_de - Remove an entry from the index buffer. */ static inline struct NTFS_DE *hdr_delete_de(struct INDEX_HDR *hdr, struct NTFS_DE *re) { u32 used = le32_to_cpu(hdr->used); u16 esize = le16_to_cpu(re->size); u32 off = PtrOffset(hdr, re); int bytes = used - (off + esize); /* check INDEX_HDR valid before using INDEX_HDR */ if (!check_index_header(hdr, le32_to_cpu(hdr->total))) return NULL; if (off >= used || esize < sizeof(struct NTFS_DE) || bytes < sizeof(struct NTFS_DE)) return NULL; hdr->used = cpu_to_le32(used - esize); memmove(re, Add2Ptr(re, esize), bytes); return re; } void indx_clear(struct ntfs_index *indx) { run_close(&indx->alloc_run); run_close(&indx->bitmap_run); } int indx_init(struct ntfs_index *indx, struct ntfs_sb_info *sbi, const struct ATTRIB *attr, enum index_mutex_classed type) { u32 t32; const struct INDEX_ROOT *root = resident_data(attr); t32 = le32_to_cpu(attr->res.data_size); if (t32 <= offsetof(struct INDEX_ROOT, ihdr) || !index_hdr_check(&root->ihdr, t32 - offsetof(struct INDEX_ROOT, ihdr))) { goto out; } /* Check root fields. */ if (!root->index_block_clst) goto out; indx->type = type; indx->idx2vbn_bits = __ffs(root->index_block_clst); t32 = le32_to_cpu(root->index_block_size); indx->index_bits = blksize_bits(t32); /* Check index record size. */ if (t32 < sbi->cluster_size) { /* Index record is smaller than a cluster, use 512 blocks. */ if (t32 != root->index_block_clst * SECTOR_SIZE) goto out; /* Check alignment to a cluster. */ if ((sbi->cluster_size >> SECTOR_SHIFT) & (root->index_block_clst - 1)) { goto out; } indx->vbn2vbo_bits = SECTOR_SHIFT; } else { /* Index record must be a multiple of cluster size. */ if (t32 != root->index_block_clst << sbi->cluster_bits) goto out; indx->vbn2vbo_bits = sbi->cluster_bits; } init_rwsem(&indx->run_lock); indx->cmp = get_cmp_func(root); if (!indx->cmp) goto out; return 0; out: ntfs_set_state(sbi, NTFS_DIRTY_DIRTY); return -EINVAL; } static struct indx_node *indx_new(struct ntfs_index *indx, struct ntfs_inode *ni, CLST vbn, const __le64 *sub_vbn) { int err; struct NTFS_DE *e; struct indx_node *r; struct INDEX_HDR *hdr; struct INDEX_BUFFER *index; u64 vbo = (u64)vbn << indx->vbn2vbo_bits; u32 bytes = 1u << indx->index_bits; u16 fn; u32 eo; r = kzalloc(sizeof(struct indx_node), GFP_NOFS); if (!r) return ERR_PTR(-ENOMEM); index = kzalloc(bytes, GFP_NOFS); if (!index) { kfree(r); return ERR_PTR(-ENOMEM); } err = ntfs_get_bh(ni->mi.sbi, &indx->alloc_run, vbo, bytes, &r->nb); if (err) { kfree(index); kfree(r); return ERR_PTR(err); } /* Create header. */ index->rhdr.sign = NTFS_INDX_SIGNATURE; index->rhdr.fix_off = cpu_to_le16(sizeof(struct INDEX_BUFFER)); // 0x28 fn = (bytes >> SECTOR_SHIFT) + 1; // 9 index->rhdr.fix_num = cpu_to_le16(fn); index->vbn = cpu_to_le64(vbn); hdr = &index->ihdr; eo = ALIGN(sizeof(struct INDEX_BUFFER) + fn * sizeof(short), 8); hdr->de_off = cpu_to_le32(eo); e = Add2Ptr(hdr, eo); if (sub_vbn) { e->flags = NTFS_IE_LAST | NTFS_IE_HAS_SUBNODES; e->size = cpu_to_le16(sizeof(struct NTFS_DE) + sizeof(u64)); hdr->used = cpu_to_le32(eo + sizeof(struct NTFS_DE) + sizeof(u64)); de_set_vbn_le(e, *sub_vbn); hdr->flags = NTFS_INDEX_HDR_HAS_SUBNODES; } else { e->size = cpu_to_le16(sizeof(struct NTFS_DE)); hdr->used = cpu_to_le32(eo + sizeof(struct NTFS_DE)); e->flags = NTFS_IE_LAST; } hdr->total = cpu_to_le32(bytes - offsetof(struct INDEX_BUFFER, ihdr)); r->index = index; return r; } struct INDEX_ROOT *indx_get_root(struct ntfs_index *indx, struct ntfs_inode *ni, struct ATTRIB **attr, struct mft_inode **mi) { struct ATTR_LIST_ENTRY *le = NULL; struct ATTRIB *a; const struct INDEX_NAMES *in = &s_index_names[indx->type]; struct INDEX_ROOT *root; a = ni_find_attr(ni, NULL, &le, ATTR_ROOT, in->name, in->name_len, NULL, mi); if (!a) return NULL; if (attr) *attr = a; root = resident_data_ex(a, sizeof(struct INDEX_ROOT)); /* length check */ if (root && offsetof(struct INDEX_ROOT, ihdr) + le32_to_cpu(root->ihdr.used) > le32_to_cpu(a->res.data_size)) { return NULL; } return root; } static int indx_write(struct ntfs_index *indx, struct ntfs_inode *ni, struct indx_node *node, int sync) { struct INDEX_BUFFER *ib = node->index; return ntfs_write_bh(ni->mi.sbi, &ib->rhdr, &node->nb, sync); } /* * indx_read * * If ntfs_readdir calls this function * inode is shared locked and no ni_lock. * Use rw_semaphore for read/write access to alloc_run. */ int indx_read(struct ntfs_index *indx, struct ntfs_inode *ni, CLST vbn, struct indx_node **node) { int err; struct INDEX_BUFFER *ib; struct runs_tree *run = &indx->alloc_run; struct rw_semaphore *lock = &indx->run_lock; u64 vbo = (u64)vbn << indx->vbn2vbo_bits; u32 bytes = 1u << indx->index_bits; struct indx_node *in = *node; const struct INDEX_NAMES *name; if (!in) { in = kzalloc(sizeof(struct indx_node), GFP_NOFS); if (!in) return -ENOMEM; } else { nb_put(&in->nb); } ib = in->index; if (!ib) { ib = kmalloc(bytes, GFP_NOFS); if (!ib) { err = -ENOMEM; goto out; } } down_read(lock); err = ntfs_read_bh(ni->mi.sbi, run, vbo, &ib->rhdr, bytes, &in->nb); up_read(lock); if (!err) goto ok; if (err == -E_NTFS_FIXUP) goto ok; if (err != -ENOENT) goto out; name = &s_index_names[indx->type]; down_write(lock); err = attr_load_runs_range(ni, ATTR_ALLOC, name->name, name->name_len, run, vbo, vbo + bytes); up_write(lock); if (err) goto out; down_read(lock); err = ntfs_read_bh(ni->mi.sbi, run, vbo, &ib->rhdr, bytes, &in->nb); up_read(lock); if (err == -E_NTFS_FIXUP) goto ok; if (err) goto out; ok: if (!index_buf_check(ib, bytes, &vbn)) { _ntfs_bad_inode(&ni->vfs_inode); err = -EINVAL; goto out; } if (err == -E_NTFS_FIXUP) { ntfs_write_bh(ni->mi.sbi, &ib->rhdr, &in->nb, 0); err = 0; } /* check for index header length */ if (offsetof(struct INDEX_BUFFER, ihdr) + le32_to_cpu(ib->ihdr.used) > bytes) { err = -EINVAL; goto out; } in->index = ib; *node = in; out: if (err == -E_NTFS_CORRUPT) { _ntfs_bad_inode(&ni->vfs_inode); err = -EINVAL; } if (ib != in->index) kfree(ib); if (*node != in) { nb_put(&in->nb); kfree(in); } return err; } /* * indx_find - Scan NTFS directory for given entry. */ int indx_find(struct ntfs_index *indx, struct ntfs_inode *ni, const struct INDEX_ROOT *root, const void *key, size_t key_len, const void *ctx, int *diff, struct NTFS_DE **entry, struct ntfs_fnd *fnd) { int err; struct NTFS_DE *e; struct indx_node *node; if (!root) root = indx_get_root(&ni->dir, ni, NULL, NULL); if (!root) { /* Should not happen. */ return -EINVAL; } /* Check cache. */ e = fnd->level ? fnd->de[fnd->level - 1] : fnd->root_de; if (e && !de_is_last(e) && !(*indx->cmp)(key, key_len, e + 1, le16_to_cpu(e->key_size), ctx)) { *entry = e; *diff = 0; return 0; } /* Soft finder reset. */ fnd_clear(fnd); /* Lookup entry that is <= to the search value. */ e = hdr_find_e(indx, &root->ihdr, key, key_len, ctx, diff); if (!e) return -EINVAL; fnd->root_de = e; for (;;) { node = NULL; if (*diff >= 0 || !de_has_vcn_ex(e)) break; /* Read next level. */ err = indx_read(indx, ni, de_get_vbn(e), &node); if (err) { /* io error? */ return err; } /* Lookup entry that is <= to the search value. */ e = hdr_find_e(indx, &node->index->ihdr, key, key_len, ctx, diff); if (!e) { put_indx_node(node); return -EINVAL; } fnd_push(fnd, node, e); } *entry = e; return 0; } int indx_find_sort(struct ntfs_index *indx, struct ntfs_inode *ni, const struct INDEX_ROOT *root, struct NTFS_DE **entry, struct ntfs_fnd *fnd) { int err; struct indx_node *n = NULL; struct NTFS_DE *e; size_t iter = 0; int level = fnd->level; if (!*entry) { /* Start find. */ e = hdr_first_de(&root->ihdr); if (!e) return 0; fnd_clear(fnd); fnd->root_de = e; } else if (!level) { if (de_is_last(fnd->root_de)) { *entry = NULL; return 0; } e = hdr_next_de(&root->ihdr, fnd->root_de); if (!e) return -EINVAL; fnd->root_de = e; } else { n = fnd->nodes[level - 1]; e = fnd->de[level - 1]; if (de_is_last(e)) goto pop_level; e = hdr_next_de(&n->index->ihdr, e); if (!e) return -EINVAL; fnd->de[level - 1] = e; } /* Just to avoid tree cycle. */ next_iter: if (iter++ >= 1000) return -EINVAL; while (de_has_vcn_ex(e)) { if (le16_to_cpu(e->size) < sizeof(struct NTFS_DE) + sizeof(u64)) { if (n) { fnd_pop(fnd); kfree(n); } return -EINVAL; } /* Read next level. */ err = indx_read(indx, ni, de_get_vbn(e), &n); if (err) return err; /* Try next level. */ e = hdr_first_de(&n->index->ihdr); if (!e) { kfree(n); return -EINVAL; } fnd_push(fnd, n, e); } if (le16_to_cpu(e->size) > sizeof(struct NTFS_DE)) { *entry = e; return 0; } pop_level: for (;;) { if (!de_is_last(e)) goto next_iter; /* Pop one level. */ if (n) { fnd_pop(fnd); kfree(n); } level = fnd->level; if (level) { n = fnd->nodes[level - 1]; e = fnd->de[level - 1]; } else if (fnd->root_de) { n = NULL; e = fnd->root_de; fnd->root_de = NULL; } else { *entry = NULL; return 0; } if (le16_to_cpu(e->size) > sizeof(struct NTFS_DE)) { *entry = e; if (!fnd->root_de) fnd->root_de = e; return 0; } } } int indx_find_raw(struct ntfs_index *indx, struct ntfs_inode *ni, const struct INDEX_ROOT *root, struct NTFS_DE **entry, size_t *off, struct ntfs_fnd *fnd) { int err; struct indx_node *n = NULL; struct NTFS_DE *e = NULL; struct NTFS_DE *e2; size_t bit; CLST next_used_vbn; CLST next_vbn; u32 record_size = ni->mi.sbi->record_size; /* Use non sorted algorithm. */ if (!*entry) { /* This is the first call. */ e = hdr_first_de(&root->ihdr); if (!e) return 0; fnd_clear(fnd); fnd->root_de = e; /* The first call with setup of initial element. */ if (*off >= record_size) { next_vbn = (((*off - record_size) >> indx->index_bits)) << indx->idx2vbn_bits; /* Jump inside cycle 'for'. */ goto next; } /* Start enumeration from root. */ *off = 0; } else if (!fnd->root_de) return -EINVAL; for (;;) { /* Check if current entry can be used. */ if (e && le16_to_cpu(e->size) > sizeof(struct NTFS_DE)) goto ok; if (!fnd->level) { /* Continue to enumerate root. */ if (!de_is_last(fnd->root_de)) { e = hdr_next_de(&root->ihdr, fnd->root_de); if (!e) return -EINVAL; fnd->root_de = e; continue; } /* Start to enumerate indexes from 0. */ next_vbn = 0; } else { /* Continue to enumerate indexes. */ e2 = fnd->de[fnd->level - 1]; n = fnd->nodes[fnd->level - 1]; if (!de_is_last(e2)) { e = hdr_next_de(&n->index->ihdr, e2); if (!e) return -EINVAL; fnd->de[fnd->level - 1] = e; continue; } /* Continue with next index. */ next_vbn = le64_to_cpu(n->index->vbn) + root->index_block_clst; } next: /* Release current index. */ if (n) { fnd_pop(fnd); put_indx_node(n); n = NULL; } /* Skip all free indexes. */ bit = next_vbn >> indx->idx2vbn_bits; err = indx_used_bit(indx, ni, &bit); if (err == -ENOENT || bit == MINUS_ONE_T) { /* No used indexes. */ *entry = NULL; return 0; } next_used_vbn = bit << indx->idx2vbn_bits; /* Read buffer into memory. */ err = indx_read(indx, ni, next_used_vbn, &n); if (err) return err; e = hdr_first_de(&n->index->ihdr); fnd_push(fnd, n, e); if (!e) return -EINVAL; } ok: /* Return offset to restore enumerator if necessary. */ if (!n) { /* 'e' points in root, */ *off = PtrOffset(&root->ihdr, e); } else { /* 'e' points in index, */ *off = (le64_to_cpu(n->index->vbn) << indx->vbn2vbo_bits) + record_size + PtrOffset(&n->index->ihdr, e); } *entry = e; return 0; } /* * indx_create_allocate - Create "Allocation + Bitmap" attributes. */ static int indx_create_allocate(struct ntfs_index *indx, struct ntfs_inode *ni, CLST *vbn) { int err; struct ntfs_sb_info *sbi = ni->mi.sbi; struct ATTRIB *bitmap; struct ATTRIB *alloc; u32 data_size = 1u << indx->index_bits; u32 alloc_size = ntfs_up_cluster(sbi, data_size); CLST len = alloc_size >> sbi->cluster_bits; const struct INDEX_NAMES *in = &s_index_names[indx->type]; CLST alen; struct runs_tree run; run_init(&run); err = attr_allocate_clusters(sbi, &run, 0, 0, len, NULL, ALLOCATE_DEF, &alen, 0, NULL, NULL); if (err) goto out; err = ni_insert_nonresident(ni, ATTR_ALLOC, in->name, in->name_len, &run, 0, len, 0, &alloc, NULL, NULL); if (err) goto out1; alloc->nres.valid_size = alloc->nres.data_size = cpu_to_le64(data_size); err = ni_insert_resident(ni, ntfs3_bitmap_size(1), ATTR_BITMAP, in->name, in->name_len, &bitmap, NULL, NULL); if (err) goto out2; if (in->name == I30_NAME) { i_size_write(&ni->vfs_inode, data_size); inode_set_bytes(&ni->vfs_inode, alloc_size); } memcpy(&indx->alloc_run, &run, sizeof(run)); *vbn = 0; return 0; out2: mi_remove_attr(NULL, &ni->mi, alloc); out1: run_deallocate(sbi, &run, false); out: return err; } /* * indx_add_allocate - Add clusters to index. */ static int indx_add_allocate(struct ntfs_index *indx, struct ntfs_inode *ni, CLST *vbn) { int err; size_t bit; u64 data_size; u64 bmp_size, bmp_size_v; struct ATTRIB *bmp, *alloc; struct mft_inode *mi; const struct INDEX_NAMES *in = &s_index_names[indx->type]; err = indx_find_free(indx, ni, &bit, &bmp); if (err) goto out1; if (bit != MINUS_ONE_T) { bmp = NULL; } else { if (bmp->non_res) { bmp_size = le64_to_cpu(bmp->nres.data_size); bmp_size_v = le64_to_cpu(bmp->nres.valid_size); } else { bmp_size = bmp_size_v = le32_to_cpu(bmp->res.data_size); } /* * Index blocks exist, but $BITMAP has zero valid bits. * This implies an on-disk corruption and must be rejected. */ if (in->name == I30_NAME && unlikely(bmp_size_v == 0 && indx->alloc_run.count)) { err = -EINVAL; goto out1; } bit = bmp_size << 3; } data_size = (u64)(bit + 1) << indx->index_bits; if (bmp) { /* Increase bitmap. */ err = attr_set_size(ni, ATTR_BITMAP, in->name, in->name_len, &indx->bitmap_run, ntfs3_bitmap_size(bit + 1), NULL, true, NULL); if (err) goto out1; } alloc = ni_find_attr(ni, NULL, NULL, ATTR_ALLOC, in->name, in->name_len, NULL, &mi); if (!alloc) { err = -EINVAL; if (bmp) goto out2; goto out1; } if (data_size <= le64_to_cpu(alloc->nres.data_size)) { /* Reuse index. */ goto out; } /* Increase allocation. */ err = attr_set_size(ni, ATTR_ALLOC, in->name, in->name_len, &indx->alloc_run, data_size, &data_size, true, NULL); if (err) { if (bmp) goto out2; goto out1; } if (in->name == I30_NAME) i_size_write(&ni->vfs_inode, data_size); out: *vbn = bit << indx->idx2vbn_bits; return 0; out2: /* Ops. No space? */ attr_set_size(ni, ATTR_BITMAP, in->name, in->name_len, &indx->bitmap_run, bmp_size, &bmp_size_v, false, NULL); out1: return err; } /* * indx_insert_into_root - Attempt to insert an entry into the index root. * * @undo - True if we undoing previous remove. * If necessary, it will twiddle the index b-tree. */ static int indx_insert_into_root(struct ntfs_index *indx, struct ntfs_inode *ni, const struct NTFS_DE *new_de, struct NTFS_DE *root_de, const void *ctx, struct ntfs_fnd *fnd, bool undo) { int err = 0; struct NTFS_DE *e, *e0, *re; struct mft_inode *mi; struct ATTRIB *attr; struct INDEX_HDR *hdr; struct indx_node *n; CLST new_vbn; __le64 *sub_vbn, t_vbn; u16 new_de_size; u32 hdr_used, hdr_total, asize, to_move; u32 root_size, new_root_size; struct ntfs_sb_info *sbi; int ds_root; struct INDEX_ROOT *root, *a_root; /* Get the record this root placed in. */ root = indx_get_root(indx, ni, &attr, &mi); if (!root) return -EINVAL; /* * Try easy case: * hdr_insert_de will succeed if there's * room the root for the new entry. */ hdr = &root->ihdr; sbi = ni->mi.sbi; new_de_size = le16_to_cpu(new_de->size); hdr_used = le32_to_cpu(hdr->used); hdr_total = le32_to_cpu(hdr->total); asize = le32_to_cpu(attr->size); root_size = le32_to_cpu(attr->res.data_size); ds_root = new_de_size + hdr_used - hdr_total; /* If 'undo' is set then reduce requirements. */ if ((undo || asize + ds_root < sbi->max_bytes_per_attr) && mi_resize_attr(mi, attr, ds_root)) { hdr->total = cpu_to_le32(hdr_total + ds_root); e = hdr_insert_de(indx, hdr, new_de, root_de, ctx); WARN_ON(!e); fnd_clear(fnd); fnd->root_de = e; return 0; } /* Make a copy of root attribute to restore if error. */ a_root = kmemdup(attr, asize, GFP_NOFS); if (!a_root) return -ENOMEM; /* * Copy all the non-end entries from * the index root to the new buffer. */ to_move = 0; e0 = hdr_first_de(hdr); /* Calculate the size to copy. */ for (e = e0;; e = hdr_next_de(hdr, e)) { if (!e) { err = -EINVAL; goto out_free_root; } if (de_is_last(e)) break; to_move += le16_to_cpu(e->size); } if (!to_move) { re = NULL; } else { re = kmemdup(e0, to_move, GFP_NOFS); if (!re) { err = -ENOMEM; goto out_free_root; } } sub_vbn = NULL; if (de_has_vcn(e)) { t_vbn = de_get_vbn_le(e); sub_vbn = &t_vbn; } new_root_size = sizeof(struct INDEX_ROOT) + sizeof(struct NTFS_DE) + sizeof(u64); ds_root = new_root_size - root_size; if (ds_root > 0 && asize + ds_root > sbi->max_bytes_per_attr) { /* Make root external. */ err = -EOPNOTSUPP; goto out_free_re; } if (ds_root) mi_resize_attr(mi, attr, ds_root); /* Fill first entry (vcn will be set later). */ e = (struct NTFS_DE *)(root + 1); memset(e, 0, sizeof(struct NTFS_DE)); e->size = cpu_to_le16(sizeof(struct NTFS_DE) + sizeof(u64)); e->flags = NTFS_IE_HAS_SUBNODES | NTFS_IE_LAST; hdr->flags = NTFS_INDEX_HDR_HAS_SUBNODES; hdr->used = hdr->total = cpu_to_le32(new_root_size - offsetof(struct INDEX_ROOT, ihdr)); fnd->root_de = hdr_first_de(hdr); mi->dirty = true; /* Create alloc and bitmap attributes (if not). */ err = run_is_empty(&indx->alloc_run) ? indx_create_allocate(indx, ni, &new_vbn) : indx_add_allocate(indx, ni, &new_vbn); /* Layout of record may be changed, so rescan root. */ root = indx_get_root(indx, ni, &attr, &mi); if (!root) { /* Bug? */ ntfs_set_state(sbi, NTFS_DIRTY_ERROR); err = -EINVAL; goto out_free_re; } if (err) { /* Restore root. */ if (mi_resize_attr(mi, attr, -ds_root)) { memcpy(attr, a_root, asize); } else { /* Bug? */ ntfs_set_state(sbi, NTFS_DIRTY_ERROR); } goto out_free_re; } e = (struct NTFS_DE *)(root + 1); *(__le64 *)(e + 1) = cpu_to_le64(new_vbn); mi->dirty = true; /* Now we can create/format the new buffer and copy the entries into. */ n = indx_new(indx, ni, new_vbn, sub_vbn); if (IS_ERR(n)) { err = PTR_ERR(n); goto out_free_re; } hdr = &n->index->ihdr; hdr_used = le32_to_cpu(hdr->used); hdr_total = le32_to_cpu(hdr->total); /* Copy root entries into new buffer. */ hdr_insert_head(hdr, re, to_move); /* Update bitmap attribute. */ indx_mark_used(indx, ni, new_vbn >> indx->idx2vbn_bits); /* Check if we can insert new entry new index buffer. */ if (hdr_used + new_de_size > hdr_total) { /* * This occurs if MFT record is the same or bigger than index * buffer. Move all root new index and have no space to add * new entry classic case when MFT record is 1K and index * buffer 4K the problem should not occurs. */ kfree(re); indx_write(indx, ni, n, 0); put_indx_node(n); fnd_clear(fnd); err = indx_insert_entry(indx, ni, new_de, ctx, fnd, undo); goto out_free_root; } /* * Now root is a parent for new index buffer. * Insert NewEntry a new buffer. */ e = hdr_insert_de(indx, hdr, new_de, NULL, ctx); if (!e) { err = -EINVAL; goto out_put_n; } fnd_push(fnd, n, e); /* Just write updates index into disk. */ indx_write(indx, ni, n, 0); n = NULL; out_put_n: put_indx_node(n); out_free_re: kfree(re); out_free_root: kfree(a_root); return err; } /* * indx_insert_into_buffer * * Attempt to insert an entry into an Index Allocation Buffer. * If necessary, it will split the buffer. */ static int indx_insert_into_buffer(struct ntfs_index *indx, struct ntfs_inode *ni, struct INDEX_ROOT *root, const struct NTFS_DE *new_de, const void *ctx, int level, struct ntfs_fnd *fnd) { int err; const struct NTFS_DE *sp; struct NTFS_DE *e, *de_t, *up_e; struct indx_node *n2; struct indx_node *n1 = fnd->nodes[level]; struct INDEX_HDR *hdr1 = &n1->index->ihdr; struct INDEX_HDR *hdr2; u32 to_copy, used, used1; CLST new_vbn; __le64 t_vbn, *sub_vbn; u16 sp_size; void *hdr1_saved = NULL; /* Try the most easy case. */ e = fnd->level - 1 == level ? fnd->de[level] : NULL; e = hdr_insert_de(indx, hdr1, new_de, e, ctx); fnd->de[level] = e; if (e) { /* Just write updated index into disk. */ indx_write(indx, ni, n1, 0); return 0; } /* * No space to insert into buffer. Split it. * To split we: * - Save split point ('cause index buffers will be changed) * - Allocate NewBuffer and copy all entries <= sp into new buffer * - Remove all entries (sp including) from TargetBuffer * - Insert NewEntry into left or right buffer (depending on sp <=> * NewEntry) * - Insert sp into parent buffer (or root) * - Make sp a parent for new buffer */ sp = hdr_find_split(hdr1); if (!sp) return -EINVAL; sp_size = le16_to_cpu(sp->size); up_e = kmalloc(sp_size + sizeof(u64), GFP_NOFS); if (!up_e) return -ENOMEM; memcpy(up_e, sp, sp_size); used1 = le32_to_cpu(hdr1->used); hdr1_saved = kmemdup(hdr1, used1, GFP_NOFS); if (!hdr1_saved) { err = -ENOMEM; goto out; } if (!hdr1->flags) { up_e->flags |= NTFS_IE_HAS_SUBNODES; up_e->size = cpu_to_le16(sp_size + sizeof(u64)); sub_vbn = NULL; } else { t_vbn = de_get_vbn_le(up_e); sub_vbn = &t_vbn; } /* Allocate on disk a new index allocation buffer. */ err = indx_add_allocate(indx, ni, &new_vbn); if (err) goto out; /* Allocate and format memory a new index buffer. */ n2 = indx_new(indx, ni, new_vbn, sub_vbn); if (IS_ERR(n2)) { err = PTR_ERR(n2); goto out; } hdr2 = &n2->index->ihdr; /* Make sp a parent for new buffer. */ de_set_vbn(up_e, new_vbn); /* Copy all the entries <= sp into the new buffer. */ de_t = hdr_first_de(hdr1); to_copy = PtrOffset(de_t, sp); hdr_insert_head(hdr2, de_t, to_copy); /* Remove all entries (sp including) from hdr1. */ used = used1 - to_copy - sp_size; memmove(de_t, Add2Ptr(sp, sp_size), used - le32_to_cpu(hdr1->de_off)); hdr1->used = cpu_to_le32(used); /* * Insert new entry into left or right buffer * (depending on sp <=> new_de). */ hdr_insert_de(indx, (*indx->cmp)(new_de + 1, le16_to_cpu(new_de->key_size), up_e + 1, le16_to_cpu(up_e->key_size), ctx) < 0 ? hdr2 : hdr1, new_de, NULL, ctx); indx_mark_used(indx, ni, new_vbn >> indx->idx2vbn_bits); indx_write(indx, ni, n1, 0); indx_write(indx, ni, n2, 0); put_indx_node(n2); /* * We've finished splitting everybody, so we are ready to * insert the promoted entry into the parent. */ if (!level) { /* Insert in root. */ err = indx_insert_into_root(indx, ni, up_e, NULL, ctx, fnd, 0); } else { /* * The target buffer's parent is another index buffer. * TODO: Remove recursion. */ err = indx_insert_into_buffer(indx, ni, root, up_e, ctx, level - 1, fnd); } if (err) { /* * Undo critical operations. */ indx_mark_free(indx, ni, new_vbn >> indx->idx2vbn_bits); unsafe_memcpy(hdr1, hdr1_saved, used1, "There are entries after the structure"); indx_write(indx, ni, n1, 0); } out: kfree(up_e); kfree(hdr1_saved); return err; } /* * indx_insert_entry - Insert new entry into index. * * @undo - True if we undoing previous remove. */ int indx_insert_entry(struct ntfs_index *indx, struct ntfs_inode *ni, const struct NTFS_DE *new_de, const void *ctx, struct ntfs_fnd *fnd, bool undo) { int err; int diff; struct NTFS_DE *e; struct ntfs_fnd *fnd_a = NULL; struct INDEX_ROOT *root; if (!fnd) { fnd_a = fnd_get(); if (!fnd_a) { err = -ENOMEM; goto out1; } fnd = fnd_a; } root = indx_get_root(indx, ni, NULL, NULL); if (!root) { err = -EINVAL; goto out; } if (fnd_is_empty(fnd)) { /* * Find the spot the tree where we want to * insert the new entry. */ err = indx_find(indx, ni, root, new_de + 1, le16_to_cpu(new_de->key_size), ctx, &diff, &e, fnd); if (err) goto out; if (!diff) { err = -EEXIST; goto out; } } if (!fnd->level) { /* * The root is also a leaf, so we'll insert the * new entry into it. */ err = indx_insert_into_root(indx, ni, new_de, fnd->root_de, ctx, fnd, undo); } else { /* * Found a leaf buffer, so we'll insert the new entry into it. */ err = indx_insert_into_buffer(indx, ni, root, new_de, ctx, fnd->level - 1, fnd); } out: fnd_put(fnd_a); out1: return err; } /* * indx_find_buffer - Locate a buffer from the tree. */ static struct indx_node *indx_find_buffer(struct ntfs_index *indx, struct ntfs_inode *ni, const struct INDEX_ROOT *root, __le64 vbn, struct indx_node *n) { int err; const struct NTFS_DE *e; struct indx_node *r; const struct INDEX_HDR *hdr = n ? &n->index->ihdr : &root->ihdr; /* Step 1: Scan one level. */ for (e = hdr_first_de(hdr);; e = hdr_next_de(hdr, e)) { if (!e) return ERR_PTR(-EINVAL); if (de_has_vcn(e) && vbn == de_get_vbn_le(e)) return n; if (de_is_last(e)) break; } /* Step2: Do recursion. */ e = Add2Ptr(hdr, le32_to_cpu(hdr->de_off)); for (;;) { if (de_has_vcn_ex(e)) { err = indx_read(indx, ni, de_get_vbn(e), &n); if (err) return ERR_PTR(err); r = indx_find_buffer(indx, ni, root, vbn, n); if (r) return r; } if (de_is_last(e)) break; e = Add2Ptr(e, le16_to_cpu(e->size)); } return NULL; } /* * indx_shrink - Deallocate unused tail indexes. */ static int indx_shrink(struct ntfs_index *indx, struct ntfs_inode *ni, size_t bit) { int err = 0; u64 bpb, new_data; size_t nbits; struct ATTRIB *b; struct ATTR_LIST_ENTRY *le = NULL; const struct INDEX_NAMES *in = &s_index_names[indx->type]; b = ni_find_attr(ni, NULL, &le, ATTR_BITMAP, in->name, in->name_len, NULL, NULL); if (!b) return -ENOENT; if (!b->non_res) { unsigned long pos; const unsigned long *bm = resident_data(b); nbits = (size_t)le32_to_cpu(b->res.data_size) * 8; if (bit >= nbits) return 0; pos = find_next_bit_le(bm, nbits, bit); if (pos < nbits) return 0; } else { size_t used = MINUS_ONE_T; nbits = le64_to_cpu(b->nres.data_size) * 8; if (bit >= nbits) return 0; err = scan_nres_bitmap(ni, b, indx, bit, &scan_for_used, &used); if (err) return err; if (used != MINUS_ONE_T) return 0; } new_data = (u64)bit << indx->index_bits; err = attr_set_size(ni, ATTR_ALLOC, in->name, in->name_len, &indx->alloc_run, new_data, &new_data, false, NULL); if (err) return err; if (in->name == I30_NAME) i_size_write(&ni->vfs_inode, new_data); bpb = ntfs3_bitmap_size(bit); if (bpb * 8 == nbits) return 0; err = attr_set_size(ni, ATTR_BITMAP, in->name, in->name_len, &indx->bitmap_run, bpb, &bpb, false, NULL); return err; } static int indx_free_children(struct ntfs_index *indx, struct ntfs_inode *ni, const struct NTFS_DE *e, bool trim) { int err; struct indx_node *n = NULL; struct INDEX_HDR *hdr; CLST vbn = de_get_vbn(e); size_t i; err = indx_read(indx, ni, vbn, &n); if (err) return err; hdr = &n->index->ihdr; /* First, recurse into the children, if any. */ if (hdr_has_subnode(hdr)) { for (e = hdr_first_de(hdr); e; e = hdr_next_de(hdr, e)) { indx_free_children(indx, ni, e, false); if (de_is_last(e)) break; } } put_indx_node(n); i = vbn >> indx->idx2vbn_bits; /* * We've gotten rid of the children; add this buffer to the free list. */ indx_mark_free(indx, ni, i); if (!trim) return 0; /* * If there are no used indexes after current free index * then we can truncate allocation and bitmap. * Use bitmap to estimate the case. */ indx_shrink(indx, ni, i + 1); return 0; } /* * indx_get_entry_to_replace * * Find a replacement entry for a deleted entry. * Always returns a node entry: * NTFS_IE_HAS_SUBNODES is set the flags and the size includes the sub_vcn. */ static int indx_get_entry_to_replace(struct ntfs_index *indx, struct ntfs_inode *ni, const struct NTFS_DE *de_next, struct NTFS_DE **de_to_replace, struct ntfs_fnd *fnd) { int err; int level = -1; CLST vbn; struct NTFS_DE *e, *te, *re; struct indx_node *n; struct INDEX_BUFFER *ib; *de_to_replace = NULL; /* Find first leaf entry down from de_next. */ vbn = de_get_vbn(de_next); for (;;) { n = NULL; err = indx_read(indx, ni, vbn, &n); if (err) goto out; e = hdr_first_de(&n->index->ihdr); fnd_push(fnd, n, e); if (!e) { err = -EINVAL; goto out; } if (!de_is_last(e)) { /* * This buffer is non-empty, so its first entry * could be used as the replacement entry. */ level = fnd->level - 1; } if (!de_has_vcn(e)) break; /* This buffer is a node. Continue to go down. */ vbn = de_get_vbn(e); } if (level == -1) goto out; n = fnd->nodes[level]; te = hdr_first_de(&n->index->ihdr); if (!te) { err = -EINVAL; goto out; } /* Copy the candidate entry into the replacement entry buffer. */ re = kmalloc(le16_to_cpu(te->size) + sizeof(u64), GFP_NOFS); if (!re) { err = -ENOMEM; goto out; } *de_to_replace = re; memcpy(re, te, le16_to_cpu(te->size)); if (!de_has_vcn(re)) { /* * The replacement entry we found doesn't have a sub_vcn. * increase its size to hold one. */ le16_add_cpu(&re->size, sizeof(u64)); re->flags |= NTFS_IE_HAS_SUBNODES; } else { /* * The replacement entry we found was a node entry, which * means that all its child buffers are empty. Return them * to the free pool. */ indx_free_children(indx, ni, te, true); } /* * Expunge the replacement entry from its former location, * and then write that buffer. */ ib = n->index; e = hdr_delete_de(&ib->ihdr, te); fnd->de[level] = e; indx_write(indx, ni, n, 0); if (ib_is_leaf(ib) && ib_is_empty(ib)) { /* An empty leaf. */ return 0; } out: fnd_clear(fnd); return err; } /* * indx_delete_entry - Delete an entry from the index. */ int indx_delete_entry(struct ntfs_index *indx, struct ntfs_inode *ni, const void *key, u32 key_len, const void *ctx) { int err, diff; struct INDEX_ROOT *root; struct INDEX_HDR *hdr; struct ntfs_fnd *fnd, *fnd2; struct INDEX_BUFFER *ib; struct NTFS_DE *e, *re, *next, *prev, *me; struct indx_node *n, *n2d = NULL; __le64 sub_vbn; int level, level2; struct ATTRIB *attr; struct mft_inode *mi; u32 e_size, root_size, new_root_size; size_t trim_bit; const struct INDEX_NAMES *in; fnd = fnd_get(); if (!fnd) { err = -ENOMEM; goto out2; } fnd2 = fnd_get(); if (!fnd2) { err = -ENOMEM; goto out1; } root = indx_get_root(indx, ni, &attr, &mi); if (!root) { err = -EINVAL; goto out; } /* Locate the entry to remove. */ err = indx_find(indx, ni, root, key, key_len, ctx, &diff, &e, fnd); if (err) goto out; if (!e || diff) { err = -ENOENT; goto out; } level = fnd->level; if (level) { n = fnd->nodes[level - 1]; e = fnd->de[level - 1]; ib = n->index; hdr = &ib->ihdr; } else { hdr = &root->ihdr; e = fnd->root_de; n = NULL; } e_size = le16_to_cpu(e->size); if (!de_has_vcn_ex(e)) { /* The entry to delete is a leaf, so we can just rip it out. */ hdr_delete_de(hdr, e); if (!level) { hdr->total = hdr->used; /* Shrink resident root attribute. */ mi_resize_attr(mi, attr, 0 - e_size); goto out; } indx_write(indx, ni, n, 0); /* * Check to see if removing that entry made * the leaf empty. */ if (ib_is_leaf(ib) && ib_is_empty(ib)) { fnd_pop(fnd); fnd_push(fnd2, n, e); } } else { /* * The entry we wish to delete is a node buffer, so we * have to find a replacement for it. */ next = de_get_next(e); err = indx_get_entry_to_replace(indx, ni, next, &re, fnd2); if (err) goto out; if (re) { de_set_vbn_le(re, de_get_vbn_le(e)); hdr_delete_de(hdr, e); err = level ? indx_insert_into_buffer(indx, ni, root, re, ctx, fnd->level - 1, fnd) : indx_insert_into_root(indx, ni, re, e, ctx, fnd, 0); kfree(re); if (err) goto out; } else { /* * There is no replacement for the current entry. * This means that the subtree rooted at its node * is empty, and can be deleted, which turn means * that the node can just inherit the deleted * entry sub_vcn. */ indx_free_children(indx, ni, next, true); de_set_vbn_le(next, de_get_vbn_le(e)); hdr_delete_de(hdr, e); if (level) { indx_write(indx, ni, n, 0); } else { hdr->total = hdr->used; /* Shrink resident root attribute. */ mi_resize_attr(mi, attr, 0 - e_size); } } } /* Delete a branch of tree. */ if (!fnd2 || !fnd2->level) goto out; /* Reinit root 'cause it can be changed. */ root = indx_get_root(indx, ni, &attr, &mi); if (!root) { err = -EINVAL; goto out; } n2d = NULL; sub_vbn = fnd2->nodes[0]->index->vbn; level2 = 0; level = fnd->level; hdr = level ? &fnd->nodes[level - 1]->index->ihdr : &root->ihdr; /* Scan current level. */ for (e = hdr_first_de(hdr);; e = hdr_next_de(hdr, e)) { if (!e) { err = -EINVAL; goto out; } if (de_has_vcn(e) && sub_vbn == de_get_vbn_le(e)) break; if (de_is_last(e)) { e = NULL; break; } } if (!e) { /* Do slow search from root. */ struct indx_node *in; fnd_clear(fnd); in = indx_find_buffer(indx, ni, root, sub_vbn, NULL); if (IS_ERR(in)) { err = PTR_ERR(in); goto out; } if (in) fnd_push(fnd, in, NULL); } /* Merge fnd2 -> fnd. */ for (level = 0; level < fnd2->level; level++) { fnd_push(fnd, fnd2->nodes[level], fnd2->de[level]); fnd2->nodes[level] = NULL; } fnd2->level = 0; hdr = NULL; for (level = fnd->level; level; level--) { struct indx_node *in = fnd->nodes[level - 1]; ib = in->index; if (ib_is_empty(ib)) { sub_vbn = ib->vbn; } else { hdr = &ib->ihdr; n2d = in; level2 = level; break; } } if (!hdr) hdr = &root->ihdr; e = hdr_first_de(hdr); if (!e) { err = -EINVAL; goto out; } if (hdr != &root->ihdr || !de_is_last(e)) { prev = NULL; while (!de_is_last(e)) { if (de_has_vcn(e) && sub_vbn == de_get_vbn_le(e)) break; prev = e; e = hdr_next_de(hdr, e); if (!e) { err = -EINVAL; goto out; } } if (sub_vbn != de_get_vbn_le(e)) { /* * Didn't find the parent entry, although this buffer * is the parent trail. Something is corrupt. */ err = -EINVAL; goto out; } if (de_is_last(e)) { /* * Since we can't remove the end entry, we'll remove * its predecessor instead. This means we have to * transfer the predecessor's sub_vcn to the end entry. * Note: This index block is not empty, so the * predecessor must exist. */ if (!prev) { err = -EINVAL; goto out; } if (de_has_vcn(prev)) { de_set_vbn_le(e, de_get_vbn_le(prev)); } else if (de_has_vcn(e)) { le16_sub_cpu(&e->size, sizeof(u64)); e->flags &= ~NTFS_IE_HAS_SUBNODES; le32_sub_cpu(&hdr->used, sizeof(u64)); } e = prev; } /* * Copy the current entry into a temporary buffer (stripping * off its down-pointer, if any) and delete it from the current * buffer or root, as appropriate. */ e_size = le16_to_cpu(e->size); me = kmemdup(e, e_size, GFP_NOFS); if (!me) { err = -ENOMEM; goto out; } if (de_has_vcn(me)) { me->flags &= ~NTFS_IE_HAS_SUBNODES; le16_sub_cpu(&me->size, sizeof(u64)); } hdr_delete_de(hdr, e); if (hdr == &root->ihdr) { level = 0; hdr->total = hdr->used; /* Shrink resident root attribute. */ mi_resize_attr(mi, attr, 0 - e_size); } else { indx_write(indx, ni, n2d, 0); level = level2; } /* Mark unused buffers as free. */ trim_bit = -1; for (; level < fnd->level; level++) { ib = fnd->nodes[level]->index; if (ib_is_empty(ib)) { size_t k = le64_to_cpu(ib->vbn) >> indx->idx2vbn_bits; indx_mark_free(indx, ni, k); if (k < trim_bit) trim_bit = k; } } fnd_clear(fnd); /*fnd->root_de = NULL;*/ /* * Re-insert the entry into the tree. * Find the spot the tree where we want to insert the new entry. */ err = indx_insert_entry(indx, ni, me, ctx, fnd, 0); kfree(me); if (err) goto out; if (trim_bit != -1) indx_shrink(indx, ni, trim_bit); } else { /* * This tree needs to be collapsed down to an empty root. * Recreate the index root as an empty leaf and free all * the bits the index allocation bitmap. */ fnd_clear(fnd); fnd_clear(fnd2); in = &s_index_names[indx->type]; err = attr_set_size(ni, ATTR_ALLOC, in->name, in->name_len, &indx->alloc_run, 0, NULL, false, NULL); if (in->name == I30_NAME) i_size_write(&ni->vfs_inode, 0); err = ni_remove_attr(ni, ATTR_ALLOC, in->name, in->name_len, false, NULL); run_close(&indx->alloc_run); err = attr_set_size(ni, ATTR_BITMAP, in->name, in->name_len, &indx->bitmap_run, 0, NULL, false, NULL); err = ni_remove_attr(ni, ATTR_BITMAP, in->name, in->name_len, false, NULL); run_close(&indx->bitmap_run); root = indx_get_root(indx, ni, &attr, &mi); if (!root) { err = -EINVAL; goto out; } root_size = le32_to_cpu(attr->res.data_size); new_root_size = sizeof(struct INDEX_ROOT) + sizeof(struct NTFS_DE); if (new_root_size != root_size && !mi_resize_attr(mi, attr, new_root_size - root_size)) { err = -EINVAL; goto out; } /* Fill first entry. */ e = (struct NTFS_DE *)(root + 1); e->ref.low = 0; e->ref.high = 0; e->ref.seq = 0; e->size = cpu_to_le16(sizeof(struct NTFS_DE)); e->flags = NTFS_IE_LAST; // 0x02 e->key_size = 0; e->res = 0; hdr = &root->ihdr; hdr->flags = 0; hdr->used = hdr->total = cpu_to_le32( new_root_size - offsetof(struct INDEX_ROOT, ihdr)); mi->dirty = true; } out: fnd_put(fnd2); out1: fnd_put(fnd); out2: return err; } /* * Update duplicated information in directory entry * 'dup' - info from MFT record */ int indx_update_dup(struct ntfs_inode *ni, struct ntfs_sb_info *sbi, const struct ATTR_FILE_NAME *fname, const struct NTFS_DUP_INFO *dup, int sync) { int err, diff; struct NTFS_DE *e = NULL; struct ATTR_FILE_NAME *e_fname; struct ntfs_fnd *fnd; struct INDEX_ROOT *root; struct mft_inode *mi; struct ntfs_index *indx = &ni->dir; fnd = fnd_get(); if (!fnd) return -ENOMEM; root = indx_get_root(indx, ni, NULL, &mi); if (!root) { err = -EINVAL; goto out; } /* Find entry in directory. */ err = indx_find(indx, ni, root, fname, fname_full_size(fname), sbi, &diff, &e, fnd); if (err) goto out; if (!e) { err = -EINVAL; goto out; } if (diff) { err = -EINVAL; goto out; } e_fname = (struct ATTR_FILE_NAME *)(e + 1); if (!memcmp(&e_fname->dup, dup, sizeof(*dup))) { /* * Nothing to update in index! Try to avoid this call. */ goto out; } memcpy(&e_fname->dup, dup, sizeof(*dup)); if (fnd->level) { /* Directory entry in index. */ err = indx_write(indx, ni, fnd->nodes[fnd->level - 1], sync); } else { /* Directory entry in directory MFT record. */ mi->dirty = true; if (sync) err = mi_write(mi, 1); else mark_inode_dirty(&ni->vfs_inode); } out: fnd_put(fnd); return err; } |
| 1 2 2 2 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2006 Jens Axboe <axboe@kernel.dk> * */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/kernel.h> #include <linux/blkdev.h> #include <linux/blktrace_api.h> #include <linux/percpu.h> #include <linux/init.h> #include <linux/mutex.h> #include <linux/slab.h> #include <linux/debugfs.h> #include <linux/export.h> #include <linux/time.h> #include <linux/uaccess.h> #include <linux/list.h> #include <linux/blk-cgroup.h> #include "../../block/blk.h" #include <trace/events/block.h> #include "trace_output.h" #ifdef CONFIG_BLK_DEV_IO_TRACE static unsigned int blktrace_seq __read_mostly = 1; static struct trace_array *blk_tr; static bool blk_tracer_enabled __read_mostly; static LIST_HEAD(running_trace_list); static __cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(running_trace_lock); /* Select an alternative, minimalistic output than the original one */ #define TRACE_BLK_OPT_CLASSIC 0x1 #define TRACE_BLK_OPT_CGROUP 0x2 #define TRACE_BLK_OPT_CGNAME 0x4 static struct tracer_opt blk_tracer_opts[] = { /* Default disable the minimalistic output */ { TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) }, #ifdef CONFIG_BLK_CGROUP { TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) }, { TRACER_OPT(blk_cgname, TRACE_BLK_OPT_CGNAME) }, #endif { } }; static struct tracer_flags blk_tracer_flags = { .val = 0, .opts = blk_tracer_opts, }; /* Global reference count of probes */ static DEFINE_MUTEX(blk_probe_mutex); static int blk_probes_ref; static void blk_register_tracepoints(void); static void blk_unregister_tracepoints(void); static void record_blktrace_event(struct blk_io_trace *t, pid_t pid, int cpu, sector_t sector, int bytes, u64 what, dev_t dev, int error, u64 cgid, ssize_t cgid_len, void *pdu_data, int pdu_len) { /* * These two are not needed in ftrace as they are in the * generic trace_entry, filled by tracing_generic_entry_update, * but for the trace_event->bin() synthesizer benefit we do it * here too. */ t->cpu = cpu; t->pid = pid; t->sector = sector; t->bytes = bytes; t->action = lower_32_bits(what); t->device = dev; t->error = error; t->pdu_len = pdu_len + cgid_len; if (cgid_len) memcpy((void *)t + sizeof(*t), &cgid, cgid_len); if (pdu_len) memcpy((void *)t + sizeof(*t) + cgid_len, pdu_data, pdu_len); } static void record_blktrace_event2(struct blk_io_trace2 *t2, pid_t pid, int cpu, sector_t sector, int bytes, u64 what, dev_t dev, int error, u64 cgid, ssize_t cgid_len, void *pdu_data, int pdu_len) { t2->pid = pid; t2->cpu = cpu; t2->sector = sector; t2->bytes = bytes; t2->action = what; t2->device = dev; t2->error = error; t2->pdu_len = pdu_len + cgid_len; if (cgid_len) memcpy((void *)t2 + sizeof(*t2), &cgid, cgid_len); if (pdu_len) memcpy((void *)t2 + sizeof(*t2) + cgid_len, pdu_data, pdu_len); } static void relay_blktrace_event1(struct blk_trace *bt, unsigned long sequence, pid_t pid, int cpu, sector_t sector, int bytes, u64 what, int error, u64 cgid, ssize_t cgid_len, void *pdu_data, int pdu_len) { struct blk_io_trace *t; size_t trace_len = sizeof(*t) + pdu_len + cgid_len; t = relay_reserve(bt->rchan, trace_len); if (!t) return; t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION; t->sequence = sequence; t->time = ktime_to_ns(ktime_get()); record_blktrace_event(t, pid, cpu, sector, bytes, what, bt->dev, error, cgid, cgid_len, pdu_data, pdu_len); } static void relay_blktrace_event2(struct blk_trace *bt, unsigned long sequence, pid_t pid, int cpu, sector_t sector, int bytes, u64 what, int error, u64 cgid, ssize_t cgid_len, void *pdu_data, int pdu_len) { struct blk_io_trace2 *t; size_t trace_len = sizeof(struct blk_io_trace2) + pdu_len + cgid_len; t = relay_reserve(bt->rchan, trace_len); if (!t) return; t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE2_VERSION; t->sequence = sequence; t->time = ktime_to_ns(ktime_get()); record_blktrace_event2(t, pid, cpu, sector, bytes, what, bt->dev, error, cgid, cgid_len, pdu_data, pdu_len); } static void relay_blktrace_event(struct blk_trace *bt, unsigned long sequence, pid_t pid, int cpu, sector_t sector, int bytes, u64 what, int error, u64 cgid, ssize_t cgid_len, void *pdu_data, int pdu_len) { if (bt->version == 2) return relay_blktrace_event2(bt, sequence, pid, cpu, sector, bytes, what, error, cgid, cgid_len, pdu_data, pdu_len); return relay_blktrace_event1(bt, sequence, pid, cpu, sector, bytes, what, error, cgid, cgid_len, pdu_data, pdu_len); } /* * Send out a notify message. */ static void trace_note(struct blk_trace *bt, pid_t pid, u64 action, const void *data, size_t len, u64 cgid) { struct ring_buffer_event *event = NULL; struct trace_buffer *buffer = NULL; unsigned int trace_ctx = 0; int cpu = smp_processor_id(); bool blk_tracer = blk_tracer_enabled; ssize_t cgid_len = cgid ? sizeof(cgid) : 0; action = lower_32_bits(action | (cgid ? __BLK_TN_CGROUP : 0)); if (blk_tracer) { struct blk_io_trace2 *t; size_t trace_len = sizeof(*t) + cgid_len + len; buffer = blk_tr->array_buffer.buffer; trace_ctx = tracing_gen_ctx_flags(0); event = trace_buffer_lock_reserve(buffer, TRACE_BLK, trace_len, trace_ctx); if (!event) return; t = ring_buffer_event_data(event); record_blktrace_event2(t, pid, cpu, 0, 0, action, bt->dev, 0, cgid, cgid_len, (void *)data, len); trace_buffer_unlock_commit(blk_tr, buffer, event, trace_ctx); return; } if (!bt->rchan) return; relay_blktrace_event(bt, 0, pid, cpu, 0, 0, action, 0, cgid, cgid_len, (void *)data, len); } /* * Send out a notify for this process, if we haven't done so since a trace * started */ static void trace_note_tsk(struct task_struct *tsk) { unsigned long flags; struct blk_trace *bt; tsk->btrace_seq = blktrace_seq; raw_spin_lock_irqsave(&running_trace_lock, flags); list_for_each_entry(bt, &running_trace_list, running_list) { trace_note(bt, tsk->pid, BLK_TN_PROCESS, tsk->comm, sizeof(tsk->comm), 0); } raw_spin_unlock_irqrestore(&running_trace_lock, flags); } static void trace_note_time(struct blk_trace *bt) { struct timespec64 now; unsigned long flags; u32 words[2]; /* need to check user space to see if this breaks in y2038 or y2106 */ ktime_get_real_ts64(&now); words[0] = (u32)now.tv_sec; words[1] = now.tv_nsec; local_irq_save(flags); trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words), 0); local_irq_restore(flags); } void __blk_trace_note_message(struct blk_trace *bt, struct cgroup_subsys_state *css, const char *fmt, ...) { int n; va_list args; unsigned long flags; char *buf; u64 cgid = 0; if (unlikely(bt->trace_state != Blktrace_running && !blk_tracer_enabled)) return; /* * If the BLK_TC_NOTIFY action mask isn't set, don't send any note * message to the trace. */ if (!(bt->act_mask & BLK_TC_NOTIFY)) return; local_irq_save(flags); buf = this_cpu_ptr(bt->msg_data); va_start(args, fmt); n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args); va_end(args); #ifdef CONFIG_BLK_CGROUP if (css && (blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP)) cgid = cgroup_id(css->cgroup); else cgid = 1; #endif trace_note(bt, current->pid, BLK_TN_MESSAGE, buf, n, cgid); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(__blk_trace_note_message); static int act_log_check(struct blk_trace *bt, u64 what, sector_t sector, pid_t pid) { if (((bt->act_mask << BLK_TC_SHIFT) & what) == 0) return 1; if (sector && (sector < bt->start_lba || sector > bt->end_lba)) return 1; if (bt->pid && pid != bt->pid) return 1; return 0; } /* * Data direction bit lookup */ static const u32 ddir_act[2] = { BLK_TC_ACT(BLK_TC_READ), BLK_TC_ACT(BLK_TC_WRITE) }; #define BLK_TC_RAHEAD BLK_TC_AHEAD #define BLK_TC_PREFLUSH BLK_TC_FLUSH /* The ilog2() calls fall out because they're constant */ #define MASK_TC_BIT(rw, __name) ((__force u32)(rw & REQ_ ## __name) << \ (ilog2(BLK_TC_ ## __name) + BLK_TC_SHIFT - __REQ_ ## __name)) /* * The worker for the various blk_add_trace*() types. Fills out a * blk_io_trace structure and places it in a per-cpu subbuffer. */ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes, const blk_opf_t opf, u64 what, int error, int pdu_len, void *pdu_data, u64 cgid) { struct task_struct *tsk = current; struct ring_buffer_event *event = NULL; struct trace_buffer *buffer = NULL; unsigned long flags = 0; unsigned long *sequence; unsigned int trace_ctx = 0; pid_t pid; int cpu; bool blk_tracer = blk_tracer_enabled; ssize_t cgid_len = cgid ? sizeof(cgid) : 0; const enum req_op op = opf & REQ_OP_MASK; size_t trace_len; if (unlikely(bt->trace_state != Blktrace_running && !blk_tracer)) return; what |= ddir_act[op_is_write(op) ? WRITE : READ]; what |= MASK_TC_BIT(opf, SYNC); what |= MASK_TC_BIT(opf, RAHEAD); what |= MASK_TC_BIT(opf, META); what |= MASK_TC_BIT(opf, PREFLUSH); what |= MASK_TC_BIT(opf, FUA); switch (op) { case REQ_OP_DISCARD: case REQ_OP_SECURE_ERASE: what |= BLK_TC_ACT(BLK_TC_DISCARD); break; case REQ_OP_FLUSH: what |= BLK_TC_ACT(BLK_TC_FLUSH); break; case REQ_OP_ZONE_APPEND: what |= BLK_TC_ACT(BLK_TC_ZONE_APPEND); break; case REQ_OP_ZONE_RESET: what |= BLK_TC_ACT(BLK_TC_ZONE_RESET); break; case REQ_OP_ZONE_RESET_ALL: what |= BLK_TC_ACT(BLK_TC_ZONE_RESET_ALL); break; case REQ_OP_ZONE_FINISH: what |= BLK_TC_ACT(BLK_TC_ZONE_FINISH); break; case REQ_OP_ZONE_OPEN: what |= BLK_TC_ACT(BLK_TC_ZONE_OPEN); break; case REQ_OP_ZONE_CLOSE: what |= BLK_TC_ACT(BLK_TC_ZONE_CLOSE); break; case REQ_OP_WRITE_ZEROES: what |= BLK_TC_ACT(BLK_TC_WRITE_ZEROES); break; default: break; } /* Drop trace events for zone operations with blktrace v1 */ if (bt->version == 1 && (what >> BLK_TC_SHIFT) > BLK_TC_END_V1) { pr_debug_ratelimited("blktrace v1 cannot trace zone operation 0x%llx\n", (unsigned long long)what); return; } if (cgid) what |= __BLK_TA_CGROUP; pid = tsk->pid; if (act_log_check(bt, what, sector, pid)) return; cpu = raw_smp_processor_id(); if (blk_tracer) { tracing_record_cmdline(current); buffer = blk_tr->array_buffer.buffer; trace_ctx = tracing_gen_ctx_flags(0); switch (bt->version) { case 1: trace_len = sizeof(struct blk_io_trace); break; case 2: default: /* * ftrace always uses v2 (blk_io_trace2) format. * * For sysfs-enabled tracing path (enabled via * /sys/block/DEV/trace/enable), blk_trace_setup_queue() * never initializes bt->version, leaving it 0 from * kzalloc(). We must handle version==0 safely here. * * Fall through to default to ensure we never hit the * old bug where default set trace_len=0, causing * buffer underflow and memory corruption. * * Always use v2 format for ftrace and normalize * bt->version to 2 when uninitialized. */ trace_len = sizeof(struct blk_io_trace2); if (bt->version == 0) bt->version = 2; break; } trace_len += pdu_len + cgid_len; event = trace_buffer_lock_reserve(buffer, TRACE_BLK, trace_len, trace_ctx); if (!event) return; switch (bt->version) { case 1: record_blktrace_event(ring_buffer_event_data(event), pid, cpu, sector, bytes, what, bt->dev, error, cgid, cgid_len, pdu_data, pdu_len); break; case 2: default: /* * Use v2 recording function (record_blktrace_event2) * which writes blk_io_trace2 structure with correct * field layout: * - 32-bit pid at offset 28 * - 64-bit action at offset 32 * * Fall through to default handles version==0 case * (from sysfs path), ensuring we always use correct * v2 recording function to match the v2 buffer * allocated above. */ record_blktrace_event2(ring_buffer_event_data(event), pid, cpu, sector, bytes, what, bt->dev, error, cgid, cgid_len, pdu_data, pdu_len); break; } trace_buffer_unlock_commit(blk_tr, buffer, event, trace_ctx); return; } if (unlikely(tsk->btrace_seq != blktrace_seq)) trace_note_tsk(tsk); /* * A word about the locking here - we disable interrupts to reserve * some space in the relay per-cpu buffer, to prevent an irq * from coming in and stepping on our toes. */ local_irq_save(flags); sequence = per_cpu_ptr(bt->sequence, cpu); (*sequence)++; relay_blktrace_event(bt, *sequence, pid, cpu, sector, bytes, what, error, cgid, cgid_len, pdu_data, pdu_len); local_irq_restore(flags); } static void blk_trace_free(struct request_queue *q, struct blk_trace *bt) { relay_close(bt->rchan); /* * If 'bt->dir' is not set, then both 'dropped' and 'msg' are created * under 'q->debugfs_dir', thus lookup and remove them. */ if (!bt->dir) { debugfs_lookup_and_remove("dropped", q->debugfs_dir); debugfs_lookup_and_remove("msg", q->debugfs_dir); } else { debugfs_remove(bt->dir); } free_percpu(bt->sequence); free_percpu(bt->msg_data); kfree(bt); } static void get_probe_ref(void) { mutex_lock(&blk_probe_mutex); if (++blk_probes_ref == 1) blk_register_tracepoints(); mutex_unlock(&blk_probe_mutex); } static void put_probe_ref(void) { mutex_lock(&blk_probe_mutex); if (!--blk_probes_ref) blk_unregister_tracepoints(); mutex_unlock(&blk_probe_mutex); } static int blk_trace_start(struct blk_trace *bt) { if (bt->trace_state != Blktrace_setup && bt->trace_state != Blktrace_stopped) return -EINVAL; blktrace_seq++; smp_mb(); bt->trace_state = Blktrace_running; raw_spin_lock_irq(&running_trace_lock); list_add(&bt->running_list, &running_trace_list); raw_spin_unlock_irq(&running_trace_lock); trace_note_time(bt); return 0; } static int blk_trace_stop(struct blk_trace *bt) { if (bt->trace_state != Blktrace_running) return -EINVAL; bt->trace_state = Blktrace_stopped; raw_spin_lock_irq(&running_trace_lock); list_del_init(&bt->running_list); raw_spin_unlock_irq(&running_trace_lock); relay_flush(bt->rchan); return 0; } static void blk_trace_cleanup(struct request_queue *q, struct blk_trace *bt) { blk_trace_stop(bt); synchronize_rcu(); blk_trace_free(q, bt); put_probe_ref(); } static int __blk_trace_remove(struct request_queue *q) { struct blk_trace *bt; bt = rcu_replace_pointer(q->blk_trace, NULL, lockdep_is_held(&q->debugfs_mutex)); if (!bt) return -EINVAL; blk_trace_cleanup(q, bt); return 0; } int blk_trace_remove(struct request_queue *q) { int ret; mutex_lock(&q->debugfs_mutex); ret = __blk_trace_remove(q); mutex_unlock(&q->debugfs_mutex); return ret; } EXPORT_SYMBOL_GPL(blk_trace_remove); static ssize_t blk_dropped_read(struct file *filp, char __user *buffer, size_t count, loff_t *ppos) { struct blk_trace *bt = filp->private_data; size_t dropped = relay_stats(bt->rchan, RELAY_STATS_BUF_FULL); char buf[16]; snprintf(buf, sizeof(buf), "%zu\n", dropped); return simple_read_from_buffer(buffer, count, ppos, buf, strlen(buf)); } static const struct file_operations blk_dropped_fops = { .owner = THIS_MODULE, .open = simple_open, .read = blk_dropped_read, .llseek = default_llseek, }; static ssize_t blk_msg_write(struct file *filp, const char __user *buffer, size_t count, loff_t *ppos) { char *msg; struct blk_trace *bt; if (count >= BLK_TN_MAX_MSG) return -EINVAL; msg = memdup_user_nul(buffer, count); if (IS_ERR(msg)) return PTR_ERR(msg); bt = filp->private_data; __blk_trace_note_message(bt, NULL, "%s", msg); kfree(msg); return count; } static const struct file_operations blk_msg_fops = { .owner = THIS_MODULE, .open = simple_open, .write = blk_msg_write, .llseek = noop_llseek, }; static int blk_remove_buf_file_callback(struct dentry *dentry) { debugfs_remove(dentry); return 0; } static struct dentry *blk_create_buf_file_callback(const char *filename, struct dentry *parent, umode_t mode, struct rchan_buf *buf, int *is_global) { return debugfs_create_file(filename, mode, parent, buf, &relay_file_operations); } static const struct rchan_callbacks blk_relay_callbacks = { .create_buf_file = blk_create_buf_file_callback, .remove_buf_file = blk_remove_buf_file_callback, }; static void blk_trace_setup_lba(struct blk_trace *bt, struct block_device *bdev) { if (bdev) { bt->start_lba = bdev->bd_start_sect; bt->end_lba = bdev->bd_start_sect + bdev_nr_sectors(bdev); } else { bt->start_lba = 0; bt->end_lba = -1ULL; } } /* * Setup everything required to start tracing */ static struct blk_trace *blk_trace_setup_prepare(struct request_queue *q, char *name, dev_t dev, u32 buf_size, u32 buf_nr, struct block_device *bdev) { struct blk_trace *bt = NULL; struct dentry *dir = NULL; int ret; lockdep_assert_held(&q->debugfs_mutex); /* * bdev can be NULL, as with scsi-generic, this is a helpful as * we can be. */ if (rcu_dereference_protected(q->blk_trace, lockdep_is_held(&q->debugfs_mutex))) { pr_warn("Concurrent blktraces are not allowed on %s\n", name); return ERR_PTR(-EBUSY); } bt = kzalloc(sizeof(*bt), GFP_KERNEL); if (!bt) return ERR_PTR(-ENOMEM); ret = -ENOMEM; bt->sequence = alloc_percpu(unsigned long); if (!bt->sequence) goto err; bt->msg_data = __alloc_percpu(BLK_TN_MAX_MSG, __alignof__(char)); if (!bt->msg_data) goto err; /* * When tracing the whole disk reuse the existing debugfs directory * created by the block layer on init. For partitions block devices, * and scsi-generic block devices we create a temporary new debugfs * directory that will be removed once the trace ends. */ if (bdev && !bdev_is_partition(bdev)) dir = q->debugfs_dir; else bt->dir = dir = debugfs_create_dir(name, blk_debugfs_root); /* * As blktrace relies on debugfs for its interface the debugfs directory * is required, contrary to the usual mantra of not checking for debugfs * files or directories. */ if (IS_ERR_OR_NULL(dir)) { pr_warn("debugfs_dir not present for %s so skipping\n", name); ret = -ENOENT; goto err; } bt->dev = dev; INIT_LIST_HEAD(&bt->running_list); ret = -EIO; debugfs_create_file("dropped", 0444, dir, bt, &blk_dropped_fops); debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops); bt->rchan = relay_open("trace", dir, buf_size, buf_nr, &blk_relay_callbacks, bt); if (!bt->rchan) goto err; blk_trace_setup_lba(bt, bdev); return bt; err: blk_trace_free(q, bt); return ERR_PTR(ret); } static void blk_trace_setup_finalize(struct request_queue *q, char *name, int version, struct blk_trace *bt, struct blk_user_trace_setup2 *buts) { strscpy_pad(buts->name, name, BLKTRACE_BDEV_SIZE2); /* * some device names have larger paths - convert the slashes * to underscores for this to work as expected */ strreplace(buts->name, '/', '_'); bt->version = version; bt->act_mask = buts->act_mask; if (!bt->act_mask) bt->act_mask = (u16) -1; /* overwrite with user settings */ if (buts->start_lba) bt->start_lba = buts->start_lba; if (buts->end_lba) bt->end_lba = buts->end_lba; bt->pid = buts->pid; bt->trace_state = Blktrace_setup; rcu_assign_pointer(q->blk_trace, bt); get_probe_ref(); } int blk_trace_setup(struct request_queue *q, char *name, dev_t dev, struct block_device *bdev, char __user *arg) { struct blk_user_trace_setup2 buts2; struct blk_user_trace_setup buts; struct blk_trace *bt; int ret; ret = copy_from_user(&buts, arg, sizeof(buts)); if (ret) return -EFAULT; if (!buts.buf_size || !buts.buf_nr) return -EINVAL; buts2 = (struct blk_user_trace_setup2) { .act_mask = buts.act_mask, .buf_size = buts.buf_size, .buf_nr = buts.buf_nr, .start_lba = buts.start_lba, .end_lba = buts.end_lba, .pid = buts.pid, }; mutex_lock(&q->debugfs_mutex); bt = blk_trace_setup_prepare(q, name, dev, buts.buf_size, buts.buf_nr, bdev); if (IS_ERR(bt)) { mutex_unlock(&q->debugfs_mutex); return PTR_ERR(bt); } blk_trace_setup_finalize(q, name, 1, bt, &buts2); strcpy(buts.name, buts2.name); mutex_unlock(&q->debugfs_mutex); if (copy_to_user(arg, &buts, sizeof(buts))) { blk_trace_remove(q); return -EFAULT; } return 0; } EXPORT_SYMBOL_GPL(blk_trace_setup); static int blk_trace_setup2(struct request_queue *q, char *name, dev_t dev, struct block_device *bdev, char __user *arg) { struct blk_user_trace_setup2 buts2; struct blk_trace *bt; if (copy_from_user(&buts2, arg, sizeof(buts2))) return -EFAULT; if (!buts2.buf_size || !buts2.buf_nr) return -EINVAL; if (buts2.flags != 0) return -EINVAL; mutex_lock(&q->debugfs_mutex); bt = blk_trace_setup_prepare(q, name, dev, buts2.buf_size, buts2.buf_nr, bdev); if (IS_ERR(bt)) { mutex_unlock(&q->debugfs_mutex); return PTR_ERR(bt); } blk_trace_setup_finalize(q, name, 2, bt, &buts2); mutex_unlock(&q->debugfs_mutex); if (copy_to_user(arg, &buts2, sizeof(buts2))) { blk_trace_remove(q); return -EFAULT; } return 0; } #if defined(CONFIG_COMPAT) && defined(CONFIG_X86_64) static int compat_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, struct block_device *bdev, char __user *arg) { struct blk_user_trace_setup2 buts2; struct compat_blk_user_trace_setup cbuts; struct blk_trace *bt; if (copy_from_user(&cbuts, arg, sizeof(cbuts))) return -EFAULT; if (!cbuts.buf_size || !cbuts.buf_nr) return -EINVAL; buts2 = (struct blk_user_trace_setup2) { .act_mask = cbuts.act_mask, .buf_size = cbuts.buf_size, .buf_nr = cbuts.buf_nr, .start_lba = cbuts.start_lba, .end_lba = cbuts.end_lba, .pid = cbuts.pid, }; mutex_lock(&q->debugfs_mutex); bt = blk_trace_setup_prepare(q, name, dev, buts2.buf_size, buts2.buf_nr, bdev); if (IS_ERR(bt)) { mutex_unlock(&q->debugfs_mutex); return PTR_ERR(bt); } blk_trace_setup_finalize(q, name, 1, bt, &buts2); mutex_unlock(&q->debugfs_mutex); if (copy_to_user(arg, &buts2.name, ARRAY_SIZE(buts2.name))) { blk_trace_remove(q); return -EFAULT; } return 0; } #endif static int __blk_trace_startstop(struct request_queue *q, int start) { struct blk_trace *bt; bt = rcu_dereference_protected(q->blk_trace, lockdep_is_held(&q->debugfs_mutex)); if (bt == NULL) return -EINVAL; if (start) return blk_trace_start(bt); else return blk_trace_stop(bt); } int blk_trace_startstop(struct request_queue *q, int start) { int ret; mutex_lock(&q->debugfs_mutex); ret = __blk_trace_startstop(q, start); mutex_unlock(&q->debugfs_mutex); return ret; } EXPORT_SYMBOL_GPL(blk_trace_startstop); /* * When reading or writing the blktrace sysfs files, the references to the * opened sysfs or device files should prevent the underlying block device * from being removed. So no further delete protection is really needed. */ /** * blk_trace_ioctl - handle the ioctls associated with tracing * @bdev: the block device * @cmd: the ioctl cmd * @arg: the argument data, if any * **/ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) { struct request_queue *q = bdev_get_queue(bdev); int ret, start = 0; char b[BDEVNAME_SIZE]; switch (cmd) { case BLKTRACESETUP2: snprintf(b, sizeof(b), "%pg", bdev); ret = blk_trace_setup2(q, b, bdev->bd_dev, bdev, arg); break; case BLKTRACESETUP: snprintf(b, sizeof(b), "%pg", bdev); ret = blk_trace_setup(q, b, bdev->bd_dev, bdev, arg); break; #if defined(CONFIG_COMPAT) && defined(CONFIG_X86_64) case BLKTRACESETUP32: snprintf(b, sizeof(b), "%pg", bdev); ret = compat_blk_trace_setup(q, b, bdev->bd_dev, bdev, arg); break; #endif case BLKTRACESTART: start = 1; fallthrough; case BLKTRACESTOP: ret = blk_trace_startstop(q, start); break; case BLKTRACETEARDOWN: ret = blk_trace_remove(q); break; default: ret = -ENOTTY; break; } return ret; } /** * blk_trace_shutdown - stop and cleanup trace structures * @q: the request queue associated with the device * **/ void blk_trace_shutdown(struct request_queue *q) { if (rcu_dereference_protected(q->blk_trace, lockdep_is_held(&q->debugfs_mutex))) __blk_trace_remove(q); } #ifdef CONFIG_BLK_CGROUP static u64 blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio) { struct cgroup_subsys_state *blkcg_css; struct blk_trace *bt; /* We don't use the 'bt' value here except as an optimization... */ bt = rcu_dereference_protected(q->blk_trace, 1); if (!bt || !(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP)) return 0; blkcg_css = bio_blkcg_css(bio); if (!blkcg_css) return 0; return cgroup_id(blkcg_css->cgroup); } #else static u64 blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio) { return 0; } #endif static u64 blk_trace_request_get_cgid(struct request *rq) { if (!rq->bio) return 0; /* Use the first bio */ return blk_trace_bio_get_cgid(rq->q, rq->bio); } /* * blktrace probes */ /** * blk_add_trace_rq - Add a trace for a request oriented action * @rq: the source request * @error: return status to log * @nr_bytes: number of completed bytes * @what: the action * @cgid: the cgroup info * * Description: * Records an action against a request. Will log the bio offset + size. * **/ static void blk_add_trace_rq(struct request *rq, blk_status_t error, unsigned int nr_bytes, u64 what, u64 cgid) { struct blk_trace *bt; rcu_read_lock(); bt = rcu_dereference(rq->q->blk_trace); if (likely(!bt)) { rcu_read_unlock(); return; } if (blk_rq_is_passthrough(rq)) what |= BLK_TC_ACT(BLK_TC_PC); else what |= BLK_TC_ACT(BLK_TC_FS); __blk_add_trace(bt, blk_rq_trace_sector(rq), nr_bytes, rq->cmd_flags, what, blk_status_to_errno(error), 0, NULL, cgid); rcu_read_unlock(); } static void blk_add_trace_rq_insert(void *ignore, struct request *rq) { blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT, blk_trace_request_get_cgid(rq)); } static void blk_add_trace_rq_issue(void *ignore, struct request *rq) { blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ISSUE, blk_trace_request_get_cgid(rq)); } static void blk_add_trace_rq_merge(void *ignore, struct request *rq) { blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_BACKMERGE, blk_trace_request_get_cgid(rq)); } static void blk_add_trace_rq_requeue(void *ignore, struct request *rq) { blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_REQUEUE, blk_trace_request_get_cgid(rq)); } static void blk_add_trace_rq_complete(void *ignore, struct request *rq, blk_status_t error, unsigned int nr_bytes) { blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE, blk_trace_request_get_cgid(rq)); } static void blk_add_trace_zone_update_request(void *ignore, struct request *rq) { struct blk_trace *bt; rcu_read_lock(); bt = rcu_dereference(rq->q->blk_trace); if (likely(!bt) || bt->version < 2) { rcu_read_unlock(); return; } rcu_read_unlock(); blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ZONE_APPEND, blk_trace_request_get_cgid(rq)); } /** * blk_add_trace_bio - Add a trace for a bio oriented action * @q: queue the io is for * @bio: the source bio * @what: the action * @error: error, if any * * Description: * Records an action against a bio. Will log the bio offset + size. * **/ static void blk_add_trace_bio(struct request_queue *q, struct bio *bio, u64 what, int error) { struct blk_trace *bt; rcu_read_lock(); bt = rcu_dereference(q->blk_trace); if (likely(!bt)) { rcu_read_unlock(); return; } __blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size, bio->bi_opf, what, error, 0, NULL, blk_trace_bio_get_cgid(q, bio)); rcu_read_unlock(); } static void blk_add_trace_bio_complete(void *ignore, struct request_queue *q, struct bio *bio) { blk_add_trace_bio(q, bio, BLK_TA_COMPLETE, blk_status_to_errno(bio->bi_status)); } static void blk_add_trace_bio_backmerge(void *ignore, struct bio *bio) { blk_add_trace_bio(bio->bi_bdev->bd_disk->queue, bio, BLK_TA_BACKMERGE, 0); } static void blk_add_trace_bio_frontmerge(void *ignore, struct bio *bio) { blk_add_trace_bio(bio->bi_bdev->bd_disk->queue, bio, BLK_TA_FRONTMERGE, 0); } static void blk_add_trace_bio_queue(void *ignore, struct bio *bio) { blk_add_trace_bio(bio->bi_bdev->bd_disk->queue, bio, BLK_TA_QUEUE, 0); } static void blk_add_trace_getrq(void *ignore, struct bio *bio) { blk_add_trace_bio(bio->bi_bdev->bd_disk->queue, bio, BLK_TA_GETRQ, 0); } static void blk_add_trace_plug(void *ignore, struct request_queue *q) { struct blk_trace *bt; rcu_read_lock(); bt = rcu_dereference(q->blk_trace); if (bt) __blk_add_trace(bt, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL, 0); rcu_read_unlock(); } static void blk_add_trace_unplug(void *ignore, struct request_queue *q, unsigned int depth, bool explicit) { struct blk_trace *bt; rcu_read_lock(); bt = rcu_dereference(q->blk_trace); if (bt) { __be64 rpdu = cpu_to_be64(depth); u64 what; if (explicit) what = BLK_TA_UNPLUG_IO; else what = BLK_TA_UNPLUG_TIMER; __blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu, 0); } rcu_read_unlock(); } static void blk_add_trace_zone_plug(void *ignore, struct request_queue *q, unsigned int zno, sector_t sector, unsigned int sectors) { struct blk_trace *bt; rcu_read_lock(); bt = rcu_dereference(q->blk_trace); if (bt && bt->version >= 2) __blk_add_trace(bt, sector, sectors << SECTOR_SHIFT, 0, BLK_TA_ZONE_PLUG, 0, 0, NULL, 0); rcu_read_unlock(); return; } static void blk_add_trace_zone_unplug(void *ignore, struct request_queue *q, unsigned int zno, sector_t sector, unsigned int sectors) { struct blk_trace *bt; rcu_read_lock(); bt = rcu_dereference(q->blk_trace); if (bt && bt->version >= 2) __blk_add_trace(bt, sector, sectors << SECTOR_SHIFT, 0, BLK_TA_ZONE_UNPLUG, 0, 0, NULL, 0); rcu_read_unlock(); return; } static void blk_add_trace_split(void *ignore, struct bio *bio, unsigned int pdu) { struct request_queue *q = bio->bi_bdev->bd_disk->queue; struct blk_trace *bt; rcu_read_lock(); bt = rcu_dereference(q->blk_trace); if (bt) { __be64 rpdu = cpu_to_be64(pdu); __blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size, bio->bi_opf, BLK_TA_SPLIT, blk_status_to_errno(bio->bi_status), sizeof(rpdu), &rpdu, blk_trace_bio_get_cgid(q, bio)); } rcu_read_unlock(); } /** * blk_add_trace_bio_remap - Add a trace for a bio-remap operation * @ignore: trace callback data parameter (not used) * @bio: the source bio * @dev: source device * @from: source sector * * Called after a bio is remapped to a different device and/or sector. **/ static void blk_add_trace_bio_remap(void *ignore, struct bio *bio, dev_t dev, sector_t from) { struct request_queue *q = bio->bi_bdev->bd_disk->queue; struct blk_trace *bt; struct blk_io_trace_remap r; rcu_read_lock(); bt = rcu_dereference(q->blk_trace); if (likely(!bt)) { rcu_read_unlock(); return; } r.device_from = cpu_to_be32(dev); r.device_to = cpu_to_be32(bio_dev(bio)); r.sector_from = cpu_to_be64(from); __blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size, bio->bi_opf, BLK_TA_REMAP, blk_status_to_errno(bio->bi_status), sizeof(r), &r, blk_trace_bio_get_cgid(q, bio)); rcu_read_unlock(); } /** * blk_add_trace_rq_remap - Add a trace for a request-remap operation * @ignore: trace callback data parameter (not used) * @rq: the source request * @dev: target device * @from: source sector * * Description: * Device mapper remaps request to other devices. * Add a trace for that action. * **/ static void blk_add_trace_rq_remap(void *ignore, struct request *rq, dev_t dev, sector_t from) { struct blk_trace *bt; struct blk_io_trace_remap r; rcu_read_lock(); bt = rcu_dereference(rq->q->blk_trace); if (likely(!bt)) { rcu_read_unlock(); return; } r.device_from = cpu_to_be32(dev); r.device_to = cpu_to_be32(disk_devt(rq->q->disk)); r.sector_from = cpu_to_be64(from); __blk_add_trace(bt, blk_rq_pos(rq), blk_rq_bytes(rq), rq->cmd_flags, BLK_TA_REMAP, 0, sizeof(r), &r, blk_trace_request_get_cgid(rq)); rcu_read_unlock(); } /** * blk_add_driver_data - Add binary message with driver-specific data * @rq: io request * @data: driver-specific data * @len: length of driver-specific data * * Description: * Some drivers might want to write driver-specific data per request. * **/ void blk_add_driver_data(struct request *rq, void *data, size_t len) { struct blk_trace *bt; rcu_read_lock(); bt = rcu_dereference(rq->q->blk_trace); if (likely(!bt)) { rcu_read_unlock(); return; } __blk_add_trace(bt, blk_rq_trace_sector(rq), blk_rq_bytes(rq), 0, BLK_TA_DRV_DATA, 0, len, data, blk_trace_request_get_cgid(rq)); rcu_read_unlock(); } EXPORT_SYMBOL_GPL(blk_add_driver_data); static void blk_register_tracepoints(void) { int ret; ret = register_trace_block_rq_insert(blk_add_trace_rq_insert, NULL); WARN_ON(ret); ret = register_trace_block_rq_issue(blk_add_trace_rq_issue, NULL); WARN_ON(ret); ret = register_trace_block_rq_merge(blk_add_trace_rq_merge, NULL); WARN_ON(ret); ret = register_trace_block_rq_requeue(blk_add_trace_rq_requeue, NULL); WARN_ON(ret); ret = register_trace_block_rq_complete(blk_add_trace_rq_complete, NULL); WARN_ON(ret); ret = register_trace_block_bio_complete(blk_add_trace_bio_complete, NULL); WARN_ON(ret); ret = register_trace_block_bio_backmerge(blk_add_trace_bio_backmerge, NULL); WARN_ON(ret); ret = register_trace_block_bio_frontmerge(blk_add_trace_bio_frontmerge, NULL); WARN_ON(ret); ret = register_trace_block_bio_queue(blk_add_trace_bio_queue, NULL); WARN_ON(ret); ret = register_trace_block_getrq(blk_add_trace_getrq, NULL); WARN_ON(ret); ret = register_trace_blk_zone_append_update_request_bio( blk_add_trace_zone_update_request, NULL); WARN_ON(ret); ret = register_trace_disk_zone_wplug_add_bio(blk_add_trace_zone_plug, NULL); WARN_ON(ret); ret = register_trace_blk_zone_wplug_bio(blk_add_trace_zone_unplug, NULL); WARN_ON(ret); ret = register_trace_block_plug(blk_add_trace_plug, NULL); WARN_ON(ret); ret = register_trace_block_unplug(blk_add_trace_unplug, NULL); WARN_ON(ret); ret = register_trace_block_split(blk_add_trace_split, NULL); WARN_ON(ret); ret = register_trace_block_bio_remap(blk_add_trace_bio_remap, NULL); WARN_ON(ret); ret = register_trace_block_rq_remap(blk_add_trace_rq_remap, NULL); WARN_ON(ret); } static void blk_unregister_tracepoints(void) { unregister_trace_block_rq_remap(blk_add_trace_rq_remap, NULL); unregister_trace_block_bio_remap(blk_add_trace_bio_remap, NULL); unregister_trace_block_split(blk_add_trace_split, NULL); unregister_trace_block_unplug(blk_add_trace_unplug, NULL); unregister_trace_block_plug(blk_add_trace_plug, NULL); unregister_trace_blk_zone_wplug_bio(blk_add_trace_zone_unplug, NULL); unregister_trace_disk_zone_wplug_add_bio(blk_add_trace_zone_plug, NULL); unregister_trace_blk_zone_append_update_request_bio( blk_add_trace_zone_update_request, NULL); unregister_trace_block_getrq(blk_add_trace_getrq, NULL); unregister_trace_block_bio_queue(blk_add_trace_bio_queue, NULL); unregister_trace_block_bio_frontmerge(blk_add_trace_bio_frontmerge, NULL); unregister_trace_block_bio_backmerge(blk_add_trace_bio_backmerge, NULL); unregister_trace_block_bio_complete(blk_add_trace_bio_complete, NULL); unregister_trace_block_rq_complete(blk_add_trace_rq_complete, NULL); unregister_trace_block_rq_requeue(blk_add_trace_rq_requeue, NULL); unregister_trace_block_rq_merge(blk_add_trace_rq_merge, NULL); unregister_trace_block_rq_issue(blk_add_trace_rq_issue, NULL); unregister_trace_block_rq_insert(blk_add_trace_rq_insert, NULL); tracepoint_synchronize_unregister(); } /* * struct blk_io_tracer formatting routines */ static void fill_rwbs(char *rwbs, const struct blk_io_trace2 *t) { int i = 0; int tc = t->action >> BLK_TC_SHIFT; if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) { rwbs[i++] = 'N'; goto out; } if (tc & BLK_TC_FLUSH) rwbs[i++] = 'F'; if (tc & BLK_TC_DISCARD) rwbs[i++] = 'D'; else if (tc & BLK_TC_WRITE_ZEROES) { rwbs[i++] = 'W'; rwbs[i++] = 'Z'; } else if (tc & BLK_TC_WRITE) rwbs[i++] = 'W'; else if (t->bytes) rwbs[i++] = 'R'; else rwbs[i++] = 'N'; if (tc & BLK_TC_FUA) rwbs[i++] = 'F'; if (tc & BLK_TC_AHEAD) rwbs[i++] = 'A'; if (tc & BLK_TC_SYNC) rwbs[i++] = 'S'; if (tc & BLK_TC_META) rwbs[i++] = 'M'; out: rwbs[i] = '\0'; } static inline const struct blk_io_trace2 *te_blk_io_trace(const struct trace_entry *ent) { return (const struct blk_io_trace2 *)ent; } static inline const void *pdu_start(const struct trace_entry *ent, bool has_cg) { return (void *)(te_blk_io_trace(ent) + 1) + (has_cg ? sizeof(u64) : 0); } static inline u64 t_cgid(const struct trace_entry *ent) { return *(u64 *)(te_blk_io_trace(ent) + 1); } static inline int pdu_real_len(const struct trace_entry *ent, bool has_cg) { return te_blk_io_trace(ent)->pdu_len - (has_cg ? sizeof(u64) : 0); } static inline u32 t_action(const struct trace_entry *ent) { return te_blk_io_trace(ent)->action; } static inline u32 t_bytes(const struct trace_entry *ent) { return te_blk_io_trace(ent)->bytes; } static inline u32 t_sec(const struct trace_entry *ent) { return te_blk_io_trace(ent)->bytes >> 9; } static inline unsigned long long t_sector(const struct trace_entry *ent) { return te_blk_io_trace(ent)->sector; } static inline __u16 t_error(const struct trace_entry *ent) { return te_blk_io_trace(ent)->error; } static __u64 get_pdu_int(const struct trace_entry *ent, bool has_cg) { const __be64 *val = pdu_start(ent, has_cg); return be64_to_cpu(*val); } typedef void (blk_log_action_t) (struct trace_iterator *iter, const char *act, bool has_cg); static void blk_log_action_classic(struct trace_iterator *iter, const char *act, bool has_cg) { char rwbs[RWBS_LEN]; unsigned long long ts = iter->ts; unsigned long nsec_rem = do_div(ts, NSEC_PER_SEC); unsigned secs = (unsigned long)ts; const struct blk_io_trace2 *t = te_blk_io_trace(iter->ent); fill_rwbs(rwbs, t); trace_seq_printf(&iter->seq, "%3d,%-3d %2d %5d.%09lu %5u %2s %3s ", MAJOR(t->device), MINOR(t->device), iter->cpu, secs, nsec_rem, iter->ent->pid, act, rwbs); } static void blk_log_action(struct trace_iterator *iter, const char *act, bool has_cg) { char rwbs[RWBS_LEN]; const struct blk_io_trace2 *t = te_blk_io_trace(iter->ent); fill_rwbs(rwbs, t); if (has_cg) { u64 id = t_cgid(iter->ent); if (blk_tracer_flags.val & TRACE_BLK_OPT_CGNAME) { char blkcg_name_buf[NAME_MAX + 1] = "<...>"; cgroup_path_from_kernfs_id(id, blkcg_name_buf, sizeof(blkcg_name_buf)); trace_seq_printf(&iter->seq, "%3d,%-3d %s %2s %3s ", MAJOR(t->device), MINOR(t->device), blkcg_name_buf, act, rwbs); } else { /* * The cgid portion used to be "INO,GEN". Userland * builds a FILEID_INO32_GEN fid out of them and * opens the cgroup using open_by_handle_at(2). * While 32bit ino setups are still the same, 64bit * ones now use the 64bit ino as the whole ID and * no longer use generation. * * Regardless of the content, always output * "LOW32,HIGH32" so that FILEID_INO32_GEN fid can * be mapped back to @id on both 64 and 32bit ino * setups. See __kernfs_fh_to_dentry(). */ trace_seq_printf(&iter->seq, "%3d,%-3d %llx,%-llx %2s %3s ", MAJOR(t->device), MINOR(t->device), id & U32_MAX, id >> 32, act, rwbs); } } else trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ", MAJOR(t->device), MINOR(t->device), act, rwbs); } static void blk_log_dump_pdu(struct trace_seq *s, const struct trace_entry *ent, bool has_cg) { const unsigned char *pdu_buf; int pdu_len; int i, end; pdu_buf = pdu_start(ent, has_cg); pdu_len = pdu_real_len(ent, has_cg); if (!pdu_len) return; /* find the last zero that needs to be printed */ for (end = pdu_len - 1; end >= 0; end--) if (pdu_buf[end]) break; end++; trace_seq_putc(s, '('); for (i = 0; i < pdu_len; i++) { trace_seq_printf(s, "%s%02x", i == 0 ? "" : " ", pdu_buf[i]); /* * stop when the rest is just zeros and indicate so * with a ".." appended */ if (i == end && end != pdu_len - 1) { trace_seq_puts(s, " ..) "); return; } } trace_seq_puts(s, ") "); } static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent, bool has_cg) { char cmd[TASK_COMM_LEN]; trace_find_cmdline(ent->pid, cmd); if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) { trace_seq_printf(s, "%u ", t_bytes(ent)); blk_log_dump_pdu(s, ent, has_cg); trace_seq_printf(s, "[%s]\n", cmd); } else { if (t_sec(ent)) trace_seq_printf(s, "%llu + %u [%s]\n", t_sector(ent), t_sec(ent), cmd); else trace_seq_printf(s, "[%s]\n", cmd); } } static void blk_log_with_error(struct trace_seq *s, const struct trace_entry *ent, bool has_cg) { if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) { blk_log_dump_pdu(s, ent, has_cg); trace_seq_printf(s, "[%d]\n", t_error(ent)); } else { if (t_sec(ent)) trace_seq_printf(s, "%llu + %u [%d]\n", t_sector(ent), t_sec(ent), t_error(ent)); else trace_seq_printf(s, "%llu [%d]\n", t_sector(ent), t_error(ent)); } } static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent, bool has_cg) { const struct blk_io_trace_remap *__r = pdu_start(ent, has_cg); trace_seq_printf(s, "%llu + %u <- (%d,%d) %llu\n", t_sector(ent), t_sec(ent), MAJOR(be32_to_cpu(__r->device_from)), MINOR(be32_to_cpu(__r->device_from)), be64_to_cpu(__r->sector_from)); } static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg) { char cmd[TASK_COMM_LEN]; trace_find_cmdline(ent->pid, cmd); trace_seq_printf(s, "[%s]\n", cmd); } static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg) { char cmd[TASK_COMM_LEN]; trace_find_cmdline(ent->pid, cmd); trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent, has_cg)); } static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent, bool has_cg) { char cmd[TASK_COMM_LEN]; trace_find_cmdline(ent->pid, cmd); trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent), get_pdu_int(ent, has_cg), cmd); } static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent, bool has_cg) { trace_seq_putmem(s, pdu_start(ent, has_cg), pdu_real_len(ent, has_cg)); trace_seq_putc(s, '\n'); } /* * struct tracer operations */ static void blk_tracer_print_header(struct seq_file *m) { if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC)) return; seq_puts(m, "# DEV CPU TIMESTAMP PID ACT FLG\n" "# | | | | | |\n"); } static void blk_tracer_start(struct trace_array *tr) { blk_tracer_enabled = true; } static int blk_tracer_init(struct trace_array *tr) { blk_tr = tr; blk_tracer_start(tr); return 0; } static void blk_tracer_stop(struct trace_array *tr) { blk_tracer_enabled = false; } static void blk_tracer_reset(struct trace_array *tr) { blk_tracer_stop(tr); } static const struct { const char *act[2]; void (*print)(struct trace_seq *s, const struct trace_entry *ent, bool has_cg); } what2act[] = { [__BLK_TA_QUEUE] = {{ "Q", "queue" }, blk_log_generic }, [__BLK_TA_BACKMERGE] = {{ "M", "backmerge" }, blk_log_generic }, [__BLK_TA_FRONTMERGE] = {{ "F", "frontmerge" }, blk_log_generic }, [__BLK_TA_GETRQ] = {{ "G", "getrq" }, blk_log_generic }, [__BLK_TA_SLEEPRQ] = {{ "S", "sleeprq" }, blk_log_generic }, [__BLK_TA_REQUEUE] = {{ "R", "requeue" }, blk_log_with_error }, [__BLK_TA_ISSUE] = {{ "D", "issue" }, blk_log_generic }, [__BLK_TA_COMPLETE] = {{ "C", "complete" }, blk_log_with_error }, [__BLK_TA_PLUG] = {{ "P", "plug" }, blk_log_plug }, [__BLK_TA_UNPLUG_IO] = {{ "U", "unplug_io" }, blk_log_unplug }, [__BLK_TA_UNPLUG_TIMER] = {{ "UT", "unplug_timer" }, blk_log_unplug }, [__BLK_TA_INSERT] = {{ "I", "insert" }, blk_log_generic }, [__BLK_TA_SPLIT] = {{ "X", "split" }, blk_log_split }, [__BLK_TA_REMAP] = {{ "A", "remap" }, blk_log_remap }, }; static enum print_line_t print_one_line(struct trace_iterator *iter, bool classic) { struct trace_array *tr = iter->tr; struct trace_seq *s = &iter->seq; const struct blk_io_trace2 *t; u16 what; bool long_act; blk_log_action_t *log_action; bool has_cg; t = te_blk_io_trace(iter->ent); what = (t->action & ((1 << BLK_TC_SHIFT) - 1)) & ~__BLK_TA_CGROUP; long_act = !!(tr->trace_flags & TRACE_ITER(VERBOSE)); log_action = classic ? &blk_log_action_classic : &blk_log_action; has_cg = t->action & __BLK_TA_CGROUP; if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) { log_action(iter, long_act ? "message" : "m", has_cg); blk_log_msg(s, iter->ent, has_cg); return trace_handle_return(s); } if (unlikely(what == 0 || what >= ARRAY_SIZE(what2act))) trace_seq_printf(s, "Unknown action %x\n", what); else { log_action(iter, what2act[what].act[long_act], has_cg); what2act[what].print(s, iter->ent, has_cg); } return trace_handle_return(s); } static enum print_line_t blk_trace_event_print(struct trace_iterator *iter, int flags, struct trace_event *event) { return print_one_line(iter, false); } static void blk_trace_synthesize_old_trace(struct trace_iterator *iter) { struct trace_seq *s = &iter->seq; struct blk_io_trace2 *t = (struct blk_io_trace2 *)iter->ent; const int offset = offsetof(struct blk_io_trace2, sector); struct blk_io_trace old = { .magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION, .time = iter->ts, }; trace_seq_putmem(s, &old, offset); trace_seq_putmem(s, &t->sector, sizeof(old) - offset + t->pdu_len); } static enum print_line_t blk_trace_event_print_binary(struct trace_iterator *iter, int flags, struct trace_event *event) { blk_trace_synthesize_old_trace(iter); return trace_handle_return(&iter->seq); } static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter) { if ((iter->ent->type != TRACE_BLK) || !(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC)) return TRACE_TYPE_UNHANDLED; return print_one_line(iter, true); } static int blk_tracer_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set) { /* don't output context-info for blk_classic output */ if (bit == TRACE_BLK_OPT_CLASSIC) { if (set) tr->trace_flags &= ~TRACE_ITER(CONTEXT_INFO); else tr->trace_flags |= TRACE_ITER(CONTEXT_INFO); } return 0; } static struct tracer blk_tracer __read_mostly = { .name = "blk", .init = blk_tracer_init, .reset = blk_tracer_reset, .start = blk_tracer_start, .stop = blk_tracer_stop, .print_header = blk_tracer_print_header, .print_line = blk_tracer_print_line, .flags = &blk_tracer_flags, .set_flag = blk_tracer_set_flag, }; static struct trace_event_functions trace_blk_event_funcs = { .trace = blk_trace_event_print, .binary = blk_trace_event_print_binary, }; static struct trace_event trace_blk_event = { .type = TRACE_BLK, .funcs = &trace_blk_event_funcs, }; static int __init init_blk_tracer(void) { if (!register_trace_event(&trace_blk_event)) { pr_warn("Warning: could not register block events\n"); return 1; } if (register_tracer(&blk_tracer) != 0) { pr_warn("Warning: could not register the block tracer\n"); unregister_trace_event(&trace_blk_event); return 1; } BUILD_BUG_ON(__alignof__(struct blk_user_trace_setup2) % __alignof__(long)); BUILD_BUG_ON(__alignof__(struct blk_io_trace2) % __alignof__(long)); return 0; } device_initcall(init_blk_tracer); static int blk_trace_remove_queue(struct request_queue *q) { struct blk_trace *bt; bt = rcu_replace_pointer(q->blk_trace, NULL, lockdep_is_held(&q->debugfs_mutex)); if (bt == NULL) return -EINVAL; blk_trace_stop(bt); put_probe_ref(); synchronize_rcu(); blk_trace_free(q, bt); return 0; } /* * Setup everything required to start tracing */ static int blk_trace_setup_queue(struct request_queue *q, struct block_device *bdev) { struct blk_trace *bt = NULL; int ret = -ENOMEM; bt = kzalloc(sizeof(*bt), GFP_KERNEL); if (!bt) return -ENOMEM; bt->msg_data = __alloc_percpu(BLK_TN_MAX_MSG, __alignof__(char)); if (!bt->msg_data) goto free_bt; bt->dev = bdev->bd_dev; bt->act_mask = (u16)-1; blk_trace_setup_lba(bt, bdev); rcu_assign_pointer(q->blk_trace, bt); get_probe_ref(); return 0; free_bt: blk_trace_free(q, bt); return ret; } /* * sysfs interface to enable and configure tracing */ static ssize_t sysfs_blk_trace_attr_show(struct device *dev, struct device_attribute *attr, char *buf); static ssize_t sysfs_blk_trace_attr_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count); #define BLK_TRACE_DEVICE_ATTR(_name) \ DEVICE_ATTR(_name, S_IRUGO | S_IWUSR, \ sysfs_blk_trace_attr_show, \ sysfs_blk_trace_attr_store) static BLK_TRACE_DEVICE_ATTR(enable); static BLK_TRACE_DEVICE_ATTR(act_mask); static BLK_TRACE_DEVICE_ATTR(pid); static BLK_TRACE_DEVICE_ATTR(start_lba); static BLK_TRACE_DEVICE_ATTR(end_lba); static struct attribute *blk_trace_attrs[] = { &dev_attr_enable.attr, &dev_attr_act_mask.attr, &dev_attr_pid.attr, &dev_attr_start_lba.attr, &dev_attr_end_lba.attr, NULL }; struct attribute_group blk_trace_attr_group = { .name = "trace", .attrs = blk_trace_attrs, }; static const struct { int mask; const char *str; } mask_maps[] = { { BLK_TC_READ, "read" }, { BLK_TC_WRITE, "write" }, { BLK_TC_FLUSH, "flush" }, { BLK_TC_SYNC, "sync" }, { BLK_TC_QUEUE, "queue" }, { BLK_TC_REQUEUE, "requeue" }, { BLK_TC_ISSUE, "issue" }, { BLK_TC_COMPLETE, "complete" }, { BLK_TC_FS, "fs" }, { BLK_TC_PC, "pc" }, { BLK_TC_NOTIFY, "notify" }, { BLK_TC_AHEAD, "ahead" }, { BLK_TC_META, "meta" }, { BLK_TC_DISCARD, "discard" }, { BLK_TC_DRV_DATA, "drv_data" }, { BLK_TC_FUA, "fua" }, { BLK_TC_WRITE_ZEROES, "write-zeroes" }, }; static int blk_trace_str2mask(const char *str) { int i; int mask = 0; char *buf, *s, *token; buf = kstrdup(str, GFP_KERNEL); if (buf == NULL) return -ENOMEM; s = strstrip(buf); while (1) { token = strsep(&s, ","); if (token == NULL) break; if (*token == '\0') continue; for (i = 0; i < ARRAY_SIZE(mask_maps); i++) { if (strcasecmp(token, mask_maps[i].str) == 0) { mask |= mask_maps[i].mask; break; } } if (i == ARRAY_SIZE(mask_maps)) { mask = -EINVAL; break; } } kfree(buf); return mask; } static ssize_t blk_trace_mask2str(char *buf, int mask) { int i; char *p = buf; for (i = 0; i < ARRAY_SIZE(mask_maps); i++) { if (mask & mask_maps[i].mask) { p += sprintf(p, "%s%s", (p == buf) ? "" : ",", mask_maps[i].str); } } *p++ = '\n'; return p - buf; } static ssize_t sysfs_blk_trace_attr_show(struct device *dev, struct device_attribute *attr, char *buf) { struct block_device *bdev = dev_to_bdev(dev); struct request_queue *q = bdev_get_queue(bdev); struct blk_trace *bt; ssize_t ret = -ENXIO; mutex_lock(&q->debugfs_mutex); bt = rcu_dereference_protected(q->blk_trace, lockdep_is_held(&q->debugfs_mutex)); if (attr == &dev_attr_enable) { ret = sprintf(buf, "%u\n", !!bt); goto out_unlock_bdev; } if (bt == NULL) ret = sprintf(buf, "disabled\n"); else if (attr == &dev_attr_act_mask) ret = blk_trace_mask2str(buf, bt->act_mask); else if (attr == &dev_attr_pid) ret = sprintf(buf, "%u\n", bt->pid); else if (attr == &dev_attr_start_lba) ret = sprintf(buf, "%llu\n", bt->start_lba); else if (attr == &dev_attr_end_lba) ret = sprintf(buf, "%llu\n", bt->end_lba); out_unlock_bdev: mutex_unlock(&q->debugfs_mutex); return ret; } static ssize_t sysfs_blk_trace_attr_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct block_device *bdev = dev_to_bdev(dev); struct request_queue *q = bdev_get_queue(bdev); struct blk_trace *bt; u64 value; ssize_t ret = -EINVAL; if (count == 0) goto out; if (attr == &dev_attr_act_mask) { if (kstrtoull(buf, 0, &value)) { /* Assume it is a list of trace category names */ ret = blk_trace_str2mask(buf); if (ret < 0) goto out; value = ret; } } else { if (kstrtoull(buf, 0, &value)) goto out; } mutex_lock(&q->debugfs_mutex); bt = rcu_dereference_protected(q->blk_trace, lockdep_is_held(&q->debugfs_mutex)); if (attr == &dev_attr_enable) { if (!!value == !!bt) { ret = 0; goto out_unlock_bdev; } if (value) ret = blk_trace_setup_queue(q, bdev); else ret = blk_trace_remove_queue(q); goto out_unlock_bdev; } ret = 0; if (bt == NULL) { ret = blk_trace_setup_queue(q, bdev); bt = rcu_dereference_protected(q->blk_trace, lockdep_is_held(&q->debugfs_mutex)); } if (ret == 0) { if (attr == &dev_attr_act_mask) bt->act_mask = value; else if (attr == &dev_attr_pid) bt->pid = value; else if (attr == &dev_attr_start_lba) bt->start_lba = value; else if (attr == &dev_attr_end_lba) bt->end_lba = value; } out_unlock_bdev: mutex_unlock(&q->debugfs_mutex); out: return ret ? ret : count; } #endif /* CONFIG_BLK_DEV_IO_TRACE */ #ifdef CONFIG_EVENT_TRACING /** * blk_fill_rwbs - Fill the buffer rwbs by mapping op to character string. * @rwbs: buffer to be filled * @opf: request operation type (REQ_OP_XXX) and flags for the tracepoint * * Description: * Maps each request operation and flag to a single character and fills the * buffer provided by the caller with resulting string. * **/ void blk_fill_rwbs(char *rwbs, blk_opf_t opf) { int i = 0; if (opf & REQ_PREFLUSH) rwbs[i++] = 'F'; switch (opf & REQ_OP_MASK) { case REQ_OP_WRITE: rwbs[i++] = 'W'; break; case REQ_OP_DISCARD: rwbs[i++] = 'D'; break; case REQ_OP_SECURE_ERASE: rwbs[i++] = 'D'; rwbs[i++] = 'E'; break; case REQ_OP_FLUSH: rwbs[i++] = 'F'; break; case REQ_OP_READ: rwbs[i++] = 'R'; break; case REQ_OP_ZONE_APPEND: rwbs[i++] = 'Z'; rwbs[i++] = 'A'; break; case REQ_OP_ZONE_RESET: case REQ_OP_ZONE_RESET_ALL: rwbs[i++] = 'Z'; rwbs[i++] = 'R'; if ((opf & REQ_OP_MASK) == REQ_OP_ZONE_RESET_ALL) rwbs[i++] = 'A'; break; case REQ_OP_ZONE_FINISH: rwbs[i++] = 'Z'; rwbs[i++] = 'F'; break; case REQ_OP_ZONE_OPEN: rwbs[i++] = 'Z'; rwbs[i++] = 'O'; break; case REQ_OP_ZONE_CLOSE: rwbs[i++] = 'Z'; rwbs[i++] = 'C'; break; case REQ_OP_WRITE_ZEROES: rwbs[i++] = 'W'; rwbs[i++] = 'Z'; break; default: rwbs[i++] = 'N'; } if (opf & REQ_FUA) rwbs[i++] = 'F'; if (opf & REQ_RAHEAD) rwbs[i++] = 'A'; if (opf & REQ_SYNC) rwbs[i++] = 'S'; if (opf & REQ_META) rwbs[i++] = 'M'; if (opf & REQ_ATOMIC) rwbs[i++] = 'U'; WARN_ON_ONCE(i >= RWBS_LEN); rwbs[i] = '\0'; } EXPORT_SYMBOL_GPL(blk_fill_rwbs); #endif /* CONFIG_EVENT_TRACING */ |
| 829 830 83 786 84 83 84 789 788 790 786 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Functions for initializing, allocating, freeing and duplicating VMAs. Shared * between CONFIG_MMU and non-CONFIG_MMU kernel configurations. */ #include "vma_internal.h" #include "vma.h" /* SLAB cache for vm_area_struct structures */ static struct kmem_cache *vm_area_cachep; void __init vma_state_init(void) { struct kmem_cache_args args = { .use_freeptr_offset = true, .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr), .sheaf_capacity = 32, }; vm_area_cachep = kmem_cache_create("vm_area_struct", sizeof(struct vm_area_struct), &args, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU| SLAB_ACCOUNT); } struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) { struct vm_area_struct *vma; vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); if (!vma) return NULL; vma_init(vma, mm); return vma; } static void vm_area_init_from(const struct vm_area_struct *src, struct vm_area_struct *dest) { dest->vm_mm = src->vm_mm; dest->vm_ops = src->vm_ops; dest->vm_start = src->vm_start; dest->vm_end = src->vm_end; dest->anon_vma = src->anon_vma; dest->vm_pgoff = src->vm_pgoff; dest->vm_file = src->vm_file; dest->vm_private_data = src->vm_private_data; vm_flags_init(dest, src->vm_flags); memcpy(&dest->vm_page_prot, &src->vm_page_prot, sizeof(dest->vm_page_prot)); /* * src->shared.rb may be modified concurrently when called from * dup_mmap(), but the clone will reinitialize it. */ data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared))); memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx, sizeof(dest->vm_userfaultfd_ctx)); #ifdef CONFIG_ANON_VMA_NAME dest->anon_name = src->anon_name; #endif #ifdef CONFIG_SWAP memcpy(&dest->swap_readahead_info, &src->swap_readahead_info, sizeof(dest->swap_readahead_info)); #endif #ifndef CONFIG_MMU dest->vm_region = src->vm_region; #endif #ifdef CONFIG_NUMA dest->vm_policy = src->vm_policy; #endif #ifdef __HAVE_PFNMAP_TRACKING dest->pfnmap_track_ctx = NULL; #endif } #ifdef __HAVE_PFNMAP_TRACKING static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig, struct vm_area_struct *new) { struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx; if (likely(!ctx)) return 0; /* * We don't expect to ever hit this. If ever required, we would have * to duplicate the tracking. */ if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX)) return -ENOMEM; kref_get(&ctx->kref); new->pfnmap_track_ctx = ctx; return 0; } static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma) { struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx; if (likely(!ctx)) return; kref_put(&ctx->kref, pfnmap_track_ctx_release); vma->pfnmap_track_ctx = NULL; } #else static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig, struct vm_area_struct *new) { return 0; } static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma) { } #endif struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) { struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); if (!new) return NULL; ASSERT_EXCLUSIVE_WRITER(orig->vm_flags); ASSERT_EXCLUSIVE_WRITER(orig->vm_file); vm_area_init_from(orig, new); if (vma_pfnmap_track_ctx_dup(orig, new)) { kmem_cache_free(vm_area_cachep, new); return NULL; } vma_lock_init(new, true); INIT_LIST_HEAD(&new->anon_vma_chain); vma_numab_state_init(new); dup_anon_vma_name(orig, new); return new; } void vm_area_free(struct vm_area_struct *vma) { /* The vma should be detached while being destroyed. */ vma_assert_detached(vma); vma_numab_state_free(vma); free_anon_vma_name(vma); vma_pfnmap_track_ctx_release(vma); kmem_cache_free(vm_area_cachep, vma); } |
| 6 6 6 6 3 6 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Scatterlist Cryptographic API. * * Procfs information. * * Copyright (c) 2002 James Morris <jmorris@intercode.com.au> * Copyright (c) 2005 Herbert Xu <herbert@gondor.apana.org.au> */ #include <linux/atomic.h> #include <linux/init.h> #include <linux/crypto.h> #include <linux/fips.h> #include <linux/module.h> /* for module_name() */ #include <linux/rwsem.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> #include "internal.h" static void *c_start(struct seq_file *m, loff_t *pos) { down_read(&crypto_alg_sem); return seq_list_start(&crypto_alg_list, *pos); } static void *c_next(struct seq_file *m, void *p, loff_t *pos) { return seq_list_next(p, &crypto_alg_list, pos); } static void c_stop(struct seq_file *m, void *p) { up_read(&crypto_alg_sem); } static int c_show(struct seq_file *m, void *p) { struct crypto_alg *alg = list_entry(p, struct crypto_alg, cra_list); seq_printf(m, "name : %s\n", alg->cra_name); seq_printf(m, "driver : %s\n", alg->cra_driver_name); seq_printf(m, "module : %s\n", module_name(alg->cra_module)); seq_printf(m, "priority : %d\n", alg->cra_priority); seq_printf(m, "refcnt : %u\n", refcount_read(&alg->cra_refcnt)); seq_printf(m, "selftest : %s\n", (alg->cra_flags & CRYPTO_ALG_TESTED) ? "passed" : "unknown"); seq_printf(m, "internal : %s\n", str_yes_no(alg->cra_flags & CRYPTO_ALG_INTERNAL)); if (fips_enabled) seq_printf(m, "fips : %s\n", str_no_yes(alg->cra_flags & CRYPTO_ALG_FIPS_INTERNAL)); if (alg->cra_flags & CRYPTO_ALG_LARVAL) { seq_printf(m, "type : larval\n"); seq_printf(m, "flags : 0x%x\n", alg->cra_flags); goto out; } if (alg->cra_type && alg->cra_type->show) { alg->cra_type->show(m, alg); goto out; } switch (alg->cra_flags & CRYPTO_ALG_TYPE_MASK) { case CRYPTO_ALG_TYPE_CIPHER: seq_printf(m, "type : cipher\n"); seq_printf(m, "blocksize : %u\n", alg->cra_blocksize); seq_printf(m, "min keysize : %u\n", alg->cra_cipher.cia_min_keysize); seq_printf(m, "max keysize : %u\n", alg->cra_cipher.cia_max_keysize); break; default: seq_printf(m, "type : unknown\n"); break; } out: seq_putc(m, '\n'); return 0; } static const struct seq_operations crypto_seq_ops = { .start = c_start, .next = c_next, .stop = c_stop, .show = c_show }; void __init crypto_init_proc(void) { proc_create_seq("crypto", 0, NULL, &crypto_seq_ops); } void __exit crypto_exit_proc(void) { remove_proc_entry("crypto", NULL); } |
| 15 15 1 1 14 14 14 15 15 15 15 1 14 15 15 15 15 19 19 15 15 15 15 14 15 15 15 14 14 15 14 15 14 15 1 1 1 1 1 1 1 1 1 47 25 21 3 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 | // SPDX-License-Identifier: GPL-2.0-or-later /* Basic authentication token and access key management * * Copyright (C) 2004-2008 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) */ #include <linux/export.h> #include <linux/init.h> #include <linux/poison.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/security.h> #include <linux/workqueue.h> #include <linux/random.h> #include <linux/err.h> #include "internal.h" struct kmem_cache *key_jar; struct rb_root key_serial_tree; /* tree of keys indexed by serial */ DEFINE_SPINLOCK(key_serial_lock); struct rb_root key_user_tree; /* tree of quota records indexed by UID */ DEFINE_SPINLOCK(key_user_lock); unsigned int key_quota_root_maxkeys = 1000000; /* root's key count quota */ unsigned int key_quota_root_maxbytes = 25000000; /* root's key space quota */ unsigned int key_quota_maxkeys = 200; /* general key count quota */ unsigned int key_quota_maxbytes = 20000; /* general key space quota */ static LIST_HEAD(key_types_list); static DECLARE_RWSEM(key_types_sem); /* We serialise key instantiation and link */ DEFINE_MUTEX(key_construction_mutex); #ifdef KEY_DEBUGGING void __key_check(const struct key *key) { printk("__key_check: key %p {%08x} should be {%08x}\n", key, key->magic, KEY_DEBUG_MAGIC); BUG(); } #endif /* * Get the key quota record for a user, allocating a new record if one doesn't * already exist. */ struct key_user *key_user_lookup(kuid_t uid) { struct key_user *candidate = NULL, *user; struct rb_node *parent, **p; try_again: parent = NULL; p = &key_user_tree.rb_node; spin_lock(&key_user_lock); /* search the tree for a user record with a matching UID */ while (*p) { parent = *p; user = rb_entry(parent, struct key_user, node); if (uid_lt(uid, user->uid)) p = &(*p)->rb_left; else if (uid_gt(uid, user->uid)) p = &(*p)->rb_right; else goto found; } /* if we get here, we failed to find a match in the tree */ if (!candidate) { /* allocate a candidate user record if we don't already have * one */ spin_unlock(&key_user_lock); user = NULL; candidate = kmalloc(sizeof(struct key_user), GFP_KERNEL); if (unlikely(!candidate)) goto out; /* the allocation may have scheduled, so we need to repeat the * search lest someone else added the record whilst we were * asleep */ goto try_again; } /* if we get here, then the user record still hadn't appeared on the * second pass - so we use the candidate record */ refcount_set(&candidate->usage, 1); atomic_set(&candidate->nkeys, 0); atomic_set(&candidate->nikeys, 0); candidate->uid = uid; candidate->qnkeys = 0; candidate->qnbytes = 0; spin_lock_init(&candidate->lock); mutex_init(&candidate->cons_lock); rb_link_node(&candidate->node, parent, p); rb_insert_color(&candidate->node, &key_user_tree); spin_unlock(&key_user_lock); user = candidate; goto out; /* okay - we found a user record for this UID */ found: refcount_inc(&user->usage); spin_unlock(&key_user_lock); kfree(candidate); out: return user; } /* * Dispose of a user structure */ void key_user_put(struct key_user *user) { if (refcount_dec_and_lock(&user->usage, &key_user_lock)) { rb_erase(&user->node, &key_user_tree); spin_unlock(&key_user_lock); kfree(user); } } /* * Allocate a serial number for a key. These are assigned randomly to avoid * security issues through covert channel problems. */ static inline void key_alloc_serial(struct key *key) { struct rb_node *parent, **p; struct key *xkey; /* propose a random serial number and look for a hole for it in the * serial number tree */ do { get_random_bytes(&key->serial, sizeof(key->serial)); key->serial >>= 1; /* negative numbers are not permitted */ } while (key->serial < 3); spin_lock(&key_serial_lock); attempt_insertion: parent = NULL; p = &key_serial_tree.rb_node; while (*p) { parent = *p; xkey = rb_entry(parent, struct key, serial_node); if (key->serial < xkey->serial) p = &(*p)->rb_left; else if (key->serial > xkey->serial) p = &(*p)->rb_right; else goto serial_exists; } /* we've found a suitable hole - arrange for this key to occupy it */ rb_link_node(&key->serial_node, parent, p); rb_insert_color(&key->serial_node, &key_serial_tree); spin_unlock(&key_serial_lock); return; /* we found a key with the proposed serial number - walk the tree from * that point looking for the next unused serial number */ serial_exists: for (;;) { key->serial++; if (key->serial < 3) { key->serial = 3; goto attempt_insertion; } parent = rb_next(parent); if (!parent) goto attempt_insertion; xkey = rb_entry(parent, struct key, serial_node); if (key->serial < xkey->serial) goto attempt_insertion; } } /** * key_alloc - Allocate a key of the specified type. * @type: The type of key to allocate. * @desc: The key description to allow the key to be searched out. * @uid: The owner of the new key. * @gid: The group ID for the new key's group permissions. * @cred: The credentials specifying UID namespace. * @perm: The permissions mask of the new key. * @flags: Flags specifying quota properties. * @restrict_link: Optional link restriction for new keyrings. * * Allocate a key of the specified type with the attributes given. The key is * returned in an uninstantiated state and the caller needs to instantiate the * key before returning. * * The restrict_link structure (if not NULL) will be freed when the * keyring is destroyed, so it must be dynamically allocated. * * The user's key count quota is updated to reflect the creation of the key and * the user's key data quota has the default for the key type reserved. The * instantiation function should amend this as necessary. If insufficient * quota is available, -EDQUOT will be returned. * * The LSM security modules can prevent a key being created, in which case * -EACCES will be returned. * * Returns a pointer to the new key if successful and an error code otherwise. * * Note that the caller needs to ensure the key type isn't uninstantiated. * Internally this can be done by locking key_types_sem. Externally, this can * be done by either never unregistering the key type, or making sure * key_alloc() calls don't race with module unloading. */ struct key *key_alloc(struct key_type *type, const char *desc, kuid_t uid, kgid_t gid, const struct cred *cred, key_perm_t perm, unsigned long flags, struct key_restriction *restrict_link) { struct key_user *user = NULL; struct key *key; size_t desclen, quotalen; int ret; unsigned long irqflags; key = ERR_PTR(-EINVAL); if (!desc || !*desc) goto error; if (type->vet_description) { ret = type->vet_description(desc); if (ret < 0) { key = ERR_PTR(ret); goto error; } } desclen = strlen(desc); quotalen = desclen + 1 + type->def_datalen; /* get hold of the key tracking for this user */ user = key_user_lookup(uid); if (!user) goto no_memory_1; /* check that the user's quota permits allocation of another key and * its description */ if (!(flags & KEY_ALLOC_NOT_IN_QUOTA)) { unsigned maxkeys = uid_eq(uid, GLOBAL_ROOT_UID) ? key_quota_root_maxkeys : key_quota_maxkeys; unsigned maxbytes = uid_eq(uid, GLOBAL_ROOT_UID) ? key_quota_root_maxbytes : key_quota_maxbytes; spin_lock_irqsave(&user->lock, irqflags); if (!(flags & KEY_ALLOC_QUOTA_OVERRUN)) { if (user->qnkeys + 1 > maxkeys || user->qnbytes + quotalen > maxbytes || user->qnbytes + quotalen < user->qnbytes) goto no_quota; } user->qnkeys++; user->qnbytes += quotalen; spin_unlock_irqrestore(&user->lock, irqflags); } /* allocate and initialise the key and its description */ key = kmem_cache_zalloc(key_jar, GFP_KERNEL); if (!key) goto no_memory_2; key->index_key.desc_len = desclen; key->index_key.description = kmemdup(desc, desclen + 1, GFP_KERNEL); if (!key->index_key.description) goto no_memory_3; key->index_key.type = type; key_set_index_key(&key->index_key); refcount_set(&key->usage, 1); init_rwsem(&key->sem); lockdep_set_class(&key->sem, &type->lock_class); key->user = user; key->quotalen = quotalen; key->datalen = type->def_datalen; key->uid = uid; key->gid = gid; key->perm = perm; key->expiry = TIME64_MAX; key->restrict_link = restrict_link; key->last_used_at = ktime_get_real_seconds(); key->flags |= 1 << KEY_FLAG_USER_ALIVE; if (!(flags & KEY_ALLOC_NOT_IN_QUOTA)) key->flags |= 1 << KEY_FLAG_IN_QUOTA; if (flags & KEY_ALLOC_BUILT_IN) key->flags |= 1 << KEY_FLAG_BUILTIN; if (flags & KEY_ALLOC_UID_KEYRING) key->flags |= 1 << KEY_FLAG_UID_KEYRING; if (flags & KEY_ALLOC_SET_KEEP) key->flags |= 1 << KEY_FLAG_KEEP; #ifdef KEY_DEBUGGING key->magic = KEY_DEBUG_MAGIC; #endif /* let the security module know about the key */ ret = security_key_alloc(key, cred, flags); if (ret < 0) goto security_error; /* publish the key by giving it a serial number */ refcount_inc(&key->domain_tag->usage); atomic_inc(&user->nkeys); key_alloc_serial(key); error: return key; security_error: kfree(key->description); kmem_cache_free(key_jar, key); if (!(flags & KEY_ALLOC_NOT_IN_QUOTA)) { spin_lock_irqsave(&user->lock, irqflags); user->qnkeys--; user->qnbytes -= quotalen; spin_unlock_irqrestore(&user->lock, irqflags); } key_user_put(user); key = ERR_PTR(ret); goto error; no_memory_3: kmem_cache_free(key_jar, key); no_memory_2: if (!(flags & KEY_ALLOC_NOT_IN_QUOTA)) { spin_lock_irqsave(&user->lock, irqflags); user->qnkeys--; user->qnbytes -= quotalen; spin_unlock_irqrestore(&user->lock, irqflags); } key_user_put(user); no_memory_1: key = ERR_PTR(-ENOMEM); goto error; no_quota: spin_unlock_irqrestore(&user->lock, irqflags); key_user_put(user); key = ERR_PTR(-EDQUOT); goto error; } EXPORT_SYMBOL(key_alloc); /** * key_payload_reserve - Adjust data quota reservation for the key's payload * @key: The key to make the reservation for. * @datalen: The amount of data payload the caller now wants. * * Adjust the amount of the owning user's key data quota that a key reserves. * If the amount is increased, then -EDQUOT may be returned if there isn't * enough free quota available. * * If successful, 0 is returned. */ int key_payload_reserve(struct key *key, size_t datalen) { int delta = (int)datalen - key->datalen; int ret = 0; key_check(key); /* contemplate the quota adjustment */ if (delta != 0 && test_bit(KEY_FLAG_IN_QUOTA, &key->flags)) { unsigned maxbytes = uid_eq(key->user->uid, GLOBAL_ROOT_UID) ? key_quota_root_maxbytes : key_quota_maxbytes; unsigned long flags; spin_lock_irqsave(&key->user->lock, flags); if (delta > 0 && (key->user->qnbytes + delta > maxbytes || key->user->qnbytes + delta < key->user->qnbytes)) { ret = -EDQUOT; } else { key->user->qnbytes += delta; key->quotalen += delta; } spin_unlock_irqrestore(&key->user->lock, flags); } /* change the recorded data length if that didn't generate an error */ if (ret == 0) key->datalen = datalen; return ret; } EXPORT_SYMBOL(key_payload_reserve); /* * Change the key state to being instantiated. */ static void mark_key_instantiated(struct key *key, int reject_error) { /* Commit the payload before setting the state; barrier versus * key_read_state(). */ smp_store_release(&key->state, (reject_error < 0) ? reject_error : KEY_IS_POSITIVE); } /* * Instantiate a key and link it into the target keyring atomically. Must be * called with the target keyring's semaphore writelocked. The target key's * semaphore need not be locked as instantiation is serialised by * key_construction_mutex. */ static int __key_instantiate_and_link(struct key *key, struct key_preparsed_payload *prep, struct key *keyring, struct key *authkey, struct assoc_array_edit **_edit) { int ret, awaken; key_check(key); key_check(keyring); awaken = 0; ret = -EBUSY; mutex_lock(&key_construction_mutex); /* can't instantiate twice */ if (key->state == KEY_IS_UNINSTANTIATED) { /* instantiate the key */ ret = key->type->instantiate(key, prep); if (ret == 0) { /* mark the key as being instantiated */ atomic_inc(&key->user->nikeys); mark_key_instantiated(key, 0); notify_key(key, NOTIFY_KEY_INSTANTIATED, 0); if (test_and_clear_bit(KEY_FLAG_USER_CONSTRUCT, &key->flags)) awaken = 1; /* and link it into the destination keyring */ if (keyring) { if (test_bit(KEY_FLAG_KEEP, &keyring->flags)) set_bit(KEY_FLAG_KEEP, &key->flags); __key_link(keyring, key, _edit); } /* disable the authorisation key */ if (authkey) key_invalidate(authkey); if (prep->expiry != TIME64_MAX) key_set_expiry(key, prep->expiry); } } mutex_unlock(&key_construction_mutex); /* wake up anyone waiting for a key to be constructed */ if (awaken) wake_up_bit(&key->flags, KEY_FLAG_USER_CONSTRUCT); return ret; } /** * key_instantiate_and_link - Instantiate a key and link it into the keyring. * @key: The key to instantiate. * @data: The data to use to instantiate the keyring. * @datalen: The length of @data. * @keyring: Keyring to create a link in on success (or NULL). * @authkey: The authorisation token permitting instantiation. * * Instantiate a key that's in the uninstantiated state using the provided data * and, if successful, link it in to the destination keyring if one is * supplied. * * If successful, 0 is returned, the authorisation token is revoked and anyone * waiting for the key is woken up. If the key was already instantiated, * -EBUSY will be returned. */ int key_instantiate_and_link(struct key *key, const void *data, size_t datalen, struct key *keyring, struct key *authkey) { struct key_preparsed_payload prep; struct assoc_array_edit *edit = NULL; int ret; memset(&prep, 0, sizeof(prep)); prep.orig_description = key->description; prep.data = data; prep.datalen = datalen; prep.quotalen = key->type->def_datalen; prep.expiry = TIME64_MAX; if (key->type->preparse) { ret = key->type->preparse(&prep); if (ret < 0) goto error; } if (keyring) { ret = __key_link_lock(keyring, &key->index_key); if (ret < 0) goto error; ret = __key_link_begin(keyring, &key->index_key, &edit); if (ret < 0) goto error_link_end; if (keyring->restrict_link && keyring->restrict_link->check) { struct key_restriction *keyres = keyring->restrict_link; ret = keyres->check(keyring, key->type, &prep.payload, keyres->key); if (ret < 0) goto error_link_end; } } ret = __key_instantiate_and_link(key, &prep, keyring, authkey, &edit); error_link_end: if (keyring) __key_link_end(keyring, &key->index_key, edit); error: if (key->type->preparse) key->type->free_preparse(&prep); return ret; } EXPORT_SYMBOL(key_instantiate_and_link); /** * key_reject_and_link - Negatively instantiate a key and link it into the keyring. * @key: The key to instantiate. * @timeout: The timeout on the negative key. * @error: The error to return when the key is hit. * @keyring: Keyring to create a link in on success (or NULL). * @authkey: The authorisation token permitting instantiation. * * Negatively instantiate a key that's in the uninstantiated state and, if * successful, set its timeout and stored error and link it in to the * destination keyring if one is supplied. The key and any links to the key * will be automatically garbage collected after the timeout expires. * * Negative keys are used to rate limit repeated request_key() calls by causing * them to return the stored error code (typically ENOKEY) until the negative * key expires. * * If successful, 0 is returned, the authorisation token is revoked and anyone * waiting for the key is woken up. If the key was already instantiated, * -EBUSY will be returned. */ int key_reject_and_link(struct key *key, unsigned timeout, unsigned error, struct key *keyring, struct key *authkey) { struct assoc_array_edit *edit = NULL; int ret, awaken, link_ret = 0; key_check(key); key_check(keyring); awaken = 0; ret = -EBUSY; if (keyring) { if (keyring->restrict_link) return -EPERM; link_ret = __key_link_lock(keyring, &key->index_key); if (link_ret == 0) { link_ret = __key_link_begin(keyring, &key->index_key, &edit); if (link_ret < 0) __key_link_end(keyring, &key->index_key, edit); } } mutex_lock(&key_construction_mutex); /* can't instantiate twice */ if (key->state == KEY_IS_UNINSTANTIATED) { /* mark the key as being negatively instantiated */ atomic_inc(&key->user->nikeys); mark_key_instantiated(key, -error); notify_key(key, NOTIFY_KEY_INSTANTIATED, -error); key_set_expiry(key, ktime_get_real_seconds() + timeout); if (test_and_clear_bit(KEY_FLAG_USER_CONSTRUCT, &key->flags)) awaken = 1; ret = 0; /* and link it into the destination keyring */ if (keyring && link_ret == 0) __key_link(keyring, key, &edit); /* disable the authorisation key */ if (authkey) key_invalidate(authkey); } mutex_unlock(&key_construction_mutex); if (keyring && link_ret == 0) __key_link_end(keyring, &key->index_key, edit); /* wake up anyone waiting for a key to be constructed */ if (awaken) wake_up_bit(&key->flags, KEY_FLAG_USER_CONSTRUCT); return ret == 0 ? link_ret : ret; } EXPORT_SYMBOL(key_reject_and_link); /** * key_put - Discard a reference to a key. * @key: The key to discard a reference from. * * Discard a reference to a key, and when all the references are gone, we * schedule the cleanup task to come and pull it out of the tree in process * context at some later time. */ void key_put(struct key *key) { if (key) { key_check(key); if (refcount_dec_and_test(&key->usage)) { unsigned long flags; /* deal with the user's key tracking and quota */ if (test_bit(KEY_FLAG_IN_QUOTA, &key->flags)) { spin_lock_irqsave(&key->user->lock, flags); key->user->qnkeys--; key->user->qnbytes -= key->quotalen; spin_unlock_irqrestore(&key->user->lock, flags); } /* Mark key as safe for GC after key->user done. */ clear_bit_unlock(KEY_FLAG_USER_ALIVE, &key->flags); schedule_work(&key_gc_work); } } } EXPORT_SYMBOL(key_put); /* * Find a key by its serial number. */ struct key *key_lookup(key_serial_t id) { struct rb_node *n; struct key *key; spin_lock(&key_serial_lock); /* search the tree for the specified key */ n = key_serial_tree.rb_node; while (n) { key = rb_entry(n, struct key, serial_node); if (id < key->serial) n = n->rb_left; else if (id > key->serial) n = n->rb_right; else goto found; } not_found: key = ERR_PTR(-ENOKEY); goto error; found: /* A key is allowed to be looked up only if someone still owns a * reference to it - otherwise it's awaiting the gc. */ if (!refcount_inc_not_zero(&key->usage)) goto not_found; error: spin_unlock(&key_serial_lock); return key; } EXPORT_SYMBOL(key_lookup); /* * Find and lock the specified key type against removal. * * We return with the sem read-locked if successful. If the type wasn't * available -ENOKEY is returned instead. */ struct key_type *key_type_lookup(const char *type) { struct key_type *ktype; down_read(&key_types_sem); /* look up the key type to see if it's one of the registered kernel * types */ list_for_each_entry(ktype, &key_types_list, link) { if (strcmp(ktype->name, type) == 0) goto found_kernel_type; } up_read(&key_types_sem); ktype = ERR_PTR(-ENOKEY); found_kernel_type: return ktype; } void key_set_timeout(struct key *key, unsigned timeout) { time64_t expiry = TIME64_MAX; /* make the changes with the locks held to prevent races */ down_write(&key->sem); if (timeout > 0) expiry = ktime_get_real_seconds() + timeout; key_set_expiry(key, expiry); up_write(&key->sem); } EXPORT_SYMBOL_GPL(key_set_timeout); /* * Unlock a key type locked by key_type_lookup(). */ void key_type_put(struct key_type *ktype) { up_read(&key_types_sem); } /* * Attempt to update an existing key. * * The key is given to us with an incremented refcount that we need to discard * if we get an error. */ static inline key_ref_t __key_update(key_ref_t key_ref, struct key_preparsed_payload *prep) { struct key *key = key_ref_to_ptr(key_ref); int ret; /* need write permission on the key to update it */ ret = key_permission(key_ref, KEY_NEED_WRITE); if (ret < 0) goto error; ret = -EEXIST; if (!key->type->update) goto error; down_write(&key->sem); ret = key->type->update(key, prep); if (ret == 0) { /* Updating a negative key positively instantiates it */ mark_key_instantiated(key, 0); notify_key(key, NOTIFY_KEY_UPDATED, 0); } up_write(&key->sem); if (ret < 0) goto error; out: return key_ref; error: key_put(key); key_ref = ERR_PTR(ret); goto out; } /* * Create or potentially update a key. The combined logic behind * key_create_or_update() and key_create() */ static key_ref_t __key_create_or_update(key_ref_t keyring_ref, const char *type, const char *description, const void *payload, size_t plen, key_perm_t perm, unsigned long flags, bool allow_update) { struct keyring_index_key index_key = { .description = description, }; struct key_preparsed_payload prep; struct assoc_array_edit *edit = NULL; const struct cred *cred = current_cred(); struct key *keyring, *key = NULL; key_ref_t key_ref; int ret; struct key_restriction *restrict_link = NULL; /* look up the key type to see if it's one of the registered kernel * types */ index_key.type = key_type_lookup(type); if (IS_ERR(index_key.type)) { key_ref = ERR_PTR(-ENODEV); goto error; } key_ref = ERR_PTR(-EINVAL); if (!index_key.type->instantiate || (!index_key.description && !index_key.type->preparse)) goto error_put_type; keyring = key_ref_to_ptr(keyring_ref); key_check(keyring); if (!(flags & KEY_ALLOC_BYPASS_RESTRICTION)) restrict_link = keyring->restrict_link; key_ref = ERR_PTR(-ENOTDIR); if (keyring->type != &key_type_keyring) goto error_put_type; memset(&prep, 0, sizeof(prep)); prep.orig_description = description; prep.data = payload; prep.datalen = plen; prep.quotalen = index_key.type->def_datalen; prep.expiry = TIME64_MAX; if (index_key.type->preparse) { ret = index_key.type->preparse(&prep); if (ret < 0) { key_ref = ERR_PTR(ret); goto error_free_prep; } if (!index_key.description) index_key.description = prep.description; key_ref = ERR_PTR(-EINVAL); if (!index_key.description) goto error_free_prep; } index_key.desc_len = strlen(index_key.description); key_set_index_key(&index_key); ret = __key_link_lock(keyring, &index_key); if (ret < 0) { key_ref = ERR_PTR(ret); goto error_free_prep; } ret = __key_link_begin(keyring, &index_key, &edit); if (ret < 0) { key_ref = ERR_PTR(ret); goto error_link_end; } if (restrict_link && restrict_link->check) { ret = restrict_link->check(keyring, index_key.type, &prep.payload, restrict_link->key); if (ret < 0) { key_ref = ERR_PTR(ret); goto error_link_end; } } /* if we're going to allocate a new key, we're going to have * to modify the keyring */ ret = key_permission(keyring_ref, KEY_NEED_WRITE); if (ret < 0) { key_ref = ERR_PTR(ret); goto error_link_end; } /* if it's requested and possible to update this type of key, search * for an existing key of the same type and description in the * destination keyring and update that instead if possible */ if (allow_update) { if (index_key.type->update) { key_ref = find_key_to_update(keyring_ref, &index_key); if (key_ref) goto found_matching_key; } } else { key_ref = find_key_to_update(keyring_ref, &index_key); if (key_ref) { key_ref_put(key_ref); key_ref = ERR_PTR(-EEXIST); goto error_link_end; } } /* if the client doesn't provide, decide on the permissions we want */ if (perm == KEY_PERM_UNDEF) { perm = KEY_POS_VIEW | KEY_POS_SEARCH | KEY_POS_LINK | KEY_POS_SETATTR; perm |= KEY_USR_VIEW; if (index_key.type->read) perm |= KEY_POS_READ; if (index_key.type == &key_type_keyring || index_key.type->update) perm |= KEY_POS_WRITE; } /* allocate a new key */ key = key_alloc(index_key.type, index_key.description, cred->fsuid, cred->fsgid, cred, perm, flags, NULL); if (IS_ERR(key)) { key_ref = ERR_CAST(key); goto error_link_end; } /* instantiate it and link it into the target keyring */ ret = __key_instantiate_and_link(key, &prep, keyring, NULL, &edit); if (ret < 0) { key_put(key); key_ref = ERR_PTR(ret); goto error_link_end; } security_key_post_create_or_update(keyring, key, payload, plen, flags, true); key_ref = make_key_ref(key, is_key_possessed(keyring_ref)); error_link_end: __key_link_end(keyring, &index_key, edit); error_free_prep: if (index_key.type->preparse) index_key.type->free_preparse(&prep); error_put_type: key_type_put(index_key.type); error: return key_ref; found_matching_key: /* we found a matching key, so we're going to try to update it * - we can drop the locks first as we have the key pinned */ __key_link_end(keyring, &index_key, edit); key = key_ref_to_ptr(key_ref); if (test_bit(KEY_FLAG_USER_CONSTRUCT, &key->flags)) { ret = wait_for_key_construction(key, true); if (ret < 0) { key_ref_put(key_ref); key_ref = ERR_PTR(ret); goto error_free_prep; } } key_ref = __key_update(key_ref, &prep); if (!IS_ERR(key_ref)) security_key_post_create_or_update(keyring, key, payload, plen, flags, false); goto error_free_prep; } /** * key_create_or_update - Update or create and instantiate a key. * @keyring_ref: A pointer to the destination keyring with possession flag. * @type: The type of key. * @description: The searchable description for the key. * @payload: The data to use to instantiate or update the key. * @plen: The length of @payload. * @perm: The permissions mask for a new key. * @flags: The quota flags for a new key. * * Search the destination keyring for a key of the same description and if one * is found, update it, otherwise create and instantiate a new one and create a * link to it from that keyring. * * If perm is KEY_PERM_UNDEF then an appropriate key permissions mask will be * concocted. * * Returns a pointer to the new key if successful, -ENODEV if the key type * wasn't available, -ENOTDIR if the keyring wasn't a keyring, -EACCES if the * caller isn't permitted to modify the keyring or the LSM did not permit * creation of the key. * * On success, the possession flag from the keyring ref will be tacked on to * the key ref before it is returned. */ key_ref_t key_create_or_update(key_ref_t keyring_ref, const char *type, const char *description, const void *payload, size_t plen, key_perm_t perm, unsigned long flags) { return __key_create_or_update(keyring_ref, type, description, payload, plen, perm, flags, true); } EXPORT_SYMBOL(key_create_or_update); /** * key_create - Create and instantiate a key. * @keyring_ref: A pointer to the destination keyring with possession flag. * @type: The type of key. * @description: The searchable description for the key. * @payload: The data to use to instantiate or update the key. * @plen: The length of @payload. * @perm: The permissions mask for a new key. * @flags: The quota flags for a new key. * * Create and instantiate a new key and link to it from the destination keyring. * * If perm is KEY_PERM_UNDEF then an appropriate key permissions mask will be * concocted. * * Returns a pointer to the new key if successful, -EEXIST if a key with the * same description already exists, -ENODEV if the key type wasn't available, * -ENOTDIR if the keyring wasn't a keyring, -EACCES if the caller isn't * permitted to modify the keyring or the LSM did not permit creation of the * key. * * On success, the possession flag from the keyring ref will be tacked on to * the key ref before it is returned. */ key_ref_t key_create(key_ref_t keyring_ref, const char *type, const char *description, const void *payload, size_t plen, key_perm_t perm, unsigned long flags) { return __key_create_or_update(keyring_ref, type, description, payload, plen, perm, flags, false); } EXPORT_SYMBOL(key_create); /** * key_update - Update a key's contents. * @key_ref: The pointer (plus possession flag) to the key. * @payload: The data to be used to update the key. * @plen: The length of @payload. * * Attempt to update the contents of a key with the given payload data. The * caller must be granted Write permission on the key. Negative keys can be * instantiated by this method. * * Returns 0 on success, -EACCES if not permitted and -EOPNOTSUPP if the key * type does not support updating. The key type may return other errors. */ int key_update(key_ref_t key_ref, const void *payload, size_t plen) { struct key_preparsed_payload prep; struct key *key = key_ref_to_ptr(key_ref); int ret; key_check(key); /* the key must be writable */ ret = key_permission(key_ref, KEY_NEED_WRITE); if (ret < 0) return ret; /* attempt to update it if supported */ if (!key->type->update) return -EOPNOTSUPP; memset(&prep, 0, sizeof(prep)); prep.data = payload; prep.datalen = plen; prep.quotalen = key->type->def_datalen; prep.expiry = TIME64_MAX; if (key->type->preparse) { ret = key->type->preparse(&prep); if (ret < 0) goto error; } down_write(&key->sem); ret = key->type->update(key, &prep); if (ret == 0) { /* Updating a negative key positively instantiates it */ mark_key_instantiated(key, 0); notify_key(key, NOTIFY_KEY_UPDATED, 0); } up_write(&key->sem); error: if (key->type->preparse) key->type->free_preparse(&prep); return ret; } EXPORT_SYMBOL(key_update); /** * key_revoke - Revoke a key. * @key: The key to be revoked. * * Mark a key as being revoked and ask the type to free up its resources. The * revocation timeout is set and the key and all its links will be * automatically garbage collected after key_gc_delay amount of time if they * are not manually dealt with first. */ void key_revoke(struct key *key) { time64_t time; key_check(key); /* make sure no one's trying to change or use the key when we mark it * - we tell lockdep that we might nest because we might be revoking an * authorisation key whilst holding the sem on a key we've just * instantiated */ down_write_nested(&key->sem, 1); if (!test_and_set_bit(KEY_FLAG_REVOKED, &key->flags)) { notify_key(key, NOTIFY_KEY_REVOKED, 0); if (key->type->revoke) key->type->revoke(key); /* set the death time to no more than the expiry time */ time = ktime_get_real_seconds(); if (key->revoked_at == 0 || key->revoked_at > time) { key->revoked_at = time; key_schedule_gc(key->revoked_at + key_gc_delay); } } up_write(&key->sem); } EXPORT_SYMBOL(key_revoke); /** * key_invalidate - Invalidate a key. * @key: The key to be invalidated. * * Mark a key as being invalidated and have it cleaned up immediately. The key * is ignored by all searches and other operations from this point. */ void key_invalidate(struct key *key) { kenter("%d", key_serial(key)); key_check(key); if (!test_bit(KEY_FLAG_INVALIDATED, &key->flags)) { down_write_nested(&key->sem, 1); if (!test_and_set_bit(KEY_FLAG_INVALIDATED, &key->flags)) { notify_key(key, NOTIFY_KEY_INVALIDATED, 0); key_schedule_gc_links(); } up_write(&key->sem); } } EXPORT_SYMBOL(key_invalidate); /** * generic_key_instantiate - Simple instantiation of a key from preparsed data * @key: The key to be instantiated * @prep: The preparsed data to load. * * Instantiate a key from preparsed data. We assume we can just copy the data * in directly and clear the old pointers. * * This can be pointed to directly by the key type instantiate op pointer. */ int generic_key_instantiate(struct key *key, struct key_preparsed_payload *prep) { int ret; pr_devel("==>%s()\n", __func__); ret = key_payload_reserve(key, prep->quotalen); if (ret == 0) { rcu_assign_keypointer(key, prep->payload.data[0]); key->payload.data[1] = prep->payload.data[1]; key->payload.data[2] = prep->payload.data[2]; key->payload.data[3] = prep->payload.data[3]; prep->payload.data[0] = NULL; prep->payload.data[1] = NULL; prep->payload.data[2] = NULL; prep->payload.data[3] = NULL; } pr_devel("<==%s() = %d\n", __func__, ret); return ret; } EXPORT_SYMBOL(generic_key_instantiate); /** * register_key_type - Register a type of key. * @ktype: The new key type. * * Register a new key type. * * Returns 0 on success or -EEXIST if a type of this name already exists. */ int register_key_type(struct key_type *ktype) { struct key_type *p; int ret; memset(&ktype->lock_class, 0, sizeof(ktype->lock_class)); ret = -EEXIST; down_write(&key_types_sem); /* disallow key types with the same name */ list_for_each_entry(p, &key_types_list, link) { if (strcmp(p->name, ktype->name) == 0) goto out; } /* store the type */ list_add(&ktype->link, &key_types_list); pr_notice("Key type %s registered\n", ktype->name); ret = 0; out: up_write(&key_types_sem); return ret; } EXPORT_SYMBOL(register_key_type); /** * unregister_key_type - Unregister a type of key. * @ktype: The key type. * * Unregister a key type and mark all the extant keys of this type as dead. * Those keys of this type are then destroyed to get rid of their payloads and * they and their links will be garbage collected as soon as possible. */ void unregister_key_type(struct key_type *ktype) { down_write(&key_types_sem); list_del_init(&ktype->link); downgrade_write(&key_types_sem); key_gc_keytype(ktype); pr_notice("Key type %s unregistered\n", ktype->name); up_read(&key_types_sem); } EXPORT_SYMBOL(unregister_key_type); /* * Initialise the key management state. */ void __init key_init(void) { /* allocate a slab in which we can store keys */ key_jar = kmem_cache_create("key_jar", sizeof(struct key), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); /* add the special key types */ list_add_tail(&key_type_keyring.link, &key_types_list); list_add_tail(&key_type_dead.link, &key_types_list); list_add_tail(&key_type_user.link, &key_types_list); list_add_tail(&key_type_logon.link, &key_types_list); /* record the root user tracking */ rb_link_node(&root_key_user.node, NULL, &key_user_tree.rb_node); rb_insert_color(&root_key_user.node, &key_user_tree); } |
| 825 826 3 3 1 1 2 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 | // SPDX-License-Identifier: GPL-2.0-only /* * Memory merging support. * * This code enables dynamic sharing of identical pages found in different * memory areas, even if they are not shared by fork() * * Copyright (C) 2008-2009 Red Hat, Inc. * Authors: * Izik Eidus * Andrea Arcangeli * Chris Wright * Hugh Dickins */ #include <linux/errno.h> #include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/fs.h> #include <linux/mman.h> #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/sched/cputime.h> #include <linux/rwsem.h> #include <linux/pagemap.h> #include <linux/rmap.h> #include <linux/spinlock.h> #include <linux/xxhash.h> #include <linux/delay.h> #include <linux/kthread.h> #include <linux/wait.h> #include <linux/slab.h> #include <linux/rbtree.h> #include <linux/memory.h> #include <linux/mmu_notifier.h> #include <linux/swap.h> #include <linux/ksm.h> #include <linux/hashtable.h> #include <linux/freezer.h> #include <linux/oom.h> #include <linux/numa.h> #include <linux/pagewalk.h> #include <asm/tlbflush.h> #include "internal.h" #include "mm_slot.h" #define CREATE_TRACE_POINTS #include <trace/events/ksm.h> #ifdef CONFIG_NUMA #define NUMA(x) (x) #define DO_NUMA(x) do { (x); } while (0) #else #define NUMA(x) (0) #define DO_NUMA(x) do { } while (0) #endif typedef u8 rmap_age_t; /** * DOC: Overview * * A few notes about the KSM scanning process, * to make it easier to understand the data structures below: * * In order to reduce excessive scanning, KSM sorts the memory pages by their * contents into a data structure that holds pointers to the pages' locations. * * Since the contents of the pages may change at any moment, KSM cannot just * insert the pages into a normal sorted tree and expect it to find anything. * Therefore KSM uses two data structures - the stable and the unstable tree. * * The stable tree holds pointers to all the merged pages (ksm pages), sorted * by their contents. Because each such page is write-protected, searching on * this tree is fully assured to be working (except when pages are unmapped), * and therefore this tree is called the stable tree. * * The stable tree node includes information required for reverse * mapping from a KSM page to virtual addresses that map this page. * * In order to avoid large latencies of the rmap walks on KSM pages, * KSM maintains two types of nodes in the stable tree: * * * the regular nodes that keep the reverse mapping structures in a * linked list * * the "chains" that link nodes ("dups") that represent the same * write protected memory content, but each "dup" corresponds to a * different KSM page copy of that content * * Internally, the regular nodes, "dups" and "chains" are represented * using the same struct ksm_stable_node structure. * * In addition to the stable tree, KSM uses a second data structure called the * unstable tree: this tree holds pointers to pages which have been found to * be "unchanged for a period of time". The unstable tree sorts these pages * by their contents, but since they are not write-protected, KSM cannot rely * upon the unstable tree to work correctly - the unstable tree is liable to * be corrupted as its contents are modified, and so it is called unstable. * * KSM solves this problem by several techniques: * * 1) The unstable tree is flushed every time KSM completes scanning all * memory areas, and then the tree is rebuilt again from the beginning. * 2) KSM will only insert into the unstable tree, pages whose hash value * has not changed since the previous scan of all memory areas. * 3) The unstable tree is a RedBlack Tree - so its balancing is based on the * colors of the nodes and not on their contents, assuring that even when * the tree gets "corrupted" it won't get out of balance, so scanning time * remains the same (also, searching and inserting nodes in an rbtree uses * the same algorithm, so we have no overhead when we flush and rebuild). * 4) KSM never flushes the stable tree, which means that even if it were to * take 10 attempts to find a page in the unstable tree, once it is found, * it is secured in the stable tree. (When we scan a new page, we first * compare it against the stable tree, and then against the unstable tree.) * * If the merge_across_nodes tunable is unset, then KSM maintains multiple * stable trees and multiple unstable trees: one of each for each NUMA node. */ /** * struct ksm_mm_slot - ksm information per mm that is being scanned * @slot: hash lookup from mm to mm_slot * @rmap_list: head for this mm_slot's singly-linked list of rmap_items */ struct ksm_mm_slot { struct mm_slot slot; struct ksm_rmap_item *rmap_list; }; /** * struct ksm_scan - cursor for scanning * @mm_slot: the current mm_slot we are scanning * @address: the next address inside that to be scanned * @rmap_list: link to the next rmap to be scanned in the rmap_list * @seqnr: count of completed full scans (needed when removing unstable node) * * There is only the one ksm_scan instance of this cursor structure. */ struct ksm_scan { struct ksm_mm_slot *mm_slot; unsigned long address; struct ksm_rmap_item **rmap_list; unsigned long seqnr; }; /** * struct ksm_stable_node - node of the stable rbtree * @node: rb node of this ksm page in the stable tree * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list * @hlist_dup: linked into the stable_node->hlist with a stable_node chain * @list: linked into migrate_nodes, pending placement in the proper node tree * @hlist: hlist head of rmap_items using this ksm page * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid) * @chain_prune_time: time of the last full garbage collection * @rmap_hlist_len: number of rmap_item entries in hlist or STABLE_NODE_CHAIN * @nid: NUMA node id of stable tree in which linked (may not match kpfn) */ struct ksm_stable_node { union { struct rb_node node; /* when node of stable tree */ struct { /* when listed for migration */ struct list_head *head; struct { struct hlist_node hlist_dup; struct list_head list; }; }; }; struct hlist_head hlist; union { unsigned long kpfn; unsigned long chain_prune_time; }; /* * STABLE_NODE_CHAIN can be any negative number in * rmap_hlist_len negative range, but better not -1 to be able * to reliably detect underflows. */ #define STABLE_NODE_CHAIN -1024 int rmap_hlist_len; #ifdef CONFIG_NUMA int nid; #endif }; /** * struct ksm_rmap_item - reverse mapping item for virtual addresses * @rmap_list: next rmap_item in mm_slot's singly-linked rmap_list * @anon_vma: pointer to anon_vma for this mm,address, when in stable tree * @nid: NUMA node id of unstable tree in which linked (may not match page) * @mm: the memory structure this rmap_item is pointing into * @address: the virtual address this rmap_item tracks (+ flags in low bits) * @oldchecksum: previous checksum of the page at that virtual address * @node: rb node of this rmap_item in the unstable tree * @head: pointer to stable_node heading this list in the stable tree * @hlist: link into hlist of rmap_items hanging off that stable_node * @age: number of scan iterations since creation * @remaining_skips: how many scans to skip */ struct ksm_rmap_item { struct ksm_rmap_item *rmap_list; union { struct anon_vma *anon_vma; /* when stable */ #ifdef CONFIG_NUMA int nid; /* when node of unstable tree */ #endif }; struct mm_struct *mm; unsigned long address; /* + low bits used for flags below */ unsigned int oldchecksum; /* when unstable */ rmap_age_t age; rmap_age_t remaining_skips; union { struct rb_node node; /* when node of unstable tree */ struct { /* when listed from stable tree */ struct ksm_stable_node *head; struct hlist_node hlist; }; }; }; #define SEQNR_MASK 0x0ff /* low bits of unstable tree seqnr */ #define UNSTABLE_FLAG 0x100 /* is a node of the unstable tree */ #define STABLE_FLAG 0x200 /* is listed from the stable tree */ /* The stable and unstable tree heads */ static struct rb_root one_stable_tree[1] = { RB_ROOT }; static struct rb_root one_unstable_tree[1] = { RB_ROOT }; static struct rb_root *root_stable_tree = one_stable_tree; static struct rb_root *root_unstable_tree = one_unstable_tree; /* Recently migrated nodes of stable tree, pending proper placement */ static LIST_HEAD(migrate_nodes); #define STABLE_NODE_DUP_HEAD ((struct list_head *)&migrate_nodes.prev) #define MM_SLOTS_HASH_BITS 10 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); static struct ksm_mm_slot ksm_mm_head = { .slot.mm_node = LIST_HEAD_INIT(ksm_mm_head.slot.mm_node), }; static struct ksm_scan ksm_scan = { .mm_slot = &ksm_mm_head, }; static struct kmem_cache *rmap_item_cache; static struct kmem_cache *stable_node_cache; static struct kmem_cache *mm_slot_cache; /* Default number of pages to scan per batch */ #define DEFAULT_PAGES_TO_SCAN 100 /* The number of pages scanned */ static unsigned long ksm_pages_scanned; /* The number of nodes in the stable tree */ static unsigned long ksm_pages_shared; /* The number of page slots additionally sharing those nodes */ static unsigned long ksm_pages_sharing; /* The number of nodes in the unstable tree */ static unsigned long ksm_pages_unshared; /* The number of rmap_items in use: to calculate pages_volatile */ static unsigned long ksm_rmap_items; /* The number of stable_node chains */ static unsigned long ksm_stable_node_chains; /* The number of stable_node dups linked to the stable_node chains */ static unsigned long ksm_stable_node_dups; /* Delay in pruning stale stable_node_dups in the stable_node_chains */ static unsigned int ksm_stable_node_chains_prune_millisecs = 2000; /* Maximum number of page slots sharing a stable node */ static int ksm_max_page_sharing = 256; /* Number of pages ksmd should scan in one batch */ static unsigned int ksm_thread_pages_to_scan = DEFAULT_PAGES_TO_SCAN; /* Milliseconds ksmd should sleep between batches */ static unsigned int ksm_thread_sleep_millisecs = 20; /* Checksum of an empty (zeroed) page */ static unsigned int zero_checksum __read_mostly; /* Whether to merge empty (zeroed) pages with actual zero pages */ static bool ksm_use_zero_pages __read_mostly; /* Skip pages that couldn't be de-duplicated previously */ /* Default to true at least temporarily, for testing */ static bool ksm_smart_scan = true; /* The number of zero pages which is placed by KSM */ atomic_long_t ksm_zero_pages = ATOMIC_LONG_INIT(0); /* The number of pages that have been skipped due to "smart scanning" */ static unsigned long ksm_pages_skipped; /* Don't scan more than max pages per batch. */ static unsigned long ksm_advisor_max_pages_to_scan = 30000; /* Min CPU for scanning pages per scan */ #define KSM_ADVISOR_MIN_CPU 10 /* Max CPU for scanning pages per scan */ static unsigned int ksm_advisor_max_cpu = 70; /* Target scan time in seconds to analyze all KSM candidate pages. */ static unsigned long ksm_advisor_target_scan_time = 200; /* Exponentially weighted moving average. */ #define EWMA_WEIGHT 30 /** * struct advisor_ctx - metadata for KSM advisor * @start_scan: start time of the current scan * @scan_time: scan time of previous scan * @change: change in percent to pages_to_scan parameter * @cpu_time: cpu time consumed by the ksmd thread in the previous scan */ struct advisor_ctx { ktime_t start_scan; unsigned long scan_time; unsigned long change; unsigned long long cpu_time; }; static struct advisor_ctx advisor_ctx; /* Define different advisor's */ enum ksm_advisor_type { KSM_ADVISOR_NONE, KSM_ADVISOR_SCAN_TIME, }; static enum ksm_advisor_type ksm_advisor; #ifdef CONFIG_SYSFS /* * Only called through the sysfs control interface: */ /* At least scan this many pages per batch. */ static unsigned long ksm_advisor_min_pages_to_scan = 500; static void set_advisor_defaults(void) { if (ksm_advisor == KSM_ADVISOR_NONE) { ksm_thread_pages_to_scan = DEFAULT_PAGES_TO_SCAN; } else if (ksm_advisor == KSM_ADVISOR_SCAN_TIME) { advisor_ctx = (const struct advisor_ctx){ 0 }; ksm_thread_pages_to_scan = ksm_advisor_min_pages_to_scan; } } #endif /* CONFIG_SYSFS */ static inline void advisor_start_scan(void) { if (ksm_advisor == KSM_ADVISOR_SCAN_TIME) advisor_ctx.start_scan = ktime_get(); } /* * Use previous scan time if available, otherwise use current scan time as an * approximation for the previous scan time. */ static inline unsigned long prev_scan_time(struct advisor_ctx *ctx, unsigned long scan_time) { return ctx->scan_time ? ctx->scan_time : scan_time; } /* Calculate exponential weighted moving average */ static unsigned long ewma(unsigned long prev, unsigned long curr) { return ((100 - EWMA_WEIGHT) * prev + EWMA_WEIGHT * curr) / 100; } /* * The scan time advisor is based on the current scan rate and the target * scan rate. * * new_pages_to_scan = pages_to_scan * (scan_time / target_scan_time) * * To avoid perturbations it calculates a change factor of previous changes. * A new change factor is calculated for each iteration and it uses an * exponentially weighted moving average. The new pages_to_scan value is * multiplied with that change factor: * * new_pages_to_scan *= change factor * * The new_pages_to_scan value is limited by the cpu min and max values. It * calculates the cpu percent for the last scan and calculates the new * estimated cpu percent cost for the next scan. That value is capped by the * cpu min and max setting. * * In addition the new pages_to_scan value is capped by the max and min * limits. */ static void scan_time_advisor(void) { unsigned int cpu_percent; unsigned long cpu_time; unsigned long cpu_time_diff; unsigned long cpu_time_diff_ms; unsigned long pages; unsigned long per_page_cost; unsigned long factor; unsigned long change; unsigned long last_scan_time; unsigned long scan_time; /* Convert scan time to seconds */ scan_time = div_s64(ktime_ms_delta(ktime_get(), advisor_ctx.start_scan), MSEC_PER_SEC); scan_time = scan_time ? scan_time : 1; /* Calculate CPU consumption of ksmd background thread */ cpu_time = task_sched_runtime(current); cpu_time_diff = cpu_time - advisor_ctx.cpu_time; cpu_time_diff_ms = cpu_time_diff / 1000 / 1000; cpu_percent = (cpu_time_diff_ms * 100) / (scan_time * 1000); cpu_percent = cpu_percent ? cpu_percent : 1; last_scan_time = prev_scan_time(&advisor_ctx, scan_time); /* Calculate scan time as percentage of target scan time */ factor = ksm_advisor_target_scan_time * 100 / scan_time; factor = factor ? factor : 1; /* * Calculate scan time as percentage of last scan time and use * exponentially weighted average to smooth it */ change = scan_time * 100 / last_scan_time; change = change ? change : 1; change = ewma(advisor_ctx.change, change); /* Calculate new scan rate based on target scan rate. */ pages = ksm_thread_pages_to_scan * 100 / factor; /* Update pages_to_scan by weighted change percentage. */ pages = pages * change / 100; /* Cap new pages_to_scan value */ per_page_cost = ksm_thread_pages_to_scan / cpu_percent; per_page_cost = per_page_cost ? per_page_cost : 1; pages = min(pages, per_page_cost * ksm_advisor_max_cpu); pages = max(pages, per_page_cost * KSM_ADVISOR_MIN_CPU); pages = min(pages, ksm_advisor_max_pages_to_scan); /* Update advisor context */ advisor_ctx.change = change; advisor_ctx.scan_time = scan_time; advisor_ctx.cpu_time = cpu_time; ksm_thread_pages_to_scan = pages; trace_ksm_advisor(scan_time, pages, cpu_percent); } static void advisor_stop_scan(void) { if (ksm_advisor == KSM_ADVISOR_SCAN_TIME) scan_time_advisor(); } #ifdef CONFIG_NUMA /* Zeroed when merging across nodes is not allowed */ static unsigned int ksm_merge_across_nodes = 1; static int ksm_nr_node_ids = 1; #else #define ksm_merge_across_nodes 1U #define ksm_nr_node_ids 1 #endif #define KSM_RUN_STOP 0 #define KSM_RUN_MERGE 1 #define KSM_RUN_UNMERGE 2 #define KSM_RUN_OFFLINE 4 static unsigned long ksm_run = KSM_RUN_STOP; static void wait_while_offlining(void); static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); static DECLARE_WAIT_QUEUE_HEAD(ksm_iter_wait); static DEFINE_MUTEX(ksm_thread_mutex); static DEFINE_SPINLOCK(ksm_mmlist_lock); static int __init ksm_slab_init(void) { rmap_item_cache = KMEM_CACHE(ksm_rmap_item, 0); if (!rmap_item_cache) goto out; stable_node_cache = KMEM_CACHE(ksm_stable_node, 0); if (!stable_node_cache) goto out_free1; mm_slot_cache = KMEM_CACHE(ksm_mm_slot, 0); if (!mm_slot_cache) goto out_free2; return 0; out_free2: kmem_cache_destroy(stable_node_cache); out_free1: kmem_cache_destroy(rmap_item_cache); out: return -ENOMEM; } static void __init ksm_slab_free(void) { kmem_cache_destroy(mm_slot_cache); kmem_cache_destroy(stable_node_cache); kmem_cache_destroy(rmap_item_cache); mm_slot_cache = NULL; } static __always_inline bool is_stable_node_chain(struct ksm_stable_node *chain) { return chain->rmap_hlist_len == STABLE_NODE_CHAIN; } static __always_inline bool is_stable_node_dup(struct ksm_stable_node *dup) { return dup->head == STABLE_NODE_DUP_HEAD; } static inline void stable_node_chain_add_dup(struct ksm_stable_node *dup, struct ksm_stable_node *chain) { VM_BUG_ON(is_stable_node_dup(dup)); dup->head = STABLE_NODE_DUP_HEAD; VM_BUG_ON(!is_stable_node_chain(chain)); hlist_add_head(&dup->hlist_dup, &chain->hlist); ksm_stable_node_dups++; } static inline void __stable_node_dup_del(struct ksm_stable_node *dup) { VM_BUG_ON(!is_stable_node_dup(dup)); hlist_del(&dup->hlist_dup); ksm_stable_node_dups--; } static inline void stable_node_dup_del(struct ksm_stable_node *dup) { VM_BUG_ON(is_stable_node_chain(dup)); if (is_stable_node_dup(dup)) __stable_node_dup_del(dup); else rb_erase(&dup->node, root_stable_tree + NUMA(dup->nid)); #ifdef CONFIG_DEBUG_VM dup->head = NULL; #endif } static inline struct ksm_rmap_item *alloc_rmap_item(void) { struct ksm_rmap_item *rmap_item; rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN); if (rmap_item) ksm_rmap_items++; return rmap_item; } static inline void free_rmap_item(struct ksm_rmap_item *rmap_item) { ksm_rmap_items--; rmap_item->mm->ksm_rmap_items--; rmap_item->mm = NULL; /* debug safety */ kmem_cache_free(rmap_item_cache, rmap_item); } static inline struct ksm_stable_node *alloc_stable_node(void) { /* * The allocation can take too long with GFP_KERNEL when memory is under * pressure, which may lead to hung task warnings. Adding __GFP_HIGH * grants access to memory reserves, helping to avoid this problem. */ return kmem_cache_alloc(stable_node_cache, GFP_KERNEL | __GFP_HIGH); } static inline void free_stable_node(struct ksm_stable_node *stable_node) { VM_BUG_ON(stable_node->rmap_hlist_len && !is_stable_node_chain(stable_node)); kmem_cache_free(stable_node_cache, stable_node); } /* * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's * page tables after it has passed through ksm_exit() - which, if necessary, * takes mmap_lock briefly to serialize against them. ksm_exit() does not set * a special flag: they can just back out as soon as mm_users goes to zero. * ksm_test_exit() is used throughout to make this test for exit: in some * places for correctness, in some places just to avoid unnecessary work. */ static inline bool ksm_test_exit(struct mm_struct *mm) { return atomic_read(&mm->mm_users) == 0; } static int break_ksm_pmd_entry(pmd_t *pmdp, unsigned long addr, unsigned long end, struct mm_walk *walk) { unsigned long *found_addr = (unsigned long *) walk->private; struct mm_struct *mm = walk->mm; pte_t *start_ptep, *ptep; spinlock_t *ptl; int found = 0; if (ksm_test_exit(walk->mm)) return 0; if (signal_pending(current)) return -ERESTARTSYS; start_ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); if (!start_ptep) return 0; for (ptep = start_ptep; addr < end; ptep++, addr += PAGE_SIZE) { pte_t pte = ptep_get(ptep); struct folio *folio = NULL; if (pte_present(pte)) { folio = vm_normal_folio(walk->vma, addr, pte); } else if (!pte_none(pte)) { const softleaf_t entry = softleaf_from_pte(pte); /* * As KSM pages remain KSM pages until freed, no need to wait * here for migration to end. */ if (softleaf_is_migration(entry)) folio = softleaf_to_folio(entry); } /* return 1 if the page is an normal ksm page or KSM-placed zero page */ found = (folio && folio_test_ksm(folio)) || (pte_present(pte) && is_ksm_zero_pte(pte)); if (found) { *found_addr = addr; goto out_unlock; } } out_unlock: pte_unmap_unlock(start_ptep, ptl); return found; } static const struct mm_walk_ops break_ksm_ops = { .pmd_entry = break_ksm_pmd_entry, .walk_lock = PGWALK_RDLOCK, }; static const struct mm_walk_ops break_ksm_lock_vma_ops = { .pmd_entry = break_ksm_pmd_entry, .walk_lock = PGWALK_WRLOCK, }; /* * Though it's very tempting to unmerge rmap_items from stable tree rather * than check every pte of a given vma, the locking doesn't quite work for * that - an rmap_item is assigned to the stable tree after inserting ksm * page and upping mmap_lock. Nor does it fit with the way we skip dup'ing * rmap_items from parent to child at fork time (so as not to waste time * if exit comes before the next scan reaches it). * * Similarly, although we'd like to remove rmap_items (so updating counts * and freeing memory) when unmerging an area, it's easier to leave that * to the next pass of ksmd - consider, for example, how ksmd might be * in cmp_and_merge_page on one of the rmap_items we would be removing. * * We use break_ksm to break COW on a ksm page by triggering unsharing, * such that the ksm page will get replaced by an exclusive anonymous page. * * We take great care only to touch a ksm page, in a VM_MERGEABLE vma, * in case the application has unmapped and remapped mm,addr meanwhile. * Could a ksm page appear anywhere else? Actually yes, in a VM_PFNMAP * mmap of /dev/mem, where we would not want to touch it. * * FAULT_FLAG_REMOTE/FOLL_REMOTE are because we do this outside the context * of the process that owns 'vma'. We also do not want to enforce * protection keys here anyway. */ static int break_ksm(struct vm_area_struct *vma, unsigned long addr, unsigned long end, bool lock_vma) { vm_fault_t ret = 0; const struct mm_walk_ops *ops = lock_vma ? &break_ksm_lock_vma_ops : &break_ksm_ops; do { int ksm_page; cond_resched(); ksm_page = walk_page_range_vma(vma, addr, end, ops, &addr); if (ksm_page <= 0) return ksm_page; ret = handle_mm_fault(vma, addr, FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE, NULL); } while (!(ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | VM_FAULT_OOM))); /* * We must loop until we no longer find a KSM page because * handle_mm_fault() may back out if there's any difficulty e.g. if * pte accessed bit gets updated concurrently. * * VM_FAULT_SIGBUS could occur if we race with truncation of the * backing file, which also invalidates anonymous pages: that's * okay, that truncation will have unmapped the KSM page for us. * * VM_FAULT_OOM: at the time of writing (late July 2009), setting * aside mem_cgroup limits, VM_FAULT_OOM would only be set if the * current task has TIF_MEMDIE set, and will be OOM killed on return * to user; and ksmd, having no mm, would never be chosen for that. * * But if the mm is in a limited mem_cgroup, then the fault may fail * with VM_FAULT_OOM even if the current task is not TIF_MEMDIE; and * even ksmd can fail in this way - though it's usually breaking ksm * just to undo a merge it made a moment before, so unlikely to oom. * * That's a pity: we might therefore have more kernel pages allocated * than we're counting as nodes in the stable tree; but ksm_do_scan * will retry to break_cow on each pass, so should recover the page * in due course. The important thing is to not let VM_MERGEABLE * be cleared while any such pages might remain in the area. */ return (ret & VM_FAULT_OOM) ? -ENOMEM : 0; } static bool ksm_compatible(const struct file *file, vm_flags_t vm_flags) { if (vm_flags & (VM_SHARED | VM_MAYSHARE | VM_SPECIAL | VM_HUGETLB | VM_DROPPABLE)) return false; /* just ignore the advice */ if (file_is_dax(file)) return false; #ifdef VM_SAO if (vm_flags & VM_SAO) return false; #endif #ifdef VM_SPARC_ADI if (vm_flags & VM_SPARC_ADI) return false; #endif return true; } static bool vma_ksm_compatible(struct vm_area_struct *vma) { return ksm_compatible(vma->vm_file, vma->vm_flags); } static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma; if (ksm_test_exit(mm)) return NULL; vma = vma_lookup(mm, addr); if (!vma || !(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) return NULL; return vma; } static void break_cow(struct ksm_rmap_item *rmap_item) { struct mm_struct *mm = rmap_item->mm; unsigned long addr = rmap_item->address; struct vm_area_struct *vma; /* * It is not an accident that whenever we want to break COW * to undo, we also need to drop a reference to the anon_vma. */ put_anon_vma(rmap_item->anon_vma); mmap_read_lock(mm); vma = find_mergeable_vma(mm, addr); if (vma) break_ksm(vma, addr, addr + PAGE_SIZE, false); mmap_read_unlock(mm); } static struct page *get_mergeable_page(struct ksm_rmap_item *rmap_item) { struct mm_struct *mm = rmap_item->mm; unsigned long addr = rmap_item->address; struct vm_area_struct *vma; struct page *page = NULL; struct folio_walk fw; struct folio *folio; mmap_read_lock(mm); vma = find_mergeable_vma(mm, addr); if (!vma) goto out; folio = folio_walk_start(&fw, vma, addr, 0); if (folio) { if (!folio_is_zone_device(folio) && folio_test_anon(folio)) { folio_get(folio); page = fw.page; } folio_walk_end(&fw, vma); } out: if (page) { flush_anon_page(vma, page, addr); flush_dcache_page(page); } mmap_read_unlock(mm); return page; } /* * This helper is used for getting right index into array of tree roots. * When merge_across_nodes knob is set to 1, there are only two rb-trees for * stable and unstable pages from all nodes with roots in index 0. Otherwise, * every node has its own stable and unstable tree. */ static inline int get_kpfn_nid(unsigned long kpfn) { return ksm_merge_across_nodes ? 0 : NUMA(pfn_to_nid(kpfn)); } static struct ksm_stable_node *alloc_stable_node_chain(struct ksm_stable_node *dup, struct rb_root *root) { struct ksm_stable_node *chain = alloc_stable_node(); VM_BUG_ON(is_stable_node_chain(dup)); if (likely(chain)) { INIT_HLIST_HEAD(&chain->hlist); chain->chain_prune_time = jiffies; chain->rmap_hlist_len = STABLE_NODE_CHAIN; #if defined (CONFIG_DEBUG_VM) && defined(CONFIG_NUMA) chain->nid = NUMA_NO_NODE; /* debug */ #endif ksm_stable_node_chains++; /* * Put the stable node chain in the first dimension of * the stable tree and at the same time remove the old * stable node. */ rb_replace_node(&dup->node, &chain->node, root); /* * Move the old stable node to the second dimension * queued in the hlist_dup. The invariant is that all * dup stable_nodes in the chain->hlist point to pages * that are write protected and have the exact same * content. */ stable_node_chain_add_dup(dup, chain); } return chain; } static inline void free_stable_node_chain(struct ksm_stable_node *chain, struct rb_root *root) { rb_erase(&chain->node, root); free_stable_node(chain); ksm_stable_node_chains--; } static void remove_node_from_stable_tree(struct ksm_stable_node *stable_node) { struct ksm_rmap_item *rmap_item; /* check it's not STABLE_NODE_CHAIN or negative */ BUG_ON(stable_node->rmap_hlist_len < 0); hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { if (rmap_item->hlist.next) { ksm_pages_sharing--; trace_ksm_remove_rmap_item(stable_node->kpfn, rmap_item, rmap_item->mm); } else { ksm_pages_shared--; } rmap_item->mm->ksm_merging_pages--; VM_BUG_ON(stable_node->rmap_hlist_len <= 0); stable_node->rmap_hlist_len--; put_anon_vma(rmap_item->anon_vma); rmap_item->address &= PAGE_MASK; cond_resched(); } /* * We need the second aligned pointer of the migrate_nodes * list_head to stay clear from the rb_parent_color union * (aligned and different than any node) and also different * from &migrate_nodes. This will verify that future list.h changes * don't break STABLE_NODE_DUP_HEAD. Only recent gcc can handle it. */ BUILD_BUG_ON(STABLE_NODE_DUP_HEAD <= &migrate_nodes); BUILD_BUG_ON(STABLE_NODE_DUP_HEAD >= &migrate_nodes + 1); trace_ksm_remove_ksm_page(stable_node->kpfn); if (stable_node->head == &migrate_nodes) list_del(&stable_node->list); else stable_node_dup_del(stable_node); free_stable_node(stable_node); } enum ksm_get_folio_flags { KSM_GET_FOLIO_NOLOCK, KSM_GET_FOLIO_LOCK, KSM_GET_FOLIO_TRYLOCK }; /* * ksm_get_folio: checks if the page indicated by the stable node * is still its ksm page, despite having held no reference to it. * In which case we can trust the content of the page, and it * returns the gotten page; but if the page has now been zapped, * remove the stale node from the stable tree and return NULL. * But beware, the stable node's page might be being migrated. * * You would expect the stable_node to hold a reference to the ksm page. * But if it increments the page's count, swapping out has to wait for * ksmd to come around again before it can free the page, which may take * seconds or even minutes: much too unresponsive. So instead we use a * "keyhole reference": access to the ksm page from the stable node peeps * out through its keyhole to see if that page still holds the right key, * pointing back to this stable node. This relies on freeing a PageAnon * page to reset its page->mapping to NULL, and relies on no other use of * a page to put something that might look like our key in page->mapping. * is on its way to being freed; but it is an anomaly to bear in mind. */ static struct folio *ksm_get_folio(struct ksm_stable_node *stable_node, enum ksm_get_folio_flags flags) { struct folio *folio; void *expected_mapping; unsigned long kpfn; expected_mapping = (void *)((unsigned long)stable_node | FOLIO_MAPPING_KSM); again: kpfn = READ_ONCE(stable_node->kpfn); /* Address dependency. */ folio = pfn_folio(kpfn); if (READ_ONCE(folio->mapping) != expected_mapping) goto stale; /* * We cannot do anything with the page while its refcount is 0. * Usually 0 means free, or tail of a higher-order page: in which * case this node is no longer referenced, and should be freed; * however, it might mean that the page is under page_ref_freeze(). * The __remove_mapping() case is easy, again the node is now stale; * the same is in reuse_ksm_page() case; but if page is swapcache * in folio_migrate_mapping(), it might still be our page, * in which case it's essential to keep the node. */ while (!folio_try_get(folio)) { /* * Another check for folio->mapping != expected_mapping * would work here too. We have chosen to test the * swapcache flag to optimize the common case, when the * folio is or is about to be freed: the swapcache flag * is cleared (under spin_lock_irq) in the ref_freeze * section of __remove_mapping(); but anon folio->mapping * is reset to NULL later, in free_pages_prepare(). */ if (!folio_test_swapcache(folio)) goto stale; cpu_relax(); } if (READ_ONCE(folio->mapping) != expected_mapping) { folio_put(folio); goto stale; } if (flags == KSM_GET_FOLIO_TRYLOCK) { if (!folio_trylock(folio)) { folio_put(folio); return ERR_PTR(-EBUSY); } } else if (flags == KSM_GET_FOLIO_LOCK) folio_lock(folio); if (flags != KSM_GET_FOLIO_NOLOCK) { if (READ_ONCE(folio->mapping) != expected_mapping) { folio_unlock(folio); folio_put(folio); goto stale; } } return folio; stale: /* * We come here from above when folio->mapping or the swapcache flag * suggests that the node is stale; but it might be under migration. * We need smp_rmb(), matching the smp_wmb() in folio_migrate_ksm(), * before checking whether node->kpfn has been changed. */ smp_rmb(); if (READ_ONCE(stable_node->kpfn) != kpfn) goto again; remove_node_from_stable_tree(stable_node); return NULL; } /* * Removing rmap_item from stable or unstable tree. * This function will clean the information from the stable/unstable tree. */ static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item) { if (rmap_item->address & STABLE_FLAG) { struct ksm_stable_node *stable_node; struct folio *folio; stable_node = rmap_item->head; folio = ksm_get_folio(stable_node, KSM_GET_FOLIO_LOCK); if (!folio) goto out; hlist_del(&rmap_item->hlist); folio_unlock(folio); folio_put(folio); if (!hlist_empty(&stable_node->hlist)) ksm_pages_sharing--; else ksm_pages_shared--; rmap_item->mm->ksm_merging_pages--; VM_BUG_ON(stable_node->rmap_hlist_len <= 0); stable_node->rmap_hlist_len--; put_anon_vma(rmap_item->anon_vma); rmap_item->head = NULL; rmap_item->address &= PAGE_MASK; } else if (rmap_item->address & UNSTABLE_FLAG) { unsigned char age; /* * Usually ksmd can and must skip the rb_erase, because * root_unstable_tree was already reset to RB_ROOT. * But be careful when an mm is exiting: do the rb_erase * if this rmap_item was inserted by this scan, rather * than left over from before. */ age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); BUG_ON(age > 1); if (!age) rb_erase(&rmap_item->node, root_unstable_tree + NUMA(rmap_item->nid)); ksm_pages_unshared--; rmap_item->address &= PAGE_MASK; } out: cond_resched(); /* we're called from many long loops */ } static void remove_trailing_rmap_items(struct ksm_rmap_item **rmap_list) { while (*rmap_list) { struct ksm_rmap_item *rmap_item = *rmap_list; *rmap_list = rmap_item->rmap_list; remove_rmap_item_from_tree(rmap_item); free_rmap_item(rmap_item); } } static inline struct ksm_stable_node *folio_stable_node(const struct folio *folio) { return folio_test_ksm(folio) ? folio_raw_mapping(folio) : NULL; } static inline void folio_set_stable_node(struct folio *folio, struct ksm_stable_node *stable_node) { VM_WARN_ON_FOLIO(folio_test_anon(folio) && PageAnonExclusive(&folio->page), folio); folio->mapping = (void *)((unsigned long)stable_node | FOLIO_MAPPING_KSM); } #ifdef CONFIG_SYSFS /* * Only called through the sysfs control interface: */ static int remove_stable_node(struct ksm_stable_node *stable_node) { struct folio *folio; int err; folio = ksm_get_folio(stable_node, KSM_GET_FOLIO_LOCK); if (!folio) { /* * ksm_get_folio did remove_node_from_stable_tree itself. */ return 0; } /* * Page could be still mapped if this races with __mmput() running in * between ksm_exit() and exit_mmap(). Just refuse to let * merge_across_nodes/max_page_sharing be switched. */ err = -EBUSY; if (!folio_mapped(folio)) { /* * The stable node did not yet appear stale to ksm_get_folio(), * since that allows for an unmapped ksm folio to be recognized * right up until it is freed; but the node is safe to remove. * This folio might be in an LRU cache waiting to be freed, * or it might be in the swapcache (perhaps under writeback), * or it might have been removed from swapcache a moment ago. */ folio_set_stable_node(folio, NULL); remove_node_from_stable_tree(stable_node); err = 0; } folio_unlock(folio); folio_put(folio); return err; } static int remove_stable_node_chain(struct ksm_stable_node *stable_node, struct rb_root *root) { struct ksm_stable_node *dup; struct hlist_node *hlist_safe; if (!is_stable_node_chain(stable_node)) { VM_BUG_ON(is_stable_node_dup(stable_node)); if (remove_stable_node(stable_node)) return true; else return false; } hlist_for_each_entry_safe(dup, hlist_safe, &stable_node->hlist, hlist_dup) { VM_BUG_ON(!is_stable_node_dup(dup)); if (remove_stable_node(dup)) return true; } BUG_ON(!hlist_empty(&stable_node->hlist)); free_stable_node_chain(stable_node, root); return false; } static int remove_all_stable_nodes(void) { struct ksm_stable_node *stable_node, *next; int nid; int err = 0; for (nid = 0; nid < ksm_nr_node_ids; nid++) { while (root_stable_tree[nid].rb_node) { stable_node = rb_entry(root_stable_tree[nid].rb_node, struct ksm_stable_node, node); if (remove_stable_node_chain(stable_node, root_stable_tree + nid)) { err = -EBUSY; break; /* proceed to next nid */ } cond_resched(); } } list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) { if (remove_stable_node(stable_node)) err = -EBUSY; cond_resched(); } return err; } static int unmerge_and_remove_all_rmap_items(void) { struct ksm_mm_slot *mm_slot; struct mm_slot *slot; struct mm_struct *mm; struct vm_area_struct *vma; int err = 0; spin_lock(&ksm_mmlist_lock); slot = list_entry(ksm_mm_head.slot.mm_node.next, struct mm_slot, mm_node); ksm_scan.mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); spin_unlock(&ksm_mmlist_lock); for (mm_slot = ksm_scan.mm_slot; mm_slot != &ksm_mm_head; mm_slot = ksm_scan.mm_slot) { VMA_ITERATOR(vmi, mm_slot->slot.mm, 0); mm = mm_slot->slot.mm; mmap_read_lock(mm); /* * Exit right away if mm is exiting to avoid lockdep issue in * the maple tree */ if (ksm_test_exit(mm)) goto mm_exiting; for_each_vma(vmi, vma) { if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) continue; err = break_ksm(vma, vma->vm_start, vma->vm_end, false); if (err) goto error; } mm_exiting: remove_trailing_rmap_items(&mm_slot->rmap_list); mmap_read_unlock(mm); spin_lock(&ksm_mmlist_lock); slot = list_entry(mm_slot->slot.mm_node.next, struct mm_slot, mm_node); ksm_scan.mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); if (ksm_test_exit(mm)) { hash_del(&mm_slot->slot.hash); list_del(&mm_slot->slot.mm_node); spin_unlock(&ksm_mmlist_lock); mm_slot_free(mm_slot_cache, mm_slot); mm_flags_clear(MMF_VM_MERGEABLE, mm); mm_flags_clear(MMF_VM_MERGE_ANY, mm); mmdrop(mm); } else spin_unlock(&ksm_mmlist_lock); } /* Clean up stable nodes, but don't worry if some are still busy */ remove_all_stable_nodes(); ksm_scan.seqnr = 0; return 0; error: mmap_read_unlock(mm); spin_lock(&ksm_mmlist_lock); ksm_scan.mm_slot = &ksm_mm_head; spin_unlock(&ksm_mmlist_lock); return err; } #endif /* CONFIG_SYSFS */ static u32 calc_checksum(struct page *page) { u32 checksum; void *addr = kmap_local_page(page); checksum = xxhash(addr, PAGE_SIZE, 0); kunmap_local(addr); return checksum; } static int write_protect_page(struct vm_area_struct *vma, struct folio *folio, pte_t *orig_pte) { struct mm_struct *mm = vma->vm_mm; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, 0, 0); int swapped; int err = -EFAULT; struct mmu_notifier_range range; bool anon_exclusive; pte_t entry; if (WARN_ON_ONCE(folio_test_large(folio))) return err; pvmw.address = page_address_in_vma(folio, folio_page(folio, 0), vma); if (pvmw.address == -EFAULT) goto out; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, pvmw.address, pvmw.address + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); if (!page_vma_mapped_walk(&pvmw)) goto out_mn; if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?")) goto out_unlock; entry = ptep_get(pvmw.pte); /* * Handle PFN swap PTEs, such as device-exclusive ones, that actually * map pages: give up just like the next folio_walk would. */ if (unlikely(!pte_present(entry))) goto out_unlock; anon_exclusive = PageAnonExclusive(&folio->page); if (pte_write(entry) || pte_dirty(entry) || anon_exclusive || mm_tlb_flush_pending(mm)) { swapped = folio_test_swapcache(folio); flush_cache_page(vma, pvmw.address, folio_pfn(folio)); /* * Ok this is tricky, when get_user_pages_fast() run it doesn't * take any lock, therefore the check that we are going to make * with the pagecount against the mapcount is racy and * O_DIRECT can happen right after the check. * So we clear the pte and flush the tlb before the check * this assure us that no O_DIRECT can happen after the check * or in the middle of the check. * * No need to notify as we are downgrading page table to read * only not changing it to point to a new page. * * See Documentation/mm/mmu_notifier.rst */ entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); /* * Check that no O_DIRECT or similar I/O is in progress on the * page */ if (folio_mapcount(folio) + 1 + swapped != folio_ref_count(folio)) { set_pte_at(mm, pvmw.address, pvmw.pte, entry); goto out_unlock; } /* See folio_try_share_anon_rmap_pte(): clear PTE first. */ if (anon_exclusive && folio_try_share_anon_rmap_pte(folio, &folio->page)) { set_pte_at(mm, pvmw.address, pvmw.pte, entry); goto out_unlock; } if (pte_dirty(entry)) folio_mark_dirty(folio); entry = pte_mkclean(entry); if (pte_write(entry)) entry = pte_wrprotect(entry); set_pte_at(mm, pvmw.address, pvmw.pte, entry); } *orig_pte = entry; err = 0; out_unlock: page_vma_mapped_walk_done(&pvmw); out_mn: mmu_notifier_invalidate_range_end(&range); out: return err; } /** * replace_page - replace page in vma by new ksm page * @vma: vma that holds the pte pointing to page * @page: the page we are replacing by kpage * @kpage: the ksm page we replace page by * @orig_pte: the original value of the pte * * Returns 0 on success, -EFAULT on failure. */ static int replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage, pte_t orig_pte) { struct folio *kfolio = page_folio(kpage); struct mm_struct *mm = vma->vm_mm; struct folio *folio = page_folio(page); pmd_t *pmd; pmd_t pmde; pte_t *ptep; pte_t newpte; spinlock_t *ptl; unsigned long addr; int err = -EFAULT; struct mmu_notifier_range range; addr = page_address_in_vma(folio, page, vma); if (addr == -EFAULT) goto out; pmd = mm_find_pmd(mm, addr); if (!pmd) goto out; /* * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at() * without holding anon_vma lock for write. So when looking for a * genuine pmde (in which to find pte), test present and !THP together. */ pmde = pmdp_get_lockless(pmd); if (!pmd_present(pmde) || pmd_trans_huge(pmde)) goto out; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr, addr + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); if (!ptep) goto out_mn; if (!pte_same(ptep_get(ptep), orig_pte)) { pte_unmap_unlock(ptep, ptl); goto out_mn; } VM_BUG_ON_PAGE(PageAnonExclusive(page), page); VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage), kfolio); /* * No need to check ksm_use_zero_pages here: we can only have a * zero_page here if ksm_use_zero_pages was enabled already. */ if (!is_zero_pfn(page_to_pfn(kpage))) { folio_get(kfolio); folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE); newpte = mk_pte(kpage, vma->vm_page_prot); } else { /* * Use pte_mkdirty to mark the zero page mapped by KSM, and then * we can easily track all KSM-placed zero pages by checking if * the dirty bit in zero page's PTE is set. */ newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot))); ksm_map_zero_page(mm); /* * We're replacing an anonymous page with a zero page, which is * not anonymous. We need to do proper accounting otherwise we * will get wrong values in /proc, and a BUG message in dmesg * when tearing down the mm. */ dec_mm_counter(mm, MM_ANONPAGES); } flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep))); /* * No need to notify as we are replacing a read only page with another * read only page with the same content. * * See Documentation/mm/mmu_notifier.rst */ ptep_clear_flush(vma, addr, ptep); set_pte_at(mm, addr, ptep, newpte); folio_remove_rmap_pte(folio, page, vma); if (!folio_mapped(folio)) folio_free_swap(folio); folio_put(folio); pte_unmap_unlock(ptep, ptl); err = 0; out_mn: mmu_notifier_invalidate_range_end(&range); out: return err; } /* * try_to_merge_one_page - take two pages and merge them into one * @vma: the vma that holds the pte pointing to page * @page: the PageAnon page that we want to replace with kpage * @kpage: the KSM page that we want to map instead of page, * or NULL the first time when we want to use page as kpage. * * This function returns 0 if the pages were merged, -EFAULT otherwise. */ static int try_to_merge_one_page(struct vm_area_struct *vma, struct page *page, struct page *kpage) { struct folio *folio = page_folio(page); pte_t orig_pte = __pte(0); int err = -EFAULT; if (page == kpage) /* ksm page forked */ return 0; if (!folio_test_anon(folio)) goto out; /* * We need the folio lock to read a stable swapcache flag in * write_protect_page(). We trylock because we don't want to wait * here - we prefer to continue scanning and merging different * pages, then come back to this page when it is unlocked. */ if (!folio_trylock(folio)) goto out; if (folio_test_large(folio)) { if (split_huge_page(page)) goto out_unlock; folio = page_folio(page); } /* * If this anonymous page is mapped only here, its pte may need * to be write-protected. If it's mapped elsewhere, all of its * ptes are necessarily already write-protected. But in either * case, we need to lock and check page_count is not raised. */ if (write_protect_page(vma, folio, &orig_pte) == 0) { if (!kpage) { /* * While we hold folio lock, upgrade folio from * anon to a NULL stable_node with the KSM flag set: * stable_tree_insert() will update stable_node. */ folio_set_stable_node(folio, NULL); folio_mark_accessed(folio); /* * Page reclaim just frees a clean folio with no dirty * ptes: make sure that the ksm page would be swapped. */ if (!folio_test_dirty(folio)) folio_mark_dirty(folio); err = 0; } else if (pages_identical(page, kpage)) err = replace_page(vma, page, kpage, orig_pte); } out_unlock: folio_unlock(folio); out: return err; } /* * This function returns 0 if the pages were merged or if they are * no longer merging candidates (e.g., VMA stale), -EFAULT otherwise. */ static int try_to_merge_with_zero_page(struct ksm_rmap_item *rmap_item, struct page *page) { struct mm_struct *mm = rmap_item->mm; int err = -EFAULT; /* * Same checksum as an empty page. We attempt to merge it with the * appropriate zero page if the user enabled this via sysfs. */ if (ksm_use_zero_pages && (rmap_item->oldchecksum == zero_checksum)) { struct vm_area_struct *vma; mmap_read_lock(mm); vma = find_mergeable_vma(mm, rmap_item->address); if (vma) { err = try_to_merge_one_page(vma, page, ZERO_PAGE(rmap_item->address)); trace_ksm_merge_one_page( page_to_pfn(ZERO_PAGE(rmap_item->address)), rmap_item, mm, err); } else { /* * If the vma is out of date, we do not need to * continue. */ err = 0; } mmap_read_unlock(mm); } return err; } /* * try_to_merge_with_ksm_page - like try_to_merge_two_pages, * but no new kernel page is allocated: kpage must already be a ksm page. * * This function returns 0 if the pages were merged, -EFAULT otherwise. */ static int try_to_merge_with_ksm_page(struct ksm_rmap_item *rmap_item, struct page *page, struct page *kpage) { struct mm_struct *mm = rmap_item->mm; struct vm_area_struct *vma; int err = -EFAULT; mmap_read_lock(mm); vma = find_mergeable_vma(mm, rmap_item->address); if (!vma) goto out; err = try_to_merge_one_page(vma, page, kpage); if (err) goto out; /* Unstable nid is in union with stable anon_vma: remove first */ remove_rmap_item_from_tree(rmap_item); /* Must get reference to anon_vma while still holding mmap_lock */ rmap_item->anon_vma = vma->anon_vma; get_anon_vma(vma->anon_vma); out: mmap_read_unlock(mm); trace_ksm_merge_with_ksm_page(kpage, page_to_pfn(kpage ? kpage : page), rmap_item, mm, err); return err; } /* * try_to_merge_two_pages - take two identical pages and prepare them * to be merged into one page. * * This function returns the kpage if we successfully merged two identical * pages into one ksm page, NULL otherwise. * * Note that this function upgrades page to ksm page: if one of the pages * is already a ksm page, try_to_merge_with_ksm_page should be used. */ static struct folio *try_to_merge_two_pages(struct ksm_rmap_item *rmap_item, struct page *page, struct ksm_rmap_item *tree_rmap_item, struct page *tree_page) { int err; err = try_to_merge_with_ksm_page(rmap_item, page, NULL); if (!err) { err = try_to_merge_with_ksm_page(tree_rmap_item, tree_page, page); /* * If that fails, we have a ksm page with only one pte * pointing to it: so break it. */ if (err) break_cow(rmap_item); } return err ? NULL : page_folio(page); } static __always_inline bool __is_page_sharing_candidate(struct ksm_stable_node *stable_node, int offset) { VM_BUG_ON(stable_node->rmap_hlist_len < 0); /* * Check that at least one mapping still exists, otherwise * there's no much point to merge and share with this * stable_node, as the underlying tree_page of the other * sharer is going to be freed soon. */ return stable_node->rmap_hlist_len && stable_node->rmap_hlist_len + offset < ksm_max_page_sharing; } static __always_inline bool is_page_sharing_candidate(struct ksm_stable_node *stable_node) { return __is_page_sharing_candidate(stable_node, 0); } static struct folio *stable_node_dup(struct ksm_stable_node **_stable_node_dup, struct ksm_stable_node **_stable_node, struct rb_root *root, bool prune_stale_stable_nodes) { struct ksm_stable_node *dup, *found = NULL, *stable_node = *_stable_node; struct hlist_node *hlist_safe; struct folio *folio, *tree_folio = NULL; int found_rmap_hlist_len; if (!prune_stale_stable_nodes || time_before(jiffies, stable_node->chain_prune_time + msecs_to_jiffies( ksm_stable_node_chains_prune_millisecs))) prune_stale_stable_nodes = false; else stable_node->chain_prune_time = jiffies; hlist_for_each_entry_safe(dup, hlist_safe, &stable_node->hlist, hlist_dup) { cond_resched(); /* * We must walk all stable_node_dup to prune the stale * stable nodes during lookup. * * ksm_get_folio can drop the nodes from the * stable_node->hlist if they point to freed pages * (that's why we do a _safe walk). The "dup" * stable_node parameter itself will be freed from * under us if it returns NULL. */ folio = ksm_get_folio(dup, KSM_GET_FOLIO_NOLOCK); if (!folio) continue; /* Pick the best candidate if possible. */ if (!found || (is_page_sharing_candidate(dup) && (!is_page_sharing_candidate(found) || dup->rmap_hlist_len > found_rmap_hlist_len))) { if (found) folio_put(tree_folio); found = dup; found_rmap_hlist_len = found->rmap_hlist_len; tree_folio = folio; /* skip put_page for found candidate */ if (!prune_stale_stable_nodes && is_page_sharing_candidate(found)) break; continue; } folio_put(folio); } if (found) { if (hlist_is_singular_node(&found->hlist_dup, &stable_node->hlist)) { /* * If there's not just one entry it would * corrupt memory, better BUG_ON. In KSM * context with no lock held it's not even * fatal. */ BUG_ON(stable_node->hlist.first->next); /* * There's just one entry and it is below the * deduplication limit so drop the chain. */ rb_replace_node(&stable_node->node, &found->node, root); free_stable_node(stable_node); ksm_stable_node_chains--; ksm_stable_node_dups--; /* * NOTE: the caller depends on the stable_node * to be equal to stable_node_dup if the chain * was collapsed. */ *_stable_node = found; /* * Just for robustness, as stable_node is * otherwise left as a stable pointer, the * compiler shall optimize it away at build * time. */ stable_node = NULL; } else if (stable_node->hlist.first != &found->hlist_dup && __is_page_sharing_candidate(found, 1)) { /* * If the found stable_node dup can accept one * more future merge (in addition to the one * that is underway) and is not at the head of * the chain, put it there so next search will * be quicker in the !prune_stale_stable_nodes * case. * * NOTE: it would be inaccurate to use nr > 1 * instead of checking the hlist.first pointer * directly, because in the * prune_stale_stable_nodes case "nr" isn't * the position of the found dup in the chain, * but the total number of dups in the chain. */ hlist_del(&found->hlist_dup); hlist_add_head(&found->hlist_dup, &stable_node->hlist); } } else { /* Its hlist must be empty if no one found. */ free_stable_node_chain(stable_node, root); } *_stable_node_dup = found; return tree_folio; } /* * Like for ksm_get_folio, this function can free the *_stable_node and * *_stable_node_dup if the returned tree_page is NULL. * * It can also free and overwrite *_stable_node with the found * stable_node_dup if the chain is collapsed (in which case * *_stable_node will be equal to *_stable_node_dup like if the chain * never existed). It's up to the caller to verify tree_page is not * NULL before dereferencing *_stable_node or *_stable_node_dup. * * *_stable_node_dup is really a second output parameter of this * function and will be overwritten in all cases, the caller doesn't * need to initialize it. */ static struct folio *__stable_node_chain(struct ksm_stable_node **_stable_node_dup, struct ksm_stable_node **_stable_node, struct rb_root *root, bool prune_stale_stable_nodes) { struct ksm_stable_node *stable_node = *_stable_node; if (!is_stable_node_chain(stable_node)) { *_stable_node_dup = stable_node; return ksm_get_folio(stable_node, KSM_GET_FOLIO_NOLOCK); } return stable_node_dup(_stable_node_dup, _stable_node, root, prune_stale_stable_nodes); } static __always_inline struct folio *chain_prune(struct ksm_stable_node **s_n_d, struct ksm_stable_node **s_n, struct rb_root *root) { return __stable_node_chain(s_n_d, s_n, root, true); } static __always_inline struct folio *chain(struct ksm_stable_node **s_n_d, struct ksm_stable_node **s_n, struct rb_root *root) { return __stable_node_chain(s_n_d, s_n, root, false); } /* * stable_tree_search - search for page inside the stable tree * * This function checks if there is a page inside the stable tree * with identical content to the page that we are scanning right now. * * This function returns the stable tree node of identical content if found, * -EBUSY if the stable node's page is being migrated, NULL otherwise. */ static struct folio *stable_tree_search(struct page *page) { int nid; struct rb_root *root; struct rb_node **new; struct rb_node *parent; struct ksm_stable_node *stable_node, *stable_node_dup; struct ksm_stable_node *page_node; struct folio *folio; folio = page_folio(page); page_node = folio_stable_node(folio); if (page_node && page_node->head != &migrate_nodes) { /* ksm page forked */ folio_get(folio); return folio; } nid = get_kpfn_nid(folio_pfn(folio)); root = root_stable_tree + nid; again: new = &root->rb_node; parent = NULL; while (*new) { struct folio *tree_folio; int ret; cond_resched(); stable_node = rb_entry(*new, struct ksm_stable_node, node); tree_folio = chain_prune(&stable_node_dup, &stable_node, root); if (!tree_folio) { /* * If we walked over a stale stable_node, * ksm_get_folio() will call rb_erase() and it * may rebalance the tree from under us. So * restart the search from scratch. Returning * NULL would be safe too, but we'd generate * false negative insertions just because some * stable_node was stale. */ goto again; } ret = memcmp_pages(page, &tree_folio->page); folio_put(tree_folio); parent = *new; if (ret < 0) new = &parent->rb_left; else if (ret > 0) new = &parent->rb_right; else { if (page_node) { VM_BUG_ON(page_node->head != &migrate_nodes); /* * If the mapcount of our migrated KSM folio is * at most 1, we can merge it with another * KSM folio where we know that we have space * for one more mapping without exceeding the * ksm_max_page_sharing limit: see * chain_prune(). This way, we can avoid adding * this stable node to the chain. */ if (folio_mapcount(folio) > 1) goto chain_append; } if (!is_page_sharing_candidate(stable_node_dup)) { /* * If the stable_node is a chain and * we got a payload match in memcmp * but we cannot merge the scanned * page in any of the existing * stable_node dups because they're * all full, we need to wait the * scanned page to find itself a match * in the unstable tree to create a * brand new KSM page to add later to * the dups of this stable_node. */ return NULL; } /* * Lock and unlock the stable_node's page (which * might already have been migrated) so that page * migration is sure to notice its raised count. * It would be more elegant to return stable_node * than kpage, but that involves more changes. */ tree_folio = ksm_get_folio(stable_node_dup, KSM_GET_FOLIO_TRYLOCK); if (PTR_ERR(tree_folio) == -EBUSY) return ERR_PTR(-EBUSY); if (unlikely(!tree_folio)) /* * The tree may have been rebalanced, * so re-evaluate parent and new. */ goto again; folio_unlock(tree_folio); if (get_kpfn_nid(stable_node_dup->kpfn) != NUMA(stable_node_dup->nid)) { folio_put(tree_folio); goto replace; } return tree_folio; } } if (!page_node) return NULL; list_del(&page_node->list); DO_NUMA(page_node->nid = nid); rb_link_node(&page_node->node, parent, new); rb_insert_color(&page_node->node, root); out: if (is_page_sharing_candidate(page_node)) { folio_get(folio); return folio; } else return NULL; replace: /* * If stable_node was a chain and chain_prune collapsed it, * stable_node has been updated to be the new regular * stable_node. A collapse of the chain is indistinguishable * from the case there was no chain in the stable * rbtree. Otherwise stable_node is the chain and * stable_node_dup is the dup to replace. */ if (stable_node_dup == stable_node) { VM_BUG_ON(is_stable_node_chain(stable_node_dup)); VM_BUG_ON(is_stable_node_dup(stable_node_dup)); /* there is no chain */ if (page_node) { VM_BUG_ON(page_node->head != &migrate_nodes); list_del(&page_node->list); DO_NUMA(page_node->nid = nid); rb_replace_node(&stable_node_dup->node, &page_node->node, root); if (is_page_sharing_candidate(page_node)) folio_get(folio); else folio = NULL; } else { rb_erase(&stable_node_dup->node, root); folio = NULL; } } else { VM_BUG_ON(!is_stable_node_chain(stable_node)); __stable_node_dup_del(stable_node_dup); if (page_node) { VM_BUG_ON(page_node->head != &migrate_nodes); list_del(&page_node->list); DO_NUMA(page_node->nid = nid); stable_node_chain_add_dup(page_node, stable_node); if (is_page_sharing_candidate(page_node)) folio_get(folio); else folio = NULL; } else { folio = NULL; } } stable_node_dup->head = &migrate_nodes; list_add(&stable_node_dup->list, stable_node_dup->head); return folio; chain_append: /* * If stable_node was a chain and chain_prune collapsed it, * stable_node has been updated to be the new regular * stable_node. A collapse of the chain is indistinguishable * from the case there was no chain in the stable * rbtree. Otherwise stable_node is the chain and * stable_node_dup is the dup to replace. */ if (stable_node_dup == stable_node) { VM_BUG_ON(is_stable_node_dup(stable_node_dup)); /* chain is missing so create it */ stable_node = alloc_stable_node_chain(stable_node_dup, root); if (!stable_node) return NULL; } /* * Add this stable_node dup that was * migrated to the stable_node chain * of the current nid for this page * content. */ VM_BUG_ON(!is_stable_node_dup(stable_node_dup)); VM_BUG_ON(page_node->head != &migrate_nodes); list_del(&page_node->list); DO_NUMA(page_node->nid = nid); stable_node_chain_add_dup(page_node, stable_node); goto out; } /* * stable_tree_insert - insert stable tree node pointing to new ksm page * into the stable tree. * * This function returns the stable tree node just allocated on success, * NULL otherwise. */ static struct ksm_stable_node *stable_tree_insert(struct folio *kfolio) { int nid; unsigned long kpfn; struct rb_root *root; struct rb_node **new; struct rb_node *parent; struct ksm_stable_node *stable_node, *stable_node_dup; bool need_chain = false; kpfn = folio_pfn(kfolio); nid = get_kpfn_nid(kpfn); root = root_stable_tree + nid; again: parent = NULL; new = &root->rb_node; while (*new) { struct folio *tree_folio; int ret; cond_resched(); stable_node = rb_entry(*new, struct ksm_stable_node, node); tree_folio = chain(&stable_node_dup, &stable_node, root); if (!tree_folio) { /* * If we walked over a stale stable_node, * ksm_get_folio() will call rb_erase() and it * may rebalance the tree from under us. So * restart the search from scratch. Returning * NULL would be safe too, but we'd generate * false negative insertions just because some * stable_node was stale. */ goto again; } ret = memcmp_pages(&kfolio->page, &tree_folio->page); folio_put(tree_folio); parent = *new; if (ret < 0) new = &parent->rb_left; else if (ret > 0) new = &parent->rb_right; else { need_chain = true; break; } } stable_node_dup = alloc_stable_node(); if (!stable_node_dup) return NULL; INIT_HLIST_HEAD(&stable_node_dup->hlist); stable_node_dup->kpfn = kpfn; stable_node_dup->rmap_hlist_len = 0; DO_NUMA(stable_node_dup->nid = nid); if (!need_chain) { rb_link_node(&stable_node_dup->node, parent, new); rb_insert_color(&stable_node_dup->node, root); } else { if (!is_stable_node_chain(stable_node)) { struct ksm_stable_node *orig = stable_node; /* chain is missing so create it */ stable_node = alloc_stable_node_chain(orig, root); if (!stable_node) { free_stable_node(stable_node_dup); return NULL; } } stable_node_chain_add_dup(stable_node_dup, stable_node); } folio_set_stable_node(kfolio, stable_node_dup); return stable_node_dup; } /* * unstable_tree_search_insert - search for identical page, * else insert rmap_item into the unstable tree. * * This function searches for a page in the unstable tree identical to the * page currently being scanned; and if no identical page is found in the * tree, we insert rmap_item as a new object into the unstable tree. * * This function returns pointer to rmap_item found to be identical * to the currently scanned page, NULL otherwise. * * This function does both searching and inserting, because they share * the same walking algorithm in an rbtree. */ static struct ksm_rmap_item *unstable_tree_search_insert(struct ksm_rmap_item *rmap_item, struct page *page, struct page **tree_pagep) { struct rb_node **new; struct rb_root *root; struct rb_node *parent = NULL; int nid; nid = get_kpfn_nid(page_to_pfn(page)); root = root_unstable_tree + nid; new = &root->rb_node; while (*new) { struct ksm_rmap_item *tree_rmap_item; struct page *tree_page; int ret; cond_resched(); tree_rmap_item = rb_entry(*new, struct ksm_rmap_item, node); tree_page = get_mergeable_page(tree_rmap_item); if (!tree_page) return NULL; /* * Don't substitute a ksm page for a forked page. */ if (page == tree_page) { put_page(tree_page); return NULL; } ret = memcmp_pages(page, tree_page); parent = *new; if (ret < 0) { put_page(tree_page); new = &parent->rb_left; } else if (ret > 0) { put_page(tree_page); new = &parent->rb_right; } else if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { /* * If tree_page has been migrated to another NUMA node, * it will be flushed out and put in the right unstable * tree next time: only merge with it when across_nodes. */ put_page(tree_page); return NULL; } else { *tree_pagep = tree_page; return tree_rmap_item; } } rmap_item->address |= UNSTABLE_FLAG; rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); DO_NUMA(rmap_item->nid = nid); rb_link_node(&rmap_item->node, parent, new); rb_insert_color(&rmap_item->node, root); ksm_pages_unshared++; return NULL; } /* * stable_tree_append - add another rmap_item to the linked list of * rmap_items hanging off a given node of the stable tree, all sharing * the same ksm page. */ static void stable_tree_append(struct ksm_rmap_item *rmap_item, struct ksm_stable_node *stable_node, bool max_page_sharing_bypass) { /* * rmap won't find this mapping if we don't insert the * rmap_item in the right stable_node * duplicate. page_migration could break later if rmap breaks, * so we can as well crash here. We really need to check for * rmap_hlist_len == STABLE_NODE_CHAIN, but we can as well check * for other negative values as an underflow if detected here * for the first time (and not when decreasing rmap_hlist_len) * would be sign of memory corruption in the stable_node. */ BUG_ON(stable_node->rmap_hlist_len < 0); stable_node->rmap_hlist_len++; if (!max_page_sharing_bypass) /* possibly non fatal but unexpected overflow, only warn */ WARN_ON_ONCE(stable_node->rmap_hlist_len > ksm_max_page_sharing); rmap_item->head = stable_node; rmap_item->address |= STABLE_FLAG; hlist_add_head(&rmap_item->hlist, &stable_node->hlist); if (rmap_item->hlist.next) ksm_pages_sharing++; else ksm_pages_shared++; rmap_item->mm->ksm_merging_pages++; } /* * cmp_and_merge_page - first see if page can be merged into the stable tree; * if not, compare checksum to previous and if it's the same, see if page can * be inserted into the unstable tree, or merged with a page already there and * both transferred to the stable tree. * * @page: the page that we are searching identical page to. * @rmap_item: the reverse mapping into the virtual address of this page */ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_item) { struct folio *folio = page_folio(page); struct ksm_rmap_item *tree_rmap_item; struct page *tree_page = NULL; struct ksm_stable_node *stable_node; struct folio *kfolio; unsigned int checksum; int err; bool max_page_sharing_bypass = false; stable_node = folio_stable_node(folio); if (stable_node) { if (stable_node->head != &migrate_nodes && get_kpfn_nid(READ_ONCE(stable_node->kpfn)) != NUMA(stable_node->nid)) { stable_node_dup_del(stable_node); stable_node->head = &migrate_nodes; list_add(&stable_node->list, stable_node->head); } if (stable_node->head != &migrate_nodes && rmap_item->head == stable_node) return; /* * If it's a KSM fork, allow it to go over the sharing limit * without warnings. */ if (!is_page_sharing_candidate(stable_node)) max_page_sharing_bypass = true; } else { remove_rmap_item_from_tree(rmap_item); /* * If the hash value of the page has changed from the last time * we calculated it, this page is changing frequently: therefore we * don't want to insert it in the unstable tree, and we don't want * to waste our time searching for something identical to it there. */ checksum = calc_checksum(page); if (rmap_item->oldchecksum != checksum) { rmap_item->oldchecksum = checksum; return; } if (!try_to_merge_with_zero_page(rmap_item, page)) return; } /* Start by searching for the folio in the stable tree */ kfolio = stable_tree_search(page); if (kfolio == folio && rmap_item->head == stable_node) { folio_put(kfolio); return; } remove_rmap_item_from_tree(rmap_item); if (kfolio) { if (kfolio == ERR_PTR(-EBUSY)) return; err = try_to_merge_with_ksm_page(rmap_item, page, &kfolio->page); if (!err) { /* * The page was successfully merged: * add its rmap_item to the stable tree. */ folio_lock(kfolio); stable_tree_append(rmap_item, folio_stable_node(kfolio), max_page_sharing_bypass); folio_unlock(kfolio); } folio_put(kfolio); return; } tree_rmap_item = unstable_tree_search_insert(rmap_item, page, &tree_page); if (tree_rmap_item) { bool split; kfolio = try_to_merge_two_pages(rmap_item, page, tree_rmap_item, tree_page); /* * If both pages we tried to merge belong to the same compound * page, then we actually ended up increasing the reference * count of the same compound page twice, and split_huge_page * failed. * Here we set a flag if that happened, and we use it later to * try split_huge_page again. Since we call put_page right * afterwards, the reference count will be correct and * split_huge_page should succeed. */ split = PageTransCompound(page) && compound_head(page) == compound_head(tree_page); put_page(tree_page); if (kfolio) { /* * The pages were successfully merged: insert new * node in the stable tree and add both rmap_items. */ folio_lock(kfolio); stable_node = stable_tree_insert(kfolio); if (stable_node) { stable_tree_append(tree_rmap_item, stable_node, false); stable_tree_append(rmap_item, stable_node, false); } folio_unlock(kfolio); /* * If we fail to insert the page into the stable tree, * we will have 2 virtual addresses that are pointing * to a ksm page left outside the stable tree, * in which case we need to break_cow on both. */ if (!stable_node) { break_cow(tree_rmap_item); break_cow(rmap_item); } } else if (split) { /* * We are here if we tried to merge two pages and * failed because they both belonged to the same * compound page. We will split the page now, but no * merging will take place. * We do not want to add the cost of a full lock; if * the page is locked, it is better to skip it and * perhaps try again later. */ if (!folio_trylock(folio)) return; split_huge_page(page); folio = page_folio(page); folio_unlock(folio); } } } static struct ksm_rmap_item *get_next_rmap_item(struct ksm_mm_slot *mm_slot, struct ksm_rmap_item **rmap_list, unsigned long addr) { struct ksm_rmap_item *rmap_item; while (*rmap_list) { rmap_item = *rmap_list; if ((rmap_item->address & PAGE_MASK) == addr) return rmap_item; if (rmap_item->address > addr) break; *rmap_list = rmap_item->rmap_list; remove_rmap_item_from_tree(rmap_item); free_rmap_item(rmap_item); } rmap_item = alloc_rmap_item(); if (rmap_item) { /* It has already been zeroed */ rmap_item->mm = mm_slot->slot.mm; rmap_item->mm->ksm_rmap_items++; rmap_item->address = addr; rmap_item->rmap_list = *rmap_list; *rmap_list = rmap_item; } return rmap_item; } /* * Calculate skip age for the ksm page age. The age determines how often * de-duplicating has already been tried unsuccessfully. If the age is * smaller, the scanning of this page is skipped for less scans. * * @age: rmap_item age of page */ static unsigned int skip_age(rmap_age_t age) { if (age <= 3) return 1; if (age <= 5) return 2; if (age <= 8) return 4; return 8; } /* * Determines if a page should be skipped for the current scan. * * @folio: folio containing the page to check * @rmap_item: associated rmap_item of page */ static bool should_skip_rmap_item(struct folio *folio, struct ksm_rmap_item *rmap_item) { rmap_age_t age; if (!ksm_smart_scan) return false; /* * Never skip pages that are already KSM; pages cmp_and_merge_page() * will essentially ignore them, but we still have to process them * properly. */ if (folio_test_ksm(folio)) return false; age = rmap_item->age; if (age != U8_MAX) rmap_item->age++; /* * Smaller ages are not skipped, they need to get a chance to go * through the different phases of the KSM merging. */ if (age < 3) return false; /* * Are we still allowed to skip? If not, then don't skip it * and determine how much more often we are allowed to skip next. */ if (!rmap_item->remaining_skips) { rmap_item->remaining_skips = skip_age(age); return false; } /* Skip this page */ ksm_pages_skipped++; rmap_item->remaining_skips--; remove_rmap_item_from_tree(rmap_item); return true; } struct ksm_next_page_arg { struct folio *folio; struct page *page; unsigned long addr; }; static int ksm_next_page_pmd_entry(pmd_t *pmdp, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct ksm_next_page_arg *private = walk->private; struct vm_area_struct *vma = walk->vma; pte_t *start_ptep = NULL, *ptep, pte; struct mm_struct *mm = walk->mm; struct folio *folio; struct page *page; spinlock_t *ptl; pmd_t pmd; if (ksm_test_exit(mm)) return 0; cond_resched(); pmd = pmdp_get_lockless(pmdp); if (!pmd_present(pmd)) return 0; if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && pmd_leaf(pmd)) { ptl = pmd_lock(mm, pmdp); pmd = pmdp_get(pmdp); if (!pmd_present(pmd)) { goto not_found_unlock; } else if (pmd_leaf(pmd)) { page = vm_normal_page_pmd(vma, addr, pmd); if (!page) goto not_found_unlock; folio = page_folio(page); if (folio_is_zone_device(folio) || !folio_test_anon(folio)) goto not_found_unlock; page += ((addr & (PMD_SIZE - 1)) >> PAGE_SHIFT); goto found_unlock; } spin_unlock(ptl); } start_ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); if (!start_ptep) return 0; for (ptep = start_ptep; addr < end; ptep++, addr += PAGE_SIZE) { pte = ptep_get(ptep); if (!pte_present(pte)) continue; page = vm_normal_page(vma, addr, pte); if (!page) continue; folio = page_folio(page); if (folio_is_zone_device(folio) || !folio_test_anon(folio)) continue; goto found_unlock; } not_found_unlock: spin_unlock(ptl); if (start_ptep) pte_unmap(start_ptep); return 0; found_unlock: folio_get(folio); spin_unlock(ptl); if (start_ptep) pte_unmap(start_ptep); private->page = page; private->folio = folio; private->addr = addr; return 1; } static struct mm_walk_ops ksm_next_page_ops = { .pmd_entry = ksm_next_page_pmd_entry, .walk_lock = PGWALK_RDLOCK, }; static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page) { struct mm_struct *mm; struct ksm_mm_slot *mm_slot; struct mm_slot *slot; struct vm_area_struct *vma; struct ksm_rmap_item *rmap_item; struct vma_iterator vmi; int nid; if (list_empty(&ksm_mm_head.slot.mm_node)) return NULL; mm_slot = ksm_scan.mm_slot; if (mm_slot == &ksm_mm_head) { advisor_start_scan(); trace_ksm_start_scan(ksm_scan.seqnr, ksm_rmap_items); /* * A number of pages can hang around indefinitely in per-cpu * LRU cache, raised page count preventing write_protect_page * from merging them. Though it doesn't really matter much, * it is puzzling to see some stuck in pages_volatile until * other activity jostles them out, and they also prevented * LTP's KSM test from succeeding deterministically; so drain * them here (here rather than on entry to ksm_do_scan(), * so we don't IPI too often when pages_to_scan is set low). */ lru_add_drain_all(); /* * Whereas stale stable_nodes on the stable_tree itself * get pruned in the regular course of stable_tree_search(), * those moved out to the migrate_nodes list can accumulate: * so prune them once before each full scan. */ if (!ksm_merge_across_nodes) { struct ksm_stable_node *stable_node, *next; struct folio *folio; list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) { folio = ksm_get_folio(stable_node, KSM_GET_FOLIO_NOLOCK); if (folio) folio_put(folio); cond_resched(); } } for (nid = 0; nid < ksm_nr_node_ids; nid++) root_unstable_tree[nid] = RB_ROOT; spin_lock(&ksm_mmlist_lock); slot = list_entry(mm_slot->slot.mm_node.next, struct mm_slot, mm_node); mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); ksm_scan.mm_slot = mm_slot; spin_unlock(&ksm_mmlist_lock); /* * Although we tested list_empty() above, a racing __ksm_exit * of the last mm on the list may have removed it since then. */ if (mm_slot == &ksm_mm_head) return NULL; next_mm: ksm_scan.address = 0; ksm_scan.rmap_list = &mm_slot->rmap_list; } slot = &mm_slot->slot; mm = slot->mm; vma_iter_init(&vmi, mm, ksm_scan.address); mmap_read_lock(mm); if (ksm_test_exit(mm)) goto no_vmas; for_each_vma(vmi, vma) { if (!(vma->vm_flags & VM_MERGEABLE)) continue; if (ksm_scan.address < vma->vm_start) ksm_scan.address = vma->vm_start; if (!vma->anon_vma) ksm_scan.address = vma->vm_end; while (ksm_scan.address < vma->vm_end) { struct ksm_next_page_arg ksm_next_page_arg; struct page *tmp_page = NULL; struct folio *folio; if (ksm_test_exit(mm)) break; int found; found = walk_page_range_vma(vma, ksm_scan.address, vma->vm_end, &ksm_next_page_ops, &ksm_next_page_arg); if (found > 0) { folio = ksm_next_page_arg.folio; tmp_page = ksm_next_page_arg.page; ksm_scan.address = ksm_next_page_arg.addr; } else { VM_WARN_ON_ONCE(found < 0); ksm_scan.address = vma->vm_end - PAGE_SIZE; } if (tmp_page) { flush_anon_page(vma, tmp_page, ksm_scan.address); flush_dcache_page(tmp_page); rmap_item = get_next_rmap_item(mm_slot, ksm_scan.rmap_list, ksm_scan.address); if (rmap_item) { ksm_scan.rmap_list = &rmap_item->rmap_list; if (should_skip_rmap_item(folio, rmap_item)) { folio_put(folio); goto next_page; } ksm_scan.address += PAGE_SIZE; *page = tmp_page; } else { folio_put(folio); } mmap_read_unlock(mm); return rmap_item; } next_page: ksm_scan.address += PAGE_SIZE; cond_resched(); } } if (ksm_test_exit(mm)) { no_vmas: ksm_scan.address = 0; ksm_scan.rmap_list = &mm_slot->rmap_list; } /* * Nuke all the rmap_items that are above this current rmap: * because there were no VM_MERGEABLE vmas with such addresses. */ remove_trailing_rmap_items(ksm_scan.rmap_list); spin_lock(&ksm_mmlist_lock); slot = list_entry(mm_slot->slot.mm_node.next, struct mm_slot, mm_node); ksm_scan.mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); if (ksm_scan.address == 0) { /* * We've completed a full scan of all vmas, holding mmap_lock * throughout, and found no VM_MERGEABLE: so do the same as * __ksm_exit does to remove this mm from all our lists now. * This applies either when cleaning up after __ksm_exit * (but beware: we can reach here even before __ksm_exit), * or when all VM_MERGEABLE areas have been unmapped (and * mmap_lock then protects against race with MADV_MERGEABLE). */ hash_del(&mm_slot->slot.hash); list_del(&mm_slot->slot.mm_node); spin_unlock(&ksm_mmlist_lock); mm_slot_free(mm_slot_cache, mm_slot); /* * Only clear MMF_VM_MERGEABLE. We must not clear * MMF_VM_MERGE_ANY, because for those MMF_VM_MERGE_ANY process, * perhaps their mm_struct has just been added to ksm_mm_slot * list, and its process has not yet officially started running * or has not yet performed mmap/brk to allocate anonymous VMAS. */ mm_flags_clear(MMF_VM_MERGEABLE, mm); mmap_read_unlock(mm); mmdrop(mm); } else { mmap_read_unlock(mm); /* * mmap_read_unlock(mm) first because after * spin_unlock(&ksm_mmlist_lock) run, the "mm" may * already have been freed under us by __ksm_exit() * because the "mm_slot" is still hashed and * ksm_scan.mm_slot doesn't point to it anymore. */ spin_unlock(&ksm_mmlist_lock); } /* Repeat until we've completed scanning the whole list */ mm_slot = ksm_scan.mm_slot; if (mm_slot != &ksm_mm_head) goto next_mm; advisor_stop_scan(); trace_ksm_stop_scan(ksm_scan.seqnr, ksm_rmap_items); ksm_scan.seqnr++; return NULL; } /** * ksm_do_scan - the ksm scanner main worker function. * @scan_npages: number of pages we want to scan before we return. */ static void ksm_do_scan(unsigned int scan_npages) { struct ksm_rmap_item *rmap_item; struct page *page; while (scan_npages-- && likely(!freezing(current))) { cond_resched(); rmap_item = scan_get_next_rmap_item(&page); if (!rmap_item) return; cmp_and_merge_page(page, rmap_item); put_page(page); ksm_pages_scanned++; } } static int ksmd_should_run(void) { return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.slot.mm_node); } static int ksm_scan_thread(void *nothing) { unsigned int sleep_ms; set_freezable(); set_user_nice(current, 5); while (!kthread_should_stop()) { mutex_lock(&ksm_thread_mutex); wait_while_offlining(); if (ksmd_should_run()) ksm_do_scan(ksm_thread_pages_to_scan); mutex_unlock(&ksm_thread_mutex); if (ksmd_should_run()) { sleep_ms = READ_ONCE(ksm_thread_sleep_millisecs); wait_event_freezable_timeout(ksm_iter_wait, sleep_ms != READ_ONCE(ksm_thread_sleep_millisecs), msecs_to_jiffies(sleep_ms)); } else { wait_event_freezable(ksm_thread_wait, ksmd_should_run() || kthread_should_stop()); } } return 0; } static bool __ksm_should_add_vma(const struct file *file, vm_flags_t vm_flags) { if (vm_flags & VM_MERGEABLE) return false; return ksm_compatible(file, vm_flags); } static void __ksm_add_vma(struct vm_area_struct *vma) { if (__ksm_should_add_vma(vma->vm_file, vma->vm_flags)) vm_flags_set(vma, VM_MERGEABLE); } static int __ksm_del_vma(struct vm_area_struct *vma) { int err; if (!(vma->vm_flags & VM_MERGEABLE)) return 0; if (vma->anon_vma) { err = break_ksm(vma, vma->vm_start, vma->vm_end, true); if (err) return err; } vm_flags_clear(vma, VM_MERGEABLE); return 0; } /** * ksm_vma_flags - Update VMA flags to mark as mergeable if compatible * * @mm: Proposed VMA's mm_struct * @file: Proposed VMA's file-backed mapping, if any. * @vm_flags: Proposed VMA"s flags. * * Returns: @vm_flags possibly updated to mark mergeable. */ vm_flags_t ksm_vma_flags(struct mm_struct *mm, const struct file *file, vm_flags_t vm_flags) { if (mm_flags_test(MMF_VM_MERGE_ANY, mm) && __ksm_should_add_vma(file, vm_flags)) { vm_flags |= VM_MERGEABLE; /* * Generally, the flags here always include MMF_VM_MERGEABLE. * However, in rare cases, this flag may be cleared by ksmd who * scans a cycle without finding any mergeable vma. */ if (unlikely(!mm_flags_test(MMF_VM_MERGEABLE, mm))) __ksm_enter(mm); } return vm_flags; } static void ksm_add_vmas(struct mm_struct *mm) { struct vm_area_struct *vma; VMA_ITERATOR(vmi, mm, 0); for_each_vma(vmi, vma) __ksm_add_vma(vma); } static int ksm_del_vmas(struct mm_struct *mm) { struct vm_area_struct *vma; int err; VMA_ITERATOR(vmi, mm, 0); for_each_vma(vmi, vma) { err = __ksm_del_vma(vma); if (err) return err; } return 0; } /** * ksm_enable_merge_any - Add mm to mm ksm list and enable merging on all * compatible VMA's * * @mm: Pointer to mm * * Returns 0 on success, otherwise error code */ int ksm_enable_merge_any(struct mm_struct *mm) { int err; if (mm_flags_test(MMF_VM_MERGE_ANY, mm)) return 0; if (!mm_flags_test(MMF_VM_MERGEABLE, mm)) { err = __ksm_enter(mm); if (err) return err; } mm_flags_set(MMF_VM_MERGE_ANY, mm); ksm_add_vmas(mm); return 0; } /** * ksm_disable_merge_any - Disable merging on all compatible VMA's of the mm, * previously enabled via ksm_enable_merge_any(). * * Disabling merging implies unmerging any merged pages, like setting * MADV_UNMERGEABLE would. If unmerging fails, the whole operation fails and * merging on all compatible VMA's remains enabled. * * @mm: Pointer to mm * * Returns 0 on success, otherwise error code */ int ksm_disable_merge_any(struct mm_struct *mm) { int err; if (!mm_flags_test(MMF_VM_MERGE_ANY, mm)) return 0; err = ksm_del_vmas(mm); if (err) { ksm_add_vmas(mm); return err; } mm_flags_clear(MMF_VM_MERGE_ANY, mm); return 0; } int ksm_disable(struct mm_struct *mm) { mmap_assert_write_locked(mm); if (!mm_flags_test(MMF_VM_MERGEABLE, mm)) return 0; if (mm_flags_test(MMF_VM_MERGE_ANY, mm)) return ksm_disable_merge_any(mm); return ksm_del_vmas(mm); } int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, vm_flags_t *vm_flags) { struct mm_struct *mm = vma->vm_mm; int err; switch (advice) { case MADV_MERGEABLE: if (vma->vm_flags & VM_MERGEABLE) return 0; if (!vma_ksm_compatible(vma)) return 0; if (!mm_flags_test(MMF_VM_MERGEABLE, mm)) { err = __ksm_enter(mm); if (err) return err; } *vm_flags |= VM_MERGEABLE; break; case MADV_UNMERGEABLE: if (!(*vm_flags & VM_MERGEABLE)) return 0; /* just ignore the advice */ if (vma->anon_vma) { err = break_ksm(vma, start, end, true); if (err) return err; } *vm_flags &= ~VM_MERGEABLE; break; } return 0; } EXPORT_SYMBOL_GPL(ksm_madvise); int __ksm_enter(struct mm_struct *mm) { struct ksm_mm_slot *mm_slot; struct mm_slot *slot; int needs_wakeup; mm_slot = mm_slot_alloc(mm_slot_cache); if (!mm_slot) return -ENOMEM; slot = &mm_slot->slot; /* Check ksm_run too? Would need tighter locking */ needs_wakeup = list_empty(&ksm_mm_head.slot.mm_node); spin_lock(&ksm_mmlist_lock); mm_slot_insert(mm_slots_hash, mm, slot); /* * When KSM_RUN_MERGE (or KSM_RUN_STOP), * insert just behind the scanning cursor, to let the area settle * down a little; when fork is followed by immediate exec, we don't * want ksmd to waste time setting up and tearing down an rmap_list. * * But when KSM_RUN_UNMERGE, it's important to insert ahead of its * scanning cursor, otherwise KSM pages in newly forked mms will be * missed: then we might as well insert at the end of the list. */ if (ksm_run & KSM_RUN_UNMERGE) list_add_tail(&slot->mm_node, &ksm_mm_head.slot.mm_node); else list_add_tail(&slot->mm_node, &ksm_scan.mm_slot->slot.mm_node); spin_unlock(&ksm_mmlist_lock); mm_flags_set(MMF_VM_MERGEABLE, mm); mmgrab(mm); if (needs_wakeup) wake_up_interruptible(&ksm_thread_wait); trace_ksm_enter(mm); return 0; } void __ksm_exit(struct mm_struct *mm) { struct ksm_mm_slot *mm_slot = NULL; struct mm_slot *slot; int easy_to_free = 0; /* * This process is exiting: if it's straightforward (as is the * case when ksmd was never running), free mm_slot immediately. * But if it's at the cursor or has rmap_items linked to it, use * mmap_lock to synchronize with any break_cows before pagetables * are freed, and leave the mm_slot on the list for ksmd to free. * Beware: ksm may already have noticed it exiting and freed the slot. */ spin_lock(&ksm_mmlist_lock); slot = mm_slot_lookup(mm_slots_hash, mm); if (!slot) goto unlock; mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); if (ksm_scan.mm_slot == mm_slot) goto unlock; if (!mm_slot->rmap_list) { hash_del(&slot->hash); list_del(&slot->mm_node); easy_to_free = 1; } else { list_move(&slot->mm_node, &ksm_scan.mm_slot->slot.mm_node); } unlock: spin_unlock(&ksm_mmlist_lock); if (easy_to_free) { mm_slot_free(mm_slot_cache, mm_slot); mm_flags_clear(MMF_VM_MERGE_ANY, mm); mm_flags_clear(MMF_VM_MERGEABLE, mm); mmdrop(mm); } else if (mm_slot) { mmap_write_lock(mm); mmap_write_unlock(mm); } trace_ksm_exit(mm); } struct folio *ksm_might_need_to_copy(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { struct page *page = folio_page(folio, 0); struct anon_vma *anon_vma = folio_anon_vma(folio); struct folio *new_folio; if (folio_test_large(folio)) return folio; if (folio_test_ksm(folio)) { if (folio_stable_node(folio) && !(ksm_run & KSM_RUN_UNMERGE)) return folio; /* no need to copy it */ } else if (!anon_vma) { return folio; /* no need to copy it */ } else if (folio->index == linear_page_index(vma, addr) && anon_vma->root == vma->anon_vma->root) { return folio; /* still no need to copy it */ } if (PageHWPoison(page)) return ERR_PTR(-EHWPOISON); if (!folio_test_uptodate(folio)) return folio; /* let do_swap_page report the error */ new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); if (new_folio && mem_cgroup_charge(new_folio, vma->vm_mm, GFP_KERNEL)) { folio_put(new_folio); new_folio = NULL; } if (new_folio) { if (copy_mc_user_highpage(folio_page(new_folio, 0), page, addr, vma)) { folio_put(new_folio); return ERR_PTR(-EHWPOISON); } folio_set_dirty(new_folio); __folio_mark_uptodate(new_folio); __folio_set_locked(new_folio); #ifdef CONFIG_SWAP count_vm_event(KSM_SWPIN_COPY); #endif } return new_folio; } void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc) { struct ksm_stable_node *stable_node; struct ksm_rmap_item *rmap_item; int search_new_forks = 0; VM_BUG_ON_FOLIO(!folio_test_ksm(folio), folio); /* * Rely on the page lock to protect against concurrent modifications * to that page's node of the stable tree. */ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); stable_node = folio_stable_node(folio); if (!stable_node) return; again: hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { struct anon_vma *anon_vma = rmap_item->anon_vma; struct anon_vma_chain *vmac; struct vm_area_struct *vma; cond_resched(); if (!anon_vma_trylock_read(anon_vma)) { if (rwc->try_lock) { rwc->contended = true; return; } anon_vma_lock_read(anon_vma); } anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, 0, ULONG_MAX) { unsigned long addr; cond_resched(); vma = vmac->vma; /* Ignore the stable/unstable/sqnr flags */ addr = rmap_item->address & PAGE_MASK; if (addr < vma->vm_start || addr >= vma->vm_end) continue; /* * Initially we examine only the vma which covers this * rmap_item; but later, if there is still work to do, * we examine covering vmas in other mms: in case they * were forked from the original since ksmd passed. */ if ((rmap_item->mm == vma->vm_mm) == search_new_forks) continue; if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) continue; if (!rwc->rmap_one(folio, vma, addr, rwc->arg)) { anon_vma_unlock_read(anon_vma); return; } if (rwc->done && rwc->done(folio)) { anon_vma_unlock_read(anon_vma); return; } } anon_vma_unlock_read(anon_vma); } if (!search_new_forks++) goto again; } #ifdef CONFIG_MEMORY_FAILURE /* * Collect processes when the error hit an ksm page. */ void collect_procs_ksm(const struct folio *folio, const struct page *page, struct list_head *to_kill, int force_early) { struct ksm_stable_node *stable_node; struct ksm_rmap_item *rmap_item; struct vm_area_struct *vma; struct task_struct *tsk; stable_node = folio_stable_node(folio); if (!stable_node) return; hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { struct anon_vma *av = rmap_item->anon_vma; anon_vma_lock_read(av); rcu_read_lock(); for_each_process(tsk) { struct anon_vma_chain *vmac; unsigned long addr; struct task_struct *t = task_early_kill(tsk, force_early); if (!t) continue; anon_vma_interval_tree_foreach(vmac, &av->rb_root, 0, ULONG_MAX) { vma = vmac->vma; if (vma->vm_mm == t->mm) { addr = rmap_item->address & PAGE_MASK; add_to_kill_ksm(t, page, vma, to_kill, addr); } } } rcu_read_unlock(); anon_vma_unlock_read(av); } } #endif #ifdef CONFIG_MIGRATION void folio_migrate_ksm(struct folio *newfolio, struct folio *folio) { struct ksm_stable_node *stable_node; VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_locked(newfolio), newfolio); VM_BUG_ON_FOLIO(newfolio->mapping != folio->mapping, newfolio); stable_node = folio_stable_node(folio); if (stable_node) { VM_BUG_ON_FOLIO(stable_node->kpfn != folio_pfn(folio), folio); stable_node->kpfn = folio_pfn(newfolio); /* * newfolio->mapping was set in advance; now we need smp_wmb() * to make sure that the new stable_node->kpfn is visible * to ksm_get_folio() before it can see that folio->mapping * has gone stale (or that the swapcache flag has been cleared). */ smp_wmb(); folio_set_stable_node(folio, NULL); } } #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_MEMORY_HOTREMOVE static void wait_while_offlining(void) { while (ksm_run & KSM_RUN_OFFLINE) { mutex_unlock(&ksm_thread_mutex); wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE), TASK_UNINTERRUPTIBLE); mutex_lock(&ksm_thread_mutex); } } static bool stable_node_dup_remove_range(struct ksm_stable_node *stable_node, unsigned long start_pfn, unsigned long end_pfn) { if (stable_node->kpfn >= start_pfn && stable_node->kpfn < end_pfn) { /* * Don't ksm_get_folio, page has already gone: * which is why we keep kpfn instead of page* */ remove_node_from_stable_tree(stable_node); return true; } return false; } static bool stable_node_chain_remove_range(struct ksm_stable_node *stable_node, unsigned long start_pfn, unsigned long end_pfn, struct rb_root *root) { struct ksm_stable_node *dup; struct hlist_node *hlist_safe; if (!is_stable_node_chain(stable_node)) { VM_BUG_ON(is_stable_node_dup(stable_node)); return stable_node_dup_remove_range(stable_node, start_pfn, end_pfn); } hlist_for_each_entry_safe(dup, hlist_safe, &stable_node->hlist, hlist_dup) { VM_BUG_ON(!is_stable_node_dup(dup)); stable_node_dup_remove_range(dup, start_pfn, end_pfn); } if (hlist_empty(&stable_node->hlist)) { free_stable_node_chain(stable_node, root); return true; /* notify caller that tree was rebalanced */ } else return false; } static void ksm_check_stable_tree(unsigned long start_pfn, unsigned long end_pfn) { struct ksm_stable_node *stable_node, *next; struct rb_node *node; int nid; for (nid = 0; nid < ksm_nr_node_ids; nid++) { node = rb_first(root_stable_tree + nid); while (node) { stable_node = rb_entry(node, struct ksm_stable_node, node); if (stable_node_chain_remove_range(stable_node, start_pfn, end_pfn, root_stable_tree + nid)) node = rb_first(root_stable_tree + nid); else node = rb_next(node); cond_resched(); } } list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) { if (stable_node->kpfn >= start_pfn && stable_node->kpfn < end_pfn) remove_node_from_stable_tree(stable_node); cond_resched(); } } static int ksm_memory_callback(struct notifier_block *self, unsigned long action, void *arg) { struct memory_notify *mn = arg; switch (action) { case MEM_GOING_OFFLINE: /* * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items() * and remove_all_stable_nodes() while memory is going offline: * it is unsafe for them to touch the stable tree at this time. * But break_ksm(), rmap lookups and other entry points * which do not need the ksm_thread_mutex are all safe. */ mutex_lock(&ksm_thread_mutex); ksm_run |= KSM_RUN_OFFLINE; mutex_unlock(&ksm_thread_mutex); break; case MEM_OFFLINE: /* * Most of the work is done by page migration; but there might * be a few stable_nodes left over, still pointing to struct * pages which have been offlined: prune those from the tree, * otherwise ksm_get_folio() might later try to access a * non-existent struct page. */ ksm_check_stable_tree(mn->start_pfn, mn->start_pfn + mn->nr_pages); fallthrough; case MEM_CANCEL_OFFLINE: mutex_lock(&ksm_thread_mutex); ksm_run &= ~KSM_RUN_OFFLINE; mutex_unlock(&ksm_thread_mutex); smp_mb(); /* wake_up_bit advises this */ wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE)); break; } return NOTIFY_OK; } #else static void wait_while_offlining(void) { } #endif /* CONFIG_MEMORY_HOTREMOVE */ #ifdef CONFIG_PROC_FS /* * The process is mergeable only if any VMA is currently * applicable to KSM. * * The mmap lock must be held in read mode. */ bool ksm_process_mergeable(struct mm_struct *mm) { struct vm_area_struct *vma; mmap_assert_locked(mm); VMA_ITERATOR(vmi, mm, 0); for_each_vma(vmi, vma) if (vma->vm_flags & VM_MERGEABLE) return true; return false; } long ksm_process_profit(struct mm_struct *mm) { return (long)(mm->ksm_merging_pages + mm_ksm_zero_pages(mm)) * PAGE_SIZE - mm->ksm_rmap_items * sizeof(struct ksm_rmap_item); } #endif /* CONFIG_PROC_FS */ #ifdef CONFIG_SYSFS /* * This all compiles without CONFIG_SYSFS, but is a waste of space. */ #define KSM_ATTR_RO(_name) \ static struct kobj_attribute _name##_attr = __ATTR_RO(_name) #define KSM_ATTR(_name) \ static struct kobj_attribute _name##_attr = __ATTR_RW(_name) static ssize_t sleep_millisecs_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_thread_sleep_millisecs); } static ssize_t sleep_millisecs_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { unsigned int msecs; int err; err = kstrtouint(buf, 10, &msecs); if (err) return -EINVAL; ksm_thread_sleep_millisecs = msecs; wake_up_interruptible(&ksm_iter_wait); return count; } KSM_ATTR(sleep_millisecs); static ssize_t pages_to_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_thread_pages_to_scan); } static ssize_t pages_to_scan_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { unsigned int nr_pages; int err; if (ksm_advisor != KSM_ADVISOR_NONE) return -EINVAL; err = kstrtouint(buf, 10, &nr_pages); if (err) return -EINVAL; ksm_thread_pages_to_scan = nr_pages; return count; } KSM_ATTR(pages_to_scan); static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_run); } static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { unsigned int flags; int err; err = kstrtouint(buf, 10, &flags); if (err) return -EINVAL; if (flags > KSM_RUN_UNMERGE) return -EINVAL; /* * KSM_RUN_MERGE sets ksmd running, and 0 stops it running. * KSM_RUN_UNMERGE stops it running and unmerges all rmap_items, * breaking COW to free the pages_shared (but leaves mm_slots * on the list for when ksmd may be set running again). */ mutex_lock(&ksm_thread_mutex); wait_while_offlining(); if (ksm_run != flags) { ksm_run = flags; if (flags & KSM_RUN_UNMERGE) { set_current_oom_origin(); err = unmerge_and_remove_all_rmap_items(); clear_current_oom_origin(); if (err) { ksm_run = KSM_RUN_STOP; count = err; } } } mutex_unlock(&ksm_thread_mutex); if (flags & KSM_RUN_MERGE) wake_up_interruptible(&ksm_thread_wait); return count; } KSM_ATTR(run); #ifdef CONFIG_NUMA static ssize_t merge_across_nodes_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_merge_across_nodes); } static ssize_t merge_across_nodes_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long knob; err = kstrtoul(buf, 10, &knob); if (err) return err; if (knob > 1) return -EINVAL; mutex_lock(&ksm_thread_mutex); wait_while_offlining(); if (ksm_merge_across_nodes != knob) { if (ksm_pages_shared || remove_all_stable_nodes()) err = -EBUSY; else if (root_stable_tree == one_stable_tree) { struct rb_root *buf; /* * This is the first time that we switch away from the * default of merging across nodes: must now allocate * a buffer to hold as many roots as may be needed. * Allocate stable and unstable together: * MAXSMP NODES_SHIFT 10 will use 16kB. */ buf = kcalloc(nr_node_ids + nr_node_ids, sizeof(*buf), GFP_KERNEL); /* Let us assume that RB_ROOT is NULL is zero */ if (!buf) err = -ENOMEM; else { root_stable_tree = buf; root_unstable_tree = buf + nr_node_ids; /* Stable tree is empty but not the unstable */ root_unstable_tree[0] = one_unstable_tree[0]; } } if (!err) { ksm_merge_across_nodes = knob; ksm_nr_node_ids = knob ? 1 : nr_node_ids; } } mutex_unlock(&ksm_thread_mutex); return err ? err : count; } KSM_ATTR(merge_across_nodes); #endif static ssize_t use_zero_pages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_use_zero_pages); } static ssize_t use_zero_pages_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; bool value; err = kstrtobool(buf, &value); if (err) return -EINVAL; ksm_use_zero_pages = value; return count; } KSM_ATTR(use_zero_pages); static ssize_t max_page_sharing_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_max_page_sharing); } static ssize_t max_page_sharing_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; int knob; err = kstrtoint(buf, 10, &knob); if (err) return err; /* * When a KSM page is created it is shared by 2 mappings. This * being a signed comparison, it implicitly verifies it's not * negative. */ if (knob < 2) return -EINVAL; if (READ_ONCE(ksm_max_page_sharing) == knob) return count; mutex_lock(&ksm_thread_mutex); wait_while_offlining(); if (ksm_max_page_sharing != knob) { if (ksm_pages_shared || remove_all_stable_nodes()) err = -EBUSY; else ksm_max_page_sharing = knob; } mutex_unlock(&ksm_thread_mutex); return err ? err : count; } KSM_ATTR(max_page_sharing); static ssize_t pages_scanned_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_pages_scanned); } KSM_ATTR_RO(pages_scanned); static ssize_t pages_shared_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_pages_shared); } KSM_ATTR_RO(pages_shared); static ssize_t pages_sharing_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_pages_sharing); } KSM_ATTR_RO(pages_sharing); static ssize_t pages_unshared_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_pages_unshared); } KSM_ATTR_RO(pages_unshared); static ssize_t pages_volatile_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { long ksm_pages_volatile; ksm_pages_volatile = ksm_rmap_items - ksm_pages_shared - ksm_pages_sharing - ksm_pages_unshared; /* * It was not worth any locking to calculate that statistic, * but it might therefore sometimes be negative: conceal that. */ if (ksm_pages_volatile < 0) ksm_pages_volatile = 0; return sysfs_emit(buf, "%ld\n", ksm_pages_volatile); } KSM_ATTR_RO(pages_volatile); static ssize_t pages_skipped_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_pages_skipped); } KSM_ATTR_RO(pages_skipped); static ssize_t ksm_zero_pages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%ld\n", atomic_long_read(&ksm_zero_pages)); } KSM_ATTR_RO(ksm_zero_pages); static ssize_t general_profit_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { long general_profit; general_profit = (ksm_pages_sharing + atomic_long_read(&ksm_zero_pages)) * PAGE_SIZE - ksm_rmap_items * sizeof(struct ksm_rmap_item); return sysfs_emit(buf, "%ld\n", general_profit); } KSM_ATTR_RO(general_profit); static ssize_t stable_node_dups_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_stable_node_dups); } KSM_ATTR_RO(stable_node_dups); static ssize_t stable_node_chains_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_stable_node_chains); } KSM_ATTR_RO(stable_node_chains); static ssize_t stable_node_chains_prune_millisecs_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_stable_node_chains_prune_millisecs); } static ssize_t stable_node_chains_prune_millisecs_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { unsigned int msecs; int err; err = kstrtouint(buf, 10, &msecs); if (err) return -EINVAL; ksm_stable_node_chains_prune_millisecs = msecs; return count; } KSM_ATTR(stable_node_chains_prune_millisecs); static ssize_t full_scans_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_scan.seqnr); } KSM_ATTR_RO(full_scans); static ssize_t smart_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_smart_scan); } static ssize_t smart_scan_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; bool value; err = kstrtobool(buf, &value); if (err) return -EINVAL; ksm_smart_scan = value; return count; } KSM_ATTR(smart_scan); static ssize_t advisor_mode_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { const char *output; if (ksm_advisor == KSM_ADVISOR_SCAN_TIME) output = "none [scan-time]"; else output = "[none] scan-time"; return sysfs_emit(buf, "%s\n", output); } static ssize_t advisor_mode_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { enum ksm_advisor_type curr_advisor = ksm_advisor; if (sysfs_streq("scan-time", buf)) ksm_advisor = KSM_ADVISOR_SCAN_TIME; else if (sysfs_streq("none", buf)) ksm_advisor = KSM_ADVISOR_NONE; else return -EINVAL; /* Set advisor default values */ if (curr_advisor != ksm_advisor) set_advisor_defaults(); return count; } KSM_ATTR(advisor_mode); static ssize_t advisor_max_cpu_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_advisor_max_cpu); } static ssize_t advisor_max_cpu_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long value; err = kstrtoul(buf, 10, &value); if (err) return -EINVAL; ksm_advisor_max_cpu = value; return count; } KSM_ATTR(advisor_max_cpu); static ssize_t advisor_min_pages_to_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_advisor_min_pages_to_scan); } static ssize_t advisor_min_pages_to_scan_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long value; err = kstrtoul(buf, 10, &value); if (err) return -EINVAL; ksm_advisor_min_pages_to_scan = value; return count; } KSM_ATTR(advisor_min_pages_to_scan); static ssize_t advisor_max_pages_to_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_advisor_max_pages_to_scan); } static ssize_t advisor_max_pages_to_scan_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long value; err = kstrtoul(buf, 10, &value); if (err) return -EINVAL; ksm_advisor_max_pages_to_scan = value; return count; } KSM_ATTR(advisor_max_pages_to_scan); static ssize_t advisor_target_scan_time_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_advisor_target_scan_time); } static ssize_t advisor_target_scan_time_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long value; err = kstrtoul(buf, 10, &value); if (err) return -EINVAL; if (value < 1) return -EINVAL; ksm_advisor_target_scan_time = value; return count; } KSM_ATTR(advisor_target_scan_time); static struct attribute *ksm_attrs[] = { &sleep_millisecs_attr.attr, &pages_to_scan_attr.attr, &run_attr.attr, &pages_scanned_attr.attr, &pages_shared_attr.attr, &pages_sharing_attr.attr, &pages_unshared_attr.attr, &pages_volatile_attr.attr, &pages_skipped_attr.attr, &ksm_zero_pages_attr.attr, &full_scans_attr.attr, #ifdef CONFIG_NUMA &merge_across_nodes_attr.attr, #endif &max_page_sharing_attr.attr, &stable_node_chains_attr.attr, &stable_node_dups_attr.attr, &stable_node_chains_prune_millisecs_attr.attr, &use_zero_pages_attr.attr, &general_profit_attr.attr, &smart_scan_attr.attr, &advisor_mode_attr.attr, &advisor_max_cpu_attr.attr, &advisor_min_pages_to_scan_attr.attr, &advisor_max_pages_to_scan_attr.attr, &advisor_target_scan_time_attr.attr, NULL, }; static const struct attribute_group ksm_attr_group = { .attrs = ksm_attrs, .name = "ksm", }; #endif /* CONFIG_SYSFS */ static int __init ksm_init(void) { struct task_struct *ksm_thread; int err; /* The correct value depends on page size and endianness */ zero_checksum = calc_checksum(ZERO_PAGE(0)); /* Default to false for backwards compatibility */ ksm_use_zero_pages = false; err = ksm_slab_init(); if (err) goto out; ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); if (IS_ERR(ksm_thread)) { pr_err("ksm: creating kthread failed\n"); err = PTR_ERR(ksm_thread); goto out_free; } #ifdef CONFIG_SYSFS err = sysfs_create_group(mm_kobj, &ksm_attr_group); if (err) { pr_err("ksm: register sysfs failed\n"); kthread_stop(ksm_thread); goto out_free; } #else ksm_run = KSM_RUN_MERGE; /* no way for user to start it */ #endif /* CONFIG_SYSFS */ #ifdef CONFIG_MEMORY_HOTREMOVE /* There is no significance to this priority 100 */ hotplug_memory_notifier(ksm_memory_callback, KSM_CALLBACK_PRI); #endif return 0; out_free: ksm_slab_free(); out: return err; } subsys_initcall(ksm_init); |
| 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 2 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 | // SPDX-License-Identifier: GPL-2.0 /* * NVMe over Fabrics common host code. * Copyright (c) 2015-2016 HGST, a Western Digital Company. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/init.h> #include <linux/miscdevice.h> #include <linux/module.h> #include <linux/mutex.h> #include <linux/parser.h> #include <linux/seq_file.h> #include "nvme.h" #include "fabrics.h" #include <linux/nvme-keyring.h> static LIST_HEAD(nvmf_transports); static DECLARE_RWSEM(nvmf_transports_rwsem); static LIST_HEAD(nvmf_hosts); static DEFINE_MUTEX(nvmf_hosts_mutex); static struct nvmf_host *nvmf_default_host; static struct nvmf_host *nvmf_host_alloc(const char *hostnqn, uuid_t *id) { struct nvmf_host *host; host = kmalloc(sizeof(*host), GFP_KERNEL); if (!host) return NULL; kref_init(&host->ref); uuid_copy(&host->id, id); strscpy(host->nqn, hostnqn, NVMF_NQN_SIZE); return host; } static struct nvmf_host *nvmf_host_add(const char *hostnqn, uuid_t *id) { struct nvmf_host *host; mutex_lock(&nvmf_hosts_mutex); /* * We have defined a host as how it is perceived by the target. * Therefore, we don't allow different Host NQNs with the same Host ID. * Similarly, we do not allow the usage of the same Host NQN with * different Host IDs. This'll maintain unambiguous host identification. */ list_for_each_entry(host, &nvmf_hosts, list) { bool same_hostnqn = !strcmp(host->nqn, hostnqn); bool same_hostid = uuid_equal(&host->id, id); if (same_hostnqn && same_hostid) { kref_get(&host->ref); goto out_unlock; } if (same_hostnqn) { pr_err("found same hostnqn %s but different hostid %pUb\n", hostnqn, id); host = ERR_PTR(-EINVAL); goto out_unlock; } if (same_hostid) { pr_err("found same hostid %pUb but different hostnqn %s\n", id, hostnqn); host = ERR_PTR(-EINVAL); goto out_unlock; } } host = nvmf_host_alloc(hostnqn, id); if (!host) { host = ERR_PTR(-ENOMEM); goto out_unlock; } list_add_tail(&host->list, &nvmf_hosts); out_unlock: mutex_unlock(&nvmf_hosts_mutex); return host; } static struct nvmf_host *nvmf_host_default(void) { struct nvmf_host *host; char nqn[NVMF_NQN_SIZE]; uuid_t id; uuid_gen(&id); snprintf(nqn, NVMF_NQN_SIZE, "nqn.2014-08.org.nvmexpress:uuid:%pUb", &id); host = nvmf_host_alloc(nqn, &id); if (!host) return NULL; mutex_lock(&nvmf_hosts_mutex); list_add_tail(&host->list, &nvmf_hosts); mutex_unlock(&nvmf_hosts_mutex); return host; } static void nvmf_host_destroy(struct kref *ref) { struct nvmf_host *host = container_of(ref, struct nvmf_host, ref); mutex_lock(&nvmf_hosts_mutex); list_del(&host->list); mutex_unlock(&nvmf_hosts_mutex); kfree(host); } static void nvmf_host_put(struct nvmf_host *host) { if (host) kref_put(&host->ref, nvmf_host_destroy); } /** * nvmf_get_address() - Get address/port * @ctrl: Host NVMe controller instance which we got the address * @buf: OUTPUT parameter that will contain the address/port * @size: buffer size */ int nvmf_get_address(struct nvme_ctrl *ctrl, char *buf, int size) { int len = 0; if (ctrl->opts->mask & NVMF_OPT_TRADDR) len += scnprintf(buf, size, "traddr=%s", ctrl->opts->traddr); if (ctrl->opts->mask & NVMF_OPT_TRSVCID) len += scnprintf(buf + len, size - len, "%strsvcid=%s", (len) ? "," : "", ctrl->opts->trsvcid); if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR) len += scnprintf(buf + len, size - len, "%shost_traddr=%s", (len) ? "," : "", ctrl->opts->host_traddr); if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE) len += scnprintf(buf + len, size - len, "%shost_iface=%s", (len) ? "," : "", ctrl->opts->host_iface); len += scnprintf(buf + len, size - len, "\n"); return len; } EXPORT_SYMBOL_GPL(nvmf_get_address); /** * nvmf_reg_read32() - NVMe Fabrics "Property Get" API function. * @ctrl: Host NVMe controller instance maintaining the admin * queue used to submit the property read command to * the allocated NVMe controller resource on the target system. * @off: Starting offset value of the targeted property * register (see the fabrics section of the NVMe standard). * @val: OUTPUT parameter that will contain the value of * the property after a successful read. * * Used by the host system to retrieve a 32-bit capsule property value * from an NVMe controller on the target system. * * ("Capsule property" is an "PCIe register concept" applied to the * NVMe fabrics space.) * * Return: * 0: successful read * > 0: NVMe error status code * < 0: Linux errno error code */ int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val) { struct nvme_command cmd = { }; union nvme_result res; int ret; cmd.prop_get.opcode = nvme_fabrics_command; cmd.prop_get.fctype = nvme_fabrics_type_property_get; cmd.prop_get.offset = cpu_to_le32(off); ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, NULL, 0, NVME_QID_ANY, NVME_SUBMIT_RESERVED); if (ret >= 0) *val = le64_to_cpu(res.u64); if (unlikely(ret != 0)) dev_err(ctrl->device, "Property Get error: %d, offset %#x\n", ret > 0 ? ret & ~NVME_STATUS_DNR : ret, off); return ret; } EXPORT_SYMBOL_GPL(nvmf_reg_read32); /** * nvmf_reg_read64() - NVMe Fabrics "Property Get" API function. * @ctrl: Host NVMe controller instance maintaining the admin * queue used to submit the property read command to * the allocated controller resource on the target system. * @off: Starting offset value of the targeted property * register (see the fabrics section of the NVMe standard). * @val: OUTPUT parameter that will contain the value of * the property after a successful read. * * Used by the host system to retrieve a 64-bit capsule property value * from an NVMe controller on the target system. * * ("Capsule property" is an "PCIe register concept" applied to the * NVMe fabrics space.) * * Return: * 0: successful read * > 0: NVMe error status code * < 0: Linux errno error code */ int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val) { struct nvme_command cmd = { }; union nvme_result res; int ret; cmd.prop_get.opcode = nvme_fabrics_command; cmd.prop_get.fctype = nvme_fabrics_type_property_get; cmd.prop_get.attrib = 1; cmd.prop_get.offset = cpu_to_le32(off); ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, NULL, 0, NVME_QID_ANY, NVME_SUBMIT_RESERVED); if (ret >= 0) *val = le64_to_cpu(res.u64); if (unlikely(ret != 0)) dev_err(ctrl->device, "Property Get error: %d, offset %#x\n", ret > 0 ? ret & ~NVME_STATUS_DNR : ret, off); return ret; } EXPORT_SYMBOL_GPL(nvmf_reg_read64); /** * nvmf_reg_write32() - NVMe Fabrics "Property Write" API function. * @ctrl: Host NVMe controller instance maintaining the admin * queue used to submit the property read command to * the allocated NVMe controller resource on the target system. * @off: Starting offset value of the targeted property * register (see the fabrics section of the NVMe standard). * @val: Input parameter that contains the value to be * written to the property. * * Used by the NVMe host system to write a 32-bit capsule property value * to an NVMe controller on the target system. * * ("Capsule property" is an "PCIe register concept" applied to the * NVMe fabrics space.) * * Return: * 0: successful write * > 0: NVMe error status code * < 0: Linux errno error code */ int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val) { struct nvme_command cmd = { }; int ret; cmd.prop_set.opcode = nvme_fabrics_command; cmd.prop_set.fctype = nvme_fabrics_type_property_set; cmd.prop_set.attrib = 0; cmd.prop_set.offset = cpu_to_le32(off); cmd.prop_set.value = cpu_to_le64(val); ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, NULL, NULL, 0, NVME_QID_ANY, NVME_SUBMIT_RESERVED); if (unlikely(ret)) dev_err(ctrl->device, "Property Set error: %d, offset %#x\n", ret > 0 ? ret & ~NVME_STATUS_DNR : ret, off); return ret; } EXPORT_SYMBOL_GPL(nvmf_reg_write32); int nvmf_subsystem_reset(struct nvme_ctrl *ctrl) { int ret; if (!nvme_wait_reset(ctrl)) return -EBUSY; ret = ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, NVME_SUBSYS_RESET); if (ret) return ret; return nvme_try_sched_reset(ctrl); } EXPORT_SYMBOL_GPL(nvmf_subsystem_reset); /** * nvmf_log_connect_error() - Error-parsing-diagnostic print out function for * connect() errors. * @ctrl: The specific /dev/nvmeX device that had the error. * @errval: Error code to be decoded in a more human-friendly * printout. * @offset: For use with the NVMe error code * NVME_SC_CONNECT_INVALID_PARAM. * @cmd: This is the SQE portion of a submission capsule. * @data: This is the "Data" portion of a submission capsule. */ static void nvmf_log_connect_error(struct nvme_ctrl *ctrl, int errval, int offset, struct nvme_command *cmd, struct nvmf_connect_data *data) { int err_sctype = errval & ~NVME_STATUS_DNR; if (errval < 0) { dev_err(ctrl->device, "Connect command failed, errno: %d\n", errval); return; } switch (err_sctype) { case NVME_SC_CONNECT_INVALID_PARAM: if (offset >> 16) { char *inv_data = "Connect Invalid Data Parameter"; switch (offset & 0xffff) { case (offsetof(struct nvmf_connect_data, cntlid)): dev_err(ctrl->device, "%s, cntlid: %d\n", inv_data, data->cntlid); break; case (offsetof(struct nvmf_connect_data, hostnqn)): dev_err(ctrl->device, "%s, hostnqn \"%s\"\n", inv_data, data->hostnqn); break; case (offsetof(struct nvmf_connect_data, subsysnqn)): dev_err(ctrl->device, "%s, subsysnqn \"%s\"\n", inv_data, data->subsysnqn); break; default: dev_err(ctrl->device, "%s, starting byte offset: %d\n", inv_data, offset & 0xffff); break; } } else { char *inv_sqe = "Connect Invalid SQE Parameter"; switch (offset) { case (offsetof(struct nvmf_connect_command, qid)): dev_err(ctrl->device, "%s, qid %d\n", inv_sqe, cmd->connect.qid); break; default: dev_err(ctrl->device, "%s, starting byte offset: %d\n", inv_sqe, offset); } } break; case NVME_SC_CONNECT_INVALID_HOST: dev_err(ctrl->device, "Connect for subsystem %s is not allowed, hostnqn: %s\n", data->subsysnqn, data->hostnqn); break; case NVME_SC_CONNECT_CTRL_BUSY: dev_err(ctrl->device, "Connect command failed: controller is busy or not available\n"); break; case NVME_SC_CONNECT_FORMAT: dev_err(ctrl->device, "Connect incompatible format: %d", cmd->connect.recfmt); break; case NVME_SC_HOST_PATH_ERROR: dev_err(ctrl->device, "Connect command failed: host path error\n"); break; case NVME_SC_AUTH_REQUIRED: dev_err(ctrl->device, "Connect command failed: authentication required\n"); break; default: dev_err(ctrl->device, "Connect command failed, error wo/DNR bit: %d\n", err_sctype); break; } } static struct nvmf_connect_data *nvmf_connect_data_prep(struct nvme_ctrl *ctrl, u16 cntlid) { struct nvmf_connect_data *data; data = kzalloc(sizeof(*data), GFP_KERNEL); if (!data) return NULL; uuid_copy(&data->hostid, &ctrl->opts->host->id); data->cntlid = cpu_to_le16(cntlid); strscpy(data->subsysnqn, ctrl->opts->subsysnqn, NVMF_NQN_SIZE); strscpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE); return data; } static void nvmf_connect_cmd_prep(struct nvme_ctrl *ctrl, u16 qid, struct nvme_command *cmd) { cmd->connect.opcode = nvme_fabrics_command; cmd->connect.fctype = nvme_fabrics_type_connect; cmd->connect.qid = cpu_to_le16(qid); if (qid) { cmd->connect.sqsize = cpu_to_le16(ctrl->sqsize); } else { cmd->connect.sqsize = cpu_to_le16(NVME_AQ_DEPTH - 1); /* * set keep-alive timeout in seconds granularity (ms * 1000) */ cmd->connect.kato = cpu_to_le32(ctrl->kato * 1000); } if (ctrl->opts->disable_sqflow) cmd->connect.cattr |= NVME_CONNECT_DISABLE_SQFLOW; } /** * nvmf_connect_admin_queue() - NVMe Fabrics Admin Queue "Connect" * API function. * @ctrl: Host nvme controller instance used to request * a new NVMe controller allocation on the target * system and establish an NVMe Admin connection to * that controller. * * This function enables an NVMe host device to request a new allocation of * an NVMe controller resource on a target system as well establish a * fabrics-protocol connection of the NVMe Admin queue between the * host system device and the allocated NVMe controller on the * target system via a NVMe Fabrics "Connect" command. */ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl) { struct nvme_command cmd = { }; union nvme_result res; struct nvmf_connect_data *data; int ret; u32 result; nvmf_connect_cmd_prep(ctrl, 0, &cmd); data = nvmf_connect_data_prep(ctrl, 0xffff); if (!data) return -ENOMEM; ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, data, sizeof(*data), NVME_QID_ANY, NVME_SUBMIT_AT_HEAD | NVME_SUBMIT_NOWAIT | NVME_SUBMIT_RESERVED); if (ret) { nvmf_log_connect_error(ctrl, ret, le32_to_cpu(res.u32), &cmd, data); goto out_free_data; } result = le32_to_cpu(res.u32); ctrl->cntlid = result & 0xFFFF; if (result & (NVME_CONNECT_AUTHREQ_ATR | NVME_CONNECT_AUTHREQ_ASCR)) { /* Check for secure concatenation */ if ((result & NVME_CONNECT_AUTHREQ_ASCR) && !ctrl->opts->concat) { dev_warn(ctrl->device, "qid 0: secure concatenation is not supported\n"); ret = -EOPNOTSUPP; goto out_free_data; } /* Authentication required */ ret = nvme_auth_negotiate(ctrl, 0); if (ret) { dev_warn(ctrl->device, "qid 0: authentication setup failed\n"); goto out_free_data; } ret = nvme_auth_wait(ctrl, 0); if (ret) { dev_warn(ctrl->device, "qid 0: authentication failed, error %d\n", ret); } else dev_info(ctrl->device, "qid 0: authenticated\n"); } out_free_data: kfree(data); return ret; } EXPORT_SYMBOL_GPL(nvmf_connect_admin_queue); /** * nvmf_connect_io_queue() - NVMe Fabrics I/O Queue "Connect" * API function. * @ctrl: Host nvme controller instance used to establish an * NVMe I/O queue connection to the already allocated NVMe * controller on the target system. * @qid: NVMe I/O queue number for the new I/O connection between * host and target (note qid == 0 is illegal as this is * the Admin queue, per NVMe standard). * * This function issues a fabrics-protocol connection * of a NVMe I/O queue (via NVMe Fabrics "Connect" command) * between the host system device and the allocated NVMe controller * on the target system. * * Return: * 0: success * > 0: NVMe error status code * < 0: Linux errno error code */ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid) { struct nvme_command cmd = { }; struct nvmf_connect_data *data; union nvme_result res; int ret; u32 result; nvmf_connect_cmd_prep(ctrl, qid, &cmd); data = nvmf_connect_data_prep(ctrl, ctrl->cntlid); if (!data) return -ENOMEM; ret = __nvme_submit_sync_cmd(ctrl->connect_q, &cmd, &res, data, sizeof(*data), qid, NVME_SUBMIT_AT_HEAD | NVME_SUBMIT_RESERVED | NVME_SUBMIT_NOWAIT); if (ret) { nvmf_log_connect_error(ctrl, ret, le32_to_cpu(res.u32), &cmd, data); goto out_free_data; } result = le32_to_cpu(res.u32); if (result & (NVME_CONNECT_AUTHREQ_ATR | NVME_CONNECT_AUTHREQ_ASCR)) { /* Secure concatenation is not implemented */ if (result & NVME_CONNECT_AUTHREQ_ASCR) { dev_warn(ctrl->device, "qid %d: secure concatenation is not supported\n", qid); ret = -EOPNOTSUPP; goto out_free_data; } /* Authentication required */ ret = nvme_auth_negotiate(ctrl, qid); if (ret) { dev_warn(ctrl->device, "qid %d: authentication setup failed\n", qid); goto out_free_data; } ret = nvme_auth_wait(ctrl, qid); if (ret) { dev_warn(ctrl->device, "qid %u: authentication failed, error %d\n", qid, ret); } } out_free_data: kfree(data); return ret; } EXPORT_SYMBOL_GPL(nvmf_connect_io_queue); /* * Evaluate the status information returned by the transport in order to decided * if a reconnect attempt should be scheduled. * * Do not retry when: * * - the DNR bit is set and the specification states no further connect * attempts with the same set of parameters should be attempted. * * - when the authentication attempt fails, because the key was invalid. * This error code is set on the host side. */ bool nvmf_should_reconnect(struct nvme_ctrl *ctrl, int status) { if (status > 0 && (status & NVME_STATUS_DNR)) return false; if (status == -EKEYREJECTED || status == -ENOKEY) return false; if (ctrl->opts->max_reconnects == -1 || ctrl->nr_reconnects < ctrl->opts->max_reconnects) return true; return false; } EXPORT_SYMBOL_GPL(nvmf_should_reconnect); /** * nvmf_register_transport() - NVMe Fabrics Library registration function. * @ops: Transport ops instance to be registered to the * common fabrics library. * * API function that registers the type of specific transport fabric * being implemented to the common NVMe fabrics library. Part of * the overall init sequence of starting up a fabrics driver. */ int nvmf_register_transport(struct nvmf_transport_ops *ops) { if (!ops->create_ctrl) return -EINVAL; down_write(&nvmf_transports_rwsem); list_add_tail(&ops->entry, &nvmf_transports); up_write(&nvmf_transports_rwsem); return 0; } EXPORT_SYMBOL_GPL(nvmf_register_transport); /** * nvmf_unregister_transport() - NVMe Fabrics Library unregistration function. * @ops: Transport ops instance to be unregistered from the * common fabrics library. * * Fabrics API function that unregisters the type of specific transport * fabric being implemented from the common NVMe fabrics library. * Part of the overall exit sequence of unloading the implemented driver. */ void nvmf_unregister_transport(struct nvmf_transport_ops *ops) { down_write(&nvmf_transports_rwsem); list_del(&ops->entry); up_write(&nvmf_transports_rwsem); } EXPORT_SYMBOL_GPL(nvmf_unregister_transport); static struct nvmf_transport_ops *nvmf_lookup_transport( struct nvmf_ctrl_options *opts) { struct nvmf_transport_ops *ops; lockdep_assert_held(&nvmf_transports_rwsem); list_for_each_entry(ops, &nvmf_transports, entry) { if (strcmp(ops->name, opts->transport) == 0) return ops; } return NULL; } static struct key *nvmf_parse_key(int key_id) { struct key *key; if (!IS_ENABLED(CONFIG_NVME_TCP_TLS)) { pr_err("TLS is not supported\n"); return ERR_PTR(-EINVAL); } key = nvme_tls_key_lookup(key_id); if (IS_ERR(key)) pr_err("key id %08x not found\n", key_id); else pr_debug("Using key id %08x\n", key_id); return key; } static const match_table_t opt_tokens = { { NVMF_OPT_TRANSPORT, "transport=%s" }, { NVMF_OPT_TRADDR, "traddr=%s" }, { NVMF_OPT_TRSVCID, "trsvcid=%s" }, { NVMF_OPT_NQN, "nqn=%s" }, { NVMF_OPT_QUEUE_SIZE, "queue_size=%d" }, { NVMF_OPT_NR_IO_QUEUES, "nr_io_queues=%d" }, { NVMF_OPT_RECONNECT_DELAY, "reconnect_delay=%d" }, { NVMF_OPT_CTRL_LOSS_TMO, "ctrl_loss_tmo=%d" }, { NVMF_OPT_KATO, "keep_alive_tmo=%d" }, { NVMF_OPT_HOSTNQN, "hostnqn=%s" }, { NVMF_OPT_HOST_TRADDR, "host_traddr=%s" }, { NVMF_OPT_HOST_IFACE, "host_iface=%s" }, { NVMF_OPT_HOST_ID, "hostid=%s" }, { NVMF_OPT_DUP_CONNECT, "duplicate_connect" }, { NVMF_OPT_DISABLE_SQFLOW, "disable_sqflow" }, { NVMF_OPT_HDR_DIGEST, "hdr_digest" }, { NVMF_OPT_DATA_DIGEST, "data_digest" }, { NVMF_OPT_NR_WRITE_QUEUES, "nr_write_queues=%d" }, { NVMF_OPT_NR_POLL_QUEUES, "nr_poll_queues=%d" }, { NVMF_OPT_TOS, "tos=%d" }, #ifdef CONFIG_NVME_TCP_TLS { NVMF_OPT_KEYRING, "keyring=%d" }, { NVMF_OPT_TLS_KEY, "tls_key=%d" }, #endif { NVMF_OPT_FAIL_FAST_TMO, "fast_io_fail_tmo=%d" }, { NVMF_OPT_DISCOVERY, "discovery" }, #ifdef CONFIG_NVME_HOST_AUTH { NVMF_OPT_DHCHAP_SECRET, "dhchap_secret=%s" }, { NVMF_OPT_DHCHAP_CTRL_SECRET, "dhchap_ctrl_secret=%s" }, #endif #ifdef CONFIG_NVME_TCP_TLS { NVMF_OPT_TLS, "tls" }, { NVMF_OPT_CONCAT, "concat" }, #endif { NVMF_OPT_ERR, NULL } }; static int nvmf_parse_options(struct nvmf_ctrl_options *opts, const char *buf) { substring_t args[MAX_OPT_ARGS]; char *options, *o, *p; int token, ret = 0; size_t nqnlen = 0; int ctrl_loss_tmo = NVMF_DEF_CTRL_LOSS_TMO, key_id; uuid_t hostid; char hostnqn[NVMF_NQN_SIZE]; struct key *key; /* Set defaults */ opts->queue_size = NVMF_DEF_QUEUE_SIZE; opts->nr_io_queues = num_online_cpus(); opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY; opts->kato = 0; opts->duplicate_connect = false; opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO; opts->hdr_digest = false; opts->data_digest = false; opts->tos = -1; /* < 0 == use transport default */ opts->tls = false; opts->tls_key = NULL; opts->keyring = NULL; opts->concat = false; options = o = kstrdup(buf, GFP_KERNEL); if (!options) return -ENOMEM; /* use default host if not given by user space */ uuid_copy(&hostid, &nvmf_default_host->id); strscpy(hostnqn, nvmf_default_host->nqn, NVMF_NQN_SIZE); while ((p = strsep(&o, ",\n")) != NULL) { if (!*p) continue; token = match_token(p, opt_tokens, args); opts->mask |= token; switch (token) { case NVMF_OPT_TRANSPORT: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->transport); opts->transport = p; break; case NVMF_OPT_NQN: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->subsysnqn); opts->subsysnqn = p; nqnlen = strlen(opts->subsysnqn); if (nqnlen >= NVMF_NQN_SIZE) { pr_err("%s needs to be < %d bytes\n", opts->subsysnqn, NVMF_NQN_SIZE); ret = -EINVAL; goto out; } opts->discovery_nqn = !(strcmp(opts->subsysnqn, NVME_DISC_SUBSYS_NAME)); break; case NVMF_OPT_TRADDR: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->traddr); opts->traddr = p; break; case NVMF_OPT_TRSVCID: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->trsvcid); opts->trsvcid = p; break; case NVMF_OPT_QUEUE_SIZE: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token < NVMF_MIN_QUEUE_SIZE || token > NVMF_MAX_QUEUE_SIZE) { pr_err("Invalid queue_size %d\n", token); ret = -EINVAL; goto out; } opts->queue_size = token; break; case NVMF_OPT_NR_IO_QUEUES: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token <= 0) { pr_err("Invalid number of IOQs %d\n", token); ret = -EINVAL; goto out; } if (opts->discovery_nqn) { pr_debug("Ignoring nr_io_queues value for discovery controller\n"); break; } opts->nr_io_queues = min_t(unsigned int, num_online_cpus(), token); break; case NVMF_OPT_KATO: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token < 0) { pr_err("Invalid keep_alive_tmo %d\n", token); ret = -EINVAL; goto out; } else if (token == 0 && !opts->discovery_nqn) { /* Allowed for debug */ pr_warn("keep_alive_tmo 0 won't execute keep alives!!!\n"); } opts->kato = token; break; case NVMF_OPT_CTRL_LOSS_TMO: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token < 0) pr_warn("ctrl_loss_tmo < 0 will reconnect forever\n"); ctrl_loss_tmo = token; break; case NVMF_OPT_FAIL_FAST_TMO: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token >= 0) pr_warn("I/O fail on reconnect controller after %d sec\n", token); else token = -1; opts->fast_io_fail_tmo = token; break; case NVMF_OPT_HOSTNQN: if (opts->host) { pr_err("hostnqn already user-assigned: %s\n", opts->host->nqn); ret = -EADDRINUSE; goto out; } p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } nqnlen = strlen(p); if (nqnlen >= NVMF_NQN_SIZE) { pr_err("%s needs to be < %d bytes\n", p, NVMF_NQN_SIZE); kfree(p); ret = -EINVAL; goto out; } strscpy(hostnqn, p, NVMF_NQN_SIZE); kfree(p); break; case NVMF_OPT_RECONNECT_DELAY: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token <= 0) { pr_err("Invalid reconnect_delay %d\n", token); ret = -EINVAL; goto out; } opts->reconnect_delay = token; break; case NVMF_OPT_HOST_TRADDR: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->host_traddr); opts->host_traddr = p; break; case NVMF_OPT_HOST_IFACE: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->host_iface); opts->host_iface = p; break; case NVMF_OPT_HOST_ID: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } ret = uuid_parse(p, &hostid); if (ret) { pr_err("Invalid hostid %s\n", p); ret = -EINVAL; kfree(p); goto out; } kfree(p); break; case NVMF_OPT_DUP_CONNECT: opts->duplicate_connect = true; break; case NVMF_OPT_DISABLE_SQFLOW: opts->disable_sqflow = true; break; case NVMF_OPT_HDR_DIGEST: opts->hdr_digest = true; break; case NVMF_OPT_DATA_DIGEST: opts->data_digest = true; break; case NVMF_OPT_NR_WRITE_QUEUES: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token <= 0) { pr_err("Invalid nr_write_queues %d\n", token); ret = -EINVAL; goto out; } opts->nr_write_queues = token; break; case NVMF_OPT_NR_POLL_QUEUES: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token <= 0) { pr_err("Invalid nr_poll_queues %d\n", token); ret = -EINVAL; goto out; } opts->nr_poll_queues = token; break; case NVMF_OPT_TOS: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token < 0) { pr_err("Invalid type of service %d\n", token); ret = -EINVAL; goto out; } if (token > 255) { pr_warn("Clamping type of service to 255\n"); token = 255; } opts->tos = token; break; case NVMF_OPT_KEYRING: if (match_int(args, &key_id) || key_id <= 0) { ret = -EINVAL; goto out; } key = nvmf_parse_key(key_id); if (IS_ERR(key)) { ret = PTR_ERR(key); goto out; } key_put(opts->keyring); opts->keyring = key; break; case NVMF_OPT_TLS_KEY: if (match_int(args, &key_id) || key_id <= 0) { ret = -EINVAL; goto out; } key = nvmf_parse_key(key_id); if (IS_ERR(key)) { ret = PTR_ERR(key); goto out; } key_put(opts->tls_key); opts->tls_key = key; break; case NVMF_OPT_DISCOVERY: opts->discovery_nqn = true; break; case NVMF_OPT_DHCHAP_SECRET: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } if (strlen(p) < 11 || strncmp(p, "DHHC-1:", 7)) { pr_err("Invalid DH-CHAP secret %s\n", p); ret = -EINVAL; goto out; } kfree(opts->dhchap_secret); opts->dhchap_secret = p; break; case NVMF_OPT_DHCHAP_CTRL_SECRET: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } if (strlen(p) < 11 || strncmp(p, "DHHC-1:", 7)) { pr_err("Invalid DH-CHAP secret %s\n", p); ret = -EINVAL; goto out; } kfree(opts->dhchap_ctrl_secret); opts->dhchap_ctrl_secret = p; break; case NVMF_OPT_TLS: if (!IS_ENABLED(CONFIG_NVME_TCP_TLS)) { pr_err("TLS is not supported\n"); ret = -EINVAL; goto out; } opts->tls = true; break; case NVMF_OPT_CONCAT: if (!IS_ENABLED(CONFIG_NVME_TCP_TLS)) { pr_err("TLS is not supported\n"); ret = -EINVAL; goto out; } opts->concat = true; break; default: pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n", p); ret = -EINVAL; goto out; } } if (opts->discovery_nqn) { opts->nr_io_queues = 0; opts->nr_write_queues = 0; opts->nr_poll_queues = 0; opts->duplicate_connect = true; } else { if (!opts->kato) opts->kato = NVME_DEFAULT_KATO; } if (ctrl_loss_tmo < 0) { opts->max_reconnects = -1; } else { opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo, opts->reconnect_delay); if (ctrl_loss_tmo < opts->fast_io_fail_tmo) pr_warn("failfast tmo (%d) larger than controller loss tmo (%d)\n", opts->fast_io_fail_tmo, ctrl_loss_tmo); } if (opts->concat) { if (opts->tls) { pr_err("Secure concatenation over TLS is not supported\n"); ret = -EINVAL; goto out; } if (opts->tls_key) { pr_err("Cannot specify a TLS key for secure concatenation\n"); ret = -EINVAL; goto out; } if (!opts->dhchap_secret) { pr_err("Need to enable DH-CHAP for secure concatenation\n"); ret = -EINVAL; goto out; } } opts->host = nvmf_host_add(hostnqn, &hostid); if (IS_ERR(opts->host)) { ret = PTR_ERR(opts->host); opts->host = NULL; goto out; } out: kfree(options); return ret; } void nvmf_set_io_queues(struct nvmf_ctrl_options *opts, u32 nr_io_queues, u32 io_queues[HCTX_MAX_TYPES]) { if (opts->nr_write_queues && opts->nr_io_queues < nr_io_queues) { /* * separate read/write queues * hand out dedicated default queues only after we have * sufficient read queues. */ io_queues[HCTX_TYPE_READ] = opts->nr_io_queues; nr_io_queues -= io_queues[HCTX_TYPE_READ]; io_queues[HCTX_TYPE_DEFAULT] = min(opts->nr_write_queues, nr_io_queues); nr_io_queues -= io_queues[HCTX_TYPE_DEFAULT]; } else { /* * shared read/write queues * either no write queues were requested, or we don't have * sufficient queue count to have dedicated default queues. */ io_queues[HCTX_TYPE_DEFAULT] = min(opts->nr_io_queues, nr_io_queues); nr_io_queues -= io_queues[HCTX_TYPE_DEFAULT]; } if (opts->nr_poll_queues && nr_io_queues) { /* map dedicated poll queues only if we have queues left */ io_queues[HCTX_TYPE_POLL] = min(opts->nr_poll_queues, nr_io_queues); } } EXPORT_SYMBOL_GPL(nvmf_set_io_queues); void nvmf_map_queues(struct blk_mq_tag_set *set, struct nvme_ctrl *ctrl, u32 io_queues[HCTX_MAX_TYPES]) { struct nvmf_ctrl_options *opts = ctrl->opts; if (opts->nr_write_queues && io_queues[HCTX_TYPE_READ]) { /* separate read/write queues */ set->map[HCTX_TYPE_DEFAULT].nr_queues = io_queues[HCTX_TYPE_DEFAULT]; set->map[HCTX_TYPE_DEFAULT].queue_offset = 0; set->map[HCTX_TYPE_READ].nr_queues = io_queues[HCTX_TYPE_READ]; set->map[HCTX_TYPE_READ].queue_offset = io_queues[HCTX_TYPE_DEFAULT]; } else { /* shared read/write queues */ set->map[HCTX_TYPE_DEFAULT].nr_queues = io_queues[HCTX_TYPE_DEFAULT]; set->map[HCTX_TYPE_DEFAULT].queue_offset = 0; set->map[HCTX_TYPE_READ].nr_queues = io_queues[HCTX_TYPE_DEFAULT]; set->map[HCTX_TYPE_READ].queue_offset = 0; } blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]); blk_mq_map_queues(&set->map[HCTX_TYPE_READ]); if (opts->nr_poll_queues && io_queues[HCTX_TYPE_POLL]) { /* map dedicated poll queues only if we have queues left */ set->map[HCTX_TYPE_POLL].nr_queues = io_queues[HCTX_TYPE_POLL]; set->map[HCTX_TYPE_POLL].queue_offset = io_queues[HCTX_TYPE_DEFAULT] + io_queues[HCTX_TYPE_READ]; blk_mq_map_queues(&set->map[HCTX_TYPE_POLL]); } dev_info(ctrl->device, "mapped %d/%d/%d default/read/poll queues.\n", io_queues[HCTX_TYPE_DEFAULT], io_queues[HCTX_TYPE_READ], io_queues[HCTX_TYPE_POLL]); } EXPORT_SYMBOL_GPL(nvmf_map_queues); static int nvmf_check_required_opts(struct nvmf_ctrl_options *opts, unsigned int required_opts) { if ((opts->mask & required_opts) != required_opts) { unsigned int i; for (i = 0; i < ARRAY_SIZE(opt_tokens); i++) { if ((opt_tokens[i].token & required_opts) && !(opt_tokens[i].token & opts->mask)) { pr_warn("missing parameter '%s'\n", opt_tokens[i].pattern); } } return -EINVAL; } return 0; } bool nvmf_ip_options_match(struct nvme_ctrl *ctrl, struct nvmf_ctrl_options *opts) { if (!nvmf_ctlr_matches_baseopts(ctrl, opts) || strcmp(opts->traddr, ctrl->opts->traddr) || strcmp(opts->trsvcid, ctrl->opts->trsvcid)) return false; /* * Checking the local address or host interfaces is rough. * * In most cases, none is specified and the host port or * host interface is selected by the stack. * * Assume no match if: * - local address or host interface is specified and address * or host interface is not the same * - local address or host interface is not specified but * remote is, or vice versa (admin using specific * host_traddr/host_iface when it matters). */ if ((opts->mask & NVMF_OPT_HOST_TRADDR) && (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR)) { if (strcmp(opts->host_traddr, ctrl->opts->host_traddr)) return false; } else if ((opts->mask & NVMF_OPT_HOST_TRADDR) || (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR)) { return false; } if ((opts->mask & NVMF_OPT_HOST_IFACE) && (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)) { if (strcmp(opts->host_iface, ctrl->opts->host_iface)) return false; } else if ((opts->mask & NVMF_OPT_HOST_IFACE) || (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)) { return false; } return true; } EXPORT_SYMBOL_GPL(nvmf_ip_options_match); static int nvmf_check_allowed_opts(struct nvmf_ctrl_options *opts, unsigned int allowed_opts) { if (opts->mask & ~allowed_opts) { unsigned int i; for (i = 0; i < ARRAY_SIZE(opt_tokens); i++) { if ((opt_tokens[i].token & opts->mask) && (opt_tokens[i].token & ~allowed_opts)) { pr_warn("invalid parameter '%s'\n", opt_tokens[i].pattern); } } return -EINVAL; } return 0; } void nvmf_free_options(struct nvmf_ctrl_options *opts) { nvmf_host_put(opts->host); key_put(opts->keyring); key_put(opts->tls_key); kfree(opts->transport); kfree(opts->traddr); kfree(opts->trsvcid); kfree(opts->subsysnqn); kfree(opts->host_traddr); kfree(opts->host_iface); kfree(opts->dhchap_secret); kfree(opts->dhchap_ctrl_secret); kfree(opts); } EXPORT_SYMBOL_GPL(nvmf_free_options); #define NVMF_REQUIRED_OPTS (NVMF_OPT_TRANSPORT | NVMF_OPT_NQN) #define NVMF_ALLOWED_OPTS (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \ NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \ NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\ NVMF_OPT_DISABLE_SQFLOW | NVMF_OPT_DISCOVERY |\ NVMF_OPT_FAIL_FAST_TMO | NVMF_OPT_DHCHAP_SECRET |\ NVMF_OPT_DHCHAP_CTRL_SECRET) static struct nvme_ctrl * nvmf_create_ctrl(struct device *dev, const char *buf) { struct nvmf_ctrl_options *opts; struct nvmf_transport_ops *ops; struct nvme_ctrl *ctrl; int ret; opts = kzalloc(sizeof(*opts), GFP_KERNEL); if (!opts) return ERR_PTR(-ENOMEM); ret = nvmf_parse_options(opts, buf); if (ret) goto out_free_opts; request_module("nvme-%s", opts->transport); /* * Check the generic options first as we need a valid transport for * the lookup below. Then clear the generic flags so that transport * drivers don't have to care about them. */ ret = nvmf_check_required_opts(opts, NVMF_REQUIRED_OPTS); if (ret) goto out_free_opts; opts->mask &= ~NVMF_REQUIRED_OPTS; down_read(&nvmf_transports_rwsem); ops = nvmf_lookup_transport(opts); if (!ops) { pr_info("no handler found for transport %s.\n", opts->transport); ret = -EINVAL; goto out_unlock; } if (!try_module_get(ops->module)) { ret = -EBUSY; goto out_unlock; } up_read(&nvmf_transports_rwsem); ret = nvmf_check_required_opts(opts, ops->required_opts); if (ret) goto out_module_put; ret = nvmf_check_allowed_opts(opts, NVMF_ALLOWED_OPTS | ops->allowed_opts | ops->required_opts); if (ret) goto out_module_put; ctrl = ops->create_ctrl(dev, opts); if (IS_ERR(ctrl)) { ret = PTR_ERR(ctrl); goto out_module_put; } module_put(ops->module); return ctrl; out_module_put: module_put(ops->module); goto out_free_opts; out_unlock: up_read(&nvmf_transports_rwsem); out_free_opts: nvmf_free_options(opts); return ERR_PTR(ret); } static const struct class nvmf_class = { .name = "nvme-fabrics", }; static struct device *nvmf_device; static DEFINE_MUTEX(nvmf_dev_mutex); static ssize_t nvmf_dev_write(struct file *file, const char __user *ubuf, size_t count, loff_t *pos) { struct seq_file *seq_file = file->private_data; struct nvme_ctrl *ctrl; const char *buf; int ret = 0; if (count > PAGE_SIZE) return -ENOMEM; buf = memdup_user_nul(ubuf, count); if (IS_ERR(buf)) return PTR_ERR(buf); mutex_lock(&nvmf_dev_mutex); if (seq_file->private) { ret = -EINVAL; goto out_unlock; } ctrl = nvmf_create_ctrl(nvmf_device, buf); if (IS_ERR(ctrl)) { ret = PTR_ERR(ctrl); goto out_unlock; } seq_file->private = ctrl; out_unlock: mutex_unlock(&nvmf_dev_mutex); kfree(buf); return ret ? ret : count; } static void __nvmf_concat_opt_tokens(struct seq_file *seq_file) { const struct match_token *tok; int idx; /* * Add dummy entries for instance and cntlid to * signal an invalid/non-existing controller */ seq_puts(seq_file, "instance=-1,cntlid=-1"); for (idx = 0; idx < ARRAY_SIZE(opt_tokens); idx++) { tok = &opt_tokens[idx]; if (tok->token == NVMF_OPT_ERR) continue; seq_putc(seq_file, ','); seq_puts(seq_file, tok->pattern); } seq_putc(seq_file, '\n'); } static int nvmf_dev_show(struct seq_file *seq_file, void *private) { struct nvme_ctrl *ctrl; mutex_lock(&nvmf_dev_mutex); ctrl = seq_file->private; if (!ctrl) { __nvmf_concat_opt_tokens(seq_file); goto out_unlock; } seq_printf(seq_file, "instance=%d,cntlid=%d\n", ctrl->instance, ctrl->cntlid); out_unlock: mutex_unlock(&nvmf_dev_mutex); return 0; } static int nvmf_dev_open(struct inode *inode, struct file *file) { /* * The miscdevice code initializes file->private_data, but doesn't * make use of it later. */ file->private_data = NULL; return single_open(file, nvmf_dev_show, NULL); } static int nvmf_dev_release(struct inode *inode, struct file *file) { struct seq_file *seq_file = file->private_data; struct nvme_ctrl *ctrl = seq_file->private; if (ctrl) nvme_put_ctrl(ctrl); return single_release(inode, file); } static const struct file_operations nvmf_dev_fops = { .owner = THIS_MODULE, .write = nvmf_dev_write, .read = seq_read, .open = nvmf_dev_open, .release = nvmf_dev_release, }; static struct miscdevice nvmf_misc = { .minor = MISC_DYNAMIC_MINOR, .name = "nvme-fabrics", .fops = &nvmf_dev_fops, }; static int __init nvmf_init(void) { int ret; nvmf_default_host = nvmf_host_default(); if (!nvmf_default_host) return -ENOMEM; ret = class_register(&nvmf_class); if (ret) { pr_err("couldn't register class nvme-fabrics\n"); goto out_free_host; } nvmf_device = device_create(&nvmf_class, NULL, MKDEV(0, 0), NULL, "ctl"); if (IS_ERR(nvmf_device)) { pr_err("couldn't create nvme-fabrics device!\n"); ret = PTR_ERR(nvmf_device); goto out_destroy_class; } ret = misc_register(&nvmf_misc); if (ret) { pr_err("couldn't register misc device: %d\n", ret); goto out_destroy_device; } return 0; out_destroy_device: device_destroy(&nvmf_class, MKDEV(0, 0)); out_destroy_class: class_unregister(&nvmf_class); out_free_host: nvmf_host_put(nvmf_default_host); return ret; } static void __exit nvmf_exit(void) { misc_deregister(&nvmf_misc); device_destroy(&nvmf_class, MKDEV(0, 0)); class_unregister(&nvmf_class); nvmf_host_put(nvmf_default_host); BUILD_BUG_ON(sizeof(struct nvmf_common_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_connect_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_property_get_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_property_set_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_auth_send_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_auth_receive_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_connect_data) != 1024); BUILD_BUG_ON(sizeof(struct nvmf_auth_dhchap_negotiate_data) != 8); BUILD_BUG_ON(sizeof(struct nvmf_auth_dhchap_challenge_data) != 16); BUILD_BUG_ON(sizeof(struct nvmf_auth_dhchap_reply_data) != 16); BUILD_BUG_ON(sizeof(struct nvmf_auth_dhchap_success1_data) != 16); BUILD_BUG_ON(sizeof(struct nvmf_auth_dhchap_success2_data) != 16); } MODULE_LICENSE("GPL v2"); MODULE_DESCRIPTION("NVMe host fabrics library"); module_init(nvmf_init); module_exit(nvmf_exit); |
| 5 1 1 1 1 1 5 5 5 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 | // SPDX-License-Identifier: GPL-2.0-or-later /* * zswap.c - zswap driver file * * zswap is a cache that takes pages that are in the process * of being swapped out and attempts to compress and store them in a * RAM-based memory pool. This can result in a significant I/O reduction on * the swap device and, in the case where decompressing from RAM is faster * than reading from the swap device, can also improve workload performance. * * Copyright (C) 2012 Seth Jennings <sjenning@linux.vnet.ibm.com> */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/module.h> #include <linux/cpu.h> #include <linux/highmem.h> #include <linux/slab.h> #include <linux/spinlock.h> #include <linux/types.h> #include <linux/atomic.h> #include <linux/swap.h> #include <linux/crypto.h> #include <linux/scatterlist.h> #include <linux/mempolicy.h> #include <linux/mempool.h> #include <crypto/acompress.h> #include <linux/zswap.h> #include <linux/mm_types.h> #include <linux/page-flags.h> #include <linux/swapops.h> #include <linux/writeback.h> #include <linux/pagemap.h> #include <linux/workqueue.h> #include <linux/list_lru.h> #include <linux/zsmalloc.h> #include "swap.h" #include "internal.h" /********************************* * statistics **********************************/ /* The number of pages currently stored in zswap */ atomic_long_t zswap_stored_pages = ATOMIC_LONG_INIT(0); /* The number of incompressible pages currently stored in zswap */ static atomic_long_t zswap_stored_incompressible_pages = ATOMIC_LONG_INIT(0); /* * The statistics below are not protected from concurrent access for * performance reasons so they may not be a 100% accurate. However, * they do provide useful information on roughly how many times a * certain event is occurring. */ /* Pool limit was hit (see zswap_max_pool_percent) */ static u64 zswap_pool_limit_hit; /* Pages written back when pool limit was reached */ static u64 zswap_written_back_pages; /* Store failed due to a reclaim failure after pool limit was reached */ static u64 zswap_reject_reclaim_fail; /* Store failed due to compression algorithm failure */ static u64 zswap_reject_compress_fail; /* Compressed page was too big for the allocator to (optimally) store */ static u64 zswap_reject_compress_poor; /* Load or writeback failed due to decompression failure */ static u64 zswap_decompress_fail; /* Store failed because underlying allocator could not get memory */ static u64 zswap_reject_alloc_fail; /* Store failed because the entry metadata could not be allocated (rare) */ static u64 zswap_reject_kmemcache_fail; /* Shrinker work queue */ static struct workqueue_struct *shrink_wq; /* Pool limit was hit, we need to calm down */ static bool zswap_pool_reached_full; /********************************* * tunables **********************************/ #define ZSWAP_PARAM_UNSET "" static int zswap_setup(void); /* Enable/disable zswap */ static DEFINE_STATIC_KEY_MAYBE(CONFIG_ZSWAP_DEFAULT_ON, zswap_ever_enabled); static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON); static int zswap_enabled_param_set(const char *, const struct kernel_param *); static const struct kernel_param_ops zswap_enabled_param_ops = { .set = zswap_enabled_param_set, .get = param_get_bool, }; module_param_cb(enabled, &zswap_enabled_param_ops, &zswap_enabled, 0644); /* Crypto compressor to use */ static char *zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT; static int zswap_compressor_param_set(const char *, const struct kernel_param *); static const struct kernel_param_ops zswap_compressor_param_ops = { .set = zswap_compressor_param_set, .get = param_get_charp, .free = param_free_charp, }; module_param_cb(compressor, &zswap_compressor_param_ops, &zswap_compressor, 0644); /* The maximum percentage of memory that the compressed pool can occupy */ static unsigned int zswap_max_pool_percent = 20; module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644); /* The threshold for accepting new pages after the max_pool_percent was hit */ static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */ module_param_named(accept_threshold_percent, zswap_accept_thr_percent, uint, 0644); /* Enable/disable memory pressure-based shrinker. */ static bool zswap_shrinker_enabled = IS_ENABLED( CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); bool zswap_is_enabled(void) { return zswap_enabled; } bool zswap_never_enabled(void) { return !static_branch_maybe(CONFIG_ZSWAP_DEFAULT_ON, &zswap_ever_enabled); } /********************************* * data structures **********************************/ struct crypto_acomp_ctx { struct crypto_acomp *acomp; struct acomp_req *req; struct crypto_wait wait; u8 *buffer; struct mutex mutex; bool is_sleepable; }; /* * The lock ordering is zswap_tree.lock -> zswap_pool.lru_lock. * The only case where lru_lock is not acquired while holding tree.lock is * when a zswap_entry is taken off the lru for writeback, in that case it * needs to be verified that it's still valid in the tree. */ struct zswap_pool { struct zs_pool *zs_pool; struct crypto_acomp_ctx __percpu *acomp_ctx; struct percpu_ref ref; struct list_head list; struct work_struct release_work; struct hlist_node node; char tfm_name[CRYPTO_MAX_ALG_NAME]; }; /* Global LRU lists shared by all zswap pools. */ static struct list_lru zswap_list_lru; /* The lock protects zswap_next_shrink updates. */ static DEFINE_SPINLOCK(zswap_shrink_lock); static struct mem_cgroup *zswap_next_shrink; static struct work_struct zswap_shrink_work; static struct shrinker *zswap_shrinker; /* * struct zswap_entry * * This structure contains the metadata for tracking a single compressed * page within zswap. * * swpentry - associated swap entry, the offset indexes into the xarray * length - the length in bytes of the compressed page data. Needed during * decompression. * referenced - true if the entry recently entered the zswap pool. Unset by the * writeback logic. The entry is only reclaimed by the writeback * logic if referenced is unset. See comments in the shrinker * section for context. * pool - the zswap_pool the entry's data is in * handle - zsmalloc allocation handle that stores the compressed page data * objcg - the obj_cgroup that the compressed memory is charged to * lru - handle to the pool's lru used to evict pages. */ struct zswap_entry { swp_entry_t swpentry; unsigned int length; bool referenced; struct zswap_pool *pool; unsigned long handle; struct obj_cgroup *objcg; struct list_head lru; }; static struct xarray *zswap_trees[MAX_SWAPFILES]; static unsigned int nr_zswap_trees[MAX_SWAPFILES]; /* RCU-protected iteration */ static LIST_HEAD(zswap_pools); /* protects zswap_pools list modification */ static DEFINE_SPINLOCK(zswap_pools_lock); /* pool counter to provide unique names to zsmalloc */ static atomic_t zswap_pools_count = ATOMIC_INIT(0); enum zswap_init_type { ZSWAP_UNINIT, ZSWAP_INIT_SUCCEED, ZSWAP_INIT_FAILED }; static enum zswap_init_type zswap_init_state; /* used to ensure the integrity of initialization */ static DEFINE_MUTEX(zswap_init_lock); /* init completed, but couldn't create the initial pool */ static bool zswap_has_pool; /********************************* * helpers and fwd declarations **********************************/ /* One swap address space for each 64M swap space */ #define ZSWAP_ADDRESS_SPACE_SHIFT 14 #define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT) static inline struct xarray *swap_zswap_tree(swp_entry_t swp) { return &zswap_trees[swp_type(swp)][swp_offset(swp) >> ZSWAP_ADDRESS_SPACE_SHIFT]; } #define zswap_pool_debug(msg, p) \ pr_debug("%s pool %s\n", msg, (p)->tfm_name) /********************************* * pool functions **********************************/ static void __zswap_pool_empty(struct percpu_ref *ref); static struct zswap_pool *zswap_pool_create(char *compressor) { struct zswap_pool *pool; char name[38]; /* 'zswap' + 32 char (max) num + \0 */ int ret, cpu; if (!zswap_has_pool && !strcmp(compressor, ZSWAP_PARAM_UNSET)) return NULL; pool = kzalloc(sizeof(*pool), GFP_KERNEL); if (!pool) return NULL; /* unique name for each pool specifically required by zsmalloc */ snprintf(name, 38, "zswap%x", atomic_inc_return(&zswap_pools_count)); pool->zs_pool = zs_create_pool(name); if (!pool->zs_pool) goto error; strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); if (!pool->acomp_ctx) { pr_err("percpu alloc failed\n"); goto error; } for_each_possible_cpu(cpu) mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); if (ret) goto error; /* being the current pool takes 1 ref; this func expects the * caller to always add the new pool as the current pool */ ret = percpu_ref_init(&pool->ref, __zswap_pool_empty, PERCPU_REF_ALLOW_REINIT, GFP_KERNEL); if (ret) goto ref_fail; INIT_LIST_HEAD(&pool->list); zswap_pool_debug("created", pool); return pool; ref_fail: cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); error: if (pool->acomp_ctx) free_percpu(pool->acomp_ctx); if (pool->zs_pool) zs_destroy_pool(pool->zs_pool); kfree(pool); return NULL; } static struct zswap_pool *__zswap_pool_create_fallback(void) { if (!crypto_has_acomp(zswap_compressor, 0, 0) && strcmp(zswap_compressor, CONFIG_ZSWAP_COMPRESSOR_DEFAULT)) { pr_err("compressor %s not available, using default %s\n", zswap_compressor, CONFIG_ZSWAP_COMPRESSOR_DEFAULT); param_free_charp(&zswap_compressor); zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT; } /* Default compressor should be available. Kconfig bug? */ if (WARN_ON_ONCE(!crypto_has_acomp(zswap_compressor, 0, 0))) { zswap_compressor = ZSWAP_PARAM_UNSET; return NULL; } return zswap_pool_create(zswap_compressor); } static void zswap_pool_destroy(struct zswap_pool *pool) { zswap_pool_debug("destroying", pool); cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); free_percpu(pool->acomp_ctx); zs_destroy_pool(pool->zs_pool); kfree(pool); } static void __zswap_pool_release(struct work_struct *work) { struct zswap_pool *pool = container_of(work, typeof(*pool), release_work); synchronize_rcu(); /* nobody should have been able to get a ref... */ WARN_ON(!percpu_ref_is_zero(&pool->ref)); percpu_ref_exit(&pool->ref); /* pool is now off zswap_pools list and has no references. */ zswap_pool_destroy(pool); } static struct zswap_pool *zswap_pool_current(void); static void __zswap_pool_empty(struct percpu_ref *ref) { struct zswap_pool *pool; pool = container_of(ref, typeof(*pool), ref); spin_lock_bh(&zswap_pools_lock); WARN_ON(pool == zswap_pool_current()); list_del_rcu(&pool->list); INIT_WORK(&pool->release_work, __zswap_pool_release); schedule_work(&pool->release_work); spin_unlock_bh(&zswap_pools_lock); } static int __must_check zswap_pool_tryget(struct zswap_pool *pool) { if (!pool) return 0; return percpu_ref_tryget(&pool->ref); } /* The caller must already have a reference. */ static void zswap_pool_get(struct zswap_pool *pool) { percpu_ref_get(&pool->ref); } static void zswap_pool_put(struct zswap_pool *pool) { percpu_ref_put(&pool->ref); } static struct zswap_pool *__zswap_pool_current(void) { struct zswap_pool *pool; pool = list_first_or_null_rcu(&zswap_pools, typeof(*pool), list); WARN_ONCE(!pool && zswap_has_pool, "%s: no page storage pool!\n", __func__); return pool; } static struct zswap_pool *zswap_pool_current(void) { assert_spin_locked(&zswap_pools_lock); return __zswap_pool_current(); } static struct zswap_pool *zswap_pool_current_get(void) { struct zswap_pool *pool; rcu_read_lock(); pool = __zswap_pool_current(); if (!zswap_pool_tryget(pool)) pool = NULL; rcu_read_unlock(); return pool; } /* type and compressor must be null-terminated */ static struct zswap_pool *zswap_pool_find_get(char *compressor) { struct zswap_pool *pool; assert_spin_locked(&zswap_pools_lock); list_for_each_entry_rcu(pool, &zswap_pools, list) { if (strcmp(pool->tfm_name, compressor)) continue; /* if we can't get it, it's about to be destroyed */ if (!zswap_pool_tryget(pool)) continue; return pool; } return NULL; } static unsigned long zswap_max_pages(void) { return totalram_pages() * zswap_max_pool_percent / 100; } static unsigned long zswap_accept_thr_pages(void) { return zswap_max_pages() * zswap_accept_thr_percent / 100; } unsigned long zswap_total_pages(void) { struct zswap_pool *pool; unsigned long total = 0; rcu_read_lock(); list_for_each_entry_rcu(pool, &zswap_pools, list) total += zs_get_total_pages(pool->zs_pool); rcu_read_unlock(); return total; } static bool zswap_check_limits(void) { unsigned long cur_pages = zswap_total_pages(); unsigned long max_pages = zswap_max_pages(); if (cur_pages >= max_pages) { zswap_pool_limit_hit++; zswap_pool_reached_full = true; } else if (zswap_pool_reached_full && cur_pages <= zswap_accept_thr_pages()) { zswap_pool_reached_full = false; } return zswap_pool_reached_full; } /********************************* * param callbacks **********************************/ static int zswap_compressor_param_set(const char *val, const struct kernel_param *kp) { struct zswap_pool *pool, *put_pool = NULL; char *s = strstrip((char *)val); bool create_pool = false; int ret = 0; mutex_lock(&zswap_init_lock); switch (zswap_init_state) { case ZSWAP_UNINIT: /* Handled in zswap_setup() */ ret = param_set_charp(s, kp); break; case ZSWAP_INIT_SUCCEED: if (!zswap_has_pool || strcmp(s, *(char **)kp->arg)) create_pool = true; break; case ZSWAP_INIT_FAILED: pr_err("can't set param, initialization failed\n"); ret = -ENODEV; } mutex_unlock(&zswap_init_lock); if (!create_pool) return ret; if (!crypto_has_acomp(s, 0, 0)) { pr_err("compressor %s not available\n", s); return -ENOENT; } spin_lock_bh(&zswap_pools_lock); pool = zswap_pool_find_get(s); if (pool) { zswap_pool_debug("using existing", pool); WARN_ON(pool == zswap_pool_current()); list_del_rcu(&pool->list); } spin_unlock_bh(&zswap_pools_lock); if (!pool) pool = zswap_pool_create(s); else { /* * Restore the initial ref dropped by percpu_ref_kill() * when the pool was decommissioned and switch it again * to percpu mode. */ percpu_ref_resurrect(&pool->ref); /* Drop the ref from zswap_pool_find_get(). */ zswap_pool_put(pool); } if (pool) ret = param_set_charp(s, kp); else ret = -EINVAL; spin_lock_bh(&zswap_pools_lock); if (!ret) { put_pool = zswap_pool_current(); list_add_rcu(&pool->list, &zswap_pools); zswap_has_pool = true; } else if (pool) { /* * Add the possibly pre-existing pool to the end of the pools * list; if it's new (and empty) then it'll be removed and * destroyed by the put after we drop the lock */ list_add_tail_rcu(&pool->list, &zswap_pools); put_pool = pool; } spin_unlock_bh(&zswap_pools_lock); /* * Drop the ref from either the old current pool, * or the new pool we failed to add */ if (put_pool) percpu_ref_kill(&put_pool->ref); return ret; } static int zswap_enabled_param_set(const char *val, const struct kernel_param *kp) { int ret = -ENODEV; /* if this is load-time (pre-init) param setting, only set param. */ if (system_state != SYSTEM_RUNNING) return param_set_bool(val, kp); mutex_lock(&zswap_init_lock); switch (zswap_init_state) { case ZSWAP_UNINIT: if (zswap_setup()) break; fallthrough; case ZSWAP_INIT_SUCCEED: if (!zswap_has_pool) pr_err("can't enable, no pool configured\n"); else ret = param_set_bool(val, kp); break; case ZSWAP_INIT_FAILED: pr_err("can't enable, initialization failed\n"); } mutex_unlock(&zswap_init_lock); return ret; } /********************************* * lru functions **********************************/ /* should be called under RCU */ #ifdef CONFIG_MEMCG static inline struct mem_cgroup *mem_cgroup_from_entry(struct zswap_entry *entry) { return entry->objcg ? obj_cgroup_memcg(entry->objcg) : NULL; } #else static inline struct mem_cgroup *mem_cgroup_from_entry(struct zswap_entry *entry) { return NULL; } #endif static inline int entry_to_nid(struct zswap_entry *entry) { return page_to_nid(virt_to_page(entry)); } static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) { int nid = entry_to_nid(entry); struct mem_cgroup *memcg; /* * Note that it is safe to use rcu_read_lock() here, even in the face of * concurrent memcg offlining: * * 1. list_lru_add() is called before list_lru_one is dead. The * new entry will be reparented to memcg's parent's list_lru. * 2. list_lru_add() is called after list_lru_one is dead. The * new entry will be added directly to memcg's parent's list_lru. * * Similar reasoning holds for list_lru_del(). */ rcu_read_lock(); memcg = mem_cgroup_from_entry(entry); /* will always succeed */ list_lru_add(list_lru, &entry->lru, nid, memcg); rcu_read_unlock(); } static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry) { int nid = entry_to_nid(entry); struct mem_cgroup *memcg; rcu_read_lock(); memcg = mem_cgroup_from_entry(entry); /* will always succeed */ list_lru_del(list_lru, &entry->lru, nid, memcg); rcu_read_unlock(); } void zswap_lruvec_state_init(struct lruvec *lruvec) { atomic_long_set(&lruvec->zswap_lruvec_state.nr_disk_swapins, 0); } void zswap_folio_swapin(struct folio *folio) { struct lruvec *lruvec; if (folio) { lruvec = folio_lruvec(folio); atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins); } } /* * This function should be called when a memcg is being offlined. * * Since the global shrinker shrink_worker() may hold a reference * of the memcg, we must check and release the reference in * zswap_next_shrink. * * shrink_worker() must handle the case where this function releases * the reference of memcg being shrunk. */ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) { /* lock out zswap shrinker walking memcg tree */ spin_lock(&zswap_shrink_lock); if (zswap_next_shrink == memcg) { do { zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); } while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink)); } spin_unlock(&zswap_shrink_lock); } /********************************* * zswap entry functions **********************************/ static struct kmem_cache *zswap_entry_cache; static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid) { struct zswap_entry *entry; entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid); if (!entry) return NULL; return entry; } static void zswap_entry_cache_free(struct zswap_entry *entry) { kmem_cache_free(zswap_entry_cache, entry); } /* * Carries out the common pattern of freeing an entry's zsmalloc allocation, * freeing the entry itself, and decrementing the number of stored pages. */ static void zswap_entry_free(struct zswap_entry *entry) { zswap_lru_del(&zswap_list_lru, entry); zs_free(entry->pool->zs_pool, entry->handle); zswap_pool_put(entry->pool); if (entry->objcg) { obj_cgroup_uncharge_zswap(entry->objcg, entry->length); obj_cgroup_put(entry->objcg); } if (entry->length == PAGE_SIZE) atomic_long_dec(&zswap_stored_incompressible_pages); zswap_entry_cache_free(entry); atomic_long_dec(&zswap_stored_pages); } /********************************* * compressed storage functions **********************************/ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) { struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); struct crypto_acomp *acomp = NULL; struct acomp_req *req = NULL; u8 *buffer = NULL; int ret; buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); if (!buffer) { ret = -ENOMEM; goto fail; } acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); if (IS_ERR(acomp)) { pr_err("could not alloc crypto acomp %s : %ld\n", pool->tfm_name, PTR_ERR(acomp)); ret = PTR_ERR(acomp); goto fail; } req = acomp_request_alloc(acomp); if (!req) { pr_err("could not alloc crypto acomp_request %s\n", pool->tfm_name); ret = -ENOMEM; goto fail; } /* * Only hold the mutex after completing allocations, otherwise we may * recurse into zswap through reclaim and attempt to hold the mutex * again resulting in a deadlock. */ mutex_lock(&acomp_ctx->mutex); crypto_init_wait(&acomp_ctx->wait); /* * if the backend of acomp is async zip, crypto_req_done() will wakeup * crypto_wait_req(); if the backend of acomp is scomp, the callback * won't be called, crypto_wait_req() will return without blocking. */ acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, crypto_req_done, &acomp_ctx->wait); acomp_ctx->buffer = buffer; acomp_ctx->acomp = acomp; acomp_ctx->is_sleepable = acomp_is_async(acomp); acomp_ctx->req = req; mutex_unlock(&acomp_ctx->mutex); return 0; fail: if (!IS_ERR_OR_NULL(acomp)) crypto_free_acomp(acomp); kfree(buffer); return ret; } static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) { struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); struct acomp_req *req; struct crypto_acomp *acomp; u8 *buffer; if (IS_ERR_OR_NULL(acomp_ctx)) return 0; mutex_lock(&acomp_ctx->mutex); req = acomp_ctx->req; acomp = acomp_ctx->acomp; buffer = acomp_ctx->buffer; acomp_ctx->req = NULL; acomp_ctx->acomp = NULL; acomp_ctx->buffer = NULL; mutex_unlock(&acomp_ctx->mutex); /* * Do the actual freeing after releasing the mutex to avoid subtle * locking dependencies causing deadlocks. */ if (!IS_ERR_OR_NULL(req)) acomp_request_free(req); if (!IS_ERR_OR_NULL(acomp)) crypto_free_acomp(acomp); kfree(buffer); return 0; } static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) { struct crypto_acomp_ctx *acomp_ctx; for (;;) { acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); mutex_lock(&acomp_ctx->mutex); if (likely(acomp_ctx->req)) return acomp_ctx; /* * It is possible that we were migrated to a different CPU after * getting the per-CPU ctx but before the mutex was acquired. If * the old CPU got offlined, zswap_cpu_comp_dead() could have * already freed ctx->req (among other things) and set it to * NULL. Just try again on the new CPU that we ended up on. */ mutex_unlock(&acomp_ctx->mutex); } } static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx) { mutex_unlock(&acomp_ctx->mutex); } static bool zswap_compress(struct page *page, struct zswap_entry *entry, struct zswap_pool *pool) { struct crypto_acomp_ctx *acomp_ctx; struct scatterlist input, output; int comp_ret = 0, alloc_ret = 0; unsigned int dlen = PAGE_SIZE; unsigned long handle; gfp_t gfp; u8 *dst; bool mapped = false; acomp_ctx = acomp_ctx_get_cpu_lock(pool); dst = acomp_ctx->buffer; sg_init_table(&input, 1); sg_set_page(&input, page, PAGE_SIZE, 0); sg_init_one(&output, dst, PAGE_SIZE); acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen); /* * it maybe looks a little bit silly that we send an asynchronous request, * then wait for its completion synchronously. This makes the process look * synchronous in fact. * Theoretically, acomp supports users send multiple acomp requests in one * acomp instance, then get those requests done simultaneously. but in this * case, zswap actually does store and load page by page, there is no * existing method to send the second page before the first page is done * in one thread doing zswap. * but in different threads running on different cpu, we have different * acomp instance, so multiple threads can do (de)compression in parallel. */ comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait); dlen = acomp_ctx->req->dlen; /* * If a page cannot be compressed into a size smaller than PAGE_SIZE, * save the content as is without a compression, to keep the LRU order * of writebacks. If writeback is disabled, reject the page since it * only adds metadata overhead. swap_writeout() will put the page back * to the active LRU list in the case. */ if (comp_ret || !dlen || dlen >= PAGE_SIZE) { if (!mem_cgroup_zswap_writeback_enabled( folio_memcg(page_folio(page)))) { comp_ret = comp_ret ? comp_ret : -EINVAL; goto unlock; } comp_ret = 0; dlen = PAGE_SIZE; dst = kmap_local_page(page); mapped = true; } gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE; handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page)); if (IS_ERR_VALUE(handle)) { alloc_ret = PTR_ERR((void *)handle); goto unlock; } zs_obj_write(pool->zs_pool, handle, dst, dlen); entry->handle = handle; entry->length = dlen; unlock: if (mapped) kunmap_local(dst); if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC) zswap_reject_compress_poor++; else if (comp_ret) zswap_reject_compress_fail++; else if (alloc_ret) zswap_reject_alloc_fail++; acomp_ctx_put_unlock(acomp_ctx); return comp_ret == 0 && alloc_ret == 0; } static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) { struct zswap_pool *pool = entry->pool; struct scatterlist input, output; struct crypto_acomp_ctx *acomp_ctx; int decomp_ret = 0, dlen = PAGE_SIZE; u8 *src, *obj; acomp_ctx = acomp_ctx_get_cpu_lock(pool); obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer); /* zswap entries of length PAGE_SIZE are not compressed. */ if (entry->length == PAGE_SIZE) { memcpy_to_folio(folio, 0, obj, entry->length); goto read_done; } /* * zs_obj_read_begin() might return a kmap address of highmem when * acomp_ctx->buffer is not used. However, sg_init_one() does not * handle highmem addresses, so copy the object to acomp_ctx->buffer. */ if (virt_addr_valid(obj)) { src = obj; } else { WARN_ON_ONCE(obj == acomp_ctx->buffer); memcpy(acomp_ctx->buffer, obj, entry->length); src = acomp_ctx->buffer; } sg_init_one(&input, src, entry->length); sg_init_table(&output, 1); sg_set_folio(&output, folio, PAGE_SIZE, 0); acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); decomp_ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); dlen = acomp_ctx->req->dlen; read_done: zs_obj_read_end(pool->zs_pool, entry->handle, obj); acomp_ctx_put_unlock(acomp_ctx); if (!decomp_ret && dlen == PAGE_SIZE) return true; zswap_decompress_fail++; pr_alert_ratelimited("Decompression error from zswap (%d:%lu %s %u->%d)\n", swp_type(entry->swpentry), swp_offset(entry->swpentry), entry->pool->tfm_name, entry->length, dlen); return false; } /********************************* * writeback code **********************************/ /* * Attempts to free an entry by adding a folio to the swap cache, * decompressing the entry data into the folio, and issuing a * bio write to write the folio back to the swap device. * * This can be thought of as a "resumed writeback" of the folio * to the swap device. We are basically resuming the same swap * writeback path that was intercepted with the zswap_store() * in the first place. After the folio has been decompressed into * the swap cache, the compressed version stored by zswap can be * freed. */ static int zswap_writeback_entry(struct zswap_entry *entry, swp_entry_t swpentry) { struct xarray *tree; pgoff_t offset = swp_offset(swpentry); struct folio *folio; struct mempolicy *mpol; bool folio_was_allocated; struct swap_info_struct *si; int ret = 0; /* try to allocate swap cache folio */ si = get_swap_device(swpentry); if (!si) return -EEXIST; mpol = get_task_policy(current); folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, NO_INTERLEAVE_INDEX, &folio_was_allocated, true); put_swap_device(si); if (!folio) return -ENOMEM; /* * Found an existing folio, we raced with swapin or concurrent * shrinker. We generally writeback cold folios from zswap, and * swapin means the folio just became hot, so skip this folio. * For unlikely concurrent shrinker case, it will be unlinked * and freed when invalidated by the concurrent shrinker anyway. */ if (!folio_was_allocated) { ret = -EEXIST; goto out; } /* * folio is locked, and the swapcache is now secured against * concurrent swapping to and from the slot, and concurrent * swapoff so we can safely dereference the zswap tree here. * Verify that the swap entry hasn't been invalidated and recycled * behind our backs, to avoid overwriting a new swap folio with * old compressed data. Only when this is successful can the entry * be dereferenced. */ tree = swap_zswap_tree(swpentry); if (entry != xa_load(tree, offset)) { ret = -ENOMEM; goto out; } if (!zswap_decompress(entry, folio)) { ret = -EIO; goto out; } xa_erase(tree, offset); count_vm_event(ZSWPWB); if (entry->objcg) count_objcg_events(entry->objcg, ZSWPWB, 1); zswap_entry_free(entry); /* folio is up to date */ folio_mark_uptodate(folio); /* move it to the tail of the inactive list after end_writeback */ folio_set_reclaim(folio); /* start writeback */ __swap_writepage(folio, NULL); out: if (ret && ret != -EEXIST) { swap_cache_del_folio(folio); folio_unlock(folio); } folio_put(folio); return ret; } /********************************* * shrinker functions **********************************/ /* * The dynamic shrinker is modulated by the following factors: * * 1. Each zswap entry has a referenced bit, which the shrinker unsets (giving * the entry a second chance) before rotating it in the LRU list. If the * entry is considered again by the shrinker, with its referenced bit unset, * it is written back. The writeback rate as a result is dynamically * adjusted by the pool activities - if the pool is dominated by new entries * (i.e lots of recent zswapouts), these entries will be protected and * the writeback rate will slow down. On the other hand, if the pool has a * lot of stagnant entries, these entries will be reclaimed immediately, * effectively increasing the writeback rate. * * 2. Swapins counter: If we observe swapins, it is a sign that we are * overshrinking and should slow down. We maintain a swapins counter, which * is consumed and subtract from the number of eligible objects on the LRU * in zswap_shrinker_count(). * * 3. Compression ratio. The better the workload compresses, the less gains we * can expect from writeback. We scale down the number of objects available * for reclaim by this ratio. */ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l, void *arg) { struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); bool *encountered_page_in_swapcache = (bool *)arg; swp_entry_t swpentry; enum lru_status ret = LRU_REMOVED_RETRY; int writeback_result; /* * Second chance algorithm: if the entry has its referenced bit set, give it * a second chance. Only clear the referenced bit and rotate it in the * zswap's LRU list. */ if (entry->referenced) { entry->referenced = false; return LRU_ROTATE; } /* * As soon as we drop the LRU lock, the entry can be freed by * a concurrent invalidation. This means the following: * * 1. We extract the swp_entry_t to the stack, allowing * zswap_writeback_entry() to pin the swap entry and * then validate the zswap entry against that swap entry's * tree using pointer value comparison. Only when that * is successful can the entry be dereferenced. * * 2. Usually, objects are taken off the LRU for reclaim. In * this case this isn't possible, because if reclaim fails * for whatever reason, we have no means of knowing if the * entry is alive to put it back on the LRU. * * So rotate it before dropping the lock. If the entry is * written back or invalidated, the free path will unlink * it. For failures, rotation is the right thing as well. * * Temporary failures, where the same entry should be tried * again immediately, almost never happen for this shrinker. * We don't do any trylocking; -ENOMEM comes closest, * but that's extremely rare and doesn't happen spuriously * either. Don't bother distinguishing this case. */ list_move_tail(item, &l->list); /* * Once the lru lock is dropped, the entry might get freed. The * swpentry is copied to the stack, and entry isn't deref'd again * until the entry is verified to still be alive in the tree. */ swpentry = entry->swpentry; /* * It's safe to drop the lock here because we return either * LRU_REMOVED_RETRY, LRU_RETRY or LRU_STOP. */ spin_unlock(&l->lock); writeback_result = zswap_writeback_entry(entry, swpentry); if (writeback_result) { zswap_reject_reclaim_fail++; ret = LRU_RETRY; /* * Encountering a page already in swap cache is a sign that we are shrinking * into the warmer region. We should terminate shrinking (if we're in the dynamic * shrinker context). */ if (writeback_result == -EEXIST && encountered_page_in_swapcache) { ret = LRU_STOP; *encountered_page_in_swapcache = true; } } else { zswap_written_back_pages++; } return ret; } static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) { unsigned long shrink_ret; bool encountered_page_in_swapcache = false; if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(sc->memcg)) { sc->nr_scanned = 0; return SHRINK_STOP; } shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb, &encountered_page_in_swapcache); if (encountered_page_in_swapcache) return SHRINK_STOP; return shrink_ret ? shrink_ret : SHRINK_STOP; } static unsigned long zswap_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc) { struct mem_cgroup *memcg = sc->memcg; struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); atomic_long_t *nr_disk_swapins = &lruvec->zswap_lruvec_state.nr_disk_swapins; unsigned long nr_backing, nr_stored, nr_freeable, nr_disk_swapins_cur, nr_remain; if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg)) return 0; /* * The shrinker resumes swap writeback, which will enter block * and may enter fs. XXX: Harmonize with vmscan.c __GFP_FS * rules (may_enter_fs()), which apply on a per-folio basis. */ if (!gfp_has_io_fs(sc->gfp_mask)) return 0; /* * For memcg, use the cgroup-wide ZSWAP stats since we don't * have them per-node and thus per-lruvec. Careful if memcg is * runtime-disabled: we can get sc->memcg == NULL, which is ok * for the lruvec, but not for memcg_page_state(). * * Without memcg, use the zswap pool-wide metrics. */ if (!mem_cgroup_disabled()) { mem_cgroup_flush_stats(memcg); nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); } else { nr_backing = zswap_total_pages(); nr_stored = atomic_long_read(&zswap_stored_pages); } if (!nr_stored) return 0; nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc); if (!nr_freeable) return 0; /* * Subtract from the lru size the number of pages that are recently swapped * in from disk. The idea is that had we protect the zswap's LRU by this * amount of pages, these disk swapins would not have happened. */ nr_disk_swapins_cur = atomic_long_read(nr_disk_swapins); do { if (nr_freeable >= nr_disk_swapins_cur) nr_remain = 0; else nr_remain = nr_disk_swapins_cur - nr_freeable; } while (!atomic_long_try_cmpxchg( nr_disk_swapins, &nr_disk_swapins_cur, nr_remain)); nr_freeable -= nr_disk_swapins_cur - nr_remain; if (!nr_freeable) return 0; /* * Scale the number of freeable pages by the memory saving factor. * This ensures that the better zswap compresses memory, the fewer * pages we will evict to swap (as it will otherwise incur IO for * relatively small memory saving). */ return mult_frac(nr_freeable, nr_backing, nr_stored); } static struct shrinker *zswap_alloc_shrinker(void) { struct shrinker *shrinker; shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE, "mm-zswap"); if (!shrinker) return NULL; shrinker->scan_objects = zswap_shrinker_scan; shrinker->count_objects = zswap_shrinker_count; shrinker->batch = 0; shrinker->seeks = DEFAULT_SEEKS; return shrinker; } static int shrink_memcg(struct mem_cgroup *memcg) { int nid, shrunk = 0, scanned = 0; if (!mem_cgroup_zswap_writeback_enabled(memcg)) return -ENOENT; /* * Skip zombies because their LRUs are reparented and we would be * reclaiming from the parent instead of the dead memcg. */ if (memcg && !mem_cgroup_online(memcg)) return -ENOENT; for_each_node_state(nid, N_NORMAL_MEMORY) { unsigned long nr_to_walk = 1; shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg, &shrink_memcg_cb, NULL, &nr_to_walk); scanned += 1 - nr_to_walk; } if (!scanned) return -ENOENT; return shrunk ? 0 : -EAGAIN; } static void shrink_worker(struct work_struct *w) { struct mem_cgroup *memcg; int ret, failures = 0, attempts = 0; unsigned long thr; /* Reclaim down to the accept threshold */ thr = zswap_accept_thr_pages(); /* * Global reclaim will select cgroup in a round-robin fashion from all * online memcgs, but memcgs that have no pages in zswap and * writeback-disabled memcgs (memory.zswap.writeback=0) are not * candidates for shrinking. * * Shrinking will be aborted if we encounter the following * MAX_RECLAIM_RETRIES times: * - No writeback-candidate memcgs found in a memcg tree walk. * - Shrinking a writeback-candidate memcg failed. * * We save iteration cursor memcg into zswap_next_shrink, * which can be modified by the offline memcg cleaner * zswap_memcg_offline_cleanup(). * * Since the offline cleaner is called only once, we cannot leave an * offline memcg reference in zswap_next_shrink. * We can rely on the cleaner only if we get online memcg under lock. * * If we get an offline memcg, we cannot determine if the cleaner has * already been called or will be called later. We must put back the * reference before returning from this function. Otherwise, the * offline memcg left in zswap_next_shrink will hold the reference * until the next run of shrink_worker(). */ do { /* * Start shrinking from the next memcg after zswap_next_shrink. * When the offline cleaner has already advanced the cursor, * advancing the cursor here overlooks one memcg, but this * should be negligibly rare. * * If we get an online memcg, keep the extra reference in case * the original one obtained by mem_cgroup_iter() is dropped by * zswap_memcg_offline_cleanup() while we are shrinking the * memcg. */ spin_lock(&zswap_shrink_lock); do { memcg = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); zswap_next_shrink = memcg; } while (memcg && !mem_cgroup_tryget_online(memcg)); spin_unlock(&zswap_shrink_lock); if (!memcg) { /* * Continue shrinking without incrementing failures if * we found candidate memcgs in the last tree walk. */ if (!attempts && ++failures == MAX_RECLAIM_RETRIES) break; attempts = 0; goto resched; } ret = shrink_memcg(memcg); /* drop the extra reference */ mem_cgroup_put(memcg); /* * There are no writeback-candidate pages in the memcg. * This is not an issue as long as we can find another memcg * with pages in zswap. Skip this without incrementing attempts * and failures. */ if (ret == -ENOENT) continue; ++attempts; if (ret && ++failures == MAX_RECLAIM_RETRIES) break; resched: cond_resched(); } while (zswap_total_pages() > thr); } /********************************* * main API **********************************/ static bool zswap_store_page(struct page *page, struct obj_cgroup *objcg, struct zswap_pool *pool) { swp_entry_t page_swpentry = page_swap_entry(page); struct zswap_entry *entry, *old; /* allocate entry */ entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page)); if (!entry) { zswap_reject_kmemcache_fail++; return false; } if (!zswap_compress(page, entry, pool)) goto compress_failed; old = xa_store(swap_zswap_tree(page_swpentry), swp_offset(page_swpentry), entry, GFP_KERNEL); if (xa_is_err(old)) { int err = xa_err(old); WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err); zswap_reject_alloc_fail++; goto store_failed; } /* * We may have had an existing entry that became stale when * the folio was redirtied and now the new version is being * swapped out. Get rid of the old. */ if (old) zswap_entry_free(old); /* * The entry is successfully compressed and stored in the tree, there is * no further possibility of failure. Grab refs to the pool and objcg, * charge zswap memory, and increment zswap_stored_pages. * The opposite actions will be performed by zswap_entry_free() * when the entry is removed from the tree. */ zswap_pool_get(pool); if (objcg) { obj_cgroup_get(objcg); obj_cgroup_charge_zswap(objcg, entry->length); } atomic_long_inc(&zswap_stored_pages); if (entry->length == PAGE_SIZE) atomic_long_inc(&zswap_stored_incompressible_pages); /* * We finish initializing the entry while it's already in xarray. * This is safe because: * * 1. Concurrent stores and invalidations are excluded by folio lock. * * 2. Writeback is excluded by the entry not being on the LRU yet. * The publishing order matters to prevent writeback from seeing * an incoherent entry. */ entry->pool = pool; entry->swpentry = page_swpentry; entry->objcg = objcg; entry->referenced = true; if (entry->length) { INIT_LIST_HEAD(&entry->lru); zswap_lru_add(&zswap_list_lru, entry); } return true; store_failed: zs_free(pool->zs_pool, entry->handle); compress_failed: zswap_entry_cache_free(entry); return false; } bool zswap_store(struct folio *folio) { long nr_pages = folio_nr_pages(folio); swp_entry_t swp = folio->swap; struct obj_cgroup *objcg = NULL; struct mem_cgroup *memcg = NULL; struct zswap_pool *pool; bool ret = false; long index; VM_WARN_ON_ONCE(!folio_test_locked(folio)); VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); if (!zswap_enabled) goto check_old; objcg = get_obj_cgroup_from_folio(folio); if (objcg && !obj_cgroup_may_zswap(objcg)) { memcg = get_mem_cgroup_from_objcg(objcg); if (shrink_memcg(memcg)) { mem_cgroup_put(memcg); goto put_objcg; } mem_cgroup_put(memcg); } if (zswap_check_limits()) goto put_objcg; pool = zswap_pool_current_get(); if (!pool) goto put_objcg; if (objcg) { memcg = get_mem_cgroup_from_objcg(objcg); if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) { mem_cgroup_put(memcg); goto put_pool; } mem_cgroup_put(memcg); } for (index = 0; index < nr_pages; ++index) { struct page *page = folio_page(folio, index); if (!zswap_store_page(page, objcg, pool)) goto put_pool; } if (objcg) count_objcg_events(objcg, ZSWPOUT, nr_pages); count_vm_events(ZSWPOUT, nr_pages); ret = true; put_pool: zswap_pool_put(pool); put_objcg: obj_cgroup_put(objcg); if (!ret && zswap_pool_reached_full) queue_work(shrink_wq, &zswap_shrink_work); check_old: /* * If the zswap store fails or zswap is disabled, we must invalidate * the possibly stale entries which were previously stored at the * offsets corresponding to each page of the folio. Otherwise, * writeback could overwrite the new data in the swapfile. */ if (!ret) { unsigned type = swp_type(swp); pgoff_t offset = swp_offset(swp); struct zswap_entry *entry; struct xarray *tree; for (index = 0; index < nr_pages; ++index) { tree = swap_zswap_tree(swp_entry(type, offset + index)); entry = xa_erase(tree, offset + index); if (entry) zswap_entry_free(entry); } } return ret; } /** * zswap_load() - load a folio from zswap * @folio: folio to load * * Return: 0 on success, with the folio unlocked and marked up-to-date, or one * of the following error codes: * * -EIO: if the swapped out content was in zswap, but could not be loaded * into the page due to a decompression failure. The folio is unlocked, but * NOT marked up-to-date, so that an IO error is emitted (e.g. do_swap_page() * will SIGBUS). * * -EINVAL: if the swapped out content was in zswap, but the page belongs * to a large folio, which is not supported by zswap. The folio is unlocked, * but NOT marked up-to-date, so that an IO error is emitted (e.g. * do_swap_page() will SIGBUS). * * -ENOENT: if the swapped out content was not in zswap. The folio remains * locked on return. */ int zswap_load(struct folio *folio) { swp_entry_t swp = folio->swap; pgoff_t offset = swp_offset(swp); bool swapcache = folio_test_swapcache(folio); struct xarray *tree = swap_zswap_tree(swp); struct zswap_entry *entry; VM_WARN_ON_ONCE(!folio_test_locked(folio)); if (zswap_never_enabled()) return -ENOENT; /* * Large folios should not be swapped in while zswap is being used, as * they are not properly handled. Zswap does not properly load large * folios, and a large folio may only be partially in zswap. */ if (WARN_ON_ONCE(folio_test_large(folio))) { folio_unlock(folio); return -EINVAL; } entry = xa_load(tree, offset); if (!entry) return -ENOENT; if (!zswap_decompress(entry, folio)) { folio_unlock(folio); return -EIO; } folio_mark_uptodate(folio); count_vm_event(ZSWPIN); if (entry->objcg) count_objcg_events(entry->objcg, ZSWPIN, 1); /* * When reading into the swapcache, invalidate our entry. The * swapcache can be the authoritative owner of the page and * its mappings, and the pressure that results from having two * in-memory copies outweighs any benefits of caching the * compression work. * * (Most swapins go through the swapcache. The notable * exception is the singleton fault on SWP_SYNCHRONOUS_IO * files, which reads into a private page and may free it if * the fault fails. We remain the primary owner of the entry.) */ if (swapcache) { folio_mark_dirty(folio); xa_erase(tree, offset); zswap_entry_free(entry); } folio_unlock(folio); return 0; } void zswap_invalidate(swp_entry_t swp) { pgoff_t offset = swp_offset(swp); struct xarray *tree = swap_zswap_tree(swp); struct zswap_entry *entry; if (xa_empty(tree)) return; entry = xa_erase(tree, offset); if (entry) zswap_entry_free(entry); } int zswap_swapon(int type, unsigned long nr_pages) { struct xarray *trees, *tree; unsigned int nr, i; nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES); trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL); if (!trees) { pr_err("alloc failed, zswap disabled for swap type %d\n", type); return -ENOMEM; } for (i = 0; i < nr; i++) xa_init(trees + i); nr_zswap_trees[type] = nr; zswap_trees[type] = trees; return 0; } void zswap_swapoff(int type) { struct xarray *trees = zswap_trees[type]; unsigned int i; if (!trees) return; /* try_to_unuse() invalidated all the entries already */ for (i = 0; i < nr_zswap_trees[type]; i++) WARN_ON_ONCE(!xa_empty(trees + i)); kvfree(trees); nr_zswap_trees[type] = 0; zswap_trees[type] = NULL; } /********************************* * debugfs functions **********************************/ #ifdef CONFIG_DEBUG_FS #include <linux/debugfs.h> static struct dentry *zswap_debugfs_root; static int debugfs_get_total_size(void *data, u64 *val) { *val = zswap_total_pages() * PAGE_SIZE; return 0; } DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size, NULL, "%llu\n"); static int debugfs_get_stored_pages(void *data, u64 *val) { *val = atomic_long_read(&zswap_stored_pages); return 0; } DEFINE_DEBUGFS_ATTRIBUTE(stored_pages_fops, debugfs_get_stored_pages, NULL, "%llu\n"); static int debugfs_get_stored_incompressible_pages(void *data, u64 *val) { *val = atomic_long_read(&zswap_stored_incompressible_pages); return 0; } DEFINE_DEBUGFS_ATTRIBUTE(stored_incompressible_pages_fops, debugfs_get_stored_incompressible_pages, NULL, "%llu\n"); static int zswap_debugfs_init(void) { if (!debugfs_initialized()) return -ENODEV; zswap_debugfs_root = debugfs_create_dir("zswap", NULL); debugfs_create_u64("pool_limit_hit", 0444, zswap_debugfs_root, &zswap_pool_limit_hit); debugfs_create_u64("reject_reclaim_fail", 0444, zswap_debugfs_root, &zswap_reject_reclaim_fail); debugfs_create_u64("reject_alloc_fail", 0444, zswap_debugfs_root, &zswap_reject_alloc_fail); debugfs_create_u64("reject_kmemcache_fail", 0444, zswap_debugfs_root, &zswap_reject_kmemcache_fail); debugfs_create_u64("reject_compress_fail", 0444, zswap_debugfs_root, &zswap_reject_compress_fail); debugfs_create_u64("reject_compress_poor", 0444, zswap_debugfs_root, &zswap_reject_compress_poor); debugfs_create_u64("decompress_fail", 0444, zswap_debugfs_root, &zswap_decompress_fail); debugfs_create_u64("written_back_pages", 0444, zswap_debugfs_root, &zswap_written_back_pages); debugfs_create_file("pool_total_size", 0444, zswap_debugfs_root, NULL, &total_size_fops); debugfs_create_file("stored_pages", 0444, zswap_debugfs_root, NULL, &stored_pages_fops); debugfs_create_file("stored_incompressible_pages", 0444, zswap_debugfs_root, NULL, &stored_incompressible_pages_fops); return 0; } #else static int zswap_debugfs_init(void) { return 0; } #endif /********************************* * module init and exit **********************************/ static int zswap_setup(void) { struct zswap_pool *pool; int ret; zswap_entry_cache = KMEM_CACHE(zswap_entry, 0); if (!zswap_entry_cache) { pr_err("entry cache creation failed\n"); goto cache_fail; } ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE, "mm/zswap_pool:prepare", zswap_cpu_comp_prepare, zswap_cpu_comp_dead); if (ret) goto hp_fail; shrink_wq = alloc_workqueue("zswap-shrink", WQ_UNBOUND|WQ_MEM_RECLAIM, 1); if (!shrink_wq) goto shrink_wq_fail; zswap_shrinker = zswap_alloc_shrinker(); if (!zswap_shrinker) goto shrinker_fail; if (list_lru_init_memcg(&zswap_list_lru, zswap_shrinker)) goto lru_fail; shrinker_register(zswap_shrinker); INIT_WORK(&zswap_shrink_work, shrink_worker); pool = __zswap_pool_create_fallback(); if (pool) { pr_info("loaded using pool %s\n", pool->tfm_name); list_add(&pool->list, &zswap_pools); zswap_has_pool = true; static_branch_enable(&zswap_ever_enabled); } else { pr_err("pool creation failed\n"); zswap_enabled = false; } if (zswap_debugfs_init()) pr_warn("debugfs initialization failed\n"); zswap_init_state = ZSWAP_INIT_SUCCEED; return 0; lru_fail: shrinker_free(zswap_shrinker); shrinker_fail: destroy_workqueue(shrink_wq); shrink_wq_fail: cpuhp_remove_multi_state(CPUHP_MM_ZSWP_POOL_PREPARE); hp_fail: kmem_cache_destroy(zswap_entry_cache); cache_fail: /* if built-in, we aren't unloaded on failure; don't allow use */ zswap_init_state = ZSWAP_INIT_FAILED; zswap_enabled = false; return -ENOMEM; } static int __init zswap_init(void) { if (!zswap_enabled) return 0; return zswap_setup(); } /* must be late so crypto has time to come up */ late_initcall(zswap_init); MODULE_AUTHOR("Seth Jennings <sjennings@variantweb.net>"); MODULE_DESCRIPTION("Compressed cache for swap pages"); |
| 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 | // SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2017 Covalent IO, Inc. http://covalent.io */ /* Devmaps primary use is as a backend map for XDP BPF helper call * bpf_redirect_map(). Because XDP is mostly concerned with performance we * spent some effort to ensure the datapath with redirect maps does not use * any locking. This is a quick note on the details. * * We have three possible paths to get into the devmap control plane bpf * syscalls, bpf programs, and driver side xmit/flush operations. A bpf syscall * will invoke an update, delete, or lookup operation. To ensure updates and * deletes appear atomic from the datapath side xchg() is used to modify the * netdev_map array. Then because the datapath does a lookup into the netdev_map * array (read-only) from an RCU critical section we use call_rcu() to wait for * an rcu grace period before free'ing the old data structures. This ensures the * datapath always has a valid copy. However, the datapath does a "flush" * operation that pushes any pending packets in the driver outside the RCU * critical section. Each bpf_dtab_netdev tracks these pending operations using * a per-cpu flush list. The bpf_dtab_netdev object will not be destroyed until * this list is empty, indicating outstanding flush operations have completed. * * BPF syscalls may race with BPF program calls on any of the update, delete * or lookup operations. As noted above the xchg() operation also keep the * netdev_map consistent in this case. From the devmap side BPF programs * calling into these operations are the same as multiple user space threads * making system calls. * * Finally, any of the above may race with a netdev_unregister notifier. The * unregister notifier must search for net devices in the map structure that * contain a reference to the net device and remove them. This is a two step * process (a) dereference the bpf_dtab_netdev object in netdev_map and (b) * check to see if the ifindex is the same as the net_device being removed. * When removing the dev a cmpxchg() is used to ensure the correct dev is * removed, in the case of a concurrent update or delete operation it is * possible that the initially referenced dev is no longer in the map. As the * notifier hook walks the map we know that new dev references can not be * added by the user because core infrastructure ensures dev_get_by_index() * calls will fail at this point. * * The devmap_hash type is a map type which interprets keys as ifindexes and * indexes these using a hashmap. This allows maps that use ifindex as key to be * densely packed instead of having holes in the lookup array for unused * ifindexes. The setup and packet enqueue/send code is shared between the two * types of devmap; only the lookup and insertion is different. */ #include <linux/bpf.h> #include <net/xdp.h> #include <linux/filter.h> #include <trace/events/xdp.h> #include <linux/btf_ids.h> #define DEV_CREATE_FLAG_MASK \ (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) struct xdp_dev_bulk_queue { struct xdp_frame *q[DEV_MAP_BULK_SIZE]; struct list_head flush_node; struct net_device *dev; struct net_device *dev_rx; struct bpf_prog *xdp_prog; unsigned int count; }; struct bpf_dtab_netdev { struct net_device *dev; /* must be first member, due to tracepoint */ struct hlist_node index_hlist; struct bpf_prog *xdp_prog; struct rcu_head rcu; unsigned int idx; struct bpf_devmap_val val; }; struct bpf_dtab { struct bpf_map map; struct bpf_dtab_netdev __rcu **netdev_map; /* DEVMAP type only */ struct list_head list; /* these are only used for DEVMAP_HASH type maps */ struct hlist_head *dev_index_head; spinlock_t index_lock; unsigned int items; u32 n_buckets; }; static DEFINE_SPINLOCK(dev_map_lock); static LIST_HEAD(dev_map_list); static struct hlist_head *dev_map_create_hash(unsigned int entries, int numa_node) { int i; struct hlist_head *hash; hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node); if (hash != NULL) for (i = 0; i < entries; i++) INIT_HLIST_HEAD(&hash[i]); return hash; } static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab, int idx) { return &dtab->dev_index_head[idx & (dtab->n_buckets - 1)]; } static int dev_map_alloc_check(union bpf_attr *attr) { u32 valsize = attr->value_size; /* check sanity of attributes. 2 value sizes supported: * 4 bytes: ifindex * 8 bytes: ifindex + prog fd */ if (attr->max_entries == 0 || attr->key_size != 4 || (valsize != offsetofend(struct bpf_devmap_val, ifindex) && valsize != offsetofend(struct bpf_devmap_val, bpf_prog.fd)) || attr->map_flags & ~DEV_CREATE_FLAG_MASK) return -EINVAL; if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) { /* Hash table size must be power of 2; roundup_pow_of_two() * can overflow into UB on 32-bit arches */ if (attr->max_entries > 1UL << 31) return -EINVAL; } return 0; } static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr) { /* Lookup returns a pointer straight to dev->ifindex, so make sure the * verifier prevents writes from the BPF side */ attr->map_flags |= BPF_F_RDONLY_PROG; bpf_map_init_from_attr(&dtab->map, attr); if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) { /* Hash table size must be power of 2 */ dtab->n_buckets = roundup_pow_of_two(dtab->map.max_entries); dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets, dtab->map.numa_node); if (!dtab->dev_index_head) return -ENOMEM; spin_lock_init(&dtab->index_lock); } else { dtab->netdev_map = bpf_map_area_alloc((u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *), dtab->map.numa_node); if (!dtab->netdev_map) return -ENOMEM; } return 0; } static struct bpf_map *dev_map_alloc(union bpf_attr *attr) { struct bpf_dtab *dtab; int err; dtab = bpf_map_area_alloc(sizeof(*dtab), NUMA_NO_NODE); if (!dtab) return ERR_PTR(-ENOMEM); err = dev_map_init_map(dtab, attr); if (err) { bpf_map_area_free(dtab); return ERR_PTR(err); } spin_lock(&dev_map_lock); list_add_tail_rcu(&dtab->list, &dev_map_list); spin_unlock(&dev_map_lock); return &dtab->map; } static void dev_map_free(struct bpf_map *map) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); u32 i; /* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0, * so the programs (can be more than one that used this map) were * disconnected from events. The following synchronize_rcu() guarantees * both rcu read critical sections complete and waits for * preempt-disable regions (NAPI being the relevant context here) so we * are certain there will be no further reads against the netdev_map and * all flush operations are complete. Flush operations can only be done * from NAPI context for this reason. */ spin_lock(&dev_map_lock); list_del_rcu(&dtab->list); spin_unlock(&dev_map_lock); /* bpf_redirect_info->map is assigned in __bpf_xdp_redirect_map() * during NAPI callback and cleared after the XDP redirect. There is no * explicit RCU read section which protects bpf_redirect_info->map but * local_bh_disable() also marks the beginning an RCU section. This * makes the complete softirq callback RCU protected. Thus after * following synchronize_rcu() there no bpf_redirect_info->map == map * assignment. */ synchronize_rcu(); /* Make sure prior __dev_map_entry_free() have completed. */ rcu_barrier(); if (dtab->map.map_type == BPF_MAP_TYPE_DEVMAP_HASH) { for (i = 0; i < dtab->n_buckets; i++) { struct bpf_dtab_netdev *dev; struct hlist_head *head; struct hlist_node *next; head = dev_map_index_hash(dtab, i); hlist_for_each_entry_safe(dev, next, head, index_hlist) { hlist_del_rcu(&dev->index_hlist); if (dev->xdp_prog) bpf_prog_put(dev->xdp_prog); dev_put(dev->dev); kfree(dev); } } bpf_map_area_free(dtab->dev_index_head); } else { for (i = 0; i < dtab->map.max_entries; i++) { struct bpf_dtab_netdev *dev; dev = rcu_dereference_raw(dtab->netdev_map[i]); if (!dev) continue; if (dev->xdp_prog) bpf_prog_put(dev->xdp_prog); dev_put(dev->dev); kfree(dev); } bpf_map_area_free(dtab->netdev_map); } bpf_map_area_free(dtab); } static int dev_map_get_next_key(struct bpf_map *map, void *key, void *next_key) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); u32 index = key ? *(u32 *)key : U32_MAX; u32 *next = next_key; if (index >= dtab->map.max_entries) { *next = 0; return 0; } if (index == dtab->map.max_entries - 1) return -ENOENT; *next = index + 1; return 0; } /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or * by local_bh_disable() (from XDP calls inside NAPI). The * rcu_read_lock_bh_held() below makes lockdep accept both. */ static void *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct hlist_head *head = dev_map_index_hash(dtab, key); struct bpf_dtab_netdev *dev; hlist_for_each_entry_rcu(dev, head, index_hlist, lockdep_is_held(&dtab->index_lock)) if (dev->idx == key) return dev; return NULL; } static int dev_map_hash_get_next_key(struct bpf_map *map, void *key, void *next_key) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); u32 idx, *next = next_key; struct bpf_dtab_netdev *dev, *next_dev; struct hlist_head *head; int i = 0; if (!key) goto find_first; idx = *(u32 *)key; dev = __dev_map_hash_lookup_elem(map, idx); if (!dev) goto find_first; next_dev = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(&dev->index_hlist)), struct bpf_dtab_netdev, index_hlist); if (next_dev) { *next = next_dev->idx; return 0; } i = idx & (dtab->n_buckets - 1); i++; find_first: for (; i < dtab->n_buckets; i++) { head = dev_map_index_hash(dtab, i); next_dev = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)), struct bpf_dtab_netdev, index_hlist); if (next_dev) { *next = next_dev->idx; return 0; } } return -ENOENT; } static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog, struct xdp_frame **frames, int n, struct net_device *tx_dev, struct net_device *rx_dev) { struct xdp_txq_info txq = { .dev = tx_dev }; struct xdp_rxq_info rxq = { .dev = rx_dev }; struct xdp_buff xdp; int i, nframes = 0; for (i = 0; i < n; i++) { struct xdp_frame *xdpf = frames[i]; u32 act; int err; xdp_convert_frame_to_buff(xdpf, &xdp); xdp.txq = &txq; xdp.rxq = &rxq; act = bpf_prog_run_xdp(xdp_prog, &xdp); switch (act) { case XDP_PASS: err = xdp_update_frame_from_buff(&xdp, xdpf); if (unlikely(err < 0)) xdp_return_frame_rx_napi(xdpf); else frames[nframes++] = xdpf; break; default: bpf_warn_invalid_xdp_action(NULL, xdp_prog, act); fallthrough; case XDP_ABORTED: trace_xdp_exception(tx_dev, xdp_prog, act); fallthrough; case XDP_DROP: xdp_return_frame_rx_napi(xdpf); break; } } return nframes; /* sent frames count */ } static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) { struct net_device *dev = bq->dev; unsigned int cnt = bq->count; int sent = 0, err = 0; int to_send = cnt; int i; if (unlikely(!cnt)) return; for (i = 0; i < cnt; i++) { struct xdp_frame *xdpf = bq->q[i]; prefetch(xdpf); } if (bq->xdp_prog) { to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev, bq->dev_rx); if (!to_send) goto out; } sent = dev->netdev_ops->ndo_xdp_xmit(dev, to_send, bq->q, flags); if (sent < 0) { /* If ndo_xdp_xmit fails with an errno, no frames have * been xmit'ed. */ err = sent; sent = 0; } /* If not all frames have been transmitted, it is our * responsibility to free them */ for (i = sent; unlikely(i < to_send); i++) xdp_return_frame_rx_napi(bq->q[i]); out: bq->count = 0; trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent, err); } /* __dev_flush is called from xdp_do_flush() which _must_ be signalled from the * driver before returning from its napi->poll() routine. See the comment above * xdp_do_flush() in filter.c. */ void __dev_flush(struct list_head *flush_list) { struct xdp_dev_bulk_queue *bq, *tmp; list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { bq_xmit_all(bq, XDP_XMIT_FLUSH); bq->dev_rx = NULL; bq->xdp_prog = NULL; __list_del_clearprev(&bq->flush_node); } } /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or * by local_bh_disable() (from XDP calls inside NAPI). The * rcu_read_lock_bh_held() below makes lockdep accept both. */ static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct bpf_dtab_netdev *obj; if (key >= map->max_entries) return NULL; obj = rcu_dereference_check(dtab->netdev_map[key], rcu_read_lock_bh_held()); return obj; } /* Runs in NAPI, i.e., softirq under local_bh_disable(). Thus, safe percpu * variable access, and map elements stick around. See comment above * xdp_do_flush() in filter.c. */ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf, struct net_device *dev_rx, struct bpf_prog *xdp_prog) { struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq); if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) bq_xmit_all(bq, 0); /* Ingress dev_rx will be the same for all xdp_frame's in * bulk_queue, because bq stored per-CPU and must be flushed * from net_device drivers NAPI func end. * * Do the same with xdp_prog and flush_list since these fields * are only ever modified together. */ if (!bq->dev_rx) { struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list(); bq->dev_rx = dev_rx; bq->xdp_prog = xdp_prog; list_add(&bq->flush_node, flush_list); } bq->q[bq->count++] = xdpf; } static inline int __xdp_enqueue(struct net_device *dev, struct xdp_frame *xdpf, struct net_device *dev_rx, struct bpf_prog *xdp_prog) { int err; if (!(dev->xdp_features & NETDEV_XDP_ACT_NDO_XMIT)) return -EOPNOTSUPP; if (unlikely(!(dev->xdp_features & NETDEV_XDP_ACT_NDO_XMIT_SG) && xdp_frame_has_frags(xdpf))) return -EOPNOTSUPP; err = xdp_ok_fwd_dev(dev, xdp_get_frame_len(xdpf)); if (unlikely(err)) return err; bq_enqueue(dev, xdpf, dev_rx, xdp_prog); return 0; } static u32 dev_map_bpf_prog_run_skb(struct sk_buff *skb, struct bpf_dtab_netdev *dst) { struct xdp_txq_info txq = { .dev = dst->dev }; struct xdp_buff xdp; u32 act; if (!dst->xdp_prog) return XDP_PASS; __skb_pull(skb, skb->mac_len); xdp.txq = &txq; act = bpf_prog_run_generic_xdp(skb, &xdp, dst->xdp_prog); switch (act) { case XDP_PASS: __skb_push(skb, skb->mac_len); break; default: bpf_warn_invalid_xdp_action(NULL, dst->xdp_prog, act); fallthrough; case XDP_ABORTED: trace_xdp_exception(dst->dev, dst->xdp_prog, act); fallthrough; case XDP_DROP: kfree_skb(skb); break; } return act; } int dev_xdp_enqueue(struct net_device *dev, struct xdp_frame *xdpf, struct net_device *dev_rx) { return __xdp_enqueue(dev, xdpf, dev_rx, NULL); } int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_frame *xdpf, struct net_device *dev_rx) { struct net_device *dev = dst->dev; return __xdp_enqueue(dev, xdpf, dev_rx, dst->xdp_prog); } static bool is_valid_dst(struct bpf_dtab_netdev *obj, struct xdp_frame *xdpf) { if (!obj) return false; if (!(obj->dev->xdp_features & NETDEV_XDP_ACT_NDO_XMIT)) return false; if (unlikely(!(obj->dev->xdp_features & NETDEV_XDP_ACT_NDO_XMIT_SG) && xdp_frame_has_frags(xdpf))) return false; if (xdp_ok_fwd_dev(obj->dev, xdp_get_frame_len(xdpf))) return false; return true; } static int dev_map_enqueue_clone(struct bpf_dtab_netdev *obj, struct net_device *dev_rx, struct xdp_frame *xdpf) { struct xdp_frame *nxdpf; nxdpf = xdpf_clone(xdpf); if (!nxdpf) return -ENOMEM; bq_enqueue(obj->dev, nxdpf, dev_rx, obj->xdp_prog); return 0; } static inline bool is_ifindex_excluded(int *excluded, int num_excluded, int ifindex) { while (num_excluded--) { if (ifindex == excluded[num_excluded]) return true; } return false; } /* Get ifindex of each upper device. 'indexes' must be able to hold at * least MAX_NEST_DEV elements. * Returns the number of ifindexes added. */ static int get_upper_ifindexes(struct net_device *dev, int *indexes) { struct net_device *upper; struct list_head *iter; int n = 0; netdev_for_each_upper_dev_rcu(dev, upper, iter) { indexes[n++] = upper->ifindex; } return n; } int dev_map_enqueue_multi(struct xdp_frame *xdpf, struct net_device *dev_rx, struct bpf_map *map, bool exclude_ingress) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct bpf_dtab_netdev *dst, *last_dst = NULL; int excluded_devices[1+MAX_NEST_DEV]; struct hlist_head *head; int num_excluded = 0; unsigned int i; int err; if (exclude_ingress) { num_excluded = get_upper_ifindexes(dev_rx, excluded_devices); excluded_devices[num_excluded++] = dev_rx->ifindex; } if (map->map_type == BPF_MAP_TYPE_DEVMAP) { for (i = 0; i < map->max_entries; i++) { dst = rcu_dereference_check(dtab->netdev_map[i], rcu_read_lock_bh_held()); if (!is_valid_dst(dst, xdpf)) continue; if (is_ifindex_excluded(excluded_devices, num_excluded, dst->dev->ifindex)) continue; /* we only need n-1 clones; last_dst enqueued below */ if (!last_dst) { last_dst = dst; continue; } err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf); if (err) return err; last_dst = dst; } } else { /* BPF_MAP_TYPE_DEVMAP_HASH */ for (i = 0; i < dtab->n_buckets; i++) { head = dev_map_index_hash(dtab, i); hlist_for_each_entry_rcu(dst, head, index_hlist, lockdep_is_held(&dtab->index_lock)) { if (!is_valid_dst(dst, xdpf)) continue; if (is_ifindex_excluded(excluded_devices, num_excluded, dst->dev->ifindex)) continue; /* we only need n-1 clones; last_dst enqueued below */ if (!last_dst) { last_dst = dst; continue; } err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf); if (err) return err; last_dst = dst; } } } /* consume the last copy of the frame */ if (last_dst) bq_enqueue(last_dst->dev, xdpf, dev_rx, last_dst->xdp_prog); else xdp_return_frame_rx_napi(xdpf); /* dtab is empty */ return 0; } int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb, const struct bpf_prog *xdp_prog) { int err; err = xdp_ok_fwd_dev(dst->dev, skb->len); if (unlikely(err)) return err; /* Redirect has already succeeded semantically at this point, so we just * return 0 even if packet is dropped. Helper below takes care of * freeing skb. */ if (dev_map_bpf_prog_run_skb(skb, dst) != XDP_PASS) return 0; skb->dev = dst->dev; generic_xdp_tx(skb, xdp_prog); return 0; } static int dev_map_redirect_clone(struct bpf_dtab_netdev *dst, struct sk_buff *skb, const struct bpf_prog *xdp_prog) { struct sk_buff *nskb; int err; nskb = skb_clone(skb, GFP_ATOMIC); if (!nskb) return -ENOMEM; err = dev_map_generic_redirect(dst, nskb, xdp_prog); if (unlikely(err)) { consume_skb(nskb); return err; } return 0; } int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb, const struct bpf_prog *xdp_prog, struct bpf_map *map, bool exclude_ingress) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct bpf_dtab_netdev *dst, *last_dst = NULL; int excluded_devices[1+MAX_NEST_DEV]; struct hlist_head *head; struct hlist_node *next; int num_excluded = 0; unsigned int i; int err; if (exclude_ingress) { num_excluded = get_upper_ifindexes(dev, excluded_devices); excluded_devices[num_excluded++] = dev->ifindex; } if (map->map_type == BPF_MAP_TYPE_DEVMAP) { for (i = 0; i < map->max_entries; i++) { dst = rcu_dereference_check(dtab->netdev_map[i], rcu_read_lock_bh_held()); if (!dst) continue; if (is_ifindex_excluded(excluded_devices, num_excluded, dst->dev->ifindex)) continue; /* we only need n-1 clones; last_dst enqueued below */ if (!last_dst) { last_dst = dst; continue; } err = dev_map_redirect_clone(last_dst, skb, xdp_prog); if (err) return err; last_dst = dst; } } else { /* BPF_MAP_TYPE_DEVMAP_HASH */ for (i = 0; i < dtab->n_buckets; i++) { head = dev_map_index_hash(dtab, i); hlist_for_each_entry_safe(dst, next, head, index_hlist) { if (is_ifindex_excluded(excluded_devices, num_excluded, dst->dev->ifindex)) continue; /* we only need n-1 clones; last_dst enqueued below */ if (!last_dst) { last_dst = dst; continue; } err = dev_map_redirect_clone(last_dst, skb, xdp_prog); if (err) return err; last_dst = dst; } } } /* consume the first skb and return */ if (last_dst) return dev_map_generic_redirect(last_dst, skb, xdp_prog); /* dtab is empty */ consume_skb(skb); return 0; } static void *dev_map_lookup_elem(struct bpf_map *map, void *key) { struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key); return obj ? &obj->val : NULL; } static void *dev_map_hash_lookup_elem(struct bpf_map *map, void *key) { struct bpf_dtab_netdev *obj = __dev_map_hash_lookup_elem(map, *(u32 *)key); return obj ? &obj->val : NULL; } static void __dev_map_entry_free(struct rcu_head *rcu) { struct bpf_dtab_netdev *dev; dev = container_of(rcu, struct bpf_dtab_netdev, rcu); if (dev->xdp_prog) bpf_prog_put(dev->xdp_prog); dev_put(dev->dev); kfree(dev); } static long dev_map_delete_elem(struct bpf_map *map, void *key) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct bpf_dtab_netdev *old_dev; u32 k = *(u32 *)key; if (k >= map->max_entries) return -EINVAL; old_dev = unrcu_pointer(xchg(&dtab->netdev_map[k], NULL)); if (old_dev) { call_rcu(&old_dev->rcu, __dev_map_entry_free); atomic_dec((atomic_t *)&dtab->items); } return 0; } static long dev_map_hash_delete_elem(struct bpf_map *map, void *key) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct bpf_dtab_netdev *old_dev; u32 k = *(u32 *)key; unsigned long flags; int ret = -ENOENT; spin_lock_irqsave(&dtab->index_lock, flags); old_dev = __dev_map_hash_lookup_elem(map, k); if (old_dev) { dtab->items--; hlist_del_init_rcu(&old_dev->index_hlist); call_rcu(&old_dev->rcu, __dev_map_entry_free); ret = 0; } spin_unlock_irqrestore(&dtab->index_lock, flags); return ret; } static struct bpf_dtab_netdev *__dev_map_alloc_node(struct net *net, struct bpf_dtab *dtab, struct bpf_devmap_val *val, unsigned int idx) { struct bpf_prog *prog = NULL; struct bpf_dtab_netdev *dev; dev = bpf_map_kmalloc_node(&dtab->map, sizeof(*dev), GFP_NOWAIT, dtab->map.numa_node); if (!dev) return ERR_PTR(-ENOMEM); dev->dev = dev_get_by_index(net, val->ifindex); if (!dev->dev) goto err_out; if (val->bpf_prog.fd > 0) { prog = bpf_prog_get_type_dev(val->bpf_prog.fd, BPF_PROG_TYPE_XDP, false); if (IS_ERR(prog)) goto err_put_dev; if (prog->expected_attach_type != BPF_XDP_DEVMAP || !bpf_prog_map_compatible(&dtab->map, prog)) goto err_put_prog; } dev->idx = idx; if (prog) { dev->xdp_prog = prog; dev->val.bpf_prog.id = prog->aux->id; } else { dev->xdp_prog = NULL; dev->val.bpf_prog.id = 0; } dev->val.ifindex = val->ifindex; return dev; err_put_prog: bpf_prog_put(prog); err_put_dev: dev_put(dev->dev); err_out: kfree(dev); return ERR_PTR(-EINVAL); } static long __dev_map_update_elem(struct net *net, struct bpf_map *map, void *key, void *value, u64 map_flags) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct bpf_dtab_netdev *dev, *old_dev; struct bpf_devmap_val val = {}; u32 i = *(u32 *)key; if (unlikely(map_flags > BPF_EXIST)) return -EINVAL; if (unlikely(i >= dtab->map.max_entries)) return -E2BIG; if (unlikely(map_flags == BPF_NOEXIST)) return -EEXIST; /* already verified value_size <= sizeof val */ memcpy(&val, value, map->value_size); if (!val.ifindex) { dev = NULL; /* can not specify fd if ifindex is 0 */ if (val.bpf_prog.fd > 0) return -EINVAL; } else { dev = __dev_map_alloc_node(net, dtab, &val, i); if (IS_ERR(dev)) return PTR_ERR(dev); } /* Use call_rcu() here to ensure rcu critical sections have completed * Remembering the driver side flush operation will happen before the * net device is removed. */ old_dev = unrcu_pointer(xchg(&dtab->netdev_map[i], RCU_INITIALIZER(dev))); if (old_dev) call_rcu(&old_dev->rcu, __dev_map_entry_free); else atomic_inc((atomic_t *)&dtab->items); return 0; } static long dev_map_update_elem(struct bpf_map *map, void *key, void *value, u64 map_flags) { return __dev_map_update_elem(current->nsproxy->net_ns, map, key, value, map_flags); } static long __dev_map_hash_update_elem(struct net *net, struct bpf_map *map, void *key, void *value, u64 map_flags) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct bpf_dtab_netdev *dev, *old_dev; struct bpf_devmap_val val = {}; u32 idx = *(u32 *)key; unsigned long flags; int err = -EEXIST; /* already verified value_size <= sizeof val */ memcpy(&val, value, map->value_size); if (unlikely(map_flags > BPF_EXIST || !val.ifindex)) return -EINVAL; spin_lock_irqsave(&dtab->index_lock, flags); old_dev = __dev_map_hash_lookup_elem(map, idx); if (old_dev && (map_flags & BPF_NOEXIST)) goto out_err; dev = __dev_map_alloc_node(net, dtab, &val, idx); if (IS_ERR(dev)) { err = PTR_ERR(dev); goto out_err; } if (old_dev) { hlist_del_rcu(&old_dev->index_hlist); } else { if (dtab->items >= dtab->map.max_entries) { spin_unlock_irqrestore(&dtab->index_lock, flags); call_rcu(&dev->rcu, __dev_map_entry_free); return -E2BIG; } dtab->items++; } hlist_add_head_rcu(&dev->index_hlist, dev_map_index_hash(dtab, idx)); spin_unlock_irqrestore(&dtab->index_lock, flags); if (old_dev) call_rcu(&old_dev->rcu, __dev_map_entry_free); return 0; out_err: spin_unlock_irqrestore(&dtab->index_lock, flags); return err; } static long dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value, u64 map_flags) { return __dev_map_hash_update_elem(current->nsproxy->net_ns, map, key, value, map_flags); } static long dev_map_redirect(struct bpf_map *map, u64 ifindex, u64 flags) { return __bpf_xdp_redirect_map(map, ifindex, flags, BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS, __dev_map_lookup_elem); } static long dev_hash_map_redirect(struct bpf_map *map, u64 ifindex, u64 flags) { return __bpf_xdp_redirect_map(map, ifindex, flags, BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS, __dev_map_hash_lookup_elem); } static u64 dev_map_mem_usage(const struct bpf_map *map) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); u64 usage = sizeof(struct bpf_dtab); if (map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) usage += (u64)dtab->n_buckets * sizeof(struct hlist_head); else usage += (u64)map->max_entries * sizeof(struct bpf_dtab_netdev *); usage += atomic_read((atomic_t *)&dtab->items) * (u64)sizeof(struct bpf_dtab_netdev); return usage; } BTF_ID_LIST_SINGLE(dev_map_btf_ids, struct, bpf_dtab) const struct bpf_map_ops dev_map_ops = { .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = dev_map_alloc_check, .map_alloc = dev_map_alloc, .map_free = dev_map_free, .map_get_next_key = dev_map_get_next_key, .map_lookup_elem = dev_map_lookup_elem, .map_update_elem = dev_map_update_elem, .map_delete_elem = dev_map_delete_elem, .map_check_btf = map_check_no_btf, .map_mem_usage = dev_map_mem_usage, .map_btf_id = &dev_map_btf_ids[0], .map_redirect = dev_map_redirect, }; const struct bpf_map_ops dev_map_hash_ops = { .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = dev_map_alloc_check, .map_alloc = dev_map_alloc, .map_free = dev_map_free, .map_get_next_key = dev_map_hash_get_next_key, .map_lookup_elem = dev_map_hash_lookup_elem, .map_update_elem = dev_map_hash_update_elem, .map_delete_elem = dev_map_hash_delete_elem, .map_check_btf = map_check_no_btf, .map_mem_usage = dev_map_mem_usage, .map_btf_id = &dev_map_btf_ids[0], .map_redirect = dev_hash_map_redirect, }; static void dev_map_hash_remove_netdev(struct bpf_dtab *dtab, struct net_device *netdev) { unsigned long flags; u32 i; spin_lock_irqsave(&dtab->index_lock, flags); for (i = 0; i < dtab->n_buckets; i++) { struct bpf_dtab_netdev *dev; struct hlist_head *head; struct hlist_node *next; head = dev_map_index_hash(dtab, i); hlist_for_each_entry_safe(dev, next, head, index_hlist) { if (netdev != dev->dev) continue; dtab->items--; hlist_del_rcu(&dev->index_hlist); call_rcu(&dev->rcu, __dev_map_entry_free); } } spin_unlock_irqrestore(&dtab->index_lock, flags); } static int dev_map_notification(struct notifier_block *notifier, ulong event, void *ptr) { struct net_device *netdev = netdev_notifier_info_to_dev(ptr); struct bpf_dtab *dtab; int i, cpu; switch (event) { case NETDEV_REGISTER: if (!netdev->netdev_ops->ndo_xdp_xmit || netdev->xdp_bulkq) break; /* will be freed in free_netdev() */ netdev->xdp_bulkq = alloc_percpu(struct xdp_dev_bulk_queue); if (!netdev->xdp_bulkq) return NOTIFY_BAD; for_each_possible_cpu(cpu) per_cpu_ptr(netdev->xdp_bulkq, cpu)->dev = netdev; break; case NETDEV_UNREGISTER: /* This rcu_read_lock/unlock pair is needed because * dev_map_list is an RCU list AND to ensure a delete * operation does not free a netdev_map entry while we * are comparing it against the netdev being unregistered. */ rcu_read_lock(); list_for_each_entry_rcu(dtab, &dev_map_list, list) { if (dtab->map.map_type == BPF_MAP_TYPE_DEVMAP_HASH) { dev_map_hash_remove_netdev(dtab, netdev); continue; } for (i = 0; i < dtab->map.max_entries; i++) { struct bpf_dtab_netdev *dev, *odev; dev = rcu_dereference(dtab->netdev_map[i]); if (!dev || netdev != dev->dev) continue; odev = unrcu_pointer(cmpxchg(&dtab->netdev_map[i], RCU_INITIALIZER(dev), NULL)); if (dev == odev) { call_rcu(&dev->rcu, __dev_map_entry_free); atomic_dec((atomic_t *)&dtab->items); } } } rcu_read_unlock(); break; default: break; } return NOTIFY_OK; } static struct notifier_block dev_map_notifier = { .notifier_call = dev_map_notification, }; static int __init dev_map_init(void) { /* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */ BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) != offsetof(struct _bpf_dtab_netdev, dev)); register_netdevice_notifier(&dev_map_notifier); return 0; } subsys_initcall(dev_map_init); |
| 48 371 370 47 48 47 46 48 370 47 48 369 370 370 371 369 369 369 368 370 369 370 369 369 369 9 8 8 48 46 48 48 48 48 47 1 46 47 46 1024 1025 7073 4140 6584 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/blkdev.h> #include <linux/wait.h> #include <linux/rbtree.h> #include <linux/kthread.h> #include <linux/backing-dev.h> #include <linux/blk-cgroup.h> #include <linux/freezer.h> #include <linux/fs.h> #include <linux/pagemap.h> #include <linux/mm.h> #include <linux/sched/mm.h> #include <linux/sched.h> #include <linux/module.h> #include <linux/writeback.h> #include <linux/device.h> #include <trace/events/writeback.h> #include "internal.h" struct backing_dev_info noop_backing_dev_info; EXPORT_SYMBOL_GPL(noop_backing_dev_info); static const char *bdi_unknown_name = "(unknown)"; /* * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU * reader side locking. */ DEFINE_SPINLOCK(bdi_lock); static u64 bdi_id_cursor; static struct rb_root bdi_tree = RB_ROOT; LIST_HEAD(bdi_list); /* bdi_wq serves all asynchronous writeback tasks */ struct workqueue_struct *bdi_wq; #ifdef CONFIG_DEBUG_FS #include <linux/debugfs.h> #include <linux/seq_file.h> struct wb_stats { unsigned long nr_dirty; unsigned long nr_io; unsigned long nr_more_io; unsigned long nr_dirty_time; unsigned long nr_writeback; unsigned long nr_reclaimable; unsigned long nr_dirtied; unsigned long nr_written; unsigned long dirty_thresh; unsigned long wb_thresh; }; static struct dentry *bdi_debug_root; static void bdi_debug_init(void) { bdi_debug_root = debugfs_create_dir("bdi", NULL); } static void collect_wb_stats(struct wb_stats *stats, struct bdi_writeback *wb) { struct inode *inode; spin_lock(&wb->list_lock); list_for_each_entry(inode, &wb->b_dirty, i_io_list) stats->nr_dirty++; list_for_each_entry(inode, &wb->b_io, i_io_list) stats->nr_io++; list_for_each_entry(inode, &wb->b_more_io, i_io_list) stats->nr_more_io++; list_for_each_entry(inode, &wb->b_dirty_time, i_io_list) if (inode_state_read_once(inode) & I_DIRTY_TIME) stats->nr_dirty_time++; spin_unlock(&wb->list_lock); stats->nr_writeback += wb_stat(wb, WB_WRITEBACK); stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE); stats->nr_dirtied += wb_stat(wb, WB_DIRTIED); stats->nr_written += wb_stat(wb, WB_WRITTEN); stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh); } #ifdef CONFIG_CGROUP_WRITEBACK static void bdi_collect_stats(struct backing_dev_info *bdi, struct wb_stats *stats) { struct bdi_writeback *wb; rcu_read_lock(); list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) { if (!wb_tryget(wb)) continue; collect_wb_stats(stats, wb); wb_put(wb); } rcu_read_unlock(); } #else static void bdi_collect_stats(struct backing_dev_info *bdi, struct wb_stats *stats) { collect_wb_stats(stats, &bdi->wb); } #endif static int bdi_debug_stats_show(struct seq_file *m, void *v) { struct backing_dev_info *bdi = m->private; unsigned long background_thresh; unsigned long dirty_thresh; struct wb_stats stats; unsigned long tot_bw; global_dirty_limits(&background_thresh, &dirty_thresh); memset(&stats, 0, sizeof(stats)); stats.dirty_thresh = dirty_thresh; bdi_collect_stats(bdi, &stats); tot_bw = atomic_long_read(&bdi->tot_write_bandwidth); seq_printf(m, "BdiWriteback: %10lu kB\n" "BdiReclaimable: %10lu kB\n" "BdiDirtyThresh: %10lu kB\n" "DirtyThresh: %10lu kB\n" "BackgroundThresh: %10lu kB\n" "BdiDirtied: %10lu kB\n" "BdiWritten: %10lu kB\n" "BdiWriteBandwidth: %10lu kBps\n" "b_dirty: %10lu\n" "b_io: %10lu\n" "b_more_io: %10lu\n" "b_dirty_time: %10lu\n" "bdi_list: %10u\n" "state: %10lx\n", K(stats.nr_writeback), K(stats.nr_reclaimable), K(stats.wb_thresh), K(dirty_thresh), K(background_thresh), K(stats.nr_dirtied), K(stats.nr_written), K(tot_bw), stats.nr_dirty, stats.nr_io, stats.nr_more_io, stats.nr_dirty_time, !list_empty(&bdi->bdi_list), bdi->wb.state); return 0; } DEFINE_SHOW_ATTRIBUTE(bdi_debug_stats); static void wb_stats_show(struct seq_file *m, struct bdi_writeback *wb, struct wb_stats *stats) { seq_printf(m, "WbCgIno: %10lu\n" "WbWriteback: %10lu kB\n" "WbReclaimable: %10lu kB\n" "WbDirtyThresh: %10lu kB\n" "WbDirtied: %10lu kB\n" "WbWritten: %10lu kB\n" "WbWriteBandwidth: %10lu kBps\n" "b_dirty: %10lu\n" "b_io: %10lu\n" "b_more_io: %10lu\n" "b_dirty_time: %10lu\n" "state: %10lx\n\n", #ifdef CONFIG_CGROUP_WRITEBACK cgroup_ino(wb->memcg_css->cgroup), #else 1ul, #endif K(stats->nr_writeback), K(stats->nr_reclaimable), K(stats->wb_thresh), K(stats->nr_dirtied), K(stats->nr_written), K(wb->avg_write_bandwidth), stats->nr_dirty, stats->nr_io, stats->nr_more_io, stats->nr_dirty_time, wb->state); } static int cgwb_debug_stats_show(struct seq_file *m, void *v) { struct backing_dev_info *bdi = m->private; unsigned long background_thresh; unsigned long dirty_thresh; struct bdi_writeback *wb; global_dirty_limits(&background_thresh, &dirty_thresh); rcu_read_lock(); list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) { struct wb_stats stats = { .dirty_thresh = dirty_thresh }; if (!wb_tryget(wb)) continue; collect_wb_stats(&stats, wb); /* * Calculate thresh of wb in writeback cgroup which is min of * thresh in global domain and thresh in cgroup domain. Drop * rcu lock because cgwb_calc_thresh may sleep in * cgroup_rstat_flush. We can do so here because we have a ref. */ if (mem_cgroup_wb_domain(wb)) { rcu_read_unlock(); stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb)); rcu_read_lock(); } wb_stats_show(m, wb, &stats); wb_put(wb); } rcu_read_unlock(); return 0; } DEFINE_SHOW_ATTRIBUTE(cgwb_debug_stats); static void bdi_debug_register(struct backing_dev_info *bdi, const char *name) { bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root); debugfs_create_file("stats", 0444, bdi->debug_dir, bdi, &bdi_debug_stats_fops); debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi, &cgwb_debug_stats_fops); } static void bdi_debug_unregister(struct backing_dev_info *bdi) { debugfs_remove_recursive(bdi->debug_dir); } #else /* CONFIG_DEBUG_FS */ static inline void bdi_debug_init(void) { } static inline void bdi_debug_register(struct backing_dev_info *bdi, const char *name) { } static inline void bdi_debug_unregister(struct backing_dev_info *bdi) { } #endif /* CONFIG_DEBUG_FS */ static ssize_t read_ahead_kb_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned long read_ahead_kb; ssize_t ret; ret = kstrtoul(buf, 10, &read_ahead_kb); if (ret < 0) return ret; bdi->ra_pages = read_ahead_kb >> (PAGE_SHIFT - 10); return count; } #define BDI_SHOW(name, expr) \ static ssize_t name##_show(struct device *dev, \ struct device_attribute *attr, char *buf) \ { \ struct backing_dev_info *bdi = dev_get_drvdata(dev); \ \ return sysfs_emit(buf, "%lld\n", (long long)expr); \ } \ static DEVICE_ATTR_RW(name); BDI_SHOW(read_ahead_kb, K(bdi->ra_pages)) static ssize_t min_ratio_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned int ratio; ssize_t ret; ret = kstrtouint(buf, 10, &ratio); if (ret < 0) return ret; ret = bdi_set_min_ratio(bdi, ratio); if (!ret) ret = count; return ret; } BDI_SHOW(min_ratio, bdi->min_ratio / BDI_RATIO_SCALE) static ssize_t min_ratio_fine_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned int ratio; ssize_t ret; ret = kstrtouint(buf, 10, &ratio); if (ret < 0) return ret; ret = bdi_set_min_ratio_no_scale(bdi, ratio); if (!ret) ret = count; return ret; } BDI_SHOW(min_ratio_fine, bdi->min_ratio) static ssize_t max_ratio_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned int ratio; ssize_t ret; ret = kstrtouint(buf, 10, &ratio); if (ret < 0) return ret; ret = bdi_set_max_ratio(bdi, ratio); if (!ret) ret = count; return ret; } BDI_SHOW(max_ratio, bdi->max_ratio / BDI_RATIO_SCALE) static ssize_t max_ratio_fine_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned int ratio; ssize_t ret; ret = kstrtouint(buf, 10, &ratio); if (ret < 0) return ret; ret = bdi_set_max_ratio_no_scale(bdi, ratio); if (!ret) ret = count; return ret; } BDI_SHOW(max_ratio_fine, bdi->max_ratio) static ssize_t min_bytes_show(struct device *dev, struct device_attribute *attr, char *buf) { struct backing_dev_info *bdi = dev_get_drvdata(dev); return sysfs_emit(buf, "%llu\n", bdi_get_min_bytes(bdi)); } static ssize_t min_bytes_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); u64 bytes; ssize_t ret; ret = kstrtoull(buf, 10, &bytes); if (ret < 0) return ret; ret = bdi_set_min_bytes(bdi, bytes); if (!ret) ret = count; return ret; } static DEVICE_ATTR_RW(min_bytes); static ssize_t max_bytes_show(struct device *dev, struct device_attribute *attr, char *buf) { struct backing_dev_info *bdi = dev_get_drvdata(dev); return sysfs_emit(buf, "%llu\n", bdi_get_max_bytes(bdi)); } static ssize_t max_bytes_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); u64 bytes; ssize_t ret; ret = kstrtoull(buf, 10, &bytes); if (ret < 0) return ret; ret = bdi_set_max_bytes(bdi, bytes); if (!ret) ret = count; return ret; } static DEVICE_ATTR_RW(max_bytes); static ssize_t stable_pages_required_show(struct device *dev, struct device_attribute *attr, char *buf) { dev_warn_once(dev, "the stable_pages_required attribute has been removed. Use the stable_writes queue attribute instead.\n"); return sysfs_emit(buf, "%d\n", 0); } static DEVICE_ATTR_RO(stable_pages_required); static ssize_t strict_limit_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned int strict_limit; ssize_t ret; ret = kstrtouint(buf, 10, &strict_limit); if (ret < 0) return ret; ret = bdi_set_strict_limit(bdi, strict_limit); if (!ret) ret = count; return ret; } static ssize_t strict_limit_show(struct device *dev, struct device_attribute *attr, char *buf) { struct backing_dev_info *bdi = dev_get_drvdata(dev); return sysfs_emit(buf, "%d\n", !!(bdi->capabilities & BDI_CAP_STRICTLIMIT)); } static DEVICE_ATTR_RW(strict_limit); static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_min_ratio_fine.attr, &dev_attr_max_ratio.attr, &dev_attr_max_ratio_fine.attr, &dev_attr_min_bytes.attr, &dev_attr_max_bytes.attr, &dev_attr_stable_pages_required.attr, &dev_attr_strict_limit.attr, NULL, }; ATTRIBUTE_GROUPS(bdi_dev); static const struct class bdi_class = { .name = "bdi", .dev_groups = bdi_dev_groups, }; static __init int bdi_class_init(void) { int ret; ret = class_register(&bdi_class); if (ret) return ret; bdi_debug_init(); return 0; } postcore_initcall(bdi_class_init); static int __init default_bdi_init(void) { bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_SYSFS, 0); if (!bdi_wq) return -ENOMEM; return 0; } subsys_initcall(default_bdi_init); static void wb_update_bandwidth_workfn(struct work_struct *work) { struct bdi_writeback *wb = container_of(to_delayed_work(work), struct bdi_writeback, bw_dwork); wb_update_bandwidth(wb); } /* * Initial write bandwidth: 100 MB/s */ #define INIT_BW MB_TO_PAGES(100) static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, gfp_t gfp) { int err; memset(wb, 0, sizeof(*wb)); wb->bdi = bdi; wb->last_old_flush = jiffies; INIT_LIST_HEAD(&wb->b_dirty); INIT_LIST_HEAD(&wb->b_io); INIT_LIST_HEAD(&wb->b_more_io); INIT_LIST_HEAD(&wb->b_dirty_time); spin_lock_init(&wb->list_lock); atomic_set(&wb->writeback_inodes, 0); wb->bw_time_stamp = jiffies; wb->balanced_dirty_ratelimit = INIT_BW; wb->dirty_ratelimit = INIT_BW; wb->write_bandwidth = INIT_BW; wb->avg_write_bandwidth = INIT_BW; spin_lock_init(&wb->work_lock); INIT_LIST_HEAD(&wb->work_list); INIT_DELAYED_WORK(&wb->dwork, wb_workfn); INIT_DELAYED_WORK(&wb->bw_dwork, wb_update_bandwidth_workfn); err = fprop_local_init_percpu(&wb->completions, gfp); if (err) return err; err = percpu_counter_init_many(wb->stat, 0, gfp, NR_WB_STAT_ITEMS); if (err) fprop_local_destroy_percpu(&wb->completions); return err; } static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb); /* * Remove bdi from the global list and shutdown any threads we have running */ static void wb_shutdown(struct bdi_writeback *wb) { /* Make sure nobody queues further work */ spin_lock_irq(&wb->work_lock); if (!test_and_clear_bit(WB_registered, &wb->state)) { spin_unlock_irq(&wb->work_lock); return; } spin_unlock_irq(&wb->work_lock); cgwb_remove_from_bdi_list(wb); /* * Drain work list and shutdown the delayed_work. !WB_registered * tells wb_workfn() that @wb is dying and its work_list needs to * be drained no matter what. */ mod_delayed_work(bdi_wq, &wb->dwork, 0); flush_delayed_work(&wb->dwork); WARN_ON(!list_empty(&wb->work_list)); flush_delayed_work(&wb->bw_dwork); } static void wb_exit(struct bdi_writeback *wb) { WARN_ON(delayed_work_pending(&wb->dwork)); percpu_counter_destroy_many(wb->stat, NR_WB_STAT_ITEMS); fprop_local_destroy_percpu(&wb->completions); } #ifdef CONFIG_CGROUP_WRITEBACK #include <linux/memcontrol.h> /* * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs and * memcg->cgwb_list. bdi->cgwb_tree is also RCU protected. */ static DEFINE_SPINLOCK(cgwb_lock); static struct workqueue_struct *cgwb_release_wq; static LIST_HEAD(offline_cgwbs); static void cleanup_offline_cgwbs_workfn(struct work_struct *work); static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn); static void cgwb_free_rcu(struct rcu_head *rcu_head) { struct bdi_writeback *wb = container_of(rcu_head, struct bdi_writeback, rcu); percpu_ref_exit(&wb->refcnt); kfree(wb); } static void cgwb_release_workfn(struct work_struct *work) { struct bdi_writeback *wb = container_of(work, struct bdi_writeback, release_work); struct backing_dev_info *bdi = wb->bdi; mutex_lock(&wb->bdi->cgwb_release_mutex); wb_shutdown(wb); css_put(wb->memcg_css); css_put(wb->blkcg_css); mutex_unlock(&wb->bdi->cgwb_release_mutex); /* triggers blkg destruction if no online users left */ blkcg_unpin_online(wb->blkcg_css); fprop_local_destroy_percpu(&wb->memcg_completions); spin_lock_irq(&cgwb_lock); list_del(&wb->offline_node); spin_unlock_irq(&cgwb_lock); wb_exit(wb); bdi_put(bdi); WARN_ON_ONCE(!list_empty(&wb->b_attached)); WARN_ON_ONCE(work_pending(&wb->switch_work)); call_rcu(&wb->rcu, cgwb_free_rcu); } static void cgwb_release(struct percpu_ref *refcnt) { struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback, refcnt); queue_work(cgwb_release_wq, &wb->release_work); } static void cgwb_kill(struct bdi_writeback *wb) { lockdep_assert_held(&cgwb_lock); WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); list_del(&wb->memcg_node); list_del(&wb->blkcg_node); list_add(&wb->offline_node, &offline_cgwbs); percpu_ref_kill(&wb->refcnt); } static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) { spin_lock_irq(&cgwb_lock); list_del_rcu(&wb->bdi_node); spin_unlock_irq(&cgwb_lock); } static int cgwb_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp) { struct mem_cgroup *memcg; struct cgroup_subsys_state *blkcg_css; struct list_head *memcg_cgwb_list, *blkcg_cgwb_list; struct bdi_writeback *wb; unsigned long flags; int ret = 0; memcg = mem_cgroup_from_css(memcg_css); blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &io_cgrp_subsys); memcg_cgwb_list = &memcg->cgwb_list; blkcg_cgwb_list = blkcg_get_cgwb_list(blkcg_css); /* look up again under lock and discard on blkcg mismatch */ spin_lock_irqsave(&cgwb_lock, flags); wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); if (wb && wb->blkcg_css != blkcg_css) { cgwb_kill(wb); wb = NULL; } spin_unlock_irqrestore(&cgwb_lock, flags); if (wb) goto out_put; /* need to create a new one */ wb = kmalloc(sizeof(*wb), gfp); if (!wb) { ret = -ENOMEM; goto out_put; } ret = wb_init(wb, bdi, gfp); if (ret) goto err_free; ret = percpu_ref_init(&wb->refcnt, cgwb_release, 0, gfp); if (ret) goto err_wb_exit; ret = fprop_local_init_percpu(&wb->memcg_completions, gfp); if (ret) goto err_ref_exit; wb->memcg_css = memcg_css; wb->blkcg_css = blkcg_css; INIT_LIST_HEAD(&wb->b_attached); INIT_WORK(&wb->switch_work, inode_switch_wbs_work_fn); init_llist_head(&wb->switch_wbs_ctxs); INIT_WORK(&wb->release_work, cgwb_release_workfn); set_bit(WB_registered, &wb->state); bdi_get(bdi); /* * The root wb determines the registered state of the whole bdi and * memcg_cgwb_list and blkcg_cgwb_list's next pointers indicate * whether they're still online. Don't link @wb if any is dead. * See wb_memcg_offline() and wb_blkcg_offline(). */ ret = -ENODEV; spin_lock_irqsave(&cgwb_lock, flags); if (test_bit(WB_registered, &bdi->wb.state) && blkcg_cgwb_list->next && memcg_cgwb_list->next) { /* we might have raced another instance of this function */ ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb); if (!ret) { list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list); list_add(&wb->memcg_node, memcg_cgwb_list); list_add(&wb->blkcg_node, blkcg_cgwb_list); blkcg_pin_online(blkcg_css); css_get(memcg_css); css_get(blkcg_css); } } spin_unlock_irqrestore(&cgwb_lock, flags); if (ret) { if (ret == -EEXIST) ret = 0; goto err_fprop_exit; } goto out_put; err_fprop_exit: bdi_put(bdi); fprop_local_destroy_percpu(&wb->memcg_completions); err_ref_exit: percpu_ref_exit(&wb->refcnt); err_wb_exit: wb_exit(wb); err_free: kfree(wb); out_put: css_put(blkcg_css); return ret; } /** * wb_get_lookup - get wb for a given memcg * @bdi: target bdi * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) * * Try to get the wb for @memcg_css on @bdi. The returned wb has its * refcount incremented. * * This function uses css_get() on @memcg_css and thus expects its refcnt * to be positive on invocation. IOW, rcu_read_lock() protection on * @memcg_css isn't enough. try_get it before calling this function. * * A wb is keyed by its associated memcg. As blkcg implicitly enables * memcg on the default hierarchy, memcg association is guaranteed to be * more specific (equal or descendant to the associated blkcg) and thus can * identify both the memcg and blkcg associations. * * Because the blkcg associated with a memcg may change as blkcg is enabled * and disabled closer to root in the hierarchy, each wb keeps track of * both the memcg and blkcg associated with it and verifies the blkcg on * each lookup. On mismatch, the existing wb is discarded and a new one is * created. */ struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css) { struct bdi_writeback *wb; if (!memcg_css->parent) return &bdi->wb; rcu_read_lock(); wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); if (wb) { struct cgroup_subsys_state *blkcg_css; /* see whether the blkcg association has changed */ blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &io_cgrp_subsys); if (unlikely(wb->blkcg_css != blkcg_css || !wb_tryget(wb))) wb = NULL; css_put(blkcg_css); } rcu_read_unlock(); return wb; } /** * wb_get_create - get wb for a given memcg, create if necessary * @bdi: target bdi * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) * @gfp: allocation mask to use * * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to * create one. See wb_get_lookup() for more details. */ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp) { struct bdi_writeback *wb; might_alloc(gfp); do { wb = wb_get_lookup(bdi, memcg_css); } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); return wb; } static int cgwb_bdi_init(struct backing_dev_info *bdi) { int ret; INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC); mutex_init(&bdi->cgwb_release_mutex); init_rwsem(&bdi->wb_switch_rwsem); ret = wb_init(&bdi->wb, bdi, GFP_KERNEL); if (!ret) { bdi->wb.memcg_css = &root_mem_cgroup->css; bdi->wb.blkcg_css = blkcg_root_css; INIT_WORK(&bdi->wb.switch_work, inode_switch_wbs_work_fn); init_llist_head(&bdi->wb.switch_wbs_ctxs); } return ret; } static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { struct radix_tree_iter iter; void **slot; struct bdi_writeback *wb; WARN_ON(test_bit(WB_registered, &bdi->wb.state)); spin_lock_irq(&cgwb_lock); radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0) cgwb_kill(*slot); spin_unlock_irq(&cgwb_lock); mutex_lock(&bdi->cgwb_release_mutex); spin_lock_irq(&cgwb_lock); while (!list_empty(&bdi->wb_list)) { wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); spin_unlock_irq(&cgwb_lock); wb_shutdown(wb); spin_lock_irq(&cgwb_lock); } spin_unlock_irq(&cgwb_lock); mutex_unlock(&bdi->cgwb_release_mutex); } /* * cleanup_offline_cgwbs_workfn - try to release dying cgwbs * * Try to release dying cgwbs by switching attached inodes to the nearest * living ancestor's writeback. Processed wbs are placed at the end * of the list to guarantee the forward progress. */ static void cleanup_offline_cgwbs_workfn(struct work_struct *work) { struct bdi_writeback *wb; LIST_HEAD(processed); spin_lock_irq(&cgwb_lock); while (!list_empty(&offline_cgwbs)) { wb = list_first_entry(&offline_cgwbs, struct bdi_writeback, offline_node); list_move(&wb->offline_node, &processed); /* * If wb is dirty, cleaning up the writeback by switching * attached inodes will result in an effective removal of any * bandwidth restrictions, which isn't the goal. Instead, * it can be postponed until the next time, when all io * will be likely completed. If in the meantime some inodes * will get re-dirtied, they should be eventually switched to * a new cgwb. */ if (wb_has_dirty_io(wb)) continue; if (!wb_tryget(wb)) continue; spin_unlock_irq(&cgwb_lock); while (cleanup_offline_cgwb(wb)) cond_resched(); spin_lock_irq(&cgwb_lock); wb_put(wb); } if (!list_empty(&processed)) list_splice_tail(&processed, &offline_cgwbs); spin_unlock_irq(&cgwb_lock); } /** * wb_memcg_offline - kill all wb's associated with a memcg being offlined * @memcg: memcg being offlined * * Also prevents creation of any new wb's associated with @memcg. */ void wb_memcg_offline(struct mem_cgroup *memcg) { struct list_head *memcg_cgwb_list = &memcg->cgwb_list; struct bdi_writeback *wb, *next; spin_lock_irq(&cgwb_lock); list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node) cgwb_kill(wb); memcg_cgwb_list->next = NULL; /* prevent new wb's */ spin_unlock_irq(&cgwb_lock); queue_work(system_unbound_wq, &cleanup_offline_cgwbs_work); } /** * wb_blkcg_offline - kill all wb's associated with a blkcg being offlined * @css: blkcg being offlined * * Also prevents creation of any new wb's associated with @blkcg. */ void wb_blkcg_offline(struct cgroup_subsys_state *css) { struct bdi_writeback *wb, *next; struct list_head *list = blkcg_get_cgwb_list(css); spin_lock_irq(&cgwb_lock); list_for_each_entry_safe(wb, next, list, blkcg_node) cgwb_kill(wb); list->next = NULL; /* prevent new wb's */ spin_unlock_irq(&cgwb_lock); } static void cgwb_bdi_register(struct backing_dev_info *bdi) { spin_lock_irq(&cgwb_lock); list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list); spin_unlock_irq(&cgwb_lock); } static int __init cgwb_init(void) { /* * There can be many concurrent release work items overwhelming * system_wq. Put them in a separate wq and limit concurrency. * There's no point in executing many of these in parallel. */ cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1); if (!cgwb_release_wq) return -ENOMEM; return 0; } subsys_initcall(cgwb_init); #else /* CONFIG_CGROUP_WRITEBACK */ static int cgwb_bdi_init(struct backing_dev_info *bdi) { return wb_init(&bdi->wb, bdi, GFP_KERNEL); } static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { } static void cgwb_bdi_register(struct backing_dev_info *bdi) { list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list); } static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) { list_del_rcu(&wb->bdi_node); } #endif /* CONFIG_CGROUP_WRITEBACK */ int bdi_init(struct backing_dev_info *bdi) { bdi->dev = NULL; kref_init(&bdi->refcnt); bdi->min_ratio = 0; bdi->max_ratio = 100 * BDI_RATIO_SCALE; bdi->max_prop_frac = FPROP_FRAC_BASE; INIT_LIST_HEAD(&bdi->bdi_list); INIT_LIST_HEAD(&bdi->wb_list); init_waitqueue_head(&bdi->wb_waitq); bdi->last_bdp_sleep = jiffies; return cgwb_bdi_init(bdi); } struct backing_dev_info *bdi_alloc(int node_id) { struct backing_dev_info *bdi; bdi = kzalloc_node(sizeof(*bdi), GFP_KERNEL, node_id); if (!bdi) return NULL; if (bdi_init(bdi)) { kfree(bdi); return NULL; } bdi->capabilities = BDI_CAP_WRITEBACK; bdi->ra_pages = VM_READAHEAD_PAGES; bdi->io_pages = VM_READAHEAD_PAGES; timer_setup(&bdi->laptop_mode_wb_timer, laptop_mode_timer_fn, 0); return bdi; } EXPORT_SYMBOL(bdi_alloc); static struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp) { struct rb_node **p = &bdi_tree.rb_node; struct rb_node *parent = NULL; struct backing_dev_info *bdi; lockdep_assert_held(&bdi_lock); while (*p) { parent = *p; bdi = rb_entry(parent, struct backing_dev_info, rb_node); if (bdi->id > id) p = &(*p)->rb_left; else if (bdi->id < id) p = &(*p)->rb_right; else break; } if (parentp) *parentp = parent; return p; } /** * bdi_get_by_id - lookup and get bdi from its id * @id: bdi id to lookup * * Find bdi matching @id and get it. Returns NULL if the matching bdi * doesn't exist or is already unregistered. */ struct backing_dev_info *bdi_get_by_id(u64 id) { struct backing_dev_info *bdi = NULL; struct rb_node **p; spin_lock_bh(&bdi_lock); p = bdi_lookup_rb_node(id, NULL); if (*p) { bdi = rb_entry(*p, struct backing_dev_info, rb_node); bdi_get(bdi); } spin_unlock_bh(&bdi_lock); return bdi; } int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) { struct device *dev; struct rb_node *parent, **p; if (bdi->dev) /* The driver needs to use separate queues per device */ return 0; vsnprintf(bdi->dev_name, sizeof(bdi->dev_name), fmt, args); dev = device_create(&bdi_class, NULL, MKDEV(0, 0), bdi, bdi->dev_name); if (IS_ERR(dev)) return PTR_ERR(dev); cgwb_bdi_register(bdi); bdi->dev = dev; bdi_debug_register(bdi, dev_name(dev)); set_bit(WB_registered, &bdi->wb.state); spin_lock_bh(&bdi_lock); bdi->id = ++bdi_id_cursor; p = bdi_lookup_rb_node(bdi->id, &parent); rb_link_node(&bdi->rb_node, parent, p); rb_insert_color(&bdi->rb_node, &bdi_tree); list_add_tail_rcu(&bdi->bdi_list, &bdi_list); spin_unlock_bh(&bdi_lock); trace_writeback_bdi_register(bdi); return 0; } int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...) { va_list args; int ret; va_start(args, fmt); ret = bdi_register_va(bdi, fmt, args); va_end(args); return ret; } EXPORT_SYMBOL(bdi_register); void bdi_set_owner(struct backing_dev_info *bdi, struct device *owner) { WARN_ON_ONCE(bdi->owner); bdi->owner = owner; get_device(owner); } /* * Remove bdi from bdi_list, and ensure that it is no longer visible */ static void bdi_remove_from_list(struct backing_dev_info *bdi) { spin_lock_bh(&bdi_lock); rb_erase(&bdi->rb_node, &bdi_tree); list_del_rcu(&bdi->bdi_list); spin_unlock_bh(&bdi_lock); synchronize_rcu_expedited(); } void bdi_unregister(struct backing_dev_info *bdi) { timer_delete_sync(&bdi->laptop_mode_wb_timer); /* make sure nobody finds us on the bdi_list anymore */ bdi_remove_from_list(bdi); wb_shutdown(&bdi->wb); cgwb_bdi_unregister(bdi); /* * If this BDI's min ratio has been set, use bdi_set_min_ratio() to * update the global bdi_min_ratio. */ if (bdi->min_ratio) bdi_set_min_ratio(bdi, 0); if (bdi->dev) { bdi_debug_unregister(bdi); device_unregister(bdi->dev); bdi->dev = NULL; } if (bdi->owner) { put_device(bdi->owner); bdi->owner = NULL; } } EXPORT_SYMBOL(bdi_unregister); static void release_bdi(struct kref *ref) { struct backing_dev_info *bdi = container_of(ref, struct backing_dev_info, refcnt); WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb.state)); WARN_ON_ONCE(bdi->dev); wb_exit(&bdi->wb); kfree(bdi); } void bdi_put(struct backing_dev_info *bdi) { kref_put(&bdi->refcnt, release_bdi); } EXPORT_SYMBOL(bdi_put); struct backing_dev_info *inode_to_bdi(struct inode *inode) { struct super_block *sb; if (!inode) return &noop_backing_dev_info; sb = inode->i_sb; #ifdef CONFIG_BLOCK if (sb_is_blkdev_sb(sb)) return I_BDEV(inode)->bd_disk->bdi; #endif return sb->s_bdi; } EXPORT_SYMBOL(inode_to_bdi); const char *bdi_dev_name(struct backing_dev_info *bdi) { if (!bdi || !bdi->dev) return bdi_unknown_name; return bdi->dev_name; } EXPORT_SYMBOL_GPL(bdi_dev_name); |
| 32 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Common values for AES algorithms */ #ifndef _CRYPTO_AES_H #define _CRYPTO_AES_H #include <linux/types.h> #include <linux/crypto.h> #define AES_MIN_KEY_SIZE 16 #define AES_MAX_KEY_SIZE 32 #define AES_KEYSIZE_128 16 #define AES_KEYSIZE_192 24 #define AES_KEYSIZE_256 32 #define AES_BLOCK_SIZE 16 #define AES_MAX_KEYLENGTH (15 * 16) #define AES_MAX_KEYLENGTH_U32 (AES_MAX_KEYLENGTH / sizeof(u32)) /* * Please ensure that the first two fields are 16-byte aligned * relative to the start of the structure, i.e., don't move them! */ struct crypto_aes_ctx { u32 key_enc[AES_MAX_KEYLENGTH_U32]; u32 key_dec[AES_MAX_KEYLENGTH_U32]; u32 key_length; }; extern const u32 crypto_ft_tab[4][256] ____cacheline_aligned; extern const u32 crypto_it_tab[4][256] ____cacheline_aligned; /* * validate key length for AES algorithms */ static inline int aes_check_keylen(unsigned int keylen) { switch (keylen) { case AES_KEYSIZE_128: case AES_KEYSIZE_192: case AES_KEYSIZE_256: break; default: return -EINVAL; } return 0; } int crypto_aes_set_key(struct crypto_tfm *tfm, const u8 *in_key, unsigned int key_len); /** * aes_expandkey - Expands the AES key as described in FIPS-197 * @ctx: The location where the computed key will be stored. * @in_key: The supplied key. * @key_len: The length of the supplied key. * * Returns 0 on success. The function fails only if an invalid key size (or * pointer) is supplied. * The expanded key size is 240 bytes (max of 14 rounds with a unique 16 bytes * key schedule plus a 16 bytes key which is used before the first round). * The decryption key is prepared for the "Equivalent Inverse Cipher" as * described in FIPS-197. The first slot (16 bytes) of each key (enc or dec) is * for the initial combination, the second slot for the first round and so on. */ int aes_expandkey(struct crypto_aes_ctx *ctx, const u8 *in_key, unsigned int key_len); /** * aes_encrypt - Encrypt a single AES block * @ctx: Context struct containing the key schedule * @out: Buffer to store the ciphertext * @in: Buffer containing the plaintext */ void aes_encrypt(const struct crypto_aes_ctx *ctx, u8 *out, const u8 *in); /** * aes_decrypt - Decrypt a single AES block * @ctx: Context struct containing the key schedule * @out: Buffer to store the plaintext * @in: Buffer containing the ciphertext */ void aes_decrypt(const struct crypto_aes_ctx *ctx, u8 *out, const u8 *in); extern const u8 crypto_aes_sbox[]; extern const u8 crypto_aes_inv_sbox[]; void aescfb_encrypt(const struct crypto_aes_ctx *ctx, u8 *dst, const u8 *src, int len, const u8 iv[AES_BLOCK_SIZE]); void aescfb_decrypt(const struct crypto_aes_ctx *ctx, u8 *dst, const u8 *src, int len, const u8 iv[AES_BLOCK_SIZE]); #endif |
| 12 10 12 13 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 | /* * linux/fs/nls/mac-gaelic.c * * Charset macgaelic translation tables. * Generated automatically from the Unicode and charset * tables from the Unicode Organization (www.unicode.org). * The Unicode to charset table has only exact mappings. */ /* * COPYRIGHT AND PERMISSION NOTICE * * Copyright 1991-2012 Unicode, Inc. All rights reserved. Distributed under * the Terms of Use in http://www.unicode.org/copyright.html. * * Permission is hereby granted, free of charge, to any person obtaining a * copy of the Unicode data files and any associated documentation (the "Data * Files") or Unicode software and any associated documentation (the * "Software") to deal in the Data Files or Software without restriction, * including without limitation the rights to use, copy, modify, merge, * publish, distribute, and/or sell copies of the Data Files or Software, and * to permit persons to whom the Data Files or Software are furnished to do * so, provided that (a) the above copyright notice(s) and this permission * notice appear with all copies of the Data Files or Software, (b) both the * above copyright notice(s) and this permission notice appear in associated * documentation, and (c) there is clear notice in each modified Data File or * in the Software as well as in the documentation associated with the Data * File(s) or Software that the data or software has been modified. * * THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY * KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF * THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS * INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT * OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF * USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR * OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR * PERFORMANCE OF THE DATA FILES OR SOFTWARE. * * Except as contained in this notice, the name of a copyright holder shall * not be used in advertising or otherwise to promote the sale, use or other * dealings in these Data Files or Software without prior written * authorization of the copyright holder. */ #include <linux/module.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/nls.h> #include <linux/errno.h> static const wchar_t charset2uni[256] = { /* 0x00 */ 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f, /* 0x10 */ 0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001a, 0x001b, 0x001c, 0x001d, 0x001e, 0x001f, /* 0x20 */ 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002a, 0x002b, 0x002c, 0x002d, 0x002e, 0x002f, /* 0x30 */ 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003a, 0x003b, 0x003c, 0x003d, 0x003e, 0x003f, /* 0x40 */ 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004a, 0x004b, 0x004c, 0x004d, 0x004e, 0x004f, /* 0x50 */ 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005a, 0x005b, 0x005c, 0x005d, 0x005e, 0x005f, /* 0x60 */ 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006a, 0x006b, 0x006c, 0x006d, 0x006e, 0x006f, /* 0x70 */ 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007a, 0x007b, 0x007c, 0x007d, 0x007e, 0x007f, /* 0x80 */ 0x00c4, 0x00c5, 0x00c7, 0x00c9, 0x00d1, 0x00d6, 0x00dc, 0x00e1, 0x00e0, 0x00e2, 0x00e4, 0x00e3, 0x00e5, 0x00e7, 0x00e9, 0x00e8, /* 0x90 */ 0x00ea, 0x00eb, 0x00ed, 0x00ec, 0x00ee, 0x00ef, 0x00f1, 0x00f3, 0x00f2, 0x00f4, 0x00f6, 0x00f5, 0x00fa, 0x00f9, 0x00fb, 0x00fc, /* 0xa0 */ 0x2020, 0x00b0, 0x00a2, 0x00a3, 0x00a7, 0x2022, 0x00b6, 0x00df, 0x00ae, 0x00a9, 0x2122, 0x00b4, 0x00a8, 0x2260, 0x00c6, 0x00d8, /* 0xb0 */ 0x1e02, 0x00b1, 0x2264, 0x2265, 0x1e03, 0x010a, 0x010b, 0x1e0a, 0x1e0b, 0x1e1e, 0x1e1f, 0x0120, 0x0121, 0x1e40, 0x00e6, 0x00f8, /* 0xc0 */ 0x1e41, 0x1e56, 0x1e57, 0x027c, 0x0192, 0x017f, 0x1e60, 0x00ab, 0x00bb, 0x2026, 0x00a0, 0x00c0, 0x00c3, 0x00d5, 0x0152, 0x0153, /* 0xd0 */ 0x2013, 0x2014, 0x201c, 0x201d, 0x2018, 0x2019, 0x1e61, 0x1e9b, 0x00ff, 0x0178, 0x1e6a, 0x20ac, 0x2039, 0x203a, 0x0176, 0x0177, /* 0xe0 */ 0x1e6b, 0x00b7, 0x1ef2, 0x1ef3, 0x204a, 0x00c2, 0x00ca, 0x00c1, 0x00cb, 0x00c8, 0x00cd, 0x00ce, 0x00cf, 0x00cc, 0x00d3, 0x00d4, /* 0xf0 */ 0x2663, 0x00d2, 0x00da, 0x00db, 0x00d9, 0x0131, 0x00dd, 0x00fd, 0x0174, 0x0175, 0x1e84, 0x1e85, 0x1e80, 0x1e81, 0x1e82, 0x1e83, }; static const unsigned char page00[256] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */ 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */ 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */ 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, /* 0x18-0x1f */ 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, /* 0x20-0x27 */ 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, /* 0x28-0x2f */ 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, /* 0x30-0x37 */ 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f, /* 0x38-0x3f */ 0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, /* 0x40-0x47 */ 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, /* 0x48-0x4f */ 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, /* 0x50-0x57 */ 0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, /* 0x58-0x5f */ 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, /* 0x60-0x67 */ 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f, /* 0x68-0x6f */ 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, /* 0x70-0x77 */ 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x80-0x87 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x88-0x8f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x90-0x97 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x98-0x9f */ 0xca, 0x00, 0xa2, 0xa3, 0x00, 0x00, 0x00, 0xa4, /* 0xa0-0xa7 */ 0xac, 0xa9, 0x00, 0xc7, 0x00, 0x00, 0xa8, 0x00, /* 0xa8-0xaf */ 0xa1, 0xb1, 0x00, 0x00, 0xab, 0x00, 0xa6, 0xe1, /* 0xb0-0xb7 */ 0x00, 0x00, 0x00, 0xc8, 0x00, 0x00, 0x00, 0x00, /* 0xb8-0xbf */ 0xcb, 0xe7, 0xe5, 0xcc, 0x80, 0x81, 0xae, 0x82, /* 0xc0-0xc7 */ 0xe9, 0x83, 0xe6, 0xe8, 0xed, 0xea, 0xeb, 0xec, /* 0xc8-0xcf */ 0x00, 0x84, 0xf1, 0xee, 0xef, 0xcd, 0x85, 0x00, /* 0xd0-0xd7 */ 0xaf, 0xf4, 0xf2, 0xf3, 0x86, 0xf6, 0x00, 0xa7, /* 0xd8-0xdf */ 0x88, 0x87, 0x89, 0x8b, 0x8a, 0x8c, 0xbe, 0x8d, /* 0xe0-0xe7 */ 0x8f, 0x8e, 0x90, 0x91, 0x93, 0x92, 0x94, 0x95, /* 0xe8-0xef */ 0x00, 0x96, 0x98, 0x97, 0x99, 0x9b, 0x9a, 0x00, /* 0xf0-0xf7 */ 0xbf, 0x9d, 0x9c, 0x9e, 0x9f, 0xf7, 0x00, 0xd8, /* 0xf8-0xff */ }; static const unsigned char page01[256] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x00-0x07 */ 0x00, 0x00, 0xb5, 0xb6, 0x00, 0x00, 0x00, 0x00, /* 0x08-0x0f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x10-0x17 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x18-0x1f */ 0xbb, 0xbc, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x20-0x27 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x28-0x2f */ 0x00, 0xf5, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x30-0x37 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x38-0x3f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x40-0x47 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x48-0x4f */ 0x00, 0x00, 0xce, 0xcf, 0x00, 0x00, 0x00, 0x00, /* 0x50-0x57 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x58-0x5f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x60-0x67 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */ 0x00, 0x00, 0x00, 0x00, 0xf8, 0xf9, 0xde, 0xdf, /* 0x70-0x77 */ 0xd9, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc5, /* 0x78-0x7f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x80-0x87 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x88-0x8f */ 0x00, 0x00, 0xc4, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x90-0x97 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x98-0x9f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa0-0xa7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa8-0xaf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb0-0xb7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb8-0xbf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc0-0xc7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc8-0xcf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd0-0xd7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd8-0xdf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe8-0xef */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf0-0xf7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf8-0xff */ }; static const unsigned char page02[256] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x00-0x07 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x08-0x0f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x10-0x17 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x18-0x1f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x20-0x27 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x28-0x2f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x30-0x37 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x38-0x3f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x40-0x47 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x48-0x4f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x50-0x57 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x58-0x5f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x60-0x67 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x70-0x77 */ 0x00, 0x00, 0x00, 0x00, 0xc3, 0x00, 0x00, 0x00, /* 0x78-0x7f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x80-0x87 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x88-0x8f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x90-0x97 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x98-0x9f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa0-0xa7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa8-0xaf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb0-0xb7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb8-0xbf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc0-0xc7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc8-0xcf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd0-0xd7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd8-0xdf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe8-0xef */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf0-0xf7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf8-0xff */ }; static const unsigned char page1e[256] = { 0x00, 0x00, 0xb0, 0xb4, 0x00, 0x00, 0x00, 0x00, /* 0x00-0x07 */ 0x00, 0x00, 0xb7, 0xb8, 0x00, 0x00, 0x00, 0x00, /* 0x08-0x0f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x10-0x17 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xb9, 0xba, /* 0x18-0x1f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x20-0x27 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x28-0x2f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x30-0x37 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x38-0x3f */ 0xbd, 0xc0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x40-0x47 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x48-0x4f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc1, 0xc2, /* 0x50-0x57 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x58-0x5f */ 0xc6, 0xd6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x60-0x67 */ 0x00, 0x00, 0xda, 0xe0, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x70-0x77 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x78-0x7f */ 0xfc, 0xfd, 0xfe, 0xff, 0xfa, 0xfb, 0x00, 0x00, /* 0x80-0x87 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x88-0x8f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x90-0x97 */ 0x00, 0x00, 0x00, 0xd7, 0x00, 0x00, 0x00, 0x00, /* 0x98-0x9f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa0-0xa7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa8-0xaf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb0-0xb7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb8-0xbf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc0-0xc7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc8-0xcf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd0-0xd7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd8-0xdf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe8-0xef */ 0x00, 0x00, 0xe2, 0xe3, 0x00, 0x00, 0x00, 0x00, /* 0xf0-0xf7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf8-0xff */ }; static const unsigned char page20[256] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x00-0x07 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x08-0x0f */ 0x00, 0x00, 0x00, 0xd0, 0xd1, 0x00, 0x00, 0x00, /* 0x10-0x17 */ 0xd4, 0xd5, 0x00, 0x00, 0xd2, 0xd3, 0x00, 0x00, /* 0x18-0x1f */ 0xa0, 0x00, 0xa5, 0x00, 0x00, 0x00, 0xc9, 0x00, /* 0x20-0x27 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x28-0x2f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x30-0x37 */ 0x00, 0xdc, 0xdd, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x38-0x3f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x40-0x47 */ 0x00, 0x00, 0xe4, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x48-0x4f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x50-0x57 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x58-0x5f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x60-0x67 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x70-0x77 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x78-0x7f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x80-0x87 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x88-0x8f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x90-0x97 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x98-0x9f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa0-0xa7 */ 0x00, 0x00, 0x00, 0x00, 0xdb, 0x00, 0x00, 0x00, /* 0xa8-0xaf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb0-0xb7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb8-0xbf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc0-0xc7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc8-0xcf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd0-0xd7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd8-0xdf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe8-0xef */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf0-0xf7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf8-0xff */ }; static const unsigned char page21[256] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x00-0x07 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x08-0x0f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x10-0x17 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x18-0x1f */ 0x00, 0x00, 0xaa, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x20-0x27 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x28-0x2f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x30-0x37 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x38-0x3f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x40-0x47 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x48-0x4f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x50-0x57 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x58-0x5f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x60-0x67 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x70-0x77 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x78-0x7f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x80-0x87 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x88-0x8f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x90-0x97 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x98-0x9f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa0-0xa7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa8-0xaf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb0-0xb7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb8-0xbf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc0-0xc7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc8-0xcf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd0-0xd7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd8-0xdf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe8-0xef */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf0-0xf7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf8-0xff */ }; static const unsigned char page22[256] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x00-0x07 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x08-0x0f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x10-0x17 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x18-0x1f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x20-0x27 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x28-0x2f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x30-0x37 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x38-0x3f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x40-0x47 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x48-0x4f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x50-0x57 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x58-0x5f */ 0xad, 0x00, 0x00, 0x00, 0xb2, 0xb3, 0x00, 0x00, /* 0x60-0x67 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x70-0x77 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x78-0x7f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x80-0x87 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x88-0x8f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x90-0x97 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x98-0x9f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa0-0xa7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa8-0xaf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb0-0xb7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb8-0xbf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc0-0xc7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc8-0xcf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd0-0xd7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd8-0xdf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe8-0xef */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf0-0xf7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf8-0xff */ }; static const unsigned char page26[256] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x00-0x07 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x08-0x0f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x10-0x17 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x18-0x1f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x20-0x27 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x28-0x2f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x30-0x37 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x38-0x3f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x40-0x47 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x48-0x4f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x50-0x57 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x58-0x5f */ 0x00, 0x00, 0x00, 0xf0, 0x00, 0x00, 0x00, 0x00, /* 0x60-0x67 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x70-0x77 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x78-0x7f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x80-0x87 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x88-0x8f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x90-0x97 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x98-0x9f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa0-0xa7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xa8-0xaf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb0-0xb7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xb8-0xbf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc0-0xc7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xc8-0xcf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd0-0xd7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xd8-0xdf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe8-0xef */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf0-0xf7 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xf8-0xff */ }; static const unsigned char *const page_uni2charset[256] = { page00, page01, page02, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, page1e, NULL, page20, page21, page22, NULL, NULL, NULL, page26, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, }; static const unsigned char charset2lower[256] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x00-0x07 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x08-0x0f */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x10-0x17 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x18-0x1f */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x20-0x27 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x28-0x2f */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x30-0x37 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x38-0x3f */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x40-0x47 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x48-0x4f */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x50-0x57 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x58-0x5f */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x60-0x67 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x68-0x6f */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x70-0x77 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x78-0x7f */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x80-0x87 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x88-0x8f */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x90-0x97 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x98-0x9f */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xa0-0xa7 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xa8-0xaf */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xb0-0xb7 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xb8-0xbf */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xc0-0xc7 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xc8-0xcf */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xd0-0xd7 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xd8-0xdf */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xe0-0xe7 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xe8-0xef */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xf0-0xf7 */ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0xf8-0xff */ }; static const unsigned char charset2upper[256] = { 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x00-0x07 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x08-0x0f */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x10-0x17 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x18-0x1f */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x20-0x27 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x28-0x2f */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x30-0x37 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x38-0x3f */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x40-0x47 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x48-0x4f */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x50-0x57 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x58-0x5f */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x60-0x67 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x68-0x6f */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x70-0x77 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x78-0x7f */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x80-0x87 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x88-0x8f */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x90-0x97 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0x98-0x9f */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xa0-0xa7 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xa8-0xaf */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xb0-0xb7 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xb8-0xbf */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xc0-0xc7 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xc8-0xcf */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xd0-0xd7 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xd8-0xdf */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xe0-0xe7 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xe8-0xef */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xf0-0xf7 */ 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, 0xfe, /* 0xf8-0xff */ }; static int uni2char(wchar_t uni, unsigned char *out, int boundlen) { const unsigned char *uni2charset; unsigned char cl = uni & 0x00ff; unsigned char ch = (uni & 0xff00) >> 8; if (boundlen <= 0) return -ENAMETOOLONG; uni2charset = page_uni2charset[ch]; if (uni2charset && uni2charset[cl]) out[0] = uni2charset[cl]; else return -EINVAL; return 1; } static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) { *uni = charset2uni[*rawstring]; if (*uni == 0x0000) return -EINVAL; return 1; } static struct nls_table table = { .charset = "macgaelic", .uni2char = uni2char, .char2uni = char2uni, .charset2lower = charset2lower, .charset2upper = charset2upper, }; static int __init init_nls_macgaelic(void) { return register_nls(&table); } static void __exit exit_nls_macgaelic(void) { unregister_nls(&table); } module_init(init_nls_macgaelic) module_exit(exit_nls_macgaelic) MODULE_DESCRIPTION("NLS Codepage macgaelic"); MODULE_LICENSE("Dual BSD/GPL"); |
| 25 13 25 26 26 2 24 23 9 1 3 3 3 3 24 18 4 4 26 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | // SPDX-License-Identifier: GPL-2.0 /* * linux/fs/hpfs/name.c * * Mikulas Patocka (mikulas@artax.karlin.mff.cuni.cz), 1998-1999 * * operations with filenames */ #include "hpfs_fn.h" static inline int not_allowed_char(unsigned char c) { return c<' ' || c=='"' || c=='*' || c=='/' || c==':' || c=='<' || c=='>' || c=='?' || c=='\\' || c=='|'; } static inline int no_dos_char(unsigned char c) { /* Characters that are allowed in HPFS but not in DOS */ return c=='+' || c==',' || c==';' || c=='=' || c=='[' || c==']'; } static inline unsigned char upcase(unsigned char *dir, unsigned char a) { if (a<128 || a==255) return a>='a' && a<='z' ? a - 0x20 : a; if (!dir) return a; return dir[a-128]; } unsigned char hpfs_upcase(unsigned char *dir, unsigned char a) { return upcase(dir, a); } static inline unsigned char locase(unsigned char *dir, unsigned char a) { if (a<128 || a==255) return a>='A' && a<='Z' ? a + 0x20 : a; if (!dir) return a; return dir[a]; } int hpfs_chk_name(const unsigned char *name, unsigned *len) { int i; if (*len > 254) return -ENAMETOOLONG; hpfs_adjust_length(name, len); if (!*len) return -EINVAL; for (i = 0; i < *len; i++) if (not_allowed_char(name[i])) return -EINVAL; if (*len == 1) if (name[0] == '.') return -EINVAL; if (*len == 2) if (name[0] == '.' && name[1] == '.') return -EINVAL; return 0; } unsigned char *hpfs_translate_name(struct super_block *s, unsigned char *from, unsigned len, int lc, int lng) { unsigned char *to; int i; if (hpfs_sb(s)->sb_chk >= 2) if (hpfs_is_name_long(from, len) != lng) { pr_err("Long name flag mismatch - name "); for (i = 0; i < len; i++) pr_cont("%c", from[i]); pr_cont(" misidentified as %s.\n", lng ? "short" : "long"); pr_err("It's nothing serious. It could happen because of bug in OS/2.\nSet checks=normal to disable this message.\n"); } if (!lc) return from; if (!(to = kmalloc(len, GFP_KERNEL))) { pr_err("can't allocate memory for name conversion buffer\n"); return from; } for (i = 0; i < len; i++) to[i] = locase(hpfs_sb(s)->sb_cp_table,from[i]); return to; } int hpfs_compare_names(struct super_block *s, const unsigned char *n1, unsigned l1, const unsigned char *n2, unsigned l2, int last) { unsigned l = l1 < l2 ? l1 : l2; unsigned i; if (last) return -1; for (i = 0; i < l; i++) { unsigned char c1 = upcase(hpfs_sb(s)->sb_cp_table,n1[i]); unsigned char c2 = upcase(hpfs_sb(s)->sb_cp_table,n2[i]); if (c1 < c2) return -1; if (c1 > c2) return 1; } if (l1 < l2) return -1; if (l1 > l2) return 1; return 0; } int hpfs_is_name_long(const unsigned char *name, unsigned len) { int i,j; for (i = 0; i < len && name[i] != '.'; i++) if (no_dos_char(name[i])) return 1; if (!i || i > 8) return 1; if (i == len) return 0; for (j = i + 1; j < len; j++) if (name[j] == '.' || no_dos_char(name[i])) return 1; return j - i > 4; } /* OS/2 clears dots and spaces at the end of file name, so we have to */ void hpfs_adjust_length(const unsigned char *name, unsigned *len) { if (!*len) return; if (*len == 1 && name[0] == '.') return; if (*len == 2 && name[0] == '.' && name[1] == '.') return; while (*len && (name[*len - 1] == '.' || name[*len - 1] == ' ')) (*len)--; } |
| 5 5 3 5 5 5 5 5 5 5 3 3 3 3 3 3 2 3 3 2 3 3 5 5 5 3 2 5 5 5 3 3 4 3 5 5 4 5 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 | // SPDX-License-Identifier: GPL-2.0 /* * List pending timers * * Copyright(C) 2006, Red Hat, Inc., Ingo Molnar */ #include <linux/proc_fs.h> #include <linux/module.h> #include <linux/spinlock.h> #include <linux/sched.h> #include <linux/seq_file.h> #include <linux/kallsyms.h> #include <linux/nmi.h> #include <linux/uaccess.h> #include "tick-internal.h" struct timer_list_iter { int cpu; bool second_pass; u64 now; }; /* * This allows printing both to /proc/timer_list and * to the console (on SysRq-Q): */ __printf(2, 3) static void SEQ_printf(struct seq_file *m, const char *fmt, ...) { va_list args; va_start(args, fmt); if (m) seq_vprintf(m, fmt, args); else vprintk(fmt, args); va_end(args); } static void print_timer(struct seq_file *m, struct hrtimer *taddr, struct hrtimer *timer, int idx, u64 now) { SEQ_printf(m, " #%d: <%p>, %ps", idx, taddr, ACCESS_PRIVATE(timer, function)); SEQ_printf(m, ", S:%02x", timer->state); SEQ_printf(m, "\n"); SEQ_printf(m, " # expires at %Lu-%Lu nsecs [in %Ld to %Ld nsecs]\n", (unsigned long long)ktime_to_ns(hrtimer_get_softexpires(timer)), (unsigned long long)ktime_to_ns(hrtimer_get_expires(timer)), (long long)(ktime_to_ns(hrtimer_get_softexpires(timer)) - now), (long long)(ktime_to_ns(hrtimer_get_expires(timer)) - now)); } static void print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base, u64 now) { struct hrtimer *timer, tmp; unsigned long next = 0, i; struct timerqueue_node *curr; unsigned long flags; next_one: i = 0; touch_nmi_watchdog(); raw_spin_lock_irqsave(&base->cpu_base->lock, flags); curr = timerqueue_getnext(&base->active); /* * Crude but we have to do this O(N*N) thing, because * we have to unlock the base when printing: */ while (curr && i < next) { curr = timerqueue_iterate_next(curr); i++; } if (curr) { timer = container_of(curr, struct hrtimer, node); tmp = *timer; raw_spin_unlock_irqrestore(&base->cpu_base->lock, flags); print_timer(m, timer, &tmp, i, now); next++; goto next_one; } raw_spin_unlock_irqrestore(&base->cpu_base->lock, flags); } static void print_base(struct seq_file *m, struct hrtimer_clock_base *base, u64 now) { SEQ_printf(m, " .base: %p\n", base); SEQ_printf(m, " .index: %d\n", base->index); SEQ_printf(m, " .resolution: %u nsecs\n", hrtimer_resolution); #ifdef CONFIG_HIGH_RES_TIMERS SEQ_printf(m, " .offset: %Lu nsecs\n", (unsigned long long) ktime_to_ns(base->offset)); #endif SEQ_printf(m, "active timers:\n"); print_active_timers(m, base, now + ktime_to_ns(base->offset)); } static void print_cpu(struct seq_file *m, int cpu, u64 now) { struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu); int i; SEQ_printf(m, "cpu: %d\n", cpu); for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { SEQ_printf(m, " clock %d:\n", i); print_base(m, cpu_base->clock_base + i, now); } #define P(x) \ SEQ_printf(m, " .%-15s: %Lu\n", #x, \ (unsigned long long)(cpu_base->x)) #define P_ns(x) \ SEQ_printf(m, " .%-15s: %Lu nsecs\n", #x, \ (unsigned long long)(ktime_to_ns(cpu_base->x))) #ifdef CONFIG_HIGH_RES_TIMERS P_ns(expires_next); P(hres_active); P(nr_events); P(nr_retries); P(nr_hangs); P(max_hang_time); #endif #undef P #undef P_ns #ifdef CONFIG_TICK_ONESHOT # define P(x) \ SEQ_printf(m, " .%-15s: %Lu\n", #x, \ (unsigned long long)(ts->x)) # define P_ns(x) \ SEQ_printf(m, " .%-15s: %Lu nsecs\n", #x, \ (unsigned long long)(ktime_to_ns(ts->x))) # define P_flag(x, f) \ SEQ_printf(m, " .%-15s: %d\n", #x, !!(ts->flags & (f))) { struct tick_sched *ts = tick_get_tick_sched(cpu); P_flag(nohz, TS_FLAG_NOHZ); P_flag(highres, TS_FLAG_HIGHRES); P_ns(last_tick); P_flag(tick_stopped, TS_FLAG_STOPPED); P(idle_jiffies); P(idle_calls); P(idle_sleeps); P_ns(idle_entrytime); P_ns(idle_waketime); P_ns(idle_exittime); P_ns(idle_sleeptime); P_ns(iowait_sleeptime); P(last_jiffies); P(next_timer); P_ns(idle_expires); SEQ_printf(m, "jiffies: %Lu\n", (unsigned long long)jiffies); } #endif #undef P #undef P_ns SEQ_printf(m, "\n"); } #ifdef CONFIG_GENERIC_CLOCKEVENTS static void print_tickdevice(struct seq_file *m, struct tick_device *td, int cpu) { struct clock_event_device *dev = td->evtdev; touch_nmi_watchdog(); SEQ_printf(m, "Tick Device: mode: %d\n", td->mode); if (cpu < 0) SEQ_printf(m, "Broadcast device\n"); else SEQ_printf(m, "Per CPU device: %d\n", cpu); SEQ_printf(m, "Clock Event Device: "); if (!dev) { SEQ_printf(m, "<NULL>\n"); return; } SEQ_printf(m, "%s\n", dev->name); SEQ_printf(m, " max_delta_ns: %llu\n", (unsigned long long) dev->max_delta_ns); SEQ_printf(m, " min_delta_ns: %llu\n", (unsigned long long) dev->min_delta_ns); SEQ_printf(m, " mult: %u\n", dev->mult); SEQ_printf(m, " shift: %u\n", dev->shift); SEQ_printf(m, " mode: %d\n", clockevent_get_state(dev)); SEQ_printf(m, " next_event: %Ld nsecs\n", (unsigned long long) ktime_to_ns(dev->next_event)); SEQ_printf(m, " set_next_event: %ps\n", dev->set_next_event); if (dev->set_state_shutdown) SEQ_printf(m, " shutdown: %ps\n", dev->set_state_shutdown); if (dev->set_state_periodic) SEQ_printf(m, " periodic: %ps\n", dev->set_state_periodic); if (dev->set_state_oneshot) SEQ_printf(m, " oneshot: %ps\n", dev->set_state_oneshot); if (dev->set_state_oneshot_stopped) SEQ_printf(m, " oneshot stopped: %ps\n", dev->set_state_oneshot_stopped); if (dev->tick_resume) SEQ_printf(m, " resume: %ps\n", dev->tick_resume); SEQ_printf(m, " event_handler: %ps\n", dev->event_handler); SEQ_printf(m, "\n"); SEQ_printf(m, " retries: %lu\n", dev->retries); #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST if (cpu >= 0) { const struct clock_event_device *wd = tick_get_wakeup_device(cpu); SEQ_printf(m, "Wakeup Device: %s\n", wd ? wd->name : "<NULL>"); } #endif SEQ_printf(m, "\n"); } static void timer_list_show_tickdevices_header(struct seq_file *m) { #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST print_tickdevice(m, tick_get_broadcast_device(), -1); SEQ_printf(m, "tick_broadcast_mask: %*pb\n", cpumask_pr_args(tick_get_broadcast_mask())); #ifdef CONFIG_TICK_ONESHOT SEQ_printf(m, "tick_broadcast_oneshot_mask: %*pb\n", cpumask_pr_args(tick_get_broadcast_oneshot_mask())); #endif SEQ_printf(m, "\n"); #endif } #endif static inline void timer_list_header(struct seq_file *m, u64 now) { SEQ_printf(m, "Timer List Version: v0.10\n"); SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES); SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now); SEQ_printf(m, "\n"); } void sysrq_timer_list_show(void) { u64 now = ktime_to_ns(ktime_get()); int cpu; timer_list_header(NULL, now); for_each_online_cpu(cpu) print_cpu(NULL, cpu, now); #ifdef CONFIG_GENERIC_CLOCKEVENTS timer_list_show_tickdevices_header(NULL); for_each_online_cpu(cpu) print_tickdevice(NULL, tick_get_device(cpu), cpu); #endif return; } #ifdef CONFIG_PROC_FS static int timer_list_show(struct seq_file *m, void *v) { struct timer_list_iter *iter = v; if (iter->cpu == -1 && !iter->second_pass) timer_list_header(m, iter->now); else if (!iter->second_pass) print_cpu(m, iter->cpu, iter->now); #ifdef CONFIG_GENERIC_CLOCKEVENTS else if (iter->cpu == -1 && iter->second_pass) timer_list_show_tickdevices_header(m); else print_tickdevice(m, tick_get_device(iter->cpu), iter->cpu); #endif return 0; } static void *move_iter(struct timer_list_iter *iter, loff_t offset) { for (; offset; offset--) { iter->cpu = cpumask_next(iter->cpu, cpu_online_mask); if (iter->cpu >= nr_cpu_ids) { #ifdef CONFIG_GENERIC_CLOCKEVENTS if (!iter->second_pass) { iter->cpu = -1; iter->second_pass = true; } else return NULL; #else return NULL; #endif } } return iter; } static void *timer_list_start(struct seq_file *file, loff_t *offset) { struct timer_list_iter *iter = file->private; if (!*offset) iter->now = ktime_to_ns(ktime_get()); iter->cpu = -1; iter->second_pass = false; return move_iter(iter, *offset); } static void *timer_list_next(struct seq_file *file, void *v, loff_t *offset) { struct timer_list_iter *iter = file->private; ++*offset; return move_iter(iter, 1); } static void timer_list_stop(struct seq_file *seq, void *v) { } static const struct seq_operations timer_list_sops = { .start = timer_list_start, .next = timer_list_next, .stop = timer_list_stop, .show = timer_list_show, }; static int __init init_timer_list_procfs(void) { struct proc_dir_entry *pe; pe = proc_create_seq_private("timer_list", 0400, NULL, &timer_list_sops, sizeof(struct timer_list_iter), NULL); if (!pe) return -ENOMEM; return 0; } __initcall(init_timer_list_procfs); #endif |
| 6 1 1 7 1 1 5 2 5 1 1 6 7 33 2 6 10 1 14 6 4 2 17 17 23 15 6 1 3 10 1 1 3 3 1 5 10 72 33 42 22 50 42 7 15 5 2 15 21 54 81 28 68 80 81 155 147 7 154 2 1 12 5 7 1 1 64 60 2 2 2 24 22 14 2 44 4 44 5 30 13 21 2 20 36 5 7 34 20 20 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 | // SPDX-License-Identifier: GPL-2.0-only /* * linux/fs/fat/file.c * * Written 1992,1993 by Werner Almesberger * * regular file handling primitives for fat-based filesystems */ #include <linux/capability.h> #include <linux/module.h> #include <linux/compat.h> #include <linux/mount.h> #include <linux/blkdev.h> #include <linux/backing-dev.h> #include <linux/fsnotify.h> #include <linux/security.h> #include <linux/falloc.h> #include "fat.h" static long fat_fallocate(struct file *file, int mode, loff_t offset, loff_t len); static int fat_ioctl_get_attributes(struct inode *inode, u32 __user *user_attr) { u32 attr; inode_lock_shared(inode); attr = fat_make_attrs(inode); inode_unlock_shared(inode); return put_user(attr, user_attr); } static int fat_ioctl_set_attributes(struct file *file, u32 __user *user_attr) { struct inode *inode = file_inode(file); struct msdos_sb_info *sbi = MSDOS_SB(inode->i_sb); int is_dir = S_ISDIR(inode->i_mode); u32 attr, oldattr; struct iattr ia; int err; err = get_user(attr, user_attr); if (err) goto out; err = mnt_want_write_file(file); if (err) goto out; inode_lock(inode); /* * ATTR_VOLUME and ATTR_DIR cannot be changed; this also * prevents the user from turning us into a VFAT * longname entry. Also, we obviously can't set * any of the NTFS attributes in the high 24 bits. */ attr &= 0xff & ~(ATTR_VOLUME | ATTR_DIR); /* Merge in ATTR_VOLUME and ATTR_DIR */ attr |= (MSDOS_I(inode)->i_attrs & ATTR_VOLUME) | (is_dir ? ATTR_DIR : 0); oldattr = fat_make_attrs(inode); /* Equivalent to a chmod() */ ia.ia_valid = ATTR_MODE | ATTR_CTIME; ia.ia_ctime = current_time(inode); if (is_dir) ia.ia_mode = fat_make_mode(sbi, attr, S_IRWXUGO); else { ia.ia_mode = fat_make_mode(sbi, attr, S_IRUGO | S_IWUGO | (inode->i_mode & S_IXUGO)); } /* The root directory has no attributes */ if (inode->i_ino == MSDOS_ROOT_INO && attr != ATTR_DIR) { err = -EINVAL; goto out_unlock_inode; } if (sbi->options.sys_immutable && ((attr | oldattr) & ATTR_SYS) && !capable(CAP_LINUX_IMMUTABLE)) { err = -EPERM; goto out_unlock_inode; } /* * The security check is questionable... We single * out the RO attribute for checking by the security * module, just because it maps to a file mode. */ err = security_inode_setattr(file_mnt_idmap(file), file->f_path.dentry, &ia); if (err) goto out_unlock_inode; /* This MUST be done before doing anything irreversible... */ err = fat_setattr(file_mnt_idmap(file), file->f_path.dentry, &ia); if (err) goto out_unlock_inode; fsnotify_change(file->f_path.dentry, ia.ia_valid); if (sbi->options.sys_immutable) { if (attr & ATTR_SYS) inode->i_flags |= S_IMMUTABLE; else inode->i_flags &= ~S_IMMUTABLE; } fat_save_attrs(inode, attr); mark_inode_dirty(inode); out_unlock_inode: inode_unlock(inode); mnt_drop_write_file(file); out: return err; } static int fat_ioctl_get_volume_id(struct inode *inode, u32 __user *user_attr) { struct msdos_sb_info *sbi = MSDOS_SB(inode->i_sb); return put_user(sbi->vol_id, user_attr); } static int fat_ioctl_fitrim(struct inode *inode, unsigned long arg) { struct super_block *sb = inode->i_sb; struct fstrim_range __user *user_range; struct fstrim_range range; int err; if (!capable(CAP_SYS_ADMIN)) return -EPERM; if (!bdev_max_discard_sectors(sb->s_bdev)) return -EOPNOTSUPP; user_range = (struct fstrim_range __user *)arg; if (copy_from_user(&range, user_range, sizeof(range))) return -EFAULT; range.minlen = max_t(unsigned int, range.minlen, bdev_discard_granularity(sb->s_bdev)); err = fat_trim_fs(inode, &range); if (err < 0) return err; if (copy_to_user(user_range, &range, sizeof(range))) return -EFAULT; return 0; } long fat_generic_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { struct inode *inode = file_inode(filp); u32 __user *user_attr = (u32 __user *)arg; switch (cmd) { case FAT_IOCTL_GET_ATTRIBUTES: return fat_ioctl_get_attributes(inode, user_attr); case FAT_IOCTL_SET_ATTRIBUTES: return fat_ioctl_set_attributes(filp, user_attr); case FAT_IOCTL_GET_VOLUME_ID: return fat_ioctl_get_volume_id(inode, user_attr); case FITRIM: return fat_ioctl_fitrim(inode, arg); default: return -ENOTTY; /* Inappropriate ioctl for device */ } } static int fat_file_release(struct inode *inode, struct file *filp) { if ((filp->f_mode & FMODE_WRITE) && MSDOS_SB(inode->i_sb)->options.flush) { fat_flush_inodes(inode->i_sb, inode, NULL); set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(HZ/10); } return 0; } int fat_file_fsync(struct file *filp, loff_t start, loff_t end, int datasync) { struct inode *inode = filp->f_mapping->host; int err; err = __generic_file_fsync(filp, start, end, datasync); if (err) return err; err = sync_mapping_buffers(MSDOS_SB(inode->i_sb)->fat_inode->i_mapping); if (err) return err; return blkdev_issue_flush(inode->i_sb->s_bdev); } const struct file_operations fat_file_operations = { .llseek = generic_file_llseek, .read_iter = generic_file_read_iter, .write_iter = generic_file_write_iter, .mmap_prepare = generic_file_mmap_prepare, .release = fat_file_release, .unlocked_ioctl = fat_generic_ioctl, .compat_ioctl = compat_ptr_ioctl, .fsync = fat_file_fsync, .splice_read = filemap_splice_read, .splice_write = iter_file_splice_write, .fallocate = fat_fallocate, }; static int fat_cont_expand(struct inode *inode, loff_t size) { struct address_space *mapping = inode->i_mapping; loff_t start = inode->i_size, count = size - inode->i_size; int err; err = generic_cont_expand_simple(inode, size); if (err) goto out; fat_truncate_time(inode, NULL, S_CTIME|S_MTIME); mark_inode_dirty(inode); if (IS_SYNC(inode)) { int err2; /* * Opencode syncing since we don't have a file open to use * standard fsync path. */ err = filemap_fdatawrite_range(mapping, start, start + count - 1); err2 = sync_mapping_buffers(mapping); if (!err) err = err2; err2 = write_inode_now(inode, 1); if (!err) err = err2; if (!err) { err = filemap_fdatawait_range(mapping, start, start + count - 1); } } out: return err; } /* * Preallocate space for a file. This implements fat's fallocate file * operation, which gets called from sys_fallocate system call. User * space requests len bytes at offset. If FALLOC_FL_KEEP_SIZE is set * we just allocate clusters without zeroing them out. Otherwise we * allocate and zero out clusters via an expanding truncate. */ static long fat_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { int nr_cluster; /* Number of clusters to be allocated */ loff_t mm_bytes; /* Number of bytes to be allocated for file */ loff_t ondisksize; /* block aligned on-disk size in bytes*/ struct inode *inode = file->f_mapping->host; struct super_block *sb = inode->i_sb; struct msdos_sb_info *sbi = MSDOS_SB(sb); int err = 0; /* No support for hole punch or other fallocate flags. */ if (mode & ~FALLOC_FL_KEEP_SIZE) return -EOPNOTSUPP; /* No support for dir */ if (!S_ISREG(inode->i_mode)) return -EOPNOTSUPP; inode_lock(inode); if (mode & FALLOC_FL_KEEP_SIZE) { ondisksize = inode->i_blocks << 9; if ((offset + len) <= ondisksize) goto error; /* First compute the number of clusters to be allocated */ mm_bytes = offset + len - ondisksize; nr_cluster = (mm_bytes + (sbi->cluster_size - 1)) >> sbi->cluster_bits; /* Start the allocation.We are not zeroing out the clusters */ while (nr_cluster-- > 0) { err = fat_add_cluster(inode); if (err) goto error; } } else { if ((offset + len) <= i_size_read(inode)) goto error; /* This is just an expanding truncate */ err = fat_cont_expand(inode, (offset + len)); } error: inode_unlock(inode); return err; } /* Free all clusters after the skip'th cluster. */ static int fat_free(struct inode *inode, int skip) { struct super_block *sb = inode->i_sb; int err, wait, free_start, i_start, i_logstart; if (MSDOS_I(inode)->i_start == 0) return 0; fat_cache_inval_inode(inode); wait = IS_DIRSYNC(inode); i_start = free_start = MSDOS_I(inode)->i_start; i_logstart = MSDOS_I(inode)->i_logstart; /* First, we write the new file size. */ if (!skip) { MSDOS_I(inode)->i_start = 0; MSDOS_I(inode)->i_logstart = 0; } MSDOS_I(inode)->i_attrs |= ATTR_ARCH; fat_truncate_time(inode, NULL, S_CTIME|S_MTIME); if (wait) { err = fat_sync_inode(inode); if (err) { MSDOS_I(inode)->i_start = i_start; MSDOS_I(inode)->i_logstart = i_logstart; return err; } } else mark_inode_dirty(inode); /* Write a new EOF, and get the remaining cluster chain for freeing. */ if (skip) { struct fat_entry fatent; int ret, fclus, dclus; ret = fat_get_cluster(inode, skip - 1, &fclus, &dclus); if (ret < 0) return ret; else if (ret == FAT_ENT_EOF) return 0; fatent_init(&fatent); ret = fat_ent_read(inode, &fatent, dclus); if (ret == FAT_ENT_EOF) { fatent_brelse(&fatent); return 0; } else if (ret == FAT_ENT_FREE) { fat_fs_error(sb, "%s: invalid cluster chain (i_pos %lld)", __func__, MSDOS_I(inode)->i_pos); ret = -EIO; } else if (ret > 0) { err = fat_ent_write(inode, &fatent, FAT_ENT_EOF, wait); if (err) ret = err; } fatent_brelse(&fatent); if (ret < 0) return ret; free_start = ret; } inode->i_blocks = skip << (MSDOS_SB(sb)->cluster_bits - 9); /* Freeing the remained cluster chain */ return fat_free_clusters(inode, free_start); } void fat_truncate_blocks(struct inode *inode, loff_t offset) { struct msdos_sb_info *sbi = MSDOS_SB(inode->i_sb); const unsigned int cluster_size = sbi->cluster_size; int nr_clusters; /* * This protects against truncating a file bigger than it was then * trying to write into the hole. */ if (MSDOS_I(inode)->mmu_private > offset) MSDOS_I(inode)->mmu_private = offset; nr_clusters = (offset + (cluster_size - 1)) >> sbi->cluster_bits; fat_free(inode, nr_clusters); fat_flush_inodes(inode->i_sb, inode, NULL); } int fat_getattr(struct mnt_idmap *idmap, const struct path *path, struct kstat *stat, u32 request_mask, unsigned int flags) { struct inode *inode = d_inode(path->dentry); struct msdos_sb_info *sbi = MSDOS_SB(inode->i_sb); generic_fillattr(idmap, request_mask, inode, stat); stat->blksize = sbi->cluster_size; if (sbi->options.nfs == FAT_NFS_NOSTALE_RO) { /* Use i_pos for ino. This is used as fileid of nfs. */ stat->ino = fat_i_pos_read(sbi, inode); } if (sbi->options.isvfat && request_mask & STATX_BTIME) { stat->result_mask |= STATX_BTIME; stat->btime = MSDOS_I(inode)->i_crtime; } return 0; } EXPORT_SYMBOL_GPL(fat_getattr); static int fat_sanitize_mode(const struct msdos_sb_info *sbi, struct inode *inode, umode_t *mode_ptr) { umode_t mask, perm; /* * Note, the basic check is already done by a caller of * (attr->ia_mode & ~FAT_VALID_MODE) */ if (S_ISREG(inode->i_mode)) mask = sbi->options.fs_fmask; else mask = sbi->options.fs_dmask; perm = *mode_ptr & ~(S_IFMT | mask); /* * Of the r and x bits, all (subject to umask) must be present. Of the * w bits, either all (subject to umask) or none must be present. * * If fat_mode_can_hold_ro(inode) is false, can't change w bits. */ if ((perm & (S_IRUGO | S_IXUGO)) != (inode->i_mode & (S_IRUGO|S_IXUGO))) return -EPERM; if (fat_mode_can_hold_ro(inode)) { if ((perm & S_IWUGO) && ((perm & S_IWUGO) != (S_IWUGO & ~mask))) return -EPERM; } else { if ((perm & S_IWUGO) != (S_IWUGO & ~mask)) return -EPERM; } *mode_ptr &= S_IFMT | perm; return 0; } static int fat_allow_set_time(struct mnt_idmap *idmap, struct msdos_sb_info *sbi, struct inode *inode) { umode_t allow_utime = sbi->options.allow_utime; if (!vfsuid_eq_kuid(i_uid_into_vfsuid(idmap, inode), current_fsuid())) { if (vfsgid_in_group_p(i_gid_into_vfsgid(idmap, inode))) allow_utime >>= 3; if (allow_utime & MAY_WRITE) return 1; } /* use a default check */ return 0; } #define TIMES_SET_FLAGS (ATTR_MTIME_SET | ATTR_ATIME_SET | ATTR_TIMES_SET) /* valid file mode bits */ #define FAT_VALID_MODE (S_IFREG | S_IFDIR | S_IRWXUGO) int fat_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct iattr *attr) { struct msdos_sb_info *sbi = MSDOS_SB(dentry->d_sb); struct inode *inode = d_inode(dentry); unsigned int ia_valid; int error; /* Check for setting the inode time. */ ia_valid = attr->ia_valid; if (ia_valid & TIMES_SET_FLAGS) { if (fat_allow_set_time(idmap, sbi, inode)) attr->ia_valid &= ~TIMES_SET_FLAGS; } error = setattr_prepare(idmap, dentry, attr); attr->ia_valid = ia_valid; if (error) { if (sbi->options.quiet) error = 0; goto out; } /* * Expand the file. Since inode_setattr() updates ->i_size * before calling the ->truncate(), but FAT needs to fill the * hole before it. XXX: this is no longer true with new truncate * sequence. */ if (attr->ia_valid & ATTR_SIZE) { inode_dio_wait(inode); if (attr->ia_size > inode->i_size) { error = fat_cont_expand(inode, attr->ia_size); if (error || attr->ia_valid == ATTR_SIZE) goto out; attr->ia_valid &= ~ATTR_SIZE; } } if (((attr->ia_valid & ATTR_UID) && (!uid_eq(from_vfsuid(idmap, i_user_ns(inode), attr->ia_vfsuid), sbi->options.fs_uid))) || ((attr->ia_valid & ATTR_GID) && (!gid_eq(from_vfsgid(idmap, i_user_ns(inode), attr->ia_vfsgid), sbi->options.fs_gid))) || ((attr->ia_valid & ATTR_MODE) && (attr->ia_mode & ~FAT_VALID_MODE))) error = -EPERM; if (error) { if (sbi->options.quiet) error = 0; goto out; } /* * We don't return -EPERM here. Yes, strange, but this is too * old behavior. */ if (attr->ia_valid & ATTR_MODE) { if (fat_sanitize_mode(sbi, inode, &attr->ia_mode) < 0) attr->ia_valid &= ~ATTR_MODE; } if (attr->ia_valid & ATTR_SIZE) { error = fat_block_truncate_page(inode, attr->ia_size); if (error) goto out; down_write(&MSDOS_I(inode)->truncate_lock); truncate_setsize(inode, attr->ia_size); fat_truncate_blocks(inode, attr->ia_size); up_write(&MSDOS_I(inode)->truncate_lock); } /* * setattr_copy can't truncate these appropriately, so we'll * copy them ourselves */ if (attr->ia_valid & ATTR_ATIME) fat_truncate_time(inode, &attr->ia_atime, S_ATIME); if (attr->ia_valid & ATTR_CTIME) fat_truncate_time(inode, &attr->ia_ctime, S_CTIME); if (attr->ia_valid & ATTR_MTIME) fat_truncate_time(inode, &attr->ia_mtime, S_MTIME); attr->ia_valid &= ~(ATTR_ATIME|ATTR_CTIME|ATTR_MTIME); setattr_copy(idmap, inode, attr); mark_inode_dirty(inode); out: return error; } EXPORT_SYMBOL_GPL(fat_setattr); const struct inode_operations fat_file_inode_operations = { .setattr = fat_setattr, .getattr = fat_getattr, .update_time = fat_update_time, }; |
| 9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * linux/fs/pnode.h * * (C) Copyright IBM Corporation 2005. */ #ifndef _LINUX_PNODE_H #define _LINUX_PNODE_H #include <linux/list.h> #include "mount.h" #define IS_MNT_SHARED(m) ((m)->mnt_t_flags & T_SHARED) #define IS_MNT_SLAVE(m) ((m)->mnt_master) #define IS_MNT_NEW(m) (!(m)->mnt_ns) #define CLEAR_MNT_SHARED(m) ((m)->mnt_t_flags &= ~T_SHARED) #define IS_MNT_UNBINDABLE(m) ((m)->mnt_t_flags & T_UNBINDABLE) #define IS_MNT_MARKED(m) ((m)->mnt_t_flags & T_MARKED) #define SET_MNT_MARK(m) ((m)->mnt_t_flags |= T_MARKED) #define CLEAR_MNT_MARK(m) ((m)->mnt_t_flags &= ~T_MARKED) #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED) #define CL_EXPIRE 0x01 #define CL_SLAVE 0x02 #define CL_COPY_UNBINDABLE 0x04 #define CL_MAKE_SHARED 0x08 #define CL_PRIVATE 0x10 #define CL_COPY_MNT_NS_FILE 0x40 /* * EXCL[namespace_sem] */ static inline void set_mnt_shared(struct mount *mnt) { mnt->mnt_t_flags &= ~T_SHARED_MASK; mnt->mnt_t_flags |= T_SHARED; } static inline bool peers(const struct mount *m1, const struct mount *m2) { return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id; } void change_mnt_propagation(struct mount *, int); void bulk_make_private(struct list_head *); int propagate_mnt(struct mount *, struct mountpoint *, struct mount *, struct hlist_head *); void propagate_umount(struct list_head *); int propagate_mount_busy(struct mount *, int); void propagate_mount_unlock(struct mount *); void mnt_release_group_id(struct mount *); int get_dominating_id(struct mount *mnt, const struct path *root); int mnt_get_count(struct mount *mnt); void mnt_set_mountpoint(struct mount *, struct mountpoint *, struct mount *); void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt); struct mount *copy_tree(struct mount *, struct dentry *, int); bool is_path_reachable(struct mount *, struct dentry *, const struct path *root); int count_mounts(struct mnt_namespace *ns, struct mount *mnt); bool propagation_would_overmount(const struct mount *from, const struct mount *to, const struct mountpoint *mp); #endif /* _LINUX_PNODE_H */ |
| 754 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Flexible mmap layout support * * Based on code by Ingo Molnar and Andi Kleen, copyrighted * as follows: * * Copyright 2003-2009 Red Hat Inc. * All Rights Reserved. * Copyright 2005 Andi Kleen, SUSE Labs. * Copyright 2007 Jiri Kosina, SUSE Labs. */ #include <linux/personality.h> #include <linux/mm.h> #include <linux/random.h> #include <linux/limits.h> #include <linux/sched/signal.h> #include <linux/sched/mm.h> #include <linux/compat.h> #include <linux/elf-randomize.h> #include <asm/elf.h> #include <asm/io.h> #include "physaddr.h" struct va_alignment __read_mostly va_align = { .flags = -1, }; unsigned long task_size_32bit(void) { return IA32_PAGE_OFFSET; } unsigned long task_size_64bit(int full_addr_space) { return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW; } static unsigned long stack_maxrandom_size(unsigned long task_size) { unsigned long max = 0; if (current->flags & PF_RANDOMIZE) { max = (-1UL) & __STACK_RND_MASK(task_size == task_size_32bit()); max <<= PAGE_SHIFT; } return max; } #ifdef CONFIG_COMPAT # define mmap32_rnd_bits mmap_rnd_compat_bits # define mmap64_rnd_bits mmap_rnd_bits #else # define mmap32_rnd_bits mmap_rnd_bits # define mmap64_rnd_bits mmap_rnd_bits #endif #define SIZE_128M (128 * 1024 * 1024UL) static int mmap_is_legacy(void) { if (current->personality & ADDR_COMPAT_LAYOUT) return 1; return sysctl_legacy_va_layout; } static unsigned long arch_rnd(unsigned int rndbits) { if (!(current->flags & PF_RANDOMIZE)) return 0; return (get_random_long() & ((1UL << rndbits) - 1)) << PAGE_SHIFT; } unsigned long arch_mmap_rnd(void) { return arch_rnd(mmap_is_ia32() ? mmap32_rnd_bits : mmap64_rnd_bits); } static unsigned long mmap_base(unsigned long rnd, unsigned long task_size, const struct rlimit *rlim_stack) { unsigned long gap = rlim_stack->rlim_cur; unsigned long pad = stack_maxrandom_size(task_size) + stack_guard_gap; /* Values close to RLIM_INFINITY can overflow. */ if (gap + pad > gap) gap += pad; /* * Top of mmap area (just below the process stack). * Leave an at least ~128 MB hole with possible stack randomization. */ gap = clamp(gap, SIZE_128M, (task_size / 6) * 5); return PAGE_ALIGN(task_size - gap - rnd); } static unsigned long mmap_legacy_base(unsigned long rnd, unsigned long task_size) { return __TASK_UNMAPPED_BASE(task_size) + rnd; } /* * This function, called very early during the creation of a new * process VM image, sets up which VM layout function to use: */ static void arch_pick_mmap_base(unsigned long *base, unsigned long *legacy_base, unsigned long random_factor, unsigned long task_size, const struct rlimit *rlim_stack) { *legacy_base = mmap_legacy_base(random_factor, task_size); if (mmap_is_legacy()) *base = *legacy_base; else *base = mmap_base(random_factor, task_size, rlim_stack); } void arch_pick_mmap_layout(struct mm_struct *mm, const struct rlimit *rlim_stack) { if (mmap_is_legacy()) mm_flags_clear(MMF_TOPDOWN, mm); else mm_flags_set(MMF_TOPDOWN, mm); arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base, arch_rnd(mmap64_rnd_bits), task_size_64bit(0), rlim_stack); #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES /* * The mmap syscall mapping base decision depends solely on the * syscall type (64-bit or compat). This applies for 64bit * applications and 32bit applications. The 64bit syscall uses * mmap_base, the compat syscall uses mmap_compat_base. */ arch_pick_mmap_base(&mm->mmap_compat_base, &mm->mmap_compat_legacy_base, arch_rnd(mmap32_rnd_bits), task_size_32bit(), rlim_stack); #endif } unsigned long get_mmap_base(int is_legacy) { struct mm_struct *mm = current->mm; #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES if (in_32bit_syscall()) { return is_legacy ? mm->mmap_compat_legacy_base : mm->mmap_compat_base; } #endif return is_legacy ? mm->mmap_legacy_base : mm->mmap_base; } /** * mmap_address_hint_valid - Validate the address hint of mmap * @addr: Address hint * @len: Mapping length * * Check whether @addr and @addr + @len result in a valid mapping. * * On 32bit this only checks whether @addr + @len is <= TASK_SIZE. * * On 64bit with 5-level page tables another sanity check is required * because mappings requested by mmap(@addr, 0) which cross the 47-bit * virtual address boundary can cause the following theoretical issue: * * An application calls mmap(addr, 0), i.e. without MAP_FIXED, where @addr * is below the border of the 47-bit address space and @addr + @len is * above the border. * * With 4-level paging this request succeeds, but the resulting mapping * address will always be within the 47-bit virtual address space, because * the hint address does not result in a valid mapping and is * ignored. Hence applications which are not prepared to handle virtual * addresses above 47-bit work correctly. * * With 5-level paging this request would be granted and result in a * mapping which crosses the border of the 47-bit virtual address * space. If the application cannot handle addresses above 47-bit this * will lead to misbehaviour and hard to diagnose failures. * * Therefore ignore address hints which would result in a mapping crossing * the 47-bit virtual address boundary. * * Note, that in the same scenario with MAP_FIXED the behaviour is * different. The request with @addr < 47-bit and @addr + @len > 47-bit * fails on a 4-level paging machine but succeeds on a 5-level paging * machine. It is reasonable to expect that an application does not rely on * the failure of such a fixed mapping request, so the restriction is not * applied. */ bool mmap_address_hint_valid(unsigned long addr, unsigned long len) { if (TASK_SIZE - len < addr) return false; return (addr > DEFAULT_MAP_WINDOW) == (addr + len > DEFAULT_MAP_WINDOW); } /* Can we access it for direct reading/writing? Must be RAM: */ int valid_phys_addr_range(phys_addr_t addr, size_t count) { return addr + count - 1 <= __pa(high_memory - 1); } /* Can we access it through mmap? Must be a valid physical address: */ int valid_mmap_phys_addr_range(unsigned long pfn, size_t count) { phys_addr_t addr = (phys_addr_t)pfn << PAGE_SHIFT; return phys_addr_valid(addr + count - 1); } /* * Only allow root to set high MMIO mappings to PROT_NONE. * This prevents an unpriv. user to set them to PROT_NONE and invert * them, then pointing to valid memory for L1TF speculation. * * Note: for locked down kernels may want to disable the root override. */ bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot) { if (!boot_cpu_has_bug(X86_BUG_L1TF)) return true; if (!__pte_needs_invert(pgprot_val(prot))) return true; /* If it's real memory always allow */ if (pfn_valid(pfn)) return true; if (pfn >= l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN)) return false; return true; } |
| 2 2 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2006, Johannes Berg <johannes@sipsolutions.net> */ /* just for IFNAMSIZ */ #include <linux/if.h> #include <linux/slab.h> #include <linux/export.h> #include "led.h" void ieee80211_led_assoc(struct ieee80211_local *local, bool associated) { if (!atomic_read(&local->assoc_led_active)) return; if (associated) led_trigger_event(&local->assoc_led, LED_FULL); else led_trigger_event(&local->assoc_led, LED_OFF); } void ieee80211_led_radio(struct ieee80211_local *local, bool enabled) { if (!atomic_read(&local->radio_led_active)) return; if (enabled) led_trigger_event(&local->radio_led, LED_FULL); else led_trigger_event(&local->radio_led, LED_OFF); } void ieee80211_alloc_led_names(struct ieee80211_local *local) { local->rx_led.name = kasprintf(GFP_KERNEL, "%srx", wiphy_name(local->hw.wiphy)); local->tx_led.name = kasprintf(GFP_KERNEL, "%stx", wiphy_name(local->hw.wiphy)); local->assoc_led.name = kasprintf(GFP_KERNEL, "%sassoc", wiphy_name(local->hw.wiphy)); local->radio_led.name = kasprintf(GFP_KERNEL, "%sradio", wiphy_name(local->hw.wiphy)); } void ieee80211_free_led_names(struct ieee80211_local *local) { kfree(local->rx_led.name); kfree(local->tx_led.name); kfree(local->assoc_led.name); kfree(local->radio_led.name); } static int ieee80211_tx_led_activate(struct led_classdev *led_cdev) { struct ieee80211_local *local = container_of(led_cdev->trigger, struct ieee80211_local, tx_led); atomic_inc(&local->tx_led_active); return 0; } static void ieee80211_tx_led_deactivate(struct led_classdev *led_cdev) { struct ieee80211_local *local = container_of(led_cdev->trigger, struct ieee80211_local, tx_led); atomic_dec(&local->tx_led_active); } static int ieee80211_rx_led_activate(struct led_classdev *led_cdev) { struct ieee80211_local *local = container_of(led_cdev->trigger, struct ieee80211_local, rx_led); atomic_inc(&local->rx_led_active); return 0; } static void ieee80211_rx_led_deactivate(struct led_classdev *led_cdev) { struct ieee80211_local *local = container_of(led_cdev->trigger, struct ieee80211_local, rx_led); atomic_dec(&local->rx_led_active); } static int ieee80211_assoc_led_activate(struct led_classdev *led_cdev) { struct ieee80211_local *local = container_of(led_cdev->trigger, struct ieee80211_local, assoc_led); atomic_inc(&local->assoc_led_active); return 0; } static void ieee80211_assoc_led_deactivate(struct led_classdev *led_cdev) { struct ieee80211_local *local = container_of(led_cdev->trigger, struct ieee80211_local, assoc_led); atomic_dec(&local->assoc_led_active); } static int ieee80211_radio_led_activate(struct led_classdev *led_cdev) { struct ieee80211_local *local = container_of(led_cdev->trigger, struct ieee80211_local, radio_led); atomic_inc(&local->radio_led_active); return 0; } static void ieee80211_radio_led_deactivate(struct led_classdev *led_cdev) { struct ieee80211_local *local = container_of(led_cdev->trigger, struct ieee80211_local, radio_led); atomic_dec(&local->radio_led_active); } static int ieee80211_tpt_led_activate(struct led_classdev *led_cdev) { struct ieee80211_local *local = container_of(led_cdev->trigger, struct ieee80211_local, tpt_led); atomic_inc(&local->tpt_led_active); return 0; } static void ieee80211_tpt_led_deactivate(struct led_classdev *led_cdev) { struct ieee80211_local *local = container_of(led_cdev->trigger, struct ieee80211_local, tpt_led); atomic_dec(&local->tpt_led_active); } void ieee80211_led_init(struct ieee80211_local *local) { atomic_set(&local->rx_led_active, 0); local->rx_led.activate = ieee80211_rx_led_activate; local->rx_led.deactivate = ieee80211_rx_led_deactivate; if (local->rx_led.name && led_trigger_register(&local->rx_led)) { kfree(local->rx_led.name); local->rx_led.name = NULL; } atomic_set(&local->tx_led_active, 0); local->tx_led.activate = ieee80211_tx_led_activate; local->tx_led.deactivate = ieee80211_tx_led_deactivate; if (local->tx_led.name && led_trigger_register(&local->tx_led)) { kfree(local->tx_led.name); local->tx_led.name = NULL; } atomic_set(&local->assoc_led_active, 0); local->assoc_led.activate = ieee80211_assoc_led_activate; local->assoc_led.deactivate = ieee80211_assoc_led_deactivate; if (local->assoc_led.name && led_trigger_register(&local->assoc_led)) { kfree(local->assoc_led.name); local->assoc_led.name = NULL; } atomic_set(&local->radio_led_active, 0); local->radio_led.activate = ieee80211_radio_led_activate; local->radio_led.deactivate = ieee80211_radio_led_deactivate; if (local->radio_led.name && led_trigger_register(&local->radio_led)) { kfree(local->radio_led.name); local->radio_led.name = NULL; } atomic_set(&local->tpt_led_active, 0); if (local->tpt_led_trigger) { local->tpt_led.activate = ieee80211_tpt_led_activate; local->tpt_led.deactivate = ieee80211_tpt_led_deactivate; if (led_trigger_register(&local->tpt_led)) { kfree(local->tpt_led_trigger); local->tpt_led_trigger = NULL; } } } void ieee80211_led_exit(struct ieee80211_local *local) { if (local->radio_led.name) led_trigger_unregister(&local->radio_led); if (local->assoc_led.name) led_trigger_unregister(&local->assoc_led); if (local->tx_led.name) led_trigger_unregister(&local->tx_led); if (local->rx_led.name) led_trigger_unregister(&local->rx_led); if (local->tpt_led_trigger) { led_trigger_unregister(&local->tpt_led); kfree(local->tpt_led_trigger); } } const char *__ieee80211_get_radio_led_name(struct ieee80211_hw *hw) { struct ieee80211_local *local = hw_to_local(hw); return local->radio_led.name; } EXPORT_SYMBOL(__ieee80211_get_radio_led_name); const char *__ieee80211_get_assoc_led_name(struct ieee80211_hw *hw) { struct ieee80211_local *local = hw_to_local(hw); return local->assoc_led.name; } EXPORT_SYMBOL(__ieee80211_get_assoc_led_name); const char *__ieee80211_get_tx_led_name(struct ieee80211_hw *hw) { struct ieee80211_local *local = hw_to_local(hw); return local->tx_led.name; } EXPORT_SYMBOL(__ieee80211_get_tx_led_name); const char *__ieee80211_get_rx_led_name(struct ieee80211_hw *hw) { struct ieee80211_local *local = hw_to_local(hw); return local->rx_led.name; } EXPORT_SYMBOL(__ieee80211_get_rx_led_name); static unsigned long tpt_trig_traffic(struct ieee80211_local *local, struct tpt_led_trigger *tpt_trig) { unsigned long traffic, delta; traffic = tpt_trig->tx_bytes + tpt_trig->rx_bytes; delta = traffic - tpt_trig->prev_traffic; tpt_trig->prev_traffic = traffic; return DIV_ROUND_UP(delta, 1024 / 8); } static void tpt_trig_timer(struct timer_list *t) { struct tpt_led_trigger *tpt_trig = timer_container_of(tpt_trig, t, timer); struct ieee80211_local *local = tpt_trig->local; unsigned long on, off, tpt; int i; if (!tpt_trig->running) return; mod_timer(&tpt_trig->timer, round_jiffies(jiffies + HZ)); tpt = tpt_trig_traffic(local, tpt_trig); /* default to just solid on */ on = 1; off = 0; for (i = tpt_trig->blink_table_len - 1; i >= 0; i--) { if (tpt_trig->blink_table[i].throughput < 0 || tpt > tpt_trig->blink_table[i].throughput) { off = tpt_trig->blink_table[i].blink_time / 2; on = tpt_trig->blink_table[i].blink_time - off; break; } } led_trigger_blink(&local->tpt_led, on, off); } const char * __ieee80211_create_tpt_led_trigger(struct ieee80211_hw *hw, unsigned int flags, const struct ieee80211_tpt_blink *blink_table, unsigned int blink_table_len) { struct ieee80211_local *local = hw_to_local(hw); struct tpt_led_trigger *tpt_trig; if (WARN_ON(local->tpt_led_trigger)) return NULL; tpt_trig = kzalloc(sizeof(struct tpt_led_trigger), GFP_KERNEL); if (!tpt_trig) return NULL; snprintf(tpt_trig->name, sizeof(tpt_trig->name), "%stpt", wiphy_name(local->hw.wiphy)); local->tpt_led.name = tpt_trig->name; tpt_trig->blink_table = blink_table; tpt_trig->blink_table_len = blink_table_len; tpt_trig->want = flags; tpt_trig->local = local; timer_setup(&tpt_trig->timer, tpt_trig_timer, 0); local->tpt_led_trigger = tpt_trig; return tpt_trig->name; } EXPORT_SYMBOL(__ieee80211_create_tpt_led_trigger); static void ieee80211_start_tpt_led_trig(struct ieee80211_local *local) { struct tpt_led_trigger *tpt_trig = local->tpt_led_trigger; if (tpt_trig->running) return; /* reset traffic */ tpt_trig_traffic(local, tpt_trig); tpt_trig->running = true; tpt_trig_timer(&tpt_trig->timer); mod_timer(&tpt_trig->timer, round_jiffies(jiffies + HZ)); } static void ieee80211_stop_tpt_led_trig(struct ieee80211_local *local) { struct tpt_led_trigger *tpt_trig = local->tpt_led_trigger; if (!tpt_trig->running) return; tpt_trig->running = false; timer_delete_sync(&tpt_trig->timer); led_trigger_event(&local->tpt_led, LED_OFF); } void ieee80211_mod_tpt_led_trig(struct ieee80211_local *local, unsigned int types_on, unsigned int types_off) { struct tpt_led_trigger *tpt_trig = local->tpt_led_trigger; bool allowed; WARN_ON(types_on & types_off); if (!tpt_trig) return; tpt_trig->active &= ~types_off; tpt_trig->active |= types_on; /* * Regardless of wanted state, we shouldn't blink when * the radio is disabled -- this can happen due to some * code ordering issues with __ieee80211_recalc_idle() * being called before the radio is started. */ allowed = tpt_trig->active & IEEE80211_TPT_LEDTRIG_FL_RADIO; if (!allowed || !(tpt_trig->active & tpt_trig->want)) ieee80211_stop_tpt_led_trig(local); else ieee80211_start_tpt_led_trig(local); } |
| 1 2 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright (C) 2018 - 2021, 2023 - 2024 Intel Corporation */ #include <net/cfg80211.h> #include "core.h" #include "nl80211.h" #include "rdev-ops.h" static int pmsr_parse_ftm(struct cfg80211_registered_device *rdev, struct nlattr *ftmreq, struct cfg80211_pmsr_request_peer *out, struct genl_info *info) { const struct cfg80211_pmsr_capabilities *capa = rdev->wiphy.pmsr_capa; struct nlattr *tb[NL80211_PMSR_FTM_REQ_ATTR_MAX + 1]; u32 preamble = NL80211_PREAMBLE_DMG; /* only optional in DMG */ /* validate existing data */ if (!(rdev->wiphy.pmsr_capa->ftm.bandwidths & BIT(out->chandef.width))) { NL_SET_ERR_MSG(info->extack, "FTM: unsupported bandwidth"); return -EINVAL; } /* no validation needed - was already done via nested policy */ nla_parse_nested_deprecated(tb, NL80211_PMSR_FTM_REQ_ATTR_MAX, ftmreq, NULL, NULL); if (tb[NL80211_PMSR_FTM_REQ_ATTR_PREAMBLE]) preamble = nla_get_u32(tb[NL80211_PMSR_FTM_REQ_ATTR_PREAMBLE]); /* set up values - struct is 0-initialized */ out->ftm.requested = true; switch (out->chandef.chan->band) { case NL80211_BAND_60GHZ: /* optional */ break; default: if (!tb[NL80211_PMSR_FTM_REQ_ATTR_PREAMBLE]) { NL_SET_ERR_MSG(info->extack, "FTM: must specify preamble"); return -EINVAL; } } if (!(capa->ftm.preambles & BIT(preamble))) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_PREAMBLE], "FTM: invalid preamble"); return -EINVAL; } out->ftm.preamble = preamble; out->ftm.burst_period = 0; if (tb[NL80211_PMSR_FTM_REQ_ATTR_BURST_PERIOD]) out->ftm.burst_period = nla_get_u16(tb[NL80211_PMSR_FTM_REQ_ATTR_BURST_PERIOD]); out->ftm.asap = !!tb[NL80211_PMSR_FTM_REQ_ATTR_ASAP]; if (out->ftm.asap && !capa->ftm.asap) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_ASAP], "FTM: ASAP mode not supported"); return -EINVAL; } if (!out->ftm.asap && !capa->ftm.non_asap) { NL_SET_ERR_MSG(info->extack, "FTM: non-ASAP mode not supported"); return -EINVAL; } out->ftm.num_bursts_exp = 0; if (tb[NL80211_PMSR_FTM_REQ_ATTR_NUM_BURSTS_EXP]) out->ftm.num_bursts_exp = nla_get_u8(tb[NL80211_PMSR_FTM_REQ_ATTR_NUM_BURSTS_EXP]); if (capa->ftm.max_bursts_exponent >= 0 && out->ftm.num_bursts_exp > capa->ftm.max_bursts_exponent) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_NUM_BURSTS_EXP], "FTM: max NUM_BURSTS_EXP must be set lower than the device limit"); return -EINVAL; } out->ftm.burst_duration = 15; if (tb[NL80211_PMSR_FTM_REQ_ATTR_BURST_DURATION]) out->ftm.burst_duration = nla_get_u8(tb[NL80211_PMSR_FTM_REQ_ATTR_BURST_DURATION]); out->ftm.ftms_per_burst = 0; if (tb[NL80211_PMSR_FTM_REQ_ATTR_FTMS_PER_BURST]) out->ftm.ftms_per_burst = nla_get_u32(tb[NL80211_PMSR_FTM_REQ_ATTR_FTMS_PER_BURST]); if (capa->ftm.max_ftms_per_burst && (out->ftm.ftms_per_burst > capa->ftm.max_ftms_per_burst || out->ftm.ftms_per_burst == 0)) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_FTMS_PER_BURST], "FTM: FTMs per burst must be set lower than the device limit but non-zero"); return -EINVAL; } out->ftm.ftmr_retries = 3; if (tb[NL80211_PMSR_FTM_REQ_ATTR_NUM_FTMR_RETRIES]) out->ftm.ftmr_retries = nla_get_u8(tb[NL80211_PMSR_FTM_REQ_ATTR_NUM_FTMR_RETRIES]); out->ftm.request_lci = !!tb[NL80211_PMSR_FTM_REQ_ATTR_REQUEST_LCI]; if (out->ftm.request_lci && !capa->ftm.request_lci) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_REQUEST_LCI], "FTM: LCI request not supported"); } out->ftm.request_civicloc = !!tb[NL80211_PMSR_FTM_REQ_ATTR_REQUEST_CIVICLOC]; if (out->ftm.request_civicloc && !capa->ftm.request_civicloc) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_REQUEST_CIVICLOC], "FTM: civic location request not supported"); } out->ftm.trigger_based = !!tb[NL80211_PMSR_FTM_REQ_ATTR_TRIGGER_BASED]; if (out->ftm.trigger_based && !capa->ftm.trigger_based) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_TRIGGER_BASED], "FTM: trigger based ranging is not supported"); return -EINVAL; } out->ftm.non_trigger_based = !!tb[NL80211_PMSR_FTM_REQ_ATTR_NON_TRIGGER_BASED]; if (out->ftm.non_trigger_based && !capa->ftm.non_trigger_based) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_NON_TRIGGER_BASED], "FTM: trigger based ranging is not supported"); return -EINVAL; } if (out->ftm.trigger_based && out->ftm.non_trigger_based) { NL_SET_ERR_MSG(info->extack, "FTM: can't set both trigger based and non trigger based"); return -EINVAL; } if (out->ftm.ftms_per_burst > 31 && !out->ftm.non_trigger_based && !out->ftm.trigger_based) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_FTMS_PER_BURST], "FTM: FTMs per burst must be set lower than 31"); return -ERANGE; } if ((out->ftm.trigger_based || out->ftm.non_trigger_based) && out->ftm.preamble != NL80211_PREAMBLE_HE) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_PREAMBLE], "FTM: non EDCA based ranging must use HE preamble"); return -EINVAL; } out->ftm.lmr_feedback = !!tb[NL80211_PMSR_FTM_REQ_ATTR_LMR_FEEDBACK]; if (!out->ftm.trigger_based && !out->ftm.non_trigger_based && out->ftm.lmr_feedback) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_LMR_FEEDBACK], "FTM: LMR feedback set for EDCA based ranging"); return -EINVAL; } if (tb[NL80211_PMSR_FTM_REQ_ATTR_BSS_COLOR]) { if (!out->ftm.non_trigger_based && !out->ftm.trigger_based) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_FTM_REQ_ATTR_BSS_COLOR], "FTM: BSS color set for EDCA based ranging"); return -EINVAL; } out->ftm.bss_color = nla_get_u8(tb[NL80211_PMSR_FTM_REQ_ATTR_BSS_COLOR]); } return 0; } static int pmsr_parse_peer(struct cfg80211_registered_device *rdev, struct nlattr *peer, struct cfg80211_pmsr_request_peer *out, struct genl_info *info) { struct nlattr *tb[NL80211_PMSR_PEER_ATTR_MAX + 1]; struct nlattr *req[NL80211_PMSR_REQ_ATTR_MAX + 1]; struct nlattr *treq; int err, rem; /* no validation needed - was already done via nested policy */ nla_parse_nested_deprecated(tb, NL80211_PMSR_PEER_ATTR_MAX, peer, NULL, NULL); if (!tb[NL80211_PMSR_PEER_ATTR_ADDR] || !tb[NL80211_PMSR_PEER_ATTR_CHAN] || !tb[NL80211_PMSR_PEER_ATTR_REQ]) { NL_SET_ERR_MSG_ATTR(info->extack, peer, "insufficient peer data"); return -EINVAL; } memcpy(out->addr, nla_data(tb[NL80211_PMSR_PEER_ATTR_ADDR]), ETH_ALEN); /* reuse info->attrs */ memset(info->attrs, 0, sizeof(*info->attrs) * (NL80211_ATTR_MAX + 1)); err = nla_parse_nested_deprecated(info->attrs, NL80211_ATTR_MAX, tb[NL80211_PMSR_PEER_ATTR_CHAN], NULL, info->extack); if (err) return err; err = nl80211_parse_chandef(rdev, info, &out->chandef); if (err) return err; /* no validation needed - was already done via nested policy */ nla_parse_nested_deprecated(req, NL80211_PMSR_REQ_ATTR_MAX, tb[NL80211_PMSR_PEER_ATTR_REQ], NULL, NULL); if (!req[NL80211_PMSR_REQ_ATTR_DATA]) { NL_SET_ERR_MSG_ATTR(info->extack, tb[NL80211_PMSR_PEER_ATTR_REQ], "missing request type/data"); return -EINVAL; } if (req[NL80211_PMSR_REQ_ATTR_GET_AP_TSF]) out->report_ap_tsf = true; if (out->report_ap_tsf && !rdev->wiphy.pmsr_capa->report_ap_tsf) { NL_SET_ERR_MSG_ATTR(info->extack, req[NL80211_PMSR_REQ_ATTR_GET_AP_TSF], "reporting AP TSF is not supported"); return -EINVAL; } nla_for_each_nested(treq, req[NL80211_PMSR_REQ_ATTR_DATA], rem) { switch (nla_type(treq)) { case NL80211_PMSR_TYPE_FTM: err = pmsr_parse_ftm(rdev, treq, out, info); break; default: NL_SET_ERR_MSG_ATTR(info->extack, treq, "unsupported measurement type"); err = -EINVAL; } } if (err) return err; return 0; } int nl80211_pmsr_start(struct sk_buff *skb, struct genl_info *info) { struct nlattr *reqattr = info->attrs[NL80211_ATTR_PEER_MEASUREMENTS]; struct cfg80211_registered_device *rdev = info->user_ptr[0]; struct wireless_dev *wdev = info->user_ptr[1]; struct cfg80211_pmsr_request *req; struct nlattr *peers, *peer; int count, rem, err, idx; if (!rdev->wiphy.pmsr_capa) return -EOPNOTSUPP; if (!reqattr) return -EINVAL; peers = nla_find(nla_data(reqattr), nla_len(reqattr), NL80211_PMSR_ATTR_PEERS); if (!peers) return -EINVAL; count = 0; nla_for_each_nested(peer, peers, rem) { count++; if (count > rdev->wiphy.pmsr_capa->max_peers) { NL_SET_ERR_MSG_ATTR(info->extack, peer, "Too many peers used"); return -EINVAL; } } req = kzalloc(struct_size(req, peers, count), GFP_KERNEL); if (!req) return -ENOMEM; req->n_peers = count; if (info->attrs[NL80211_ATTR_TIMEOUT]) req->timeout = nla_get_u32(info->attrs[NL80211_ATTR_TIMEOUT]); if (info->attrs[NL80211_ATTR_MAC]) { if (!rdev->wiphy.pmsr_capa->randomize_mac_addr) { NL_SET_ERR_MSG_ATTR(info->extack, info->attrs[NL80211_ATTR_MAC], "device cannot randomize MAC address"); err = -EINVAL; goto out_err; } err = nl80211_parse_random_mac(info->attrs, req->mac_addr, req->mac_addr_mask); if (err) goto out_err; } else { memcpy(req->mac_addr, wdev_address(wdev), ETH_ALEN); eth_broadcast_addr(req->mac_addr_mask); } idx = 0; nla_for_each_nested(peer, peers, rem) { /* NB: this reuses info->attrs, but we no longer need it */ err = pmsr_parse_peer(rdev, peer, &req->peers[idx], info); if (err) goto out_err; idx++; } req->cookie = cfg80211_assign_cookie(rdev); req->nl_portid = info->snd_portid; err = rdev_start_pmsr(rdev, wdev, req); if (err) goto out_err; list_add_tail(&req->list, &wdev->pmsr_list); nl_set_extack_cookie_u64(info->extack, req->cookie); return 0; out_err: kfree(req); return err; } void cfg80211_pmsr_complete(struct wireless_dev *wdev, struct cfg80211_pmsr_request *req, gfp_t gfp) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wdev->wiphy); struct cfg80211_pmsr_request *tmp, *prev, *to_free = NULL; struct sk_buff *msg; void *hdr; trace_cfg80211_pmsr_complete(wdev->wiphy, wdev, req->cookie); msg = nlmsg_new(NLMSG_DEFAULT_SIZE, gfp); if (!msg) goto free_request; hdr = nl80211hdr_put(msg, 0, 0, 0, NL80211_CMD_PEER_MEASUREMENT_COMPLETE); if (!hdr) goto free_msg; if (nla_put_u32(msg, NL80211_ATTR_WIPHY, rdev->wiphy_idx) || nla_put_u64_64bit(msg, NL80211_ATTR_WDEV, wdev_id(wdev), NL80211_ATTR_PAD)) goto free_msg; if (nla_put_u64_64bit(msg, NL80211_ATTR_COOKIE, req->cookie, NL80211_ATTR_PAD)) goto free_msg; genlmsg_end(msg, hdr); genlmsg_unicast(wiphy_net(wdev->wiphy), msg, req->nl_portid); goto free_request; free_msg: nlmsg_free(msg); free_request: spin_lock_bh(&wdev->pmsr_lock); /* * cfg80211_pmsr_process_abort() may have already moved this request * to the free list, and will free it later. In this case, don't free * it here. */ list_for_each_entry_safe(tmp, prev, &wdev->pmsr_list, list) { if (tmp == req) { list_del(&req->list); to_free = req; break; } } spin_unlock_bh(&wdev->pmsr_lock); kfree(to_free); } EXPORT_SYMBOL_GPL(cfg80211_pmsr_complete); static int nl80211_pmsr_send_ftm_res(struct sk_buff *msg, struct cfg80211_pmsr_result *res) { if (res->status == NL80211_PMSR_STATUS_FAILURE) { if (nla_put_u32(msg, NL80211_PMSR_FTM_RESP_ATTR_FAIL_REASON, res->ftm.failure_reason)) goto error; if (res->ftm.failure_reason == NL80211_PMSR_FTM_FAILURE_PEER_BUSY && res->ftm.busy_retry_time && nla_put_u32(msg, NL80211_PMSR_FTM_RESP_ATTR_BUSY_RETRY_TIME, res->ftm.busy_retry_time)) goto error; return 0; } #define PUT(tp, attr, val) \ do { \ if (nla_put_##tp(msg, \ NL80211_PMSR_FTM_RESP_ATTR_##attr, \ res->ftm.val)) \ goto error; \ } while (0) #define PUTOPT(tp, attr, val) \ do { \ if (res->ftm.val##_valid) \ PUT(tp, attr, val); \ } while (0) #define PUT_U64(attr, val) \ do { \ if (nla_put_u64_64bit(msg, \ NL80211_PMSR_FTM_RESP_ATTR_##attr,\ res->ftm.val, \ NL80211_PMSR_FTM_RESP_ATTR_PAD)) \ goto error; \ } while (0) #define PUTOPT_U64(attr, val) \ do { \ if (res->ftm.val##_valid) \ PUT_U64(attr, val); \ } while (0) if (res->ftm.burst_index >= 0) PUT(u32, BURST_INDEX, burst_index); PUTOPT(u32, NUM_FTMR_ATTEMPTS, num_ftmr_attempts); PUTOPT(u32, NUM_FTMR_SUCCESSES, num_ftmr_successes); PUT(u8, NUM_BURSTS_EXP, num_bursts_exp); PUT(u8, BURST_DURATION, burst_duration); PUT(u8, FTMS_PER_BURST, ftms_per_burst); PUTOPT(s32, RSSI_AVG, rssi_avg); PUTOPT(s32, RSSI_SPREAD, rssi_spread); if (res->ftm.tx_rate_valid && !nl80211_put_sta_rate(msg, &res->ftm.tx_rate, NL80211_PMSR_FTM_RESP_ATTR_TX_RATE)) goto error; if (res->ftm.rx_rate_valid && !nl80211_put_sta_rate(msg, &res->ftm.rx_rate, NL80211_PMSR_FTM_RESP_ATTR_RX_RATE)) goto error; PUTOPT_U64(RTT_AVG, rtt_avg); PUTOPT_U64(RTT_VARIANCE, rtt_variance); PUTOPT_U64(RTT_SPREAD, rtt_spread); PUTOPT_U64(DIST_AVG, dist_avg); PUTOPT_U64(DIST_VARIANCE, dist_variance); PUTOPT_U64(DIST_SPREAD, dist_spread); if (res->ftm.lci && res->ftm.lci_len && nla_put(msg, NL80211_PMSR_FTM_RESP_ATTR_LCI, res->ftm.lci_len, res->ftm.lci)) goto error; if (res->ftm.civicloc && res->ftm.civicloc_len && nla_put(msg, NL80211_PMSR_FTM_RESP_ATTR_CIVICLOC, res->ftm.civicloc_len, res->ftm.civicloc)) goto error; #undef PUT #undef PUTOPT #undef PUT_U64 #undef PUTOPT_U64 return 0; error: return -ENOSPC; } static int nl80211_pmsr_send_result(struct sk_buff *msg, struct cfg80211_pmsr_result *res) { struct nlattr *pmsr, *peers, *peer, *resp, *data, *typedata; pmsr = nla_nest_start_noflag(msg, NL80211_ATTR_PEER_MEASUREMENTS); if (!pmsr) goto error; peers = nla_nest_start_noflag(msg, NL80211_PMSR_ATTR_PEERS); if (!peers) goto error; peer = nla_nest_start_noflag(msg, 1); if (!peer) goto error; if (nla_put(msg, NL80211_PMSR_PEER_ATTR_ADDR, ETH_ALEN, res->addr)) goto error; resp = nla_nest_start_noflag(msg, NL80211_PMSR_PEER_ATTR_RESP); if (!resp) goto error; if (nla_put_u32(msg, NL80211_PMSR_RESP_ATTR_STATUS, res->status) || nla_put_u64_64bit(msg, NL80211_PMSR_RESP_ATTR_HOST_TIME, res->host_time, NL80211_PMSR_RESP_ATTR_PAD)) goto error; if (res->ap_tsf_valid && nla_put_u64_64bit(msg, NL80211_PMSR_RESP_ATTR_AP_TSF, res->ap_tsf, NL80211_PMSR_RESP_ATTR_PAD)) goto error; if (res->final && nla_put_flag(msg, NL80211_PMSR_RESP_ATTR_FINAL)) goto error; data = nla_nest_start_noflag(msg, NL80211_PMSR_RESP_ATTR_DATA); if (!data) goto error; typedata = nla_nest_start_noflag(msg, res->type); if (!typedata) goto error; switch (res->type) { case NL80211_PMSR_TYPE_FTM: if (nl80211_pmsr_send_ftm_res(msg, res)) goto error; break; default: WARN_ON(1); } nla_nest_end(msg, typedata); nla_nest_end(msg, data); nla_nest_end(msg, resp); nla_nest_end(msg, peer); nla_nest_end(msg, peers); nla_nest_end(msg, pmsr); return 0; error: return -ENOSPC; } void cfg80211_pmsr_report(struct wireless_dev *wdev, struct cfg80211_pmsr_request *req, struct cfg80211_pmsr_result *result, gfp_t gfp) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wdev->wiphy); struct sk_buff *msg; void *hdr; int err; trace_cfg80211_pmsr_report(wdev->wiphy, wdev, req->cookie, result->addr); /* * Currently, only variable items are LCI and civic location, * both of which are reasonably short so we don't need to * worry about them here for the allocation. */ msg = nlmsg_new(NLMSG_DEFAULT_SIZE, gfp); if (!msg) return; hdr = nl80211hdr_put(msg, 0, 0, 0, NL80211_CMD_PEER_MEASUREMENT_RESULT); if (!hdr) goto free; if (nla_put_u32(msg, NL80211_ATTR_WIPHY, rdev->wiphy_idx) || nla_put_u64_64bit(msg, NL80211_ATTR_WDEV, wdev_id(wdev), NL80211_ATTR_PAD)) goto free; if (nla_put_u64_64bit(msg, NL80211_ATTR_COOKIE, req->cookie, NL80211_ATTR_PAD)) goto free; err = nl80211_pmsr_send_result(msg, result); if (err) { pr_err_ratelimited("peer measurement result: message didn't fit!"); goto free; } genlmsg_end(msg, hdr); genlmsg_unicast(wiphy_net(wdev->wiphy), msg, req->nl_portid); return; free: nlmsg_free(msg); } EXPORT_SYMBOL_GPL(cfg80211_pmsr_report); static void cfg80211_pmsr_process_abort(struct wireless_dev *wdev) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wdev->wiphy); struct cfg80211_pmsr_request *req, *tmp; LIST_HEAD(free_list); lockdep_assert_wiphy(wdev->wiphy); spin_lock_bh(&wdev->pmsr_lock); list_for_each_entry_safe(req, tmp, &wdev->pmsr_list, list) { if (req->nl_portid) continue; list_move_tail(&req->list, &free_list); } spin_unlock_bh(&wdev->pmsr_lock); list_for_each_entry_safe(req, tmp, &free_list, list) { rdev_abort_pmsr(rdev, wdev, req); kfree(req); } } void cfg80211_pmsr_free_wk(struct work_struct *work) { struct wireless_dev *wdev = container_of(work, struct wireless_dev, pmsr_free_wk); guard(wiphy)(wdev->wiphy); cfg80211_pmsr_process_abort(wdev); } void cfg80211_pmsr_wdev_down(struct wireless_dev *wdev) { struct cfg80211_pmsr_request *req; bool found = false; spin_lock_bh(&wdev->pmsr_lock); list_for_each_entry(req, &wdev->pmsr_list, list) { found = true; req->nl_portid = 0; } spin_unlock_bh(&wdev->pmsr_lock); if (found) cfg80211_pmsr_process_abort(wdev); WARN_ON(!list_empty(&wdev->pmsr_list)); } void cfg80211_release_pmsr(struct wireless_dev *wdev, u32 portid) { struct cfg80211_pmsr_request *req; spin_lock_bh(&wdev->pmsr_lock); list_for_each_entry(req, &wdev->pmsr_list, list) { if (req->nl_portid == portid) { req->nl_portid = 0; schedule_work(&wdev->pmsr_free_wk); } } spin_unlock_bh(&wdev->pmsr_lock); } |
| 4 1 1 1 2 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (c) International Business Machines Corp., 2006 * * Author: Artem Bityutskiy (Битюцкий Артём) */ /* * This file includes implementation of UBI character device operations. * * There are two kinds of character devices in UBI: UBI character devices and * UBI volume character devices. UBI character devices allow users to * manipulate whole volumes: create, remove, and re-size them. Volume character * devices provide volume I/O capabilities. * * Major and minor numbers are assigned dynamically to both UBI and volume * character devices. * * Well, there is the third kind of character devices - the UBI control * character device, which allows to manipulate by UBI devices - create and * delete them. In other words, it is used for attaching and detaching MTD * devices. */ #include <linux/module.h> #include <linux/stat.h> #include <linux/slab.h> #include <linux/ioctl.h> #include <linux/capability.h> #include <linux/uaccess.h> #include <linux/compat.h> #include <linux/math64.h> #include <mtd/ubi-user.h> #include "ubi.h" /** * get_exclusive - get exclusive access to an UBI volume. * @desc: volume descriptor * * This function changes UBI volume open mode to "exclusive". Returns previous * mode value (positive integer) in case of success and a negative error code * in case of failure. */ static int get_exclusive(struct ubi_volume_desc *desc) { int users, err; struct ubi_volume *vol = desc->vol; spin_lock(&vol->ubi->volumes_lock); users = vol->readers + vol->writers + vol->exclusive + vol->metaonly; ubi_assert(users > 0); if (users > 1) { ubi_err(vol->ubi, "%d users for volume %d", users, vol->vol_id); err = -EBUSY; } else { vol->readers = vol->writers = vol->metaonly = 0; vol->exclusive = 1; err = desc->mode; desc->mode = UBI_EXCLUSIVE; } spin_unlock(&vol->ubi->volumes_lock); return err; } /** * revoke_exclusive - revoke exclusive mode. * @desc: volume descriptor * @mode: new mode to switch to */ static void revoke_exclusive(struct ubi_volume_desc *desc, int mode) { struct ubi_volume *vol = desc->vol; spin_lock(&vol->ubi->volumes_lock); ubi_assert(vol->readers == 0 && vol->writers == 0 && vol->metaonly == 0); ubi_assert(vol->exclusive == 1 && desc->mode == UBI_EXCLUSIVE); vol->exclusive = 0; if (mode == UBI_READONLY) vol->readers = 1; else if (mode == UBI_READWRITE) vol->writers = 1; else if (mode == UBI_METAONLY) vol->metaonly = 1; else vol->exclusive = 1; spin_unlock(&vol->ubi->volumes_lock); desc->mode = mode; } static int vol_cdev_open(struct inode *inode, struct file *file) { struct ubi_volume_desc *desc; int vol_id = iminor(inode) - 1, mode, ubi_num; ubi_num = ubi_major2num(imajor(inode)); if (ubi_num < 0) return ubi_num; if (file->f_mode & FMODE_WRITE) mode = UBI_READWRITE; else mode = UBI_READONLY; dbg_gen("open device %d, volume %d, mode %d", ubi_num, vol_id, mode); desc = ubi_open_volume(ubi_num, vol_id, mode); if (IS_ERR(desc)) return PTR_ERR(desc); file->private_data = desc; return 0; } static int vol_cdev_release(struct inode *inode, struct file *file) { struct ubi_volume_desc *desc = file->private_data; struct ubi_volume *vol = desc->vol; dbg_gen("release device %d, volume %d, mode %d", vol->ubi->ubi_num, vol->vol_id, desc->mode); if (vol->updating) { ubi_warn(vol->ubi, "update of volume %d not finished, volume is damaged", vol->vol_id); ubi_assert(!vol->changing_leb); vol->updating = 0; vfree(vol->upd_buf); } else if (vol->changing_leb) { dbg_gen("only %lld of %lld bytes received for atomic LEB change for volume %d:%d, cancel", vol->upd_received, vol->upd_bytes, vol->ubi->ubi_num, vol->vol_id); vol->changing_leb = 0; vfree(vol->upd_buf); } ubi_close_volume(desc); return 0; } static loff_t vol_cdev_llseek(struct file *file, loff_t offset, int origin) { struct ubi_volume_desc *desc = file->private_data; struct ubi_volume *vol = desc->vol; if (vol->updating) { /* Update is in progress, seeking is prohibited */ ubi_err(vol->ubi, "updating"); return -EBUSY; } return fixed_size_llseek(file, offset, origin, vol->used_bytes); } static int vol_cdev_fsync(struct file *file, loff_t start, loff_t end, int datasync) { struct ubi_volume_desc *desc = file->private_data; struct ubi_device *ubi = desc->vol->ubi; struct inode *inode = file_inode(file); int err; inode_lock(inode); err = ubi_sync(ubi->ubi_num); inode_unlock(inode); return err; } static ssize_t vol_cdev_read(struct file *file, __user char *buf, size_t count, loff_t *offp) { struct ubi_volume_desc *desc = file->private_data; struct ubi_volume *vol = desc->vol; struct ubi_device *ubi = vol->ubi; int err, lnum, off, len, tbuf_size; size_t count_save = count; void *tbuf; dbg_gen("read %zd bytes from offset %lld of volume %d", count, *offp, vol->vol_id); if (vol->updating) { ubi_err(vol->ubi, "updating"); return -EBUSY; } if (vol->upd_marker) { ubi_err(vol->ubi, "damaged volume, update marker is set"); return -EBADF; } if (*offp == vol->used_bytes || count == 0) return 0; if (vol->corrupted) dbg_gen("read from corrupted volume %d", vol->vol_id); if (*offp + count > vol->used_bytes) count_save = count = vol->used_bytes - *offp; tbuf_size = vol->usable_leb_size; if (count < tbuf_size) tbuf_size = ALIGN(count, ubi->min_io_size); tbuf = vmalloc(tbuf_size); if (!tbuf) return -ENOMEM; len = count > tbuf_size ? tbuf_size : count; lnum = div_u64_rem(*offp, vol->usable_leb_size, &off); do { cond_resched(); if (off + len >= vol->usable_leb_size) len = vol->usable_leb_size - off; err = ubi_eba_read_leb(ubi, vol, lnum, tbuf, off, len, 0); if (err) break; off += len; if (off == vol->usable_leb_size) { lnum += 1; off -= vol->usable_leb_size; } count -= len; *offp += len; err = copy_to_user(buf, tbuf, len); if (err) { err = -EFAULT; break; } buf += len; len = count > tbuf_size ? tbuf_size : count; } while (count); vfree(tbuf); return err ? err : count_save - count; } /* * This function allows to directly write to dynamic UBI volumes, without * issuing the volume update operation. */ static ssize_t vol_cdev_direct_write(struct file *file, const char __user *buf, size_t count, loff_t *offp) { struct ubi_volume_desc *desc = file->private_data; struct ubi_volume *vol = desc->vol; struct ubi_device *ubi = vol->ubi; int lnum, off, len, tbuf_size, err = 0; size_t count_save = count; char *tbuf; if (!vol->direct_writes) return -EPERM; dbg_gen("requested: write %zd bytes to offset %lld of volume %u", count, *offp, vol->vol_id); if (vol->vol_type == UBI_STATIC_VOLUME) return -EROFS; lnum = div_u64_rem(*offp, vol->usable_leb_size, &off); if (off & (ubi->min_io_size - 1)) { ubi_err(ubi, "unaligned position"); return -EINVAL; } if (*offp + count > vol->used_bytes) count_save = count = vol->used_bytes - *offp; /* We can write only in fractions of the minimum I/O unit */ if (count & (ubi->min_io_size - 1)) { ubi_err(ubi, "unaligned write length"); return -EINVAL; } tbuf_size = vol->usable_leb_size; if (count < tbuf_size) tbuf_size = ALIGN(count, ubi->min_io_size); tbuf = vmalloc(tbuf_size); if (!tbuf) return -ENOMEM; len = count > tbuf_size ? tbuf_size : count; while (count) { cond_resched(); if (off + len >= vol->usable_leb_size) len = vol->usable_leb_size - off; err = copy_from_user(tbuf, buf, len); if (err) { err = -EFAULT; break; } err = ubi_eba_write_leb(ubi, vol, lnum, tbuf, off, len); if (err) break; off += len; if (off == vol->usable_leb_size) { lnum += 1; off -= vol->usable_leb_size; } count -= len; *offp += len; buf += len; len = count > tbuf_size ? tbuf_size : count; } vfree(tbuf); return err ? err : count_save - count; } static ssize_t vol_cdev_write(struct file *file, const char __user *buf, size_t count, loff_t *offp) { int err = 0; struct ubi_volume_desc *desc = file->private_data; struct ubi_volume *vol = desc->vol; struct ubi_device *ubi = vol->ubi; if (!vol->updating && !vol->changing_leb) return vol_cdev_direct_write(file, buf, count, offp); if (vol->updating) err = ubi_more_update_data(ubi, vol, buf, count); else err = ubi_more_leb_change_data(ubi, vol, buf, count); if (err < 0) { ubi_err(ubi, "cannot accept more %zd bytes of data, error %d", count, err); return err; } if (err) { /* * The operation is finished, @err contains number of actually * written bytes. */ count = err; if (vol->changing_leb) { revoke_exclusive(desc, UBI_READWRITE); return count; } /* * We voluntarily do not take into account the skip_check flag * as we want to make sure what we wrote was correctly written. */ err = ubi_check_volume(ubi, vol->vol_id); if (err < 0) return err; if (err) { ubi_warn(ubi, "volume %d on UBI device %d is corrupted", vol->vol_id, ubi->ubi_num); vol->corrupted = 1; } vol->checked = 1; ubi_volume_notify(ubi, vol, UBI_VOLUME_UPDATED); revoke_exclusive(desc, UBI_READWRITE); } return count; } static long vol_cdev_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { int err = 0; struct ubi_volume_desc *desc = file->private_data; struct ubi_volume *vol = desc->vol; struct ubi_device *ubi = vol->ubi; void __user *argp = (void __user *)arg; switch (cmd) { /* Volume update command */ case UBI_IOCVOLUP: { int64_t bytes, rsvd_bytes; if (!capable(CAP_SYS_RESOURCE)) { err = -EPERM; break; } err = copy_from_user(&bytes, argp, sizeof(int64_t)); if (err) { err = -EFAULT; break; } if (desc->mode == UBI_READONLY) { err = -EROFS; break; } rsvd_bytes = (long long)vol->reserved_pebs * vol->usable_leb_size; if (bytes < 0 || bytes > rsvd_bytes) { err = -EINVAL; break; } err = get_exclusive(desc); if (err < 0) break; err = ubi_start_update(ubi, vol, bytes); if (bytes == 0) { ubi_volume_notify(ubi, vol, UBI_VOLUME_UPDATED); revoke_exclusive(desc, UBI_READWRITE); } break; } /* Atomic logical eraseblock change command */ case UBI_IOCEBCH: { struct ubi_leb_change_req req; err = copy_from_user(&req, argp, sizeof(struct ubi_leb_change_req)); if (err) { err = -EFAULT; break; } if (desc->mode == UBI_READONLY || vol->vol_type == UBI_STATIC_VOLUME) { err = -EROFS; break; } /* Validate the request */ err = -EINVAL; if (!ubi_leb_valid(vol, req.lnum) || req.bytes < 0 || req.bytes > vol->usable_leb_size) break; err = get_exclusive(desc); if (err < 0) break; err = ubi_start_leb_change(ubi, vol, &req); if (req.bytes == 0) revoke_exclusive(desc, UBI_READWRITE); break; } /* Logical eraseblock erasure command */ case UBI_IOCEBER: { int32_t lnum; err = get_user(lnum, (__user int32_t *)argp); if (err) { err = -EFAULT; break; } if (desc->mode == UBI_READONLY || vol->vol_type == UBI_STATIC_VOLUME) { err = -EROFS; break; } if (!ubi_leb_valid(vol, lnum)) { err = -EINVAL; break; } dbg_gen("erase LEB %d:%d", vol->vol_id, lnum); err = ubi_eba_unmap_leb(ubi, vol, lnum); if (err) break; err = ubi_wl_flush(ubi, UBI_ALL, UBI_ALL); break; } /* Logical eraseblock map command */ case UBI_IOCEBMAP: { struct ubi_map_req req; err = copy_from_user(&req, argp, sizeof(struct ubi_map_req)); if (err) { err = -EFAULT; break; } err = ubi_leb_map(desc, req.lnum); break; } /* Logical eraseblock un-map command */ case UBI_IOCEBUNMAP: { int32_t lnum; err = get_user(lnum, (__user int32_t *)argp); if (err) { err = -EFAULT; break; } err = ubi_leb_unmap(desc, lnum); break; } /* Check if logical eraseblock is mapped command */ case UBI_IOCEBISMAP: { int32_t lnum; err = get_user(lnum, (__user int32_t *)argp); if (err) { err = -EFAULT; break; } err = ubi_is_mapped(desc, lnum); break; } /* Set volume property command */ case UBI_IOCSETVOLPROP: { struct ubi_set_vol_prop_req req; err = copy_from_user(&req, argp, sizeof(struct ubi_set_vol_prop_req)); if (err) { err = -EFAULT; break; } switch (req.property) { case UBI_VOL_PROP_DIRECT_WRITE: mutex_lock(&ubi->device_mutex); desc->vol->direct_writes = !!req.value; mutex_unlock(&ubi->device_mutex); break; default: err = -EINVAL; break; } break; } /* Create a R/O block device on top of the UBI volume */ case UBI_IOCVOLCRBLK: { struct ubi_volume_info vi; ubi_get_volume_info(desc, &vi); err = ubiblock_create(&vi); break; } /* Remove the R/O block device */ case UBI_IOCVOLRMBLK: { struct ubi_volume_info vi; ubi_get_volume_info(desc, &vi); err = ubiblock_remove(&vi); break; } default: err = -ENOTTY; break; } return err; } /** * verify_mkvol_req - verify volume creation request. * @ubi: UBI device description object * @req: the request to check * * This function zero if the request is correct, and %-EINVAL if not. */ static int verify_mkvol_req(const struct ubi_device *ubi, const struct ubi_mkvol_req *req) { int n, err = -EINVAL; if (req->bytes < 0 || req->alignment < 0 || req->vol_type < 0 || req->name_len < 0) goto bad; if ((req->vol_id < 0 || req->vol_id >= ubi->vtbl_slots) && req->vol_id != UBI_VOL_NUM_AUTO) goto bad; if (req->alignment == 0) goto bad; if (req->bytes == 0) goto bad; if (req->vol_type != UBI_DYNAMIC_VOLUME && req->vol_type != UBI_STATIC_VOLUME) goto bad; if (req->flags & ~UBI_VOL_VALID_FLGS) goto bad; if (req->flags & UBI_VOL_SKIP_CRC_CHECK_FLG && req->vol_type != UBI_STATIC_VOLUME) goto bad; if (req->alignment > ubi->leb_size) goto bad; n = req->alignment & (ubi->min_io_size - 1); if (req->alignment != 1 && n) goto bad; if (!req->name[0] || !req->name_len) goto bad; if (req->name_len > UBI_VOL_NAME_MAX) { err = -ENAMETOOLONG; goto bad; } n = strnlen(req->name, req->name_len + 1); if (n != req->name_len) goto bad; return 0; bad: ubi_err(ubi, "bad volume creation request"); ubi_dump_mkvol_req(req); return err; } /** * verify_rsvol_req - verify volume re-size request. * @ubi: UBI device description object * @req: the request to check * * This function returns zero if the request is correct, and %-EINVAL if not. */ static int verify_rsvol_req(const struct ubi_device *ubi, const struct ubi_rsvol_req *req) { if (req->bytes <= 0) return -EINVAL; if (req->vol_id < 0 || req->vol_id >= ubi->vtbl_slots) return -EINVAL; return 0; } /** * rename_volumes - rename UBI volumes. * @ubi: UBI device description object * @req: volumes re-name request * * This is a helper function for the volume re-name IOCTL which validates the * request, opens the volume and calls corresponding volumes management * function. Returns zero in case of success and a negative error code in case * of failure. */ static int rename_volumes(struct ubi_device *ubi, struct ubi_rnvol_req *req) { int i, n, err; struct list_head rename_list; struct ubi_rename_entry *re, *re1; if (req->count < 0 || req->count > UBI_MAX_RNVOL) return -EINVAL; if (req->count == 0) return 0; /* Validate volume IDs and names in the request */ for (i = 0; i < req->count; i++) { if (req->ents[i].vol_id < 0 || req->ents[i].vol_id >= ubi->vtbl_slots) return -EINVAL; if (req->ents[i].name_len < 0) return -EINVAL; if (req->ents[i].name_len > UBI_VOL_NAME_MAX) return -ENAMETOOLONG; req->ents[i].name[req->ents[i].name_len] = '\0'; n = strlen(req->ents[i].name); if (n != req->ents[i].name_len) return -EINVAL; } /* Make sure volume IDs and names are unique */ for (i = 0; i < req->count - 1; i++) { for (n = i + 1; n < req->count; n++) { if (req->ents[i].vol_id == req->ents[n].vol_id) { ubi_err(ubi, "duplicated volume id %d", req->ents[i].vol_id); return -EINVAL; } if (!strcmp(req->ents[i].name, req->ents[n].name)) { ubi_err(ubi, "duplicated volume name \"%s\"", req->ents[i].name); return -EINVAL; } } } /* Create the re-name list */ INIT_LIST_HEAD(&rename_list); for (i = 0; i < req->count; i++) { int vol_id = req->ents[i].vol_id; int name_len = req->ents[i].name_len; const char *name = req->ents[i].name; re = kzalloc(sizeof(struct ubi_rename_entry), GFP_KERNEL); if (!re) { err = -ENOMEM; goto out_free; } re->desc = ubi_open_volume(ubi->ubi_num, vol_id, UBI_METAONLY); if (IS_ERR(re->desc)) { err = PTR_ERR(re->desc); ubi_err(ubi, "cannot open volume %d, error %d", vol_id, err); kfree(re); goto out_free; } /* Skip this re-naming if the name does not really change */ if (re->desc->vol->name_len == name_len && !memcmp(re->desc->vol->name, name, name_len)) { ubi_close_volume(re->desc); kfree(re); continue; } re->new_name_len = name_len; memcpy(re->new_name, name, name_len); list_add_tail(&re->list, &rename_list); dbg_gen("will rename volume %d from \"%s\" to \"%s\"", vol_id, re->desc->vol->name, name); } if (list_empty(&rename_list)) return 0; /* Find out the volumes which have to be removed */ list_for_each_entry(re, &rename_list, list) { struct ubi_volume_desc *desc; int no_remove_needed = 0; /* * Volume @re->vol_id is going to be re-named to * @re->new_name, while its current name is @name. If a volume * with name @re->new_name currently exists, it has to be * removed, unless it is also re-named in the request (@req). */ list_for_each_entry(re1, &rename_list, list) { if (re->new_name_len == re1->desc->vol->name_len && !memcmp(re->new_name, re1->desc->vol->name, re1->desc->vol->name_len)) { no_remove_needed = 1; break; } } if (no_remove_needed) continue; /* * It seems we need to remove volume with name @re->new_name, * if it exists. */ desc = ubi_open_volume_nm(ubi->ubi_num, re->new_name, UBI_EXCLUSIVE); if (IS_ERR(desc)) { err = PTR_ERR(desc); if (err == -ENODEV) /* Re-naming into a non-existing volume name */ continue; /* The volume exists but busy, or an error occurred */ ubi_err(ubi, "cannot open volume \"%s\", error %d", re->new_name, err); goto out_free; } re1 = kzalloc(sizeof(struct ubi_rename_entry), GFP_KERNEL); if (!re1) { err = -ENOMEM; ubi_close_volume(desc); goto out_free; } re1->remove = 1; re1->desc = desc; list_add(&re1->list, &rename_list); dbg_gen("will remove volume %d, name \"%s\"", re1->desc->vol->vol_id, re1->desc->vol->name); } mutex_lock(&ubi->device_mutex); err = ubi_rename_volumes(ubi, &rename_list); mutex_unlock(&ubi->device_mutex); out_free: list_for_each_entry_safe(re, re1, &rename_list, list) { ubi_close_volume(re->desc); list_del(&re->list); kfree(re); } return err; } static int ubi_get_ec_info(struct ubi_device *ubi, struct ubi_ecinfo_req __user *ureq) { struct ubi_ecinfo_req req; struct ubi_wl_entry *wl; int read_cnt; int peb; int end_peb; /* Copy the input arguments */ if (copy_from_user(&req, ureq, sizeof(struct ubi_ecinfo_req))) return -EFAULT; /* Check input arguments */ if (req.length <= 0 || req.start < 0 || req.start >= ubi->peb_count) return -EINVAL; if (check_add_overflow(req.start, req.length, &end_peb)) return -EINVAL; if (end_peb > ubi->peb_count) end_peb = ubi->peb_count; /* Check access rights before filling erase_counters array */ if (!access_ok((void __user *)ureq->erase_counters, (end_peb-req.start) * sizeof(int32_t))) return -EFAULT; /* Fill erase counter array */ read_cnt = 0; for (peb = req.start; peb < end_peb; read_cnt++, peb++) { int ec; if (ubi_io_is_bad(ubi, peb)) { if (__put_user(UBI_UNKNOWN, ureq->erase_counters+read_cnt)) return -EFAULT; continue; } spin_lock(&ubi->wl_lock); wl = ubi->lookuptbl[peb]; if (wl) ec = wl->ec; else ec = UBI_UNKNOWN; spin_unlock(&ubi->wl_lock); if (__put_user(ec, ureq->erase_counters+read_cnt)) return -EFAULT; } /* Return actual read length */ req.read_length = read_cnt; /* Copy everything except erase counter array */ if (copy_to_user(ureq, &req, sizeof(struct ubi_ecinfo_req))) return -EFAULT; return 0; } static long ubi_cdev_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { int err = 0; struct ubi_device *ubi; struct ubi_volume_desc *desc; void __user *argp = (void __user *)arg; if (!capable(CAP_SYS_RESOURCE)) return -EPERM; ubi = ubi_get_by_major(imajor(file->f_mapping->host)); if (!ubi) return -ENODEV; switch (cmd) { /* Create volume command */ case UBI_IOCMKVOL: { struct ubi_mkvol_req req; dbg_gen("create volume"); err = copy_from_user(&req, argp, sizeof(struct ubi_mkvol_req)); if (err) { err = -EFAULT; break; } err = verify_mkvol_req(ubi, &req); if (err) break; mutex_lock(&ubi->device_mutex); err = ubi_create_volume(ubi, &req); mutex_unlock(&ubi->device_mutex); if (err) break; err = put_user(req.vol_id, (__user int32_t *)argp); if (err) err = -EFAULT; break; } /* Remove volume command */ case UBI_IOCRMVOL: { int vol_id; dbg_gen("remove volume"); err = get_user(vol_id, (__user int32_t *)argp); if (err) { err = -EFAULT; break; } desc = ubi_open_volume(ubi->ubi_num, vol_id, UBI_EXCLUSIVE); if (IS_ERR(desc)) { err = PTR_ERR(desc); break; } mutex_lock(&ubi->device_mutex); err = ubi_remove_volume(desc, 0); mutex_unlock(&ubi->device_mutex); /* * The volume is deleted (unless an error occurred), and the * 'struct ubi_volume' object will be freed when * 'ubi_close_volume()' will call 'put_device()'. */ ubi_close_volume(desc); break; } /* Re-size volume command */ case UBI_IOCRSVOL: { int pebs; struct ubi_rsvol_req req; dbg_gen("re-size volume"); err = copy_from_user(&req, argp, sizeof(struct ubi_rsvol_req)); if (err) { err = -EFAULT; break; } err = verify_rsvol_req(ubi, &req); if (err) break; desc = ubi_open_volume(ubi->ubi_num, req.vol_id, UBI_EXCLUSIVE); if (IS_ERR(desc)) { err = PTR_ERR(desc); break; } pebs = div_u64(req.bytes + desc->vol->usable_leb_size - 1, desc->vol->usable_leb_size); mutex_lock(&ubi->device_mutex); err = ubi_resize_volume(desc, pebs); mutex_unlock(&ubi->device_mutex); ubi_close_volume(desc); break; } /* Re-name volumes command */ case UBI_IOCRNVOL: { struct ubi_rnvol_req *req; dbg_gen("re-name volumes"); req = kmalloc(sizeof(struct ubi_rnvol_req), GFP_KERNEL); if (!req) { err = -ENOMEM; break; } err = copy_from_user(req, argp, sizeof(struct ubi_rnvol_req)); if (err) { err = -EFAULT; kfree(req); break; } err = rename_volumes(ubi, req); kfree(req); break; } /* Check a specific PEB for bitflips and scrub it if needed */ case UBI_IOCRPEB: { int pnum; err = get_user(pnum, (__user int32_t *)argp); if (err) { err = -EFAULT; break; } err = ubi_bitflip_check(ubi, pnum, 0); break; } /* Force scrubbing for a specific PEB */ case UBI_IOCSPEB: { int pnum; err = get_user(pnum, (__user int32_t *)argp); if (err) { err = -EFAULT; break; } err = ubi_bitflip_check(ubi, pnum, 1); break; } case UBI_IOCECNFO: { err = ubi_get_ec_info(ubi, argp); break; } default: err = -ENOTTY; break; } ubi_put_device(ubi); return err; } static long ctrl_cdev_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { int err = 0; void __user *argp = (void __user *)arg; if (!capable(CAP_SYS_RESOURCE)) return -EPERM; switch (cmd) { /* Attach an MTD device command */ case UBI_IOCATT: { struct ubi_attach_req req; struct mtd_info *mtd; dbg_gen("attach MTD device"); err = copy_from_user(&req, argp, sizeof(struct ubi_attach_req)); if (err) { err = -EFAULT; break; } if (req.mtd_num < 0 || (req.ubi_num < 0 && req.ubi_num != UBI_DEV_NUM_AUTO)) { err = -EINVAL; break; } mtd = get_mtd_device(NULL, req.mtd_num); if (IS_ERR(mtd)) { err = PTR_ERR(mtd); break; } /* * Note, further request verification is done by * 'ubi_attach_mtd_dev()'. */ mutex_lock(&ubi_devices_mutex); err = ubi_attach_mtd_dev(mtd, req.ubi_num, req.vid_hdr_offset, req.max_beb_per1024, !!req.disable_fm, !!req.need_resv_pool); mutex_unlock(&ubi_devices_mutex); if (err < 0) put_mtd_device(mtd); else /* @err contains UBI device number */ err = put_user(err, (__user int32_t *)argp); break; } /* Detach an MTD device command */ case UBI_IOCDET: { int ubi_num; dbg_gen("detach MTD device"); err = get_user(ubi_num, (__user int32_t *)argp); if (err) { err = -EFAULT; break; } mutex_lock(&ubi_devices_mutex); err = ubi_detach_mtd_dev(ubi_num, 0); mutex_unlock(&ubi_devices_mutex); break; } default: err = -ENOTTY; break; } return err; } /* UBI volume character device operations */ const struct file_operations ubi_vol_cdev_operations = { .owner = THIS_MODULE, .open = vol_cdev_open, .release = vol_cdev_release, .llseek = vol_cdev_llseek, .read = vol_cdev_read, .write = vol_cdev_write, .fsync = vol_cdev_fsync, .unlocked_ioctl = vol_cdev_ioctl, .compat_ioctl = compat_ptr_ioctl, }; /* UBI character device operations */ const struct file_operations ubi_cdev_operations = { .owner = THIS_MODULE, .unlocked_ioctl = ubi_cdev_ioctl, .compat_ioctl = compat_ptr_ioctl, }; /* UBI control character device operations */ const struct file_operations ubi_ctrl_cdev_operations = { .owner = THIS_MODULE, .unlocked_ioctl = ctrl_cdev_ioctl, .compat_ioctl = compat_ptr_ioctl, }; |
| 5 3 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | /* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM vb2 #if !defined(_TRACE_VB2_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_VB2_H #include <linux/tracepoint.h> #include <media/videobuf2-core.h> DECLARE_EVENT_CLASS(vb2_event_class, TP_PROTO(struct vb2_queue *q, struct vb2_buffer *vb), TP_ARGS(q, vb), TP_STRUCT__entry( __field(void *, owner) __field(u32, queued_count) __field(int, owned_by_drv_count) __field(u32, index) __field(u32, type) __field(u32, bytesused) __field(u64, timestamp) ), TP_fast_assign( __entry->owner = q->owner; __entry->queued_count = q->queued_count; __entry->owned_by_drv_count = atomic_read(&q->owned_by_drv_count); __entry->index = vb->index; __entry->type = vb->type; __entry->bytesused = vb->planes[0].bytesused; __entry->timestamp = vb->timestamp; ), TP_printk("owner = %p, queued = %u, owned_by_drv = %d, index = %u, " "type = %u, bytesused = %u, timestamp = %llu", __entry->owner, __entry->queued_count, __entry->owned_by_drv_count, __entry->index, __entry->type, __entry->bytesused, __entry->timestamp ) ) DEFINE_EVENT(vb2_event_class, vb2_buf_done, TP_PROTO(struct vb2_queue *q, struct vb2_buffer *vb), TP_ARGS(q, vb) ); DEFINE_EVENT(vb2_event_class, vb2_buf_queue, TP_PROTO(struct vb2_queue *q, struct vb2_buffer *vb), TP_ARGS(q, vb) ); DEFINE_EVENT(vb2_event_class, vb2_dqbuf, TP_PROTO(struct vb2_queue *q, struct vb2_buffer *vb), TP_ARGS(q, vb) ); DEFINE_EVENT(vb2_event_class, vb2_qbuf, TP_PROTO(struct vb2_queue *q, struct vb2_buffer *vb), TP_ARGS(q, vb) ); #endif /* if !defined(_TRACE_VB2_H) || defined(TRACE_HEADER_MULTI_READ) */ /* This part must be outside protection */ #include <trace/define_trace.h> |
| 4 3 1 4 1 2 1 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Misc and compatibility things * Copyright (c) by Jaroslav Kysela <perex@perex.cz> */ #include <linux/init.h> #include <linux/export.h> #include <linux/moduleparam.h> #include <linux/time.h> #include <linux/slab.h> #include <linux/ioport.h> #include <linux/fs.h> #include <sound/core.h> void release_and_free_resource(struct resource *res) { if (res) { release_resource(res); kfree(res); } } EXPORT_SYMBOL(release_and_free_resource); #ifdef CONFIG_PCI #include <linux/pci.h> /** * snd_pci_quirk_lookup_id - look up a PCI SSID quirk list * @vendor: PCI SSV id * @device: PCI SSD id * @list: quirk list, terminated by a null entry * * Look through the given quirk list and finds a matching entry * with the same PCI SSID. When subdevice is 0, all subdevice * values may match. * * Returns the matched entry pointer, or NULL if nothing matched. */ const struct snd_pci_quirk * snd_pci_quirk_lookup_id(u16 vendor, u16 device, const struct snd_pci_quirk *list) { const struct snd_pci_quirk *q; for (q = list; q->subvendor || q->subdevice; q++) { if (q->subvendor != vendor) continue; if (!q->subdevice || (device & q->subdevice_mask) == q->subdevice) return q; } return NULL; } EXPORT_SYMBOL(snd_pci_quirk_lookup_id); /** * snd_pci_quirk_lookup - look up a PCI SSID quirk list * @pci: pci_dev handle * @list: quirk list, terminated by a null entry * * Look through the given quirk list and finds a matching entry * with the same PCI SSID. When subdevice is 0, all subdevice * values may match. * * Returns the matched entry pointer, or NULL if nothing matched. */ const struct snd_pci_quirk * snd_pci_quirk_lookup(struct pci_dev *pci, const struct snd_pci_quirk *list) { if (!pci) return NULL; return snd_pci_quirk_lookup_id(pci->subsystem_vendor, pci->subsystem_device, list); } EXPORT_SYMBOL(snd_pci_quirk_lookup); #endif /* * Deferred async signal helpers * * Below are a few helper functions to wrap the async signal handling * in the deferred work. The main purpose is to avoid the messy deadlock * around tasklist_lock and co at the kill_fasync() invocation. * fasync_helper() and kill_fasync() are replaced with snd_fasync_helper() * and snd_kill_fasync(), respectively. In addition, snd_fasync_free() has * to be called at releasing the relevant file object. */ struct snd_fasync { struct fasync_struct *fasync; int signal; int poll; int on; struct list_head list; }; static DEFINE_SPINLOCK(snd_fasync_lock); static LIST_HEAD(snd_fasync_list); static void snd_fasync_work_fn(struct work_struct *work) { struct snd_fasync *fasync; spin_lock_irq(&snd_fasync_lock); while (!list_empty(&snd_fasync_list)) { fasync = list_first_entry(&snd_fasync_list, struct snd_fasync, list); list_del_init(&fasync->list); spin_unlock_irq(&snd_fasync_lock); if (fasync->on) kill_fasync(&fasync->fasync, fasync->signal, fasync->poll); spin_lock_irq(&snd_fasync_lock); } spin_unlock_irq(&snd_fasync_lock); } static DECLARE_WORK(snd_fasync_work, snd_fasync_work_fn); int snd_fasync_helper(int fd, struct file *file, int on, struct snd_fasync **fasyncp) { struct snd_fasync *fasync = NULL; if (on) { fasync = kzalloc(sizeof(*fasync), GFP_KERNEL); if (!fasync) return -ENOMEM; INIT_LIST_HEAD(&fasync->list); } scoped_guard(spinlock_irq, &snd_fasync_lock) { if (*fasyncp) { kfree(fasync); fasync = *fasyncp; } else { if (!fasync) return 0; *fasyncp = fasync; } fasync->on = on; } return fasync_helper(fd, file, on, &fasync->fasync); } EXPORT_SYMBOL_GPL(snd_fasync_helper); void snd_kill_fasync(struct snd_fasync *fasync, int signal, int poll) { if (!fasync || !fasync->on) return; guard(spinlock_irqsave)(&snd_fasync_lock); fasync->signal = signal; fasync->poll = poll; list_move(&fasync->list, &snd_fasync_list); schedule_work(&snd_fasync_work); } EXPORT_SYMBOL_GPL(snd_kill_fasync); void snd_fasync_free(struct snd_fasync *fasync) { if (!fasync) return; fasync->on = 0; flush_work(&snd_fasync_work); kfree(fasync); } EXPORT_SYMBOL_GPL(snd_fasync_free); |
| 5333 5320 145 4 347 1 11 246 58 14 352 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | /** * css_get - obtain a reference on the specified css * @css: target css * * The caller must already have a reference. */ CGROUP_REF_FN_ATTRS void css_get(struct cgroup_subsys_state *css) { if (!(css->flags & CSS_NO_REF)) percpu_ref_get(&css->refcnt); } CGROUP_REF_EXPORT(css_get) /** * css_get_many - obtain references on the specified css * @css: target css * @n: number of references to get * * The caller must already have a reference. */ CGROUP_REF_FN_ATTRS void css_get_many(struct cgroup_subsys_state *css, unsigned int n) { if (!(css->flags & CSS_NO_REF)) percpu_ref_get_many(&css->refcnt, n); } CGROUP_REF_EXPORT(css_get_many) /** * css_tryget - try to obtain a reference on the specified css * @css: target css * * Obtain a reference on @css unless it already has reached zero and is * being released. This function doesn't care whether @css is on or * offline. The caller naturally needs to ensure that @css is accessible * but doesn't have to be holding a reference on it - IOW, RCU protected * access is good enough for this function. Returns %true if a reference * count was successfully obtained; %false otherwise. */ CGROUP_REF_FN_ATTRS bool css_tryget(struct cgroup_subsys_state *css) { if (!(css->flags & CSS_NO_REF)) return percpu_ref_tryget(&css->refcnt); return true; } CGROUP_REF_EXPORT(css_tryget) /** * css_tryget_online - try to obtain a reference on the specified css if online * @css: target css * * Obtain a reference on @css if it's online. The caller naturally needs * to ensure that @css is accessible but doesn't have to be holding a * reference on it - IOW, RCU protected access is good enough for this * function. Returns %true if a reference count was successfully obtained; * %false otherwise. */ CGROUP_REF_FN_ATTRS bool css_tryget_online(struct cgroup_subsys_state *css) { if (!(css->flags & CSS_NO_REF)) return percpu_ref_tryget_live(&css->refcnt); return true; } CGROUP_REF_EXPORT(css_tryget_online) /** * css_put - put a css reference * @css: target css * * Put a reference obtained via css_get() and css_tryget_online(). */ CGROUP_REF_FN_ATTRS void css_put(struct cgroup_subsys_state *css) { if (!(css->flags & CSS_NO_REF)) percpu_ref_put(&css->refcnt); } CGROUP_REF_EXPORT(css_put) /** * css_put_many - put css references * @css: target css * @n: number of references to put * * Put references obtained via css_get() and css_tryget_online(). */ CGROUP_REF_FN_ATTRS void css_put_many(struct cgroup_subsys_state *css, unsigned int n) { if (!(css->flags & CSS_NO_REF)) percpu_ref_put_many(&css->refcnt, n); } CGROUP_REF_EXPORT(css_put_many) |
| 1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 | // SPDX-License-Identifier: GPL-2.0-only /* * IBSS mode implementation * Copyright 2003-2008, Jouni Malinen <j@w1.fi> * Copyright 2004, Instant802 Networks, Inc. * Copyright 2005, Devicescape Software, Inc. * Copyright 2006-2007 Jiri Benc <jbenc@suse.cz> * Copyright 2007, Michael Wu <flamingice@sourmilk.net> * Copyright 2009, Johannes Berg <johannes@sipsolutions.net> * Copyright 2013-2014 Intel Mobile Communications GmbH * Copyright(c) 2016 Intel Deutschland GmbH * Copyright(c) 2018-2025 Intel Corporation */ #include <linux/delay.h> #include <linux/slab.h> #include <linux/if_ether.h> #include <linux/skbuff.h> #include <linux/if_arp.h> #include <linux/etherdevice.h> #include <linux/rtnetlink.h> #include <net/mac80211.h> #include "ieee80211_i.h" #include "driver-ops.h" #include "rate.h" #define IEEE80211_SCAN_INTERVAL (2 * HZ) #define IEEE80211_IBSS_JOIN_TIMEOUT (7 * HZ) #define IEEE80211_IBSS_MERGE_INTERVAL (30 * HZ) #define IEEE80211_IBSS_INACTIVITY_LIMIT (60 * HZ) #define IEEE80211_IBSS_RSN_INACTIVITY_LIMIT (10 * HZ) #define IEEE80211_IBSS_MAX_STA_ENTRIES 128 static struct beacon_data * ieee80211_ibss_build_presp(struct ieee80211_sub_if_data *sdata, const int beacon_int, const u32 basic_rates, const u16 capability, u64 tsf, struct cfg80211_chan_def *chandef, bool *have_higher_than_11mbit, struct cfg80211_csa_settings *csa_settings) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct ieee80211_local *local = sdata->local; int rates_n = 0, i, ri; struct ieee80211_mgmt *mgmt; u8 *pos; struct ieee80211_supported_band *sband; u32 rates = 0, rates_added = 0; struct beacon_data *presp; int frame_len; /* Build IBSS probe response */ frame_len = sizeof(struct ieee80211_hdr_3addr) + 12 /* struct ieee80211_mgmt.u.beacon */ + 2 + IEEE80211_MAX_SSID_LEN /* max SSID */ + 2 + 8 /* max Supported Rates */ + 3 /* max DS params */ + 4 /* IBSS params */ + 5 /* Channel Switch Announcement */ + 2 + (IEEE80211_MAX_SUPP_RATES - 8) + 2 + sizeof(struct ieee80211_ht_cap) + 2 + sizeof(struct ieee80211_ht_operation) + 2 + sizeof(struct ieee80211_vht_cap) + 2 + sizeof(struct ieee80211_vht_operation) + ifibss->ie_len; presp = kzalloc(sizeof(*presp) + frame_len, GFP_KERNEL); if (!presp) return NULL; presp->head = (void *)(presp + 1); mgmt = (void *) presp->head; mgmt->frame_control = cpu_to_le16(IEEE80211_FTYPE_MGMT | IEEE80211_STYPE_PROBE_RESP); eth_broadcast_addr(mgmt->da); memcpy(mgmt->sa, sdata->vif.addr, ETH_ALEN); memcpy(mgmt->bssid, ifibss->bssid, ETH_ALEN); mgmt->u.beacon.beacon_int = cpu_to_le16(beacon_int); mgmt->u.beacon.timestamp = cpu_to_le64(tsf); mgmt->u.beacon.capab_info = cpu_to_le16(capability); pos = (u8 *)mgmt + offsetof(struct ieee80211_mgmt, u.beacon.variable); *pos++ = WLAN_EID_SSID; *pos++ = ifibss->ssid_len; memcpy(pos, ifibss->ssid, ifibss->ssid_len); pos += ifibss->ssid_len; sband = local->hw.wiphy->bands[chandef->chan->band]; rates_n = 0; if (have_higher_than_11mbit) *have_higher_than_11mbit = false; for (i = 0; i < sband->n_bitrates; i++) { if (sband->bitrates[i].bitrate > 110 && have_higher_than_11mbit) *have_higher_than_11mbit = true; rates |= BIT(i); rates_n++; } *pos++ = WLAN_EID_SUPP_RATES; *pos++ = min_t(int, 8, rates_n); for (ri = 0; ri < sband->n_bitrates; ri++) { int rate = DIV_ROUND_UP(sband->bitrates[ri].bitrate, 5); u8 basic = 0; if (!(rates & BIT(ri))) continue; if (basic_rates & BIT(ri)) basic = 0x80; *pos++ = basic | (u8) rate; if (++rates_added == 8) { ri++; /* continue at next rate for EXT_SUPP_RATES */ break; } } if (sband->band == NL80211_BAND_2GHZ) { *pos++ = WLAN_EID_DS_PARAMS; *pos++ = 1; *pos++ = ieee80211_frequency_to_channel( chandef->chan->center_freq); } *pos++ = WLAN_EID_IBSS_PARAMS; *pos++ = 2; /* FIX: set ATIM window based on scan results */ *pos++ = 0; *pos++ = 0; if (csa_settings) { *pos++ = WLAN_EID_CHANNEL_SWITCH; *pos++ = 3; *pos++ = csa_settings->block_tx ? 1 : 0; *pos++ = ieee80211_frequency_to_channel( csa_settings->chandef.chan->center_freq); presp->cntdwn_counter_offsets[0] = (pos - presp->head); *pos++ = csa_settings->count; presp->cntdwn_current_counter = csa_settings->count; } /* put the remaining rates in WLAN_EID_EXT_SUPP_RATES */ if (rates_n > 8) { *pos++ = WLAN_EID_EXT_SUPP_RATES; *pos++ = rates_n - 8; for (; ri < sband->n_bitrates; ri++) { int rate = DIV_ROUND_UP(sband->bitrates[ri].bitrate, 5); u8 basic = 0; if (!(rates & BIT(ri))) continue; if (basic_rates & BIT(ri)) basic = 0x80; *pos++ = basic | (u8) rate; } } if (ifibss->ie_len) { memcpy(pos, ifibss->ie, ifibss->ie_len); pos += ifibss->ie_len; } /* add HT capability and information IEs */ if (chandef->width != NL80211_CHAN_WIDTH_20_NOHT && chandef->width != NL80211_CHAN_WIDTH_5 && chandef->width != NL80211_CHAN_WIDTH_10 && sband->ht_cap.ht_supported) { struct ieee80211_sta_ht_cap ht_cap; memcpy(&ht_cap, &sband->ht_cap, sizeof(ht_cap)); ieee80211_apply_htcap_overrides(sdata, &ht_cap); pos = ieee80211_ie_build_ht_cap(pos, &ht_cap, ht_cap.cap); /* * Note: According to 802.11n-2009 9.13.3.1, HT Protection * field and RIFS Mode are reserved in IBSS mode, therefore * keep them at 0 */ pos = ieee80211_ie_build_ht_oper(pos, &sband->ht_cap, chandef, 0, false); /* add VHT capability and information IEs */ if (chandef->width != NL80211_CHAN_WIDTH_20 && chandef->width != NL80211_CHAN_WIDTH_40 && sband->vht_cap.vht_supported) { pos = ieee80211_ie_build_vht_cap(pos, &sband->vht_cap, sband->vht_cap.cap); pos = ieee80211_ie_build_vht_oper(pos, &sband->vht_cap, chandef); } } if (local->hw.queues >= IEEE80211_NUM_ACS) pos = ieee80211_add_wmm_info_ie(pos, 0); /* U-APSD not in use */ presp->head_len = pos - presp->head; if (WARN_ON(presp->head_len > frame_len)) goto error; return presp; error: kfree(presp); return NULL; } static void __ieee80211_sta_join_ibss(struct ieee80211_sub_if_data *sdata, const u8 *bssid, const int beacon_int, struct cfg80211_chan_def *req_chandef, const u32 basic_rates, const u16 capability, u64 tsf, bool creator) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct ieee80211_local *local = sdata->local; struct ieee80211_mgmt *mgmt; struct cfg80211_bss *bss; u64 bss_change; struct ieee80211_chan_req chanreq = {}; struct ieee80211_channel *chan; struct beacon_data *presp; struct cfg80211_inform_bss bss_meta = {}; bool have_higher_than_11mbit; bool radar_required; int err; lockdep_assert_wiphy(local->hw.wiphy); /* Reset own TSF to allow time synchronization work. */ drv_reset_tsf(local, sdata); if (!ether_addr_equal(ifibss->bssid, bssid)) sta_info_flush(sdata, -1); /* if merging, indicate to driver that we leave the old IBSS */ if (sdata->vif.cfg.ibss_joined) { sdata->vif.cfg.ibss_joined = false; sdata->vif.cfg.ibss_creator = false; sdata->vif.bss_conf.enable_beacon = false; netif_carrier_off(sdata->dev); synchronize_net(); ieee80211_bss_info_change_notify(sdata, BSS_CHANGED_IBSS | BSS_CHANGED_BEACON_ENABLED); drv_leave_ibss(local, sdata); } presp = sdata_dereference(ifibss->presp, sdata); RCU_INIT_POINTER(ifibss->presp, NULL); if (presp) kfree_rcu(presp, rcu_head); /* make a copy of the chandef, it could be modified below. */ chanreq.oper = *req_chandef; chan = chanreq.oper.chan; if (!cfg80211_reg_can_beacon(local->hw.wiphy, &chanreq.oper, NL80211_IFTYPE_ADHOC)) { if (chanreq.oper.width == NL80211_CHAN_WIDTH_5 || chanreq.oper.width == NL80211_CHAN_WIDTH_10 || chanreq.oper.width == NL80211_CHAN_WIDTH_20_NOHT || chanreq.oper.width == NL80211_CHAN_WIDTH_20) { sdata_info(sdata, "Failed to join IBSS, beacons forbidden\n"); return; } chanreq.oper.width = NL80211_CHAN_WIDTH_20; chanreq.oper.center_freq1 = chan->center_freq; /* check again for downgraded chandef */ if (!cfg80211_reg_can_beacon(local->hw.wiphy, &chanreq.oper, NL80211_IFTYPE_ADHOC)) { sdata_info(sdata, "Failed to join IBSS, beacons forbidden\n"); return; } } err = cfg80211_chandef_dfs_required(sdata->local->hw.wiphy, &chanreq.oper, NL80211_IFTYPE_ADHOC); if (err < 0) { sdata_info(sdata, "Failed to join IBSS, invalid chandef\n"); return; } if (err > 0 && !ifibss->userspace_handles_dfs) { sdata_info(sdata, "Failed to join IBSS, DFS channel without control program\n"); return; } radar_required = err; if (ieee80211_link_use_channel(&sdata->deflink, &chanreq, ifibss->fixed_channel ? IEEE80211_CHANCTX_SHARED : IEEE80211_CHANCTX_EXCLUSIVE)) { sdata_info(sdata, "Failed to join IBSS, no channel context\n"); return; } sdata->deflink.radar_required = radar_required; memcpy(ifibss->bssid, bssid, ETH_ALEN); presp = ieee80211_ibss_build_presp(sdata, beacon_int, basic_rates, capability, tsf, &chanreq.oper, &have_higher_than_11mbit, NULL); if (!presp) return; rcu_assign_pointer(ifibss->presp, presp); mgmt = (void *)presp->head; sdata->vif.bss_conf.enable_beacon = true; sdata->vif.bss_conf.beacon_int = beacon_int; sdata->vif.bss_conf.basic_rates = basic_rates; sdata->vif.cfg.ssid_len = ifibss->ssid_len; memcpy(sdata->vif.cfg.ssid, ifibss->ssid, ifibss->ssid_len); bss_change = BSS_CHANGED_BEACON_INT; bss_change |= ieee80211_reset_erp_info(sdata); bss_change |= BSS_CHANGED_BSSID; bss_change |= BSS_CHANGED_BEACON; bss_change |= BSS_CHANGED_BEACON_ENABLED; bss_change |= BSS_CHANGED_BASIC_RATES; bss_change |= BSS_CHANGED_HT; bss_change |= BSS_CHANGED_IBSS; bss_change |= BSS_CHANGED_SSID; /* * In 5 GHz/802.11a, we can always use short slot time. * (IEEE 802.11-2012 18.3.8.7) * * In 2.4GHz, we must always use long slots in IBSS for compatibility * reasons. * (IEEE 802.11-2012 19.4.5) * * HT follows these specifications (IEEE 802.11-2012 20.3.18) */ sdata->vif.bss_conf.use_short_slot = chan->band == NL80211_BAND_5GHZ; bss_change |= BSS_CHANGED_ERP_SLOT; /* cf. IEEE 802.11 9.2.12 */ sdata->deflink.operating_11g_mode = chan->band == NL80211_BAND_2GHZ && have_higher_than_11mbit; ieee80211_set_wmm_default(&sdata->deflink, true, false); sdata->vif.cfg.ibss_joined = true; sdata->vif.cfg.ibss_creator = creator; err = drv_join_ibss(local, sdata); if (err) { sdata->vif.cfg.ibss_joined = false; sdata->vif.cfg.ibss_creator = false; sdata->vif.bss_conf.enable_beacon = false; sdata->vif.cfg.ssid_len = 0; RCU_INIT_POINTER(ifibss->presp, NULL); kfree_rcu(presp, rcu_head); ieee80211_link_release_channel(&sdata->deflink); sdata_info(sdata, "Failed to join IBSS, driver failure: %d\n", err); return; } ieee80211_bss_info_change_notify(sdata, bss_change); ifibss->state = IEEE80211_IBSS_MLME_JOINED; mod_timer(&ifibss->timer, round_jiffies(jiffies + IEEE80211_IBSS_MERGE_INTERVAL)); bss_meta.chan = chan; bss = cfg80211_inform_bss_frame_data(local->hw.wiphy, &bss_meta, mgmt, presp->head_len, GFP_KERNEL); cfg80211_put_bss(local->hw.wiphy, bss); netif_carrier_on(sdata->dev); cfg80211_ibss_joined(sdata->dev, ifibss->bssid, chan, GFP_KERNEL); } static void ieee80211_sta_join_ibss(struct ieee80211_sub_if_data *sdata, struct ieee80211_bss *bss) { struct cfg80211_bss *cbss = container_of((void *)bss, struct cfg80211_bss, priv); struct ieee80211_supported_band *sband; struct cfg80211_chan_def chandef; u32 basic_rates; int i, j; u16 beacon_int = cbss->beacon_interval; const struct cfg80211_bss_ies *ies; enum nl80211_channel_type chan_type; u64 tsf; lockdep_assert_wiphy(sdata->local->hw.wiphy); if (beacon_int < 10) beacon_int = 10; switch (sdata->u.ibss.chandef.width) { case NL80211_CHAN_WIDTH_20_NOHT: case NL80211_CHAN_WIDTH_20: case NL80211_CHAN_WIDTH_40: chan_type = cfg80211_get_chandef_type(&sdata->u.ibss.chandef); cfg80211_chandef_create(&chandef, cbss->channel, chan_type); break; case NL80211_CHAN_WIDTH_5: case NL80211_CHAN_WIDTH_10: cfg80211_chandef_create(&chandef, cbss->channel, NL80211_CHAN_NO_HT); chandef.width = sdata->u.ibss.chandef.width; break; case NL80211_CHAN_WIDTH_80: case NL80211_CHAN_WIDTH_80P80: case NL80211_CHAN_WIDTH_160: chandef = sdata->u.ibss.chandef; chandef.chan = cbss->channel; break; default: /* fall back to 20 MHz for unsupported modes */ cfg80211_chandef_create(&chandef, cbss->channel, NL80211_CHAN_NO_HT); break; } sband = sdata->local->hw.wiphy->bands[cbss->channel->band]; basic_rates = 0; for (i = 0; i < bss->supp_rates_len; i++) { int rate = bss->supp_rates[i] & 0x7f; bool is_basic = !!(bss->supp_rates[i] & 0x80); for (j = 0; j < sband->n_bitrates; j++) { int brate; brate = DIV_ROUND_UP(sband->bitrates[j].bitrate, 5); if (brate == rate) { if (is_basic) basic_rates |= BIT(j); break; } } } rcu_read_lock(); ies = rcu_dereference(cbss->ies); tsf = ies->tsf; rcu_read_unlock(); __ieee80211_sta_join_ibss(sdata, cbss->bssid, beacon_int, &chandef, basic_rates, cbss->capability, tsf, false); } int ieee80211_ibss_csa_beacon(struct ieee80211_sub_if_data *sdata, struct cfg80211_csa_settings *csa_settings, u64 *changed) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct beacon_data *presp, *old_presp; struct cfg80211_bss *cbss; const struct cfg80211_bss_ies *ies; u16 capability = WLAN_CAPABILITY_IBSS; u64 tsf; lockdep_assert_wiphy(sdata->local->hw.wiphy); if (ifibss->privacy) capability |= WLAN_CAPABILITY_PRIVACY; cbss = cfg80211_get_bss(sdata->local->hw.wiphy, ifibss->chandef.chan, ifibss->bssid, ifibss->ssid, ifibss->ssid_len, IEEE80211_BSS_TYPE_IBSS, IEEE80211_PRIVACY(ifibss->privacy)); if (unlikely(!cbss)) return -EINVAL; rcu_read_lock(); ies = rcu_dereference(cbss->ies); tsf = ies->tsf; rcu_read_unlock(); cfg80211_put_bss(sdata->local->hw.wiphy, cbss); old_presp = sdata_dereference(ifibss->presp, sdata); presp = ieee80211_ibss_build_presp(sdata, sdata->vif.bss_conf.beacon_int, sdata->vif.bss_conf.basic_rates, capability, tsf, &ifibss->chandef, NULL, csa_settings); if (!presp) return -ENOMEM; rcu_assign_pointer(ifibss->presp, presp); if (old_presp) kfree_rcu(old_presp, rcu_head); *changed |= BSS_CHANGED_BEACON; return 0; } int ieee80211_ibss_finish_csa(struct ieee80211_sub_if_data *sdata, u64 *changed) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct cfg80211_bss *cbss; lockdep_assert_wiphy(sdata->local->hw.wiphy); /* When not connected/joined, sending CSA doesn't make sense. */ if (ifibss->state != IEEE80211_IBSS_MLME_JOINED) return -ENOLINK; /* update cfg80211 bss information with the new channel */ if (!is_zero_ether_addr(ifibss->bssid)) { cbss = cfg80211_get_bss(sdata->local->hw.wiphy, ifibss->chandef.chan, ifibss->bssid, ifibss->ssid, ifibss->ssid_len, IEEE80211_BSS_TYPE_IBSS, IEEE80211_PRIVACY(ifibss->privacy)); /* XXX: should not really modify cfg80211 data */ if (cbss) { cbss->channel = sdata->deflink.csa.chanreq.oper.chan; cfg80211_put_bss(sdata->local->hw.wiphy, cbss); } } ifibss->chandef = sdata->deflink.csa.chanreq.oper; /* generate the beacon */ return ieee80211_ibss_csa_beacon(sdata, NULL, changed); } void ieee80211_ibss_stop(struct ieee80211_sub_if_data *sdata) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; wiphy_work_cancel(sdata->local->hw.wiphy, &ifibss->csa_connection_drop_work); } static struct sta_info *ieee80211_ibss_finish_sta(struct sta_info *sta) __acquires(RCU) { struct ieee80211_sub_if_data *sdata = sta->sdata; u8 addr[ETH_ALEN]; memcpy(addr, sta->sta.addr, ETH_ALEN); ibss_dbg(sdata, "Adding new IBSS station %pM\n", addr); sta_info_pre_move_state(sta, IEEE80211_STA_AUTH); sta_info_pre_move_state(sta, IEEE80211_STA_ASSOC); /* authorize the station only if the network is not RSN protected. If * not wait for the userspace to authorize it */ if (!sta->sdata->u.ibss.control_port) sta_info_pre_move_state(sta, IEEE80211_STA_AUTHORIZED); rate_control_rate_init(&sta->deflink); /* If it fails, maybe we raced another insertion? */ if (sta_info_insert_rcu(sta)) return sta_info_get(sdata, addr); return sta; } static struct sta_info * ieee80211_ibss_add_sta(struct ieee80211_sub_if_data *sdata, const u8 *bssid, const u8 *addr, u32 supp_rates) __acquires(RCU) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct ieee80211_local *local = sdata->local; struct sta_info *sta; struct ieee80211_chanctx_conf *chanctx_conf; struct ieee80211_supported_band *sband; int band; /* * XXX: Consider removing the least recently used entry and * allow new one to be added. */ if (local->num_sta >= IEEE80211_IBSS_MAX_STA_ENTRIES) { net_info_ratelimited("%s: No room for a new IBSS STA entry %pM\n", sdata->name, addr); rcu_read_lock(); return NULL; } if (ifibss->state == IEEE80211_IBSS_MLME_SEARCH) { rcu_read_lock(); return NULL; } if (!ether_addr_equal(bssid, sdata->u.ibss.bssid)) { rcu_read_lock(); return NULL; } rcu_read_lock(); chanctx_conf = rcu_dereference(sdata->vif.bss_conf.chanctx_conf); if (WARN_ON_ONCE(!chanctx_conf)) return NULL; band = chanctx_conf->def.chan->band; rcu_read_unlock(); sta = sta_info_alloc(sdata, addr, GFP_KERNEL); if (!sta) { rcu_read_lock(); return NULL; } /* make sure mandatory rates are always added */ sband = local->hw.wiphy->bands[band]; sta->sta.deflink.supp_rates[band] = supp_rates | ieee80211_mandatory_rates(sband); return ieee80211_ibss_finish_sta(sta); } static int ieee80211_sta_active_ibss(struct ieee80211_sub_if_data *sdata) { struct ieee80211_local *local = sdata->local; int active = 0; struct sta_info *sta; lockdep_assert_wiphy(sdata->local->hw.wiphy); rcu_read_lock(); list_for_each_entry_rcu(sta, &local->sta_list, list) { unsigned long last_active = ieee80211_sta_last_active(sta, -1); if (sta->sdata == sdata && time_is_after_jiffies(last_active + IEEE80211_IBSS_MERGE_INTERVAL)) { active++; break; } } rcu_read_unlock(); return active; } static void ieee80211_ibss_disconnect(struct ieee80211_sub_if_data *sdata) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct ieee80211_local *local = sdata->local; struct cfg80211_bss *cbss; struct beacon_data *presp; struct sta_info *sta; lockdep_assert_wiphy(local->hw.wiphy); if (!is_zero_ether_addr(ifibss->bssid)) { cbss = cfg80211_get_bss(local->hw.wiphy, ifibss->chandef.chan, ifibss->bssid, ifibss->ssid, ifibss->ssid_len, IEEE80211_BSS_TYPE_IBSS, IEEE80211_PRIVACY(ifibss->privacy)); if (cbss) { cfg80211_unlink_bss(local->hw.wiphy, cbss); cfg80211_put_bss(sdata->local->hw.wiphy, cbss); } } ifibss->state = IEEE80211_IBSS_MLME_SEARCH; sta_info_flush(sdata, -1); spin_lock_bh(&ifibss->incomplete_lock); while (!list_empty(&ifibss->incomplete_stations)) { sta = list_first_entry(&ifibss->incomplete_stations, struct sta_info, list); list_del(&sta->list); spin_unlock_bh(&ifibss->incomplete_lock); sta_info_free(local, sta); spin_lock_bh(&ifibss->incomplete_lock); } spin_unlock_bh(&ifibss->incomplete_lock); netif_carrier_off(sdata->dev); sdata->vif.cfg.ibss_joined = false; sdata->vif.cfg.ibss_creator = false; sdata->vif.bss_conf.enable_beacon = false; sdata->vif.cfg.ssid_len = 0; /* remove beacon */ presp = sdata_dereference(ifibss->presp, sdata); RCU_INIT_POINTER(sdata->u.ibss.presp, NULL); if (presp) kfree_rcu(presp, rcu_head); clear_bit(SDATA_STATE_OFFCHANNEL_BEACON_STOPPED, &sdata->state); ieee80211_bss_info_change_notify(sdata, BSS_CHANGED_BEACON_ENABLED | BSS_CHANGED_IBSS); drv_leave_ibss(local, sdata); ieee80211_link_release_channel(&sdata->deflink); } static void ieee80211_csa_connection_drop_work(struct wiphy *wiphy, struct wiphy_work *work) { struct ieee80211_sub_if_data *sdata = container_of(work, struct ieee80211_sub_if_data, u.ibss.csa_connection_drop_work); ieee80211_ibss_disconnect(sdata); synchronize_rcu(); skb_queue_purge(&sdata->skb_queue); /* trigger a scan to find another IBSS network to join */ wiphy_work_queue(sdata->local->hw.wiphy, &sdata->work); } static void ieee80211_ibss_csa_mark_radar(struct ieee80211_sub_if_data *sdata) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; int err; /* if the current channel is a DFS channel, mark the channel as * unavailable. */ err = cfg80211_chandef_dfs_required(sdata->local->hw.wiphy, &ifibss->chandef, NL80211_IFTYPE_ADHOC); if (err > 0) cfg80211_radar_event(sdata->local->hw.wiphy, &ifibss->chandef, GFP_ATOMIC); } static bool ieee80211_ibss_process_chanswitch(struct ieee80211_sub_if_data *sdata, struct ieee802_11_elems *elems, bool beacon) { struct cfg80211_csa_settings params; struct ieee80211_csa_ie csa_ie; struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; enum nl80211_channel_type ch_type; int err; struct ieee80211_conn_settings conn = { .mode = IEEE80211_CONN_MODE_HT, .bw_limit = IEEE80211_CONN_BW_LIMIT_40, }; u32 vht_cap_info = 0; lockdep_assert_wiphy(sdata->local->hw.wiphy); switch (ifibss->chandef.width) { case NL80211_CHAN_WIDTH_5: case NL80211_CHAN_WIDTH_10: case NL80211_CHAN_WIDTH_20_NOHT: conn.mode = IEEE80211_CONN_MODE_LEGACY; fallthrough; case NL80211_CHAN_WIDTH_20: conn.bw_limit = IEEE80211_CONN_BW_LIMIT_20; break; default: break; } if (elems->vht_cap_elem) vht_cap_info = le32_to_cpu(elems->vht_cap_elem->vht_cap_info); memset(¶ms, 0, sizeof(params)); err = ieee80211_parse_ch_switch_ie(sdata, elems, ifibss->chandef.chan->band, vht_cap_info, &conn, ifibss->bssid, false, &csa_ie); /* can't switch to destination channel, fail */ if (err < 0) goto disconnect; /* did not contain a CSA */ if (err) return false; /* channel switch is not supported, disconnect */ if (!(sdata->local->hw.wiphy->flags & WIPHY_FLAG_HAS_CHANNEL_SWITCH)) goto disconnect; params.count = csa_ie.count; params.chandef = csa_ie.chanreq.oper; switch (ifibss->chandef.width) { case NL80211_CHAN_WIDTH_20_NOHT: case NL80211_CHAN_WIDTH_20: case NL80211_CHAN_WIDTH_40: /* keep our current HT mode (HT20/HT40+/HT40-), even if * another mode has been announced. The mode is not adopted * within the beacon while doing CSA and we should therefore * keep the mode which we announce. */ ch_type = cfg80211_get_chandef_type(&ifibss->chandef); cfg80211_chandef_create(¶ms.chandef, params.chandef.chan, ch_type); break; case NL80211_CHAN_WIDTH_5: case NL80211_CHAN_WIDTH_10: if (params.chandef.width != ifibss->chandef.width) { sdata_info(sdata, "IBSS %pM received channel switch from incompatible channel width (%d MHz, width:%d, CF1/2: %d/%d MHz), disconnecting\n", ifibss->bssid, params.chandef.chan->center_freq, params.chandef.width, params.chandef.center_freq1, params.chandef.center_freq2); goto disconnect; } break; default: /* should not happen, conn_flags should prevent VHT modes. */ WARN_ON(1); goto disconnect; } if (!cfg80211_reg_can_beacon(sdata->local->hw.wiphy, ¶ms.chandef, NL80211_IFTYPE_ADHOC)) { sdata_info(sdata, "IBSS %pM switches to unsupported channel (%d MHz, width:%d, CF1/2: %d/%d MHz), disconnecting\n", ifibss->bssid, params.chandef.chan->center_freq, params.chandef.width, params.chandef.center_freq1, params.chandef.center_freq2); goto disconnect; } err = cfg80211_chandef_dfs_required(sdata->local->hw.wiphy, ¶ms.chandef, NL80211_IFTYPE_ADHOC); if (err < 0) goto disconnect; if (err > 0 && !ifibss->userspace_handles_dfs) { /* IBSS-DFS only allowed with a control program */ goto disconnect; } params.radar_required = err; if (cfg80211_chandef_identical(¶ms.chandef, &sdata->vif.bss_conf.chanreq.oper)) { ibss_dbg(sdata, "received csa with an identical chandef, ignoring\n"); return true; } /* all checks done, now perform the channel switch. */ ibss_dbg(sdata, "received channel switch announcement to go to channel %d MHz\n", params.chandef.chan->center_freq); params.block_tx = !!csa_ie.mode; if (ieee80211_channel_switch(sdata->local->hw.wiphy, sdata->dev, ¶ms)) goto disconnect; ieee80211_ibss_csa_mark_radar(sdata); return true; disconnect: ibss_dbg(sdata, "Can't handle channel switch, disconnect\n"); wiphy_work_queue(sdata->local->hw.wiphy, &ifibss->csa_connection_drop_work); ieee80211_ibss_csa_mark_radar(sdata); return true; } static void ieee80211_rx_mgmt_spectrum_mgmt(struct ieee80211_sub_if_data *sdata, struct ieee80211_mgmt *mgmt, size_t len, struct ieee80211_rx_status *rx_status, struct ieee802_11_elems *elems) { int required_len; if (len < IEEE80211_MIN_ACTION_SIZE + 1) return; /* CSA is the only action we handle for now */ if (mgmt->u.action.u.measurement.action_code != WLAN_ACTION_SPCT_CHL_SWITCH) return; required_len = IEEE80211_MIN_ACTION_SIZE + sizeof(mgmt->u.action.u.chan_switch); if (len < required_len) return; if (!sdata->vif.bss_conf.csa_active) ieee80211_ibss_process_chanswitch(sdata, elems, false); } static void ieee80211_rx_mgmt_deauth_ibss(struct ieee80211_sub_if_data *sdata, struct ieee80211_mgmt *mgmt, size_t len) { u16 reason = le16_to_cpu(mgmt->u.deauth.reason_code); if (len < IEEE80211_DEAUTH_FRAME_LEN) return; ibss_dbg(sdata, "RX DeAuth SA=%pM DA=%pM\n", mgmt->sa, mgmt->da); ibss_dbg(sdata, "\tBSSID=%pM (reason: %d)\n", mgmt->bssid, reason); sta_info_destroy_addr(sdata, mgmt->sa); } static void ieee80211_rx_mgmt_auth_ibss(struct ieee80211_sub_if_data *sdata, struct ieee80211_mgmt *mgmt, size_t len) { u16 auth_alg, auth_transaction; lockdep_assert_wiphy(sdata->local->hw.wiphy); if (len < 24 + 6) return; auth_alg = le16_to_cpu(mgmt->u.auth.auth_alg); auth_transaction = le16_to_cpu(mgmt->u.auth.auth_transaction); ibss_dbg(sdata, "RX Auth SA=%pM DA=%pM\n", mgmt->sa, mgmt->da); ibss_dbg(sdata, "\tBSSID=%pM (auth_transaction=%d)\n", mgmt->bssid, auth_transaction); if (auth_alg != WLAN_AUTH_OPEN || auth_transaction != 1) return; /* * IEEE 802.11 standard does not require authentication in IBSS * networks and most implementations do not seem to use it. * However, try to reply to authentication attempts if someone * has actually implemented this. */ ieee80211_send_auth(sdata, 2, WLAN_AUTH_OPEN, 0, NULL, 0, mgmt->sa, sdata->u.ibss.bssid, NULL, 0, 0, 0); } static void ieee80211_update_sta_info(struct ieee80211_sub_if_data *sdata, struct ieee80211_mgmt *mgmt, size_t len, struct ieee80211_rx_status *rx_status, struct ieee802_11_elems *elems, struct ieee80211_channel *channel) { struct sta_info *sta; enum nl80211_band band = rx_status->band; struct ieee80211_local *local = sdata->local; struct ieee80211_supported_band *sband; bool rates_updated = false; u32 supp_rates = 0; if (sdata->vif.type != NL80211_IFTYPE_ADHOC) return; if (!ether_addr_equal(mgmt->bssid, sdata->u.ibss.bssid)) return; sband = local->hw.wiphy->bands[band]; if (WARN_ON(!sband)) return; rcu_read_lock(); sta = sta_info_get(sdata, mgmt->sa); if (elems->supp_rates) { supp_rates = ieee80211_sta_get_rates(sdata, elems, band, NULL); if (sta) { u32 prev_rates; prev_rates = sta->sta.deflink.supp_rates[band]; sta->sta.deflink.supp_rates[band] = supp_rates | ieee80211_mandatory_rates(sband); if (sta->sta.deflink.supp_rates[band] != prev_rates) { ibss_dbg(sdata, "updated supp_rates set for %pM based on beacon/probe_resp (0x%x -> 0x%x)\n", sta->sta.addr, prev_rates, sta->sta.deflink.supp_rates[band]); rates_updated = true; } } else { rcu_read_unlock(); sta = ieee80211_ibss_add_sta(sdata, mgmt->bssid, mgmt->sa, supp_rates); } } if (sta && !sta->sta.wme && (elems->wmm_info || elems->s1g_capab) && local->hw.queues >= IEEE80211_NUM_ACS) { sta->sta.wme = true; ieee80211_check_fast_xmit(sta); } if (sta && elems->ht_operation && elems->ht_cap_elem && sdata->u.ibss.chandef.width != NL80211_CHAN_WIDTH_20_NOHT && sdata->u.ibss.chandef.width != NL80211_CHAN_WIDTH_5 && sdata->u.ibss.chandef.width != NL80211_CHAN_WIDTH_10) { /* we both use HT */ struct ieee80211_ht_cap htcap_ie; struct cfg80211_chan_def chandef; enum ieee80211_sta_rx_bandwidth bw = sta->sta.deflink.bandwidth; cfg80211_chandef_create(&chandef, channel, NL80211_CHAN_NO_HT); ieee80211_chandef_ht_oper(elems->ht_operation, &chandef); memcpy(&htcap_ie, elems->ht_cap_elem, sizeof(htcap_ie)); rates_updated |= ieee80211_ht_cap_ie_to_sta_ht_cap(sdata, sband, &htcap_ie, &sta->deflink); if (elems->vht_operation && elems->vht_cap_elem && sdata->u.ibss.chandef.width != NL80211_CHAN_WIDTH_20 && sdata->u.ibss.chandef.width != NL80211_CHAN_WIDTH_40) { /* we both use VHT */ struct ieee80211_vht_cap cap_ie; struct ieee80211_sta_vht_cap cap = sta->sta.deflink.vht_cap; u32 vht_cap_info = le32_to_cpu(elems->vht_cap_elem->vht_cap_info); ieee80211_chandef_vht_oper(&local->hw, vht_cap_info, elems->vht_operation, elems->ht_operation, &chandef); memcpy(&cap_ie, elems->vht_cap_elem, sizeof(cap_ie)); ieee80211_vht_cap_ie_to_sta_vht_cap(sdata, sband, &cap_ie, NULL, &sta->deflink); if (memcmp(&cap, &sta->sta.deflink.vht_cap, sizeof(cap))) rates_updated |= true; } if (bw != sta->sta.deflink.bandwidth) rates_updated |= true; if (!cfg80211_chandef_compatible(&sdata->u.ibss.chandef, &chandef)) WARN_ON_ONCE(1); } if (sta && rates_updated) { u32 changed = IEEE80211_RC_SUPP_RATES_CHANGED; u8 rx_nss = sta->sta.deflink.rx_nss; /* Force rx_nss recalculation */ sta->sta.deflink.rx_nss = 0; rate_control_rate_init(&sta->deflink); if (sta->sta.deflink.rx_nss != rx_nss) changed |= IEEE80211_RC_NSS_CHANGED; drv_link_sta_rc_update(local, sdata, &sta->sta.deflink, changed); } rcu_read_unlock(); } static void ieee80211_rx_bss_info(struct ieee80211_sub_if_data *sdata, struct ieee80211_mgmt *mgmt, size_t len, struct ieee80211_rx_status *rx_status, struct ieee802_11_elems *elems) { struct ieee80211_local *local = sdata->local; struct cfg80211_bss *cbss; struct ieee80211_bss *bss; struct ieee80211_channel *channel; u64 beacon_timestamp, rx_timestamp; u32 supp_rates = 0; enum nl80211_band band = rx_status->band; channel = ieee80211_get_channel(local->hw.wiphy, rx_status->freq); if (!channel) return; ieee80211_update_sta_info(sdata, mgmt, len, rx_status, elems, channel); bss = ieee80211_bss_info_update(local, rx_status, mgmt, len, channel); if (!bss) return; cbss = container_of((void *)bss, struct cfg80211_bss, priv); /* same for beacon and probe response */ beacon_timestamp = le64_to_cpu(mgmt->u.beacon.timestamp); /* check if we need to merge IBSS */ /* not an IBSS */ if (!(cbss->capability & WLAN_CAPABILITY_IBSS)) goto put_bss; /* different channel */ if (sdata->u.ibss.fixed_channel && sdata->u.ibss.chandef.chan != cbss->channel) goto put_bss; /* different SSID */ if (elems->ssid_len != sdata->u.ibss.ssid_len || memcmp(elems->ssid, sdata->u.ibss.ssid, sdata->u.ibss.ssid_len)) goto put_bss; /* process channel switch */ if (sdata->vif.bss_conf.csa_active || ieee80211_ibss_process_chanswitch(sdata, elems, true)) goto put_bss; /* same BSSID */ if (ether_addr_equal(cbss->bssid, sdata->u.ibss.bssid)) goto put_bss; /* we use a fixed BSSID */ if (sdata->u.ibss.fixed_bssid) goto put_bss; if (ieee80211_have_rx_timestamp(rx_status)) { /* time when timestamp field was received */ rx_timestamp = ieee80211_calculate_rx_timestamp(local, rx_status, len + FCS_LEN, 24); } else { /* * second best option: get current TSF * (will return -1 if not supported) */ rx_timestamp = drv_get_tsf(local, sdata); } ibss_dbg(sdata, "RX beacon SA=%pM BSSID=%pM TSF=0x%llx\n", mgmt->sa, mgmt->bssid, (unsigned long long)rx_timestamp); ibss_dbg(sdata, "\tBCN=0x%llx diff=%lld @%lu\n", (unsigned long long)beacon_timestamp, (unsigned long long)(rx_timestamp - beacon_timestamp), jiffies); if (beacon_timestamp > rx_timestamp) { ibss_dbg(sdata, "beacon TSF higher than local TSF - IBSS merge with BSSID %pM\n", mgmt->bssid); ieee80211_sta_join_ibss(sdata, bss); supp_rates = ieee80211_sta_get_rates(sdata, elems, band, NULL); ieee80211_ibss_add_sta(sdata, mgmt->bssid, mgmt->sa, supp_rates); rcu_read_unlock(); } put_bss: ieee80211_rx_bss_put(local, bss); } void ieee80211_ibss_rx_no_sta(struct ieee80211_sub_if_data *sdata, const u8 *bssid, const u8 *addr, u32 supp_rates) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct ieee80211_local *local = sdata->local; struct sta_info *sta; struct ieee80211_chanctx_conf *chanctx_conf; struct ieee80211_supported_band *sband; int band; /* * XXX: Consider removing the least recently used entry and * allow new one to be added. */ if (local->num_sta >= IEEE80211_IBSS_MAX_STA_ENTRIES) { net_info_ratelimited("%s: No room for a new IBSS STA entry %pM\n", sdata->name, addr); return; } if (ifibss->state == IEEE80211_IBSS_MLME_SEARCH) return; if (!ether_addr_equal(bssid, sdata->u.ibss.bssid)) return; rcu_read_lock(); chanctx_conf = rcu_dereference(sdata->vif.bss_conf.chanctx_conf); if (WARN_ON_ONCE(!chanctx_conf)) { rcu_read_unlock(); return; } band = chanctx_conf->def.chan->band; rcu_read_unlock(); sta = sta_info_alloc(sdata, addr, GFP_ATOMIC); if (!sta) return; /* make sure mandatory rates are always added */ sband = local->hw.wiphy->bands[band]; sta->sta.deflink.supp_rates[band] = supp_rates | ieee80211_mandatory_rates(sband); spin_lock(&ifibss->incomplete_lock); list_add(&sta->list, &ifibss->incomplete_stations); spin_unlock(&ifibss->incomplete_lock); wiphy_work_queue(local->hw.wiphy, &sdata->work); } static void ieee80211_ibss_sta_expire(struct ieee80211_sub_if_data *sdata) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct ieee80211_local *local = sdata->local; struct sta_info *sta, *tmp; unsigned long exp_time = IEEE80211_IBSS_INACTIVITY_LIMIT; unsigned long exp_rsn = IEEE80211_IBSS_RSN_INACTIVITY_LIMIT; lockdep_assert_wiphy(local->hw.wiphy); list_for_each_entry_safe(sta, tmp, &local->sta_list, list) { unsigned long last_active = ieee80211_sta_last_active(sta, -1); if (sdata != sta->sdata) continue; if (time_is_before_jiffies(last_active + exp_time) || (time_is_before_jiffies(last_active + exp_rsn) && sta->sta_state != IEEE80211_STA_AUTHORIZED)) { u8 frame_buf[IEEE80211_DEAUTH_FRAME_LEN]; sta_dbg(sta->sdata, "expiring inactive %sSTA %pM\n", sta->sta_state != IEEE80211_STA_AUTHORIZED ? "not authorized " : "", sta->sta.addr); ieee80211_send_deauth_disassoc(sdata, sta->sta.addr, ifibss->bssid, IEEE80211_STYPE_DEAUTH, WLAN_REASON_DEAUTH_LEAVING, true, frame_buf); WARN_ON(__sta_info_destroy(sta)); } } } /* * This function is called with state == IEEE80211_IBSS_MLME_JOINED */ static void ieee80211_sta_merge_ibss(struct ieee80211_sub_if_data *sdata) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; lockdep_assert_wiphy(sdata->local->hw.wiphy); mod_timer(&ifibss->timer, round_jiffies(jiffies + IEEE80211_IBSS_MERGE_INTERVAL)); ieee80211_ibss_sta_expire(sdata); if (time_before(jiffies, ifibss->last_scan_completed + IEEE80211_IBSS_MERGE_INTERVAL)) return; if (ieee80211_sta_active_ibss(sdata)) return; if (ifibss->fixed_channel) return; sdata_info(sdata, "No active IBSS STAs - trying to scan for other IBSS networks with same SSID (merge)\n"); ieee80211_request_ibss_scan(sdata, ifibss->ssid, ifibss->ssid_len, NULL, 0); } static void ieee80211_sta_create_ibss(struct ieee80211_sub_if_data *sdata) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; u8 bssid[ETH_ALEN]; u16 capability; int i; lockdep_assert_wiphy(sdata->local->hw.wiphy); if (ifibss->fixed_bssid) { memcpy(bssid, ifibss->bssid, ETH_ALEN); } else { /* Generate random, not broadcast, locally administered BSSID. Mix in * own MAC address to make sure that devices that do not have proper * random number generator get different BSSID. */ get_random_bytes(bssid, ETH_ALEN); for (i = 0; i < ETH_ALEN; i++) bssid[i] ^= sdata->vif.addr[i]; bssid[0] &= ~0x01; bssid[0] |= 0x02; } sdata_info(sdata, "Creating new IBSS network, BSSID %pM\n", bssid); capability = WLAN_CAPABILITY_IBSS; if (ifibss->privacy) capability |= WLAN_CAPABILITY_PRIVACY; __ieee80211_sta_join_ibss(sdata, bssid, sdata->vif.bss_conf.beacon_int, &ifibss->chandef, ifibss->basic_rates, capability, 0, true); } static unsigned int ibss_setup_channels(struct wiphy *wiphy, struct ieee80211_channel **channels, unsigned int channels_max, u32 center_freq, u32 width) { struct ieee80211_channel *chan = NULL; unsigned int n_chan = 0; u32 start_freq, end_freq, freq; if (width <= 20) { start_freq = center_freq; end_freq = center_freq; } else { start_freq = center_freq - width / 2 + 10; end_freq = center_freq + width / 2 - 10; } for (freq = start_freq; freq <= end_freq; freq += 20) { chan = ieee80211_get_channel(wiphy, freq); if (!chan) continue; if (n_chan >= channels_max) return n_chan; channels[n_chan] = chan; n_chan++; } return n_chan; } static unsigned int ieee80211_ibss_setup_scan_channels(struct wiphy *wiphy, const struct cfg80211_chan_def *chandef, struct ieee80211_channel **channels, unsigned int channels_max) { unsigned int n_chan = 0; u32 width, cf1, cf2 = 0; switch (chandef->width) { case NL80211_CHAN_WIDTH_40: width = 40; break; case NL80211_CHAN_WIDTH_80P80: cf2 = chandef->center_freq2; fallthrough; case NL80211_CHAN_WIDTH_80: width = 80; break; case NL80211_CHAN_WIDTH_160: width = 160; break; default: width = 20; break; } cf1 = chandef->center_freq1; n_chan = ibss_setup_channels(wiphy, channels, channels_max, cf1, width); if (cf2) n_chan += ibss_setup_channels(wiphy, &channels[n_chan], channels_max - n_chan, cf2, width); return n_chan; } /* * This function is called with state == IEEE80211_IBSS_MLME_SEARCH */ static void ieee80211_sta_find_ibss(struct ieee80211_sub_if_data *sdata) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct ieee80211_local *local = sdata->local; struct cfg80211_bss *cbss; struct ieee80211_channel *chan = NULL; const u8 *bssid = NULL; int active_ibss; lockdep_assert_wiphy(sdata->local->hw.wiphy); active_ibss = ieee80211_sta_active_ibss(sdata); ibss_dbg(sdata, "sta_find_ibss (active_ibss=%d)\n", active_ibss); if (active_ibss) return; if (ifibss->fixed_bssid) bssid = ifibss->bssid; if (ifibss->fixed_channel) chan = ifibss->chandef.chan; if (!is_zero_ether_addr(ifibss->bssid)) bssid = ifibss->bssid; cbss = cfg80211_get_bss(local->hw.wiphy, chan, bssid, ifibss->ssid, ifibss->ssid_len, IEEE80211_BSS_TYPE_IBSS, IEEE80211_PRIVACY(ifibss->privacy)); if (cbss) { struct ieee80211_bss *bss; bss = (void *)cbss->priv; ibss_dbg(sdata, "sta_find_ibss: selected %pM current %pM\n", cbss->bssid, ifibss->bssid); sdata_info(sdata, "Selected IBSS BSSID %pM based on configured SSID\n", cbss->bssid); ieee80211_sta_join_ibss(sdata, bss); ieee80211_rx_bss_put(local, bss); return; } /* if a fixed bssid and a fixed freq have been provided create the IBSS * directly and do not waste time scanning */ if (ifibss->fixed_bssid && ifibss->fixed_channel) { sdata_info(sdata, "Created IBSS using preconfigured BSSID %pM\n", bssid); ieee80211_sta_create_ibss(sdata); return; } ibss_dbg(sdata, "sta_find_ibss: did not try to join ibss\n"); /* Selected IBSS not found in current scan results - try to scan */ if (time_after(jiffies, ifibss->last_scan_completed + IEEE80211_SCAN_INTERVAL)) { struct ieee80211_channel *channels[8]; unsigned int num; sdata_info(sdata, "Trigger new scan to find an IBSS to join\n"); if (ifibss->fixed_channel) { num = ieee80211_ibss_setup_scan_channels(local->hw.wiphy, &ifibss->chandef, channels, ARRAY_SIZE(channels)); ieee80211_request_ibss_scan(sdata, ifibss->ssid, ifibss->ssid_len, channels, num); } else { ieee80211_request_ibss_scan(sdata, ifibss->ssid, ifibss->ssid_len, NULL, 0); } } else { int interval = IEEE80211_SCAN_INTERVAL; if (time_after(jiffies, ifibss->ibss_join_req + IEEE80211_IBSS_JOIN_TIMEOUT)) ieee80211_sta_create_ibss(sdata); mod_timer(&ifibss->timer, round_jiffies(jiffies + interval)); } } static void ieee80211_rx_mgmt_probe_req(struct ieee80211_sub_if_data *sdata, struct sk_buff *req) { struct ieee80211_mgmt *mgmt = (void *)req->data; struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct ieee80211_local *local = sdata->local; int tx_last_beacon, len = req->len; struct sk_buff *skb; struct beacon_data *presp; u8 *pos, *end; lockdep_assert_wiphy(sdata->local->hw.wiphy); presp = sdata_dereference(ifibss->presp, sdata); if (ifibss->state != IEEE80211_IBSS_MLME_JOINED || len < 24 + 2 || !presp) return; tx_last_beacon = drv_tx_last_beacon(local); ibss_dbg(sdata, "RX ProbeReq SA=%pM DA=%pM\n", mgmt->sa, mgmt->da); ibss_dbg(sdata, "\tBSSID=%pM (tx_last_beacon=%d)\n", mgmt->bssid, tx_last_beacon); if (!tx_last_beacon && is_multicast_ether_addr(mgmt->da)) return; if (!ether_addr_equal(mgmt->bssid, ifibss->bssid) && !is_broadcast_ether_addr(mgmt->bssid)) return; end = ((u8 *) mgmt) + len; pos = mgmt->u.probe_req.variable; if (pos[0] != WLAN_EID_SSID || pos + 2 + pos[1] > end) { ibss_dbg(sdata, "Invalid SSID IE in ProbeReq from %pM\n", mgmt->sa); return; } if (pos[1] != 0 && (pos[1] != ifibss->ssid_len || memcmp(pos + 2, ifibss->ssid, ifibss->ssid_len))) { /* Ignore ProbeReq for foreign SSID */ return; } /* Reply with ProbeResp */ skb = dev_alloc_skb(local->tx_headroom + presp->head_len); if (!skb) return; skb_reserve(skb, local->tx_headroom); skb_put_data(skb, presp->head, presp->head_len); memcpy(((struct ieee80211_mgmt *) skb->data)->da, mgmt->sa, ETH_ALEN); ibss_dbg(sdata, "Sending ProbeResp to %pM\n", mgmt->sa); IEEE80211_SKB_CB(skb)->flags |= IEEE80211_TX_INTFL_DONT_ENCRYPT; /* avoid excessive retries for probe request to wildcard SSIDs */ if (pos[1] == 0) IEEE80211_SKB_CB(skb)->flags |= IEEE80211_TX_CTL_NO_ACK; ieee80211_tx_skb(sdata, skb); } static void ieee80211_rx_mgmt_probe_beacon(struct ieee80211_sub_if_data *sdata, struct ieee80211_mgmt *mgmt, size_t len, struct ieee80211_rx_status *rx_status) { size_t baselen; struct ieee802_11_elems *elems; u16 type; BUILD_BUG_ON(offsetof(typeof(mgmt->u.probe_resp), variable) != offsetof(typeof(mgmt->u.beacon), variable)); /* * either beacon or probe_resp but the variable field is at the * same offset */ baselen = (u8 *) mgmt->u.probe_resp.variable - (u8 *) mgmt; if (baselen > len) return; type = le16_to_cpu(mgmt->frame_control) & IEEE80211_FCTL_TYPE; elems = ieee802_11_parse_elems(mgmt->u.probe_resp.variable, len - baselen, type, NULL); if (elems) { ieee80211_rx_bss_info(sdata, mgmt, len, rx_status, elems); kfree(elems); } } void ieee80211_ibss_rx_queued_mgmt(struct ieee80211_sub_if_data *sdata, struct sk_buff *skb) { struct ieee80211_rx_status *rx_status; struct ieee80211_mgmt *mgmt; u16 fc; struct ieee802_11_elems *elems; int ies_len; rx_status = IEEE80211_SKB_RXCB(skb); mgmt = (struct ieee80211_mgmt *) skb->data; fc = le16_to_cpu(mgmt->frame_control); if (!sdata->u.ibss.ssid_len) return; /* not ready to merge yet */ switch (fc & IEEE80211_FCTL_STYPE) { case IEEE80211_STYPE_PROBE_REQ: ieee80211_rx_mgmt_probe_req(sdata, skb); break; case IEEE80211_STYPE_PROBE_RESP: case IEEE80211_STYPE_BEACON: ieee80211_rx_mgmt_probe_beacon(sdata, mgmt, skb->len, rx_status); break; case IEEE80211_STYPE_AUTH: ieee80211_rx_mgmt_auth_ibss(sdata, mgmt, skb->len); break; case IEEE80211_STYPE_DEAUTH: ieee80211_rx_mgmt_deauth_ibss(sdata, mgmt, skb->len); break; case IEEE80211_STYPE_ACTION: switch (mgmt->u.action.category) { case WLAN_CATEGORY_SPECTRUM_MGMT: ies_len = skb->len - offsetof(struct ieee80211_mgmt, u.action.u.chan_switch.variable); if (ies_len < 0) break; elems = ieee802_11_parse_elems(mgmt->u.action.u.chan_switch.variable, ies_len, IEEE80211_FTYPE_MGMT | IEEE80211_STYPE_ACTION, NULL); if (elems && !elems->parse_error) ieee80211_rx_mgmt_spectrum_mgmt(sdata, mgmt, skb->len, rx_status, elems); kfree(elems); break; } } } void ieee80211_ibss_work(struct ieee80211_sub_if_data *sdata) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; struct sta_info *sta; /* * Work could be scheduled after scan or similar * when we aren't even joined (or trying) with a * network. */ if (!ifibss->ssid_len) return; spin_lock_bh(&ifibss->incomplete_lock); while (!list_empty(&ifibss->incomplete_stations)) { sta = list_first_entry(&ifibss->incomplete_stations, struct sta_info, list); list_del(&sta->list); spin_unlock_bh(&ifibss->incomplete_lock); ieee80211_ibss_finish_sta(sta); rcu_read_unlock(); spin_lock_bh(&ifibss->incomplete_lock); } spin_unlock_bh(&ifibss->incomplete_lock); switch (ifibss->state) { case IEEE80211_IBSS_MLME_SEARCH: ieee80211_sta_find_ibss(sdata); break; case IEEE80211_IBSS_MLME_JOINED: ieee80211_sta_merge_ibss(sdata); break; default: WARN_ON(1); break; } } static void ieee80211_ibss_timer(struct timer_list *t) { struct ieee80211_sub_if_data *sdata = timer_container_of(sdata, t, u.ibss.timer); wiphy_work_queue(sdata->local->hw.wiphy, &sdata->work); } void ieee80211_ibss_setup_sdata(struct ieee80211_sub_if_data *sdata) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; timer_setup(&ifibss->timer, ieee80211_ibss_timer, 0); INIT_LIST_HEAD(&ifibss->incomplete_stations); spin_lock_init(&ifibss->incomplete_lock); wiphy_work_init(&ifibss->csa_connection_drop_work, ieee80211_csa_connection_drop_work); } /* scan finished notification */ void ieee80211_ibss_notify_scan_completed(struct ieee80211_local *local) { struct ieee80211_sub_if_data *sdata; lockdep_assert_wiphy(local->hw.wiphy); list_for_each_entry(sdata, &local->interfaces, list) { if (!ieee80211_sdata_running(sdata)) continue; if (sdata->vif.type != NL80211_IFTYPE_ADHOC) continue; sdata->u.ibss.last_scan_completed = jiffies; } } int ieee80211_ibss_join(struct ieee80211_sub_if_data *sdata, struct cfg80211_ibss_params *params) { u64 changed = 0; enum ieee80211_chanctx_mode chanmode; struct ieee80211_local *local = sdata->local; int radar_detect_width = 0; int ret; lockdep_assert_wiphy(local->hw.wiphy); if (params->chandef.chan->freq_offset) { /* this may work, but is untested */ return -EOPNOTSUPP; } ret = cfg80211_chandef_dfs_required(local->hw.wiphy, ¶ms->chandef, sdata->wdev.iftype); if (ret < 0) return ret; if (ret > 0) { if (!params->userspace_handles_dfs) return -EINVAL; radar_detect_width = BIT(params->chandef.width); } chanmode = (params->channel_fixed && !ret) ? IEEE80211_CHANCTX_SHARED : IEEE80211_CHANCTX_EXCLUSIVE; ret = ieee80211_check_combinations(sdata, ¶ms->chandef, chanmode, radar_detect_width, -1); if (ret < 0) return ret; if (params->bssid) { memcpy(sdata->u.ibss.bssid, params->bssid, ETH_ALEN); sdata->u.ibss.fixed_bssid = true; } else sdata->u.ibss.fixed_bssid = false; sdata->u.ibss.privacy = params->privacy; sdata->u.ibss.control_port = params->control_port; sdata->u.ibss.userspace_handles_dfs = params->userspace_handles_dfs; sdata->u.ibss.basic_rates = params->basic_rates; sdata->u.ibss.last_scan_completed = jiffies; /* fix basic_rates if channel does not support these rates */ memcpy(sdata->vif.bss_conf.mcast_rate, params->mcast_rate, sizeof(params->mcast_rate)); sdata->vif.bss_conf.beacon_int = params->beacon_interval; sdata->u.ibss.chandef = params->chandef; sdata->u.ibss.fixed_channel = params->channel_fixed; if (params->ie) { sdata->u.ibss.ie = kmemdup(params->ie, params->ie_len, GFP_KERNEL); if (sdata->u.ibss.ie) sdata->u.ibss.ie_len = params->ie_len; } sdata->u.ibss.state = IEEE80211_IBSS_MLME_SEARCH; sdata->u.ibss.ibss_join_req = jiffies; memcpy(sdata->u.ibss.ssid, params->ssid, params->ssid_len); sdata->u.ibss.ssid_len = params->ssid_len; memcpy(&sdata->u.ibss.ht_capa, ¶ms->ht_capa, sizeof(sdata->u.ibss.ht_capa)); memcpy(&sdata->u.ibss.ht_capa_mask, ¶ms->ht_capa_mask, sizeof(sdata->u.ibss.ht_capa_mask)); /* * 802.11n-2009 9.13.3.1: In an IBSS, the HT Protection field is * reserved, but an HT STA shall protect HT transmissions as though * the HT Protection field were set to non-HT mixed mode. * * In an IBSS, the RIFS Mode field of the HT Operation element is * also reserved, but an HT STA shall operate as though this field * were set to 1. */ sdata->vif.bss_conf.ht_operation_mode |= IEEE80211_HT_OP_MODE_PROTECTION_NONHT_MIXED | IEEE80211_HT_PARAM_RIFS_MODE; changed |= BSS_CHANGED_HT | BSS_CHANGED_MCAST_RATE; ieee80211_link_info_change_notify(sdata, &sdata->deflink, changed); sdata->deflink.smps_mode = IEEE80211_SMPS_OFF; sdata->deflink.needed_rx_chains = local->rx_chains; sdata->control_port_over_nl80211 = params->control_port_over_nl80211; wiphy_work_queue(local->hw.wiphy, &sdata->work); return 0; } int ieee80211_ibss_leave(struct ieee80211_sub_if_data *sdata) { struct ieee80211_if_ibss *ifibss = &sdata->u.ibss; ifibss->ssid_len = 0; ieee80211_ibss_disconnect(sdata); eth_zero_addr(ifibss->bssid); /* remove beacon */ kfree(sdata->u.ibss.ie); sdata->u.ibss.ie = NULL; sdata->u.ibss.ie_len = 0; /* on the next join, re-program HT parameters */ memset(&ifibss->ht_capa, 0, sizeof(ifibss->ht_capa)); memset(&ifibss->ht_capa_mask, 0, sizeof(ifibss->ht_capa_mask)); synchronize_rcu(); skb_queue_purge(&sdata->skb_queue); timer_delete_sync(&sdata->u.ibss.timer); return 0; } |
| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 | // SPDX-License-Identifier: GPL-2.0 #include <linux/cgroup.h> #include <linux/sched.h> #include <linux/sched/task.h> #include <linux/sched/signal.h> #include "cgroup-internal.h" #include <trace/events/cgroup.h> /* * Update CGRP_FROZEN of cgroup.flag * Return true if flags is updated; false if flags has no change */ static bool cgroup_update_frozen_flag(struct cgroup *cgrp, bool frozen) { lockdep_assert_held(&css_set_lock); /* Already there? */ if (test_bit(CGRP_FROZEN, &cgrp->flags) == frozen) return false; if (frozen) set_bit(CGRP_FROZEN, &cgrp->flags); else clear_bit(CGRP_FROZEN, &cgrp->flags); cgroup_file_notify(&cgrp->events_file); TRACE_CGROUP_PATH(notify_frozen, cgrp, frozen); return true; } /* * Propagate the cgroup frozen state upwards by the cgroup tree. */ static void cgroup_propagate_frozen(struct cgroup *cgrp, bool frozen) { int desc = 1; /* * If the new state is frozen, some freezing ancestor cgroups may change * their state too, depending on if all their descendants are frozen. * * Otherwise, all ancestor cgroups are forced into the non-frozen state. */ while ((cgrp = cgroup_parent(cgrp))) { if (frozen) { cgrp->freezer.nr_frozen_descendants += desc; if (!test_bit(CGRP_FREEZE, &cgrp->flags) || (cgrp->freezer.nr_frozen_descendants != cgrp->nr_descendants)) continue; } else { cgrp->freezer.nr_frozen_descendants -= desc; } if (cgroup_update_frozen_flag(cgrp, frozen)) desc++; } } /* * Revisit the cgroup frozen state. * Checks if the cgroup is really frozen and perform all state transitions. */ void cgroup_update_frozen(struct cgroup *cgrp) { bool frozen; /* * If the cgroup has to be frozen (CGRP_FREEZE bit set), * and all tasks are frozen and/or stopped, let's consider * the cgroup frozen. Otherwise it's not frozen. */ frozen = test_bit(CGRP_FREEZE, &cgrp->flags) && cgrp->freezer.nr_frozen_tasks == __cgroup_task_count(cgrp); /* If flags is updated, update the state of ancestor cgroups. */ if (cgroup_update_frozen_flag(cgrp, frozen)) cgroup_propagate_frozen(cgrp, frozen); } /* * Increment cgroup's nr_frozen_tasks. */ static void cgroup_inc_frozen_cnt(struct cgroup *cgrp) { cgrp->freezer.nr_frozen_tasks++; } /* * Decrement cgroup's nr_frozen_tasks. */ static void cgroup_dec_frozen_cnt(struct cgroup *cgrp) { cgrp->freezer.nr_frozen_tasks--; WARN_ON_ONCE(cgrp->freezer.nr_frozen_tasks < 0); } /* * Enter frozen/stopped state, if not yet there. Update cgroup's counters, * and revisit the state of the cgroup, if necessary. */ void cgroup_enter_frozen(void) { struct cgroup *cgrp; if (current->frozen) return; spin_lock_irq(&css_set_lock); current->frozen = true; cgrp = task_dfl_cgroup(current); cgroup_inc_frozen_cnt(cgrp); cgroup_update_frozen(cgrp); spin_unlock_irq(&css_set_lock); } /* * Conditionally leave frozen/stopped state. Update cgroup's counters, * and revisit the state of the cgroup, if necessary. * * If always_leave is not set, and the cgroup is freezing, * we're racing with the cgroup freezing. In this case, we don't * drop the frozen counter to avoid a transient switch to * the unfrozen state. */ void cgroup_leave_frozen(bool always_leave) { struct cgroup *cgrp; spin_lock_irq(&css_set_lock); cgrp = task_dfl_cgroup(current); if (always_leave || !test_bit(CGRP_FREEZE, &cgrp->flags)) { cgroup_dec_frozen_cnt(cgrp); cgroup_update_frozen(cgrp); WARN_ON_ONCE(!current->frozen); current->frozen = false; } else if (!(current->jobctl & JOBCTL_TRAP_FREEZE)) { spin_lock(¤t->sighand->siglock); current->jobctl |= JOBCTL_TRAP_FREEZE; set_thread_flag(TIF_SIGPENDING); spin_unlock(¤t->sighand->siglock); } spin_unlock_irq(&css_set_lock); } /* * Freeze or unfreeze the task by setting or clearing the JOBCTL_TRAP_FREEZE * jobctl bit. */ static void cgroup_freeze_task(struct task_struct *task, bool freeze) { unsigned long flags; /* If the task is about to die, don't bother with freezing it. */ if (!lock_task_sighand(task, &flags)) return; if (freeze) { task->jobctl |= JOBCTL_TRAP_FREEZE; signal_wake_up(task, false); } else { task->jobctl &= ~JOBCTL_TRAP_FREEZE; wake_up_process(task); } unlock_task_sighand(task, &flags); } /* * Freeze or unfreeze all tasks in the given cgroup. */ static void cgroup_do_freeze(struct cgroup *cgrp, bool freeze, u64 ts_nsec) { struct css_task_iter it; struct task_struct *task; lockdep_assert_held(&cgroup_mutex); spin_lock_irq(&css_set_lock); write_seqcount_begin(&cgrp->freezer.freeze_seq); if (freeze) { set_bit(CGRP_FREEZE, &cgrp->flags); cgrp->freezer.freeze_start_nsec = ts_nsec; } else { clear_bit(CGRP_FREEZE, &cgrp->flags); cgrp->freezer.frozen_nsec += (ts_nsec - cgrp->freezer.freeze_start_nsec); } write_seqcount_end(&cgrp->freezer.freeze_seq); spin_unlock_irq(&css_set_lock); if (freeze) TRACE_CGROUP_PATH(freeze, cgrp); else TRACE_CGROUP_PATH(unfreeze, cgrp); css_task_iter_start(&cgrp->self, 0, &it); while ((task = css_task_iter_next(&it))) { /* * Ignore kernel threads here. Freezing cgroups containing * kthreads isn't supported. */ if (task->flags & PF_KTHREAD) continue; cgroup_freeze_task(task, freeze); } css_task_iter_end(&it); /* * Cgroup state should be revisited here to cover empty leaf cgroups * and cgroups which descendants are already in the desired state. */ spin_lock_irq(&css_set_lock); if (cgrp->nr_descendants == cgrp->freezer.nr_frozen_descendants) cgroup_update_frozen(cgrp); spin_unlock_irq(&css_set_lock); } /* * Adjust the task state (freeze or unfreeze) and revisit the state of * source and destination cgroups. */ void cgroup_freezer_migrate_task(struct task_struct *task, struct cgroup *src, struct cgroup *dst) { lockdep_assert_held(&css_set_lock); /* * Kernel threads are not supposed to be frozen at all. */ if (task->flags & PF_KTHREAD) return; /* * It's not necessary to do changes if both of the src and dst cgroups * are not freezing and task is not frozen. */ if (!test_bit(CGRP_FREEZE, &src->flags) && !test_bit(CGRP_FREEZE, &dst->flags) && !task->frozen) return; /* * Adjust counters of freezing and frozen tasks. * Note, that if the task is frozen, but the destination cgroup is not * frozen, we bump both counters to keep them balanced. */ if (task->frozen) { cgroup_inc_frozen_cnt(dst); cgroup_dec_frozen_cnt(src); } cgroup_update_frozen(dst); cgroup_update_frozen(src); /* * Force the task to the desired state. */ cgroup_freeze_task(task, test_bit(CGRP_FREEZE, &dst->flags)); } void cgroup_freeze(struct cgroup *cgrp, bool freeze) { struct cgroup_subsys_state *css; struct cgroup *parent; struct cgroup *dsct; bool applied = false; u64 ts_nsec; bool old_e; lockdep_assert_held(&cgroup_mutex); /* * Nothing changed? Just exit. */ if (cgrp->freezer.freeze == freeze) return; cgrp->freezer.freeze = freeze; ts_nsec = ktime_get_ns(); /* * Propagate changes downwards the cgroup tree. */ css_for_each_descendant_pre(css, &cgrp->self) { dsct = css->cgroup; if (cgroup_is_dead(dsct)) continue; /* * e_freeze is affected by parent's e_freeze and dst's freeze. * If old e_freeze eq new e_freeze, no change, its children * will not be affected. So do nothing and skip the subtree */ old_e = dsct->freezer.e_freeze; parent = cgroup_parent(dsct); dsct->freezer.e_freeze = (dsct->freezer.freeze || parent->freezer.e_freeze); if (dsct->freezer.e_freeze == old_e) { css = css_rightmost_descendant(css); continue; } /* * Do change actual state: freeze or unfreeze. */ cgroup_do_freeze(dsct, freeze, ts_nsec); applied = true; } /* * Even if the actual state hasn't changed, let's notify a user. * The state can be enforced by an ancestor cgroup: the cgroup * can already be in the desired state or it can be locked in the * opposite state, so that the transition will never happen. * In both cases it's better to notify a user, that there is * nothing to wait for. */ if (!applied) { TRACE_CGROUP_PATH(notify_frozen, cgrp, test_bit(CGRP_FROZEN, &cgrp->flags)); cgroup_file_notify(&cgrp->events_file); } } |
| 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 | // SPDX-License-Identifier: GPL-2.0-or-later /* * INET An implementation of the TCP/IP protocol suite for the LINUX * operating system. INET is implemented using the BSD Socket * interface as the means of communication with the user level. * * PF_INET protocol family socket handler. * * Authors: Ross Biro * Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG> * Florian La Roche, <flla@stud.uni-sb.de> * Alan Cox, <A.Cox@swansea.ac.uk> * * Changes (see also sock.c) * * piggy, * Karl Knutson : Socket protocol table * A.N.Kuznetsov : Socket death error in accept(). * John Richardson : Fix non blocking error in connect() * so sockets that fail to connect * don't return -EINPROGRESS. * Alan Cox : Asynchronous I/O support * Alan Cox : Keep correct socket pointer on sock * structures * when accept() ed * Alan Cox : Semantics of SO_LINGER aren't state * moved to close when you look carefully. * With this fixed and the accept bug fixed * some RPC stuff seems happier. * Niibe Yutaka : 4.4BSD style write async I/O * Alan Cox, * Tony Gale : Fixed reuse semantics. * Alan Cox : bind() shouldn't abort existing but dead * sockets. Stops FTP netin:.. I hope. * Alan Cox : bind() works correctly for RAW sockets. * Note that FreeBSD at least was broken * in this respect so be careful with * compatibility tests... * Alan Cox : routing cache support * Alan Cox : memzero the socket structure for * compactness. * Matt Day : nonblock connect error handler * Alan Cox : Allow large numbers of pending sockets * (eg for big web sites), but only if * specifically application requested. * Alan Cox : New buffering throughout IP. Used * dumbly. * Alan Cox : New buffering now used smartly. * Alan Cox : BSD rather than common sense * interpretation of listen. * Germano Caronni : Assorted small races. * Alan Cox : sendmsg/recvmsg basic support. * Alan Cox : Only sendmsg/recvmsg now supported. * Alan Cox : Locked down bind (see security list). * Alan Cox : Loosened bind a little. * Mike McLagan : ADD/DEL DLCI Ioctls * Willy Konynenberg : Transparent proxying support. * David S. Miller : New socket lookup architecture. * Some other random speedups. * Cyrus Durgin : Cleaned up file for kmod hacks. * Andi Kleen : Fix inet_stream_connect TCP race. */ #define pr_fmt(fmt) "IPv4: " fmt #include <linux/err.h> #include <linux/errno.h> #include <linux/types.h> #include <linux/socket.h> #include <linux/in.h> #include <linux/kernel.h> #include <linux/kmod.h> #include <linux/sched.h> #include <linux/timer.h> #include <linux/string.h> #include <linux/sockios.h> #include <linux/net.h> #include <linux/capability.h> #include <linux/fcntl.h> #include <linux/mm.h> #include <linux/interrupt.h> #include <linux/stat.h> #include <linux/init.h> #include <linux/poll.h> #include <linux/netfilter_ipv4.h> #include <linux/random.h> #include <linux/slab.h> #include <linux/uaccess.h> #include <linux/inet.h> #include <linux/igmp.h> #include <linux/inetdevice.h> #include <linux/netdevice.h> #include <net/checksum.h> #include <net/ip.h> #include <net/protocol.h> #include <net/arp.h> #include <net/route.h> #include <net/ip_fib.h> #include <net/inet_connection_sock.h> #include <net/gro.h> #include <net/gso.h> #include <net/tcp.h> #include <net/psp.h> #include <net/udp.h> #include <net/udplite.h> #include <net/ping.h> #include <linux/skbuff.h> #include <net/sock.h> #include <net/raw.h> #include <net/icmp.h> #include <net/inet_common.h> #include <net/ip_tunnels.h> #include <net/xfrm.h> #include <net/net_namespace.h> #include <net/secure_seq.h> #ifdef CONFIG_IP_MROUTE #include <linux/mroute.h> #endif #include <net/l3mdev.h> #include <net/compat.h> #include <net/rps.h> #include <trace/events/sock.h> /* The inetsw table contains everything that inet_create needs to * build a new socket. */ static struct list_head inetsw[SOCK_MAX]; static DEFINE_SPINLOCK(inetsw_lock); /* New destruction routine */ void inet_sock_destruct(struct sock *sk) { struct inet_sock *inet = inet_sk(sk); __skb_queue_purge(&sk->sk_receive_queue); __skb_queue_purge(&sk->sk_error_queue); sk_mem_reclaim_final(sk); if (sk->sk_type == SOCK_STREAM && sk->sk_state != TCP_CLOSE) { pr_err("Attempt to release TCP socket in state %d %p\n", sk->sk_state, sk); return; } if (!sock_flag(sk, SOCK_DEAD)) { pr_err("Attempt to release alive inet socket %p\n", sk); return; } WARN_ON_ONCE(atomic_read(&sk->sk_rmem_alloc)); WARN_ON_ONCE(refcount_read(&sk->sk_wmem_alloc)); WARN_ON_ONCE(sk->sk_wmem_queued); WARN_ON_ONCE(sk->sk_forward_alloc); kfree(rcu_dereference_protected(inet->inet_opt, 1)); dst_release(rcu_dereference_protected(sk->sk_dst_cache, 1)); dst_release(rcu_dereference_protected(sk->sk_rx_dst, 1)); psp_sk_assoc_free(sk); } EXPORT_SYMBOL(inet_sock_destruct); /* * The routines beyond this point handle the behaviour of an AF_INET * socket object. Mostly it punts to the subprotocols of IP to do * the work. */ /* * Automatically bind an unbound socket. */ static int inet_autobind(struct sock *sk) { struct inet_sock *inet; /* We may need to bind the socket. */ lock_sock(sk); inet = inet_sk(sk); if (!inet->inet_num) { if (sk->sk_prot->get_port(sk, 0)) { release_sock(sk); return -EAGAIN; } inet->inet_sport = htons(inet->inet_num); } release_sock(sk); return 0; } int __inet_listen_sk(struct sock *sk, int backlog) { unsigned char old_state = sk->sk_state; int err, tcp_fastopen; if (!((1 << old_state) & (TCPF_CLOSE | TCPF_LISTEN))) return -EINVAL; WRITE_ONCE(sk->sk_max_ack_backlog, backlog); /* Really, if the socket is already in listen state * we can only allow the backlog to be adjusted. */ if (old_state != TCP_LISTEN) { /* Enable TFO w/o requiring TCP_FASTOPEN socket option. * Note that only TCP sockets (SOCK_STREAM) will reach here. * Also fastopen backlog may already been set via the option * because the socket was in TCP_LISTEN state previously but * was shutdown() rather than close(). */ tcp_fastopen = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_fastopen); if ((tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) && (tcp_fastopen & TFO_SERVER_ENABLE) && !inet_csk(sk)->icsk_accept_queue.fastopenq.max_qlen) { fastopen_queue_tune(sk, backlog); tcp_fastopen_init_key_once(sock_net(sk)); } err = inet_csk_listen_start(sk); if (err) return err; tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_LISTEN_CB, 0, NULL); } return 0; } /* * Move a socket into listening state. */ int inet_listen(struct socket *sock, int backlog) { struct sock *sk = sock->sk; int err = -EINVAL; lock_sock(sk); if (sock->state != SS_UNCONNECTED || sock->type != SOCK_STREAM) goto out; err = __inet_listen_sk(sk, backlog); out: release_sock(sk); return err; } EXPORT_SYMBOL(inet_listen); /* * Create an inet socket. */ static int inet_create(struct net *net, struct socket *sock, int protocol, int kern) { struct sock *sk; struct inet_protosw *answer; struct inet_sock *inet; struct proto *answer_prot; unsigned char answer_flags; int try_loading_module = 0; int err; if (protocol < 0 || protocol >= IPPROTO_MAX) return -EINVAL; sock->state = SS_UNCONNECTED; /* Look for the requested type/protocol pair. */ lookup_protocol: err = -ESOCKTNOSUPPORT; rcu_read_lock(); list_for_each_entry_rcu(answer, &inetsw[sock->type], list) { err = 0; /* Check the non-wild match. */ if (protocol == answer->protocol) { if (protocol != IPPROTO_IP) break; } else { /* Check for the two wild cases. */ if (IPPROTO_IP == protocol) { protocol = answer->protocol; break; } if (IPPROTO_IP == answer->protocol) break; } err = -EPROTONOSUPPORT; } if (unlikely(err)) { if (try_loading_module < 2) { rcu_read_unlock(); /* * Be more specific, e.g. net-pf-2-proto-132-type-1 * (net-pf-PF_INET-proto-IPPROTO_SCTP-type-SOCK_STREAM) */ if (++try_loading_module == 1) request_module("net-pf-%d-proto-%d-type-%d", PF_INET, protocol, sock->type); /* * Fall back to generic, e.g. net-pf-2-proto-132 * (net-pf-PF_INET-proto-IPPROTO_SCTP) */ else request_module("net-pf-%d-proto-%d", PF_INET, protocol); goto lookup_protocol; } else goto out_rcu_unlock; } err = -EPERM; if (sock->type == SOCK_RAW && !kern && !ns_capable(net->user_ns, CAP_NET_RAW)) goto out_rcu_unlock; sock->ops = answer->ops; answer_prot = answer->prot; answer_flags = answer->flags; rcu_read_unlock(); WARN_ON(!answer_prot->slab); err = -ENOMEM; sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot, kern); if (!sk) goto out; err = 0; if (INET_PROTOSW_REUSE & answer_flags) sk->sk_reuse = SK_CAN_REUSE; if (INET_PROTOSW_ICSK & answer_flags) inet_init_csk_locks(sk); inet = inet_sk(sk); inet_assign_bit(IS_ICSK, sk, INET_PROTOSW_ICSK & answer_flags); inet_clear_bit(NODEFRAG, sk); if (SOCK_RAW == sock->type) { inet->inet_num = protocol; if (IPPROTO_RAW == protocol) inet_set_bit(HDRINCL, sk); } if (READ_ONCE(net->ipv4.sysctl_ip_no_pmtu_disc)) inet->pmtudisc = IP_PMTUDISC_DONT; else inet->pmtudisc = IP_PMTUDISC_WANT; atomic_set(&inet->inet_id, 0); sock_init_data(sock, sk); sk->sk_destruct = inet_sock_destruct; sk->sk_protocol = protocol; sk->sk_backlog_rcv = sk->sk_prot->backlog_rcv; sk->sk_txrehash = READ_ONCE(net->core.sysctl_txrehash); inet->uc_ttl = -1; inet_set_bit(MC_LOOP, sk); inet->mc_ttl = 1; inet_set_bit(MC_ALL, sk); inet->mc_index = 0; inet->mc_list = NULL; inet->rcv_tos = 0; if (inet->inet_num) { /* It assumes that any protocol which allows * the user to assign a number at socket * creation time automatically * shares. */ inet->inet_sport = htons(inet->inet_num); /* Add to protocol hash chains. */ err = sk->sk_prot->hash(sk); if (err) goto out_sk_release; } if (sk->sk_prot->init) { err = sk->sk_prot->init(sk); if (err) goto out_sk_release; } if (!kern) { err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk); if (err) goto out_sk_release; } out: return err; out_rcu_unlock: rcu_read_unlock(); goto out; out_sk_release: sk_common_release(sk); sock->sk = NULL; goto out; } /* * The peer socket should always be NULL (or else). When we call this * function we are destroying the object and from then on nobody * should refer to it. */ int inet_release(struct socket *sock) { struct sock *sk = sock->sk; if (sk) { long timeout; if (!sk->sk_kern_sock) BPF_CGROUP_RUN_PROG_INET_SOCK_RELEASE(sk); /* Applications forget to leave groups before exiting */ ip_mc_drop_socket(sk); /* If linger is set, we don't return until the close * is complete. Otherwise we return immediately. The * actually closing is done the same either way. * * If the close is due to the process exiting, we never * linger.. */ timeout = 0; if (sock_flag(sk, SOCK_LINGER) && !(current->flags & PF_EXITING)) timeout = sk->sk_lingertime; sk->sk_prot->close(sk, timeout); sock->sk = NULL; } return 0; } EXPORT_SYMBOL(inet_release); int inet_bind_sk(struct sock *sk, struct sockaddr_unsized *uaddr, int addr_len) { u32 flags = BIND_WITH_LOCK; int err; /* If the socket has its own bind function then use it. (RAW) */ if (sk->sk_prot->bind) { return sk->sk_prot->bind(sk, uaddr, addr_len); } if (addr_len < sizeof(struct sockaddr_in)) return -EINVAL; /* BPF prog is run before any checks are done so that if the prog * changes context in a wrong way it will be caught. */ err = BPF_CGROUP_RUN_PROG_INET_BIND_LOCK(sk, uaddr, &addr_len, CGROUP_INET4_BIND, &flags); if (err) return err; return __inet_bind(sk, uaddr, addr_len, flags); } int inet_bind(struct socket *sock, struct sockaddr_unsized *uaddr, int addr_len) { return inet_bind_sk(sock->sk, uaddr, addr_len); } EXPORT_SYMBOL(inet_bind); int __inet_bind(struct sock *sk, struct sockaddr_unsized *uaddr, int addr_len, u32 flags) { struct sockaddr_in *addr = (struct sockaddr_in *)uaddr; struct inet_sock *inet = inet_sk(sk); struct net *net = sock_net(sk); unsigned short snum; int chk_addr_ret; u32 tb_id = RT_TABLE_LOCAL; int err; if (addr->sin_family != AF_INET) { /* Compatibility games : accept AF_UNSPEC (mapped to AF_INET) * only if s_addr is INADDR_ANY. */ err = -EAFNOSUPPORT; if (addr->sin_family != AF_UNSPEC || addr->sin_addr.s_addr != htonl(INADDR_ANY)) goto out; } tb_id = l3mdev_fib_table_by_index(net, sk->sk_bound_dev_if) ? : tb_id; chk_addr_ret = inet_addr_type_table(net, addr->sin_addr.s_addr, tb_id); /* Not specified by any standard per-se, however it breaks too * many applications when removed. It is unfortunate since * allowing applications to make a non-local bind solves * several problems with systems using dynamic addressing. * (ie. your servers still start up even if your ISDN link * is temporarily down) */ err = -EADDRNOTAVAIL; if (!inet_addr_valid_or_nonlocal(net, inet, addr->sin_addr.s_addr, chk_addr_ret)) goto out; snum = ntohs(addr->sin_port); err = -EACCES; if (!(flags & BIND_NO_CAP_NET_BIND_SERVICE) && snum && inet_port_requires_bind_service(net, snum) && !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) goto out; /* We keep a pair of addresses. rcv_saddr is the one * used by hash lookups, and saddr is used for transmit. * * In the BSD API these are the same except where it * would be illegal to use them (multicast/broadcast) in * which case the sending device address is used. */ if (flags & BIND_WITH_LOCK) lock_sock(sk); /* Check these errors (active socket, double bind). */ err = -EINVAL; if (sk->sk_state != TCP_CLOSE || inet->inet_num) goto out_release_sock; inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr; if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST) inet->inet_saddr = 0; /* Use device */ /* Make sure we are allowed to bind here. */ if (snum || !(inet_test_bit(BIND_ADDRESS_NO_PORT, sk) || (flags & BIND_FORCE_ADDRESS_NO_PORT))) { err = sk->sk_prot->get_port(sk, snum); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } if (!(flags & BIND_FROM_BPF)) { err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; if (sk->sk_prot->put_port) sk->sk_prot->put_port(sk); goto out_release_sock; } } } if (inet->inet_rcv_saddr) sk->sk_userlocks |= SOCK_BINDADDR_LOCK; if (snum) sk->sk_userlocks |= SOCK_BINDPORT_LOCK; inet->inet_sport = htons(inet->inet_num); inet->inet_daddr = 0; inet->inet_dport = 0; sk_dst_reset(sk); err = 0; out_release_sock: if (flags & BIND_WITH_LOCK) release_sock(sk); out: return err; } int inet_dgram_connect(struct socket *sock, struct sockaddr_unsized *uaddr, int addr_len, int flags) { struct sock *sk = sock->sk; const struct proto *prot; int err; if (addr_len < sizeof(uaddr->sa_family)) return -EINVAL; /* IPV6_ADDRFORM can change sk->sk_prot under us. */ prot = READ_ONCE(sk->sk_prot); if (uaddr->sa_family == AF_UNSPEC) return prot->disconnect(sk, flags); if (BPF_CGROUP_PRE_CONNECT_ENABLED(sk)) { err = prot->pre_connect(sk, uaddr, addr_len); if (err) return err; } if (data_race(!inet_sk(sk)->inet_num) && inet_autobind(sk)) return -EAGAIN; return prot->connect(sk, uaddr, addr_len); } EXPORT_SYMBOL(inet_dgram_connect); static long inet_wait_for_connect(struct sock *sk, long timeo, int writebias) { DEFINE_WAIT_FUNC(wait, woken_wake_function); add_wait_queue(sk_sleep(sk), &wait); sk->sk_write_pending += writebias; /* Basic assumption: if someone sets sk->sk_err, he _must_ * change state of the socket from TCP_SYN_*. * Connect() does not allow to get error notifications * without closing the socket. */ while ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { release_sock(sk); timeo = wait_woken(&wait, TASK_INTERRUPTIBLE, timeo); lock_sock(sk); if (signal_pending(current) || !timeo) break; } remove_wait_queue(sk_sleep(sk), &wait); sk->sk_write_pending -= writebias; return timeo; } /* * Connect to a remote host. There is regrettably still a little * TCP 'magic' in here. */ int __inet_stream_connect(struct socket *sock, struct sockaddr_unsized *uaddr, int addr_len, int flags, int is_sendmsg) { struct sock *sk = sock->sk; int err; long timeo; /* * uaddr can be NULL and addr_len can be 0 if: * sk is a TCP fastopen active socket and * TCP_FASTOPEN_CONNECT sockopt is set and * we already have a valid cookie for this socket. * In this case, user can call write() after connect(). * write() will invoke tcp_sendmsg_fastopen() which calls * __inet_stream_connect(). */ if (uaddr) { if (addr_len < sizeof(uaddr->sa_family)) return -EINVAL; if (uaddr->sa_family == AF_UNSPEC) { sk->sk_disconnects++; err = sk->sk_prot->disconnect(sk, flags); sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED; goto out; } } switch (sock->state) { default: err = -EINVAL; goto out; case SS_CONNECTED: err = -EISCONN; goto out; case SS_CONNECTING: if (inet_test_bit(DEFER_CONNECT, sk)) err = is_sendmsg ? -EINPROGRESS : -EISCONN; else err = -EALREADY; /* Fall out of switch with err, set for this state */ break; case SS_UNCONNECTED: err = -EISCONN; if (sk->sk_state != TCP_CLOSE) goto out; if (BPF_CGROUP_PRE_CONNECT_ENABLED(sk)) { err = sk->sk_prot->pre_connect(sk, uaddr, addr_len); if (err) goto out; } err = sk->sk_prot->connect(sk, uaddr, addr_len); if (err < 0) goto out; sock->state = SS_CONNECTING; if (!err && inet_test_bit(DEFER_CONNECT, sk)) goto out; /* Just entered SS_CONNECTING state; the only * difference is that return value in non-blocking * case is EINPROGRESS, rather than EALREADY. */ err = -EINPROGRESS; break; } timeo = sock_sndtimeo(sk, flags & O_NONBLOCK); if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { int writebias = (sk->sk_protocol == IPPROTO_TCP) && tcp_sk(sk)->fastopen_req && tcp_sk(sk)->fastopen_req->data ? 1 : 0; int dis = sk->sk_disconnects; /* Error code is set above */ if (!timeo || !inet_wait_for_connect(sk, timeo, writebias)) goto out; err = sock_intr_errno(timeo); if (signal_pending(current)) goto out; if (dis != sk->sk_disconnects) { err = -EPIPE; goto out; } } /* Connection was closed by RST, timeout, ICMP error * or another process disconnected us. */ if (sk->sk_state == TCP_CLOSE) goto sock_error; /* sk->sk_err may be not zero now, if RECVERR was ordered by user * and error was received after socket entered established state. * Hence, it is handled normally after connect() return successfully. */ sock->state = SS_CONNECTED; err = 0; out: return err; sock_error: err = sock_error(sk) ? : -ECONNABORTED; sock->state = SS_UNCONNECTED; sk->sk_disconnects++; if (sk->sk_prot->disconnect(sk, flags)) sock->state = SS_DISCONNECTING; goto out; } EXPORT_SYMBOL(__inet_stream_connect); int inet_stream_connect(struct socket *sock, struct sockaddr_unsized *uaddr, int addr_len, int flags) { int err; lock_sock(sock->sk); err = __inet_stream_connect(sock, uaddr, addr_len, flags, 0); release_sock(sock->sk); return err; } EXPORT_SYMBOL(inet_stream_connect); void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *newsk) { if (mem_cgroup_sockets_enabled) { mem_cgroup_sk_alloc(newsk); __sk_charge(newsk, GFP_KERNEL); } sock_rps_record_flow(newsk); WARN_ON(!((1 << newsk->sk_state) & (TCPF_ESTABLISHED | TCPF_SYN_RECV | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2 | TCPF_CLOSING | TCPF_CLOSE_WAIT | TCPF_CLOSE))); if (test_bit(SOCK_SUPPORT_ZC, &sock->flags)) set_bit(SOCK_SUPPORT_ZC, &newsock->flags); sock_graft(newsk, newsock); newsock->state = SS_CONNECTED; } EXPORT_SYMBOL_GPL(__inet_accept); /* * Accept a pending connection. The TCP layer now gives BSD semantics. */ int inet_accept(struct socket *sock, struct socket *newsock, struct proto_accept_arg *arg) { struct sock *sk1 = sock->sk, *sk2; /* IPV6_ADDRFORM can change sk->sk_prot under us. */ arg->err = -EINVAL; sk2 = READ_ONCE(sk1->sk_prot)->accept(sk1, arg); if (!sk2) return arg->err; lock_sock(sk2); __inet_accept(sock, newsock, sk2); release_sock(sk2); return 0; } EXPORT_SYMBOL(inet_accept); /* * This does both peername and sockname. */ int inet_getname(struct socket *sock, struct sockaddr *uaddr, int peer) { struct sock *sk = sock->sk; struct inet_sock *inet = inet_sk(sk); DECLARE_SOCKADDR(struct sockaddr_in *, sin, uaddr); int sin_addr_len = sizeof(*sin); sin->sin_family = AF_INET; lock_sock(sk); if (peer) { if (!inet->inet_dport || (((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_SYN_SENT)) && peer == 1)) { release_sock(sk); return -ENOTCONN; } sin->sin_port = inet->inet_dport; sin->sin_addr.s_addr = inet->inet_daddr; BPF_CGROUP_RUN_SA_PROG(sk, sin, &sin_addr_len, CGROUP_INET4_GETPEERNAME); } else { __be32 addr = inet->inet_rcv_saddr; if (!addr) addr = inet->inet_saddr; sin->sin_port = inet->inet_sport; sin->sin_addr.s_addr = addr; BPF_CGROUP_RUN_SA_PROG(sk, sin, &sin_addr_len, CGROUP_INET4_GETSOCKNAME); } release_sock(sk); memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); return sin_addr_len; } EXPORT_SYMBOL(inet_getname); int inet_send_prepare(struct sock *sk) { sock_rps_record_flow(sk); /* We may need to bind the socket. */ if (data_race(!inet_sk(sk)->inet_num) && !sk->sk_prot->no_autobind && inet_autobind(sk)) return -EAGAIN; return 0; } EXPORT_SYMBOL_GPL(inet_send_prepare); int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) { struct sock *sk = sock->sk; if (unlikely(inet_send_prepare(sk))) return -EAGAIN; return INDIRECT_CALL_2(sk->sk_prot->sendmsg, tcp_sendmsg, udp_sendmsg, sk, msg, size); } EXPORT_SYMBOL(inet_sendmsg); void inet_splice_eof(struct socket *sock) { const struct proto *prot; struct sock *sk = sock->sk; if (unlikely(inet_send_prepare(sk))) return; /* IPV6_ADDRFORM can change sk->sk_prot under us. */ prot = READ_ONCE(sk->sk_prot); if (prot->splice_eof) prot->splice_eof(sock); } EXPORT_SYMBOL_GPL(inet_splice_eof); INDIRECT_CALLABLE_DECLARE(int udp_recvmsg(struct sock *, struct msghdr *, size_t, int, int *)); int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags) { struct sock *sk = sock->sk; int addr_len = 0; int err; if (likely(!(flags & MSG_ERRQUEUE))) sock_rps_record_flow(sk); err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg, sk, msg, size, flags, &addr_len); if (err >= 0) msg->msg_namelen = addr_len; return err; } EXPORT_SYMBOL(inet_recvmsg); int inet_shutdown(struct socket *sock, int how) { struct sock *sk = sock->sk; int err = 0; /* This should really check to make sure * the socket is a TCP socket. (WHY AC...) */ how++; /* maps 0->1 has the advantage of making bit 1 rcvs and 1->2 bit 2 snds. 2->3 */ if ((how & ~SHUTDOWN_MASK) || !how) /* MAXINT->0 */ return -EINVAL; lock_sock(sk); if (sock->state == SS_CONNECTING) { if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV | TCPF_CLOSE)) sock->state = SS_DISCONNECTING; else sock->state = SS_CONNECTED; } switch (sk->sk_state) { case TCP_CLOSE: err = -ENOTCONN; /* Hack to wake up other listeners, who can poll for EPOLLHUP, even on eg. unconnected UDP sockets -- RR */ fallthrough; default: WRITE_ONCE(sk->sk_shutdown, sk->sk_shutdown | how); if (sk->sk_prot->shutdown) sk->sk_prot->shutdown(sk, how); break; /* Remaining two branches are temporary solution for missing * close() in multithreaded environment. It is _not_ a good idea, * but we have no choice until close() is repaired at VFS level. */ case TCP_LISTEN: if (!(how & RCV_SHUTDOWN)) break; fallthrough; case TCP_SYN_SENT: err = sk->sk_prot->disconnect(sk, O_NONBLOCK); sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED; break; } /* Wake up anyone sleeping in poll. */ sk->sk_state_change(sk); release_sock(sk); return err; } EXPORT_SYMBOL(inet_shutdown); /* * ioctl() calls you can issue on an INET socket. Most of these are * device configuration and stuff and very rarely used. Some ioctls * pass on to the socket itself. * * NOTE: I like the idea of a module for the config stuff. ie ifconfig * loads the devconfigure module does its configuring and unloads it. * There's a good 20K of config code hanging around the kernel. */ int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { struct sock *sk = sock->sk; int err = 0; struct net *net = sock_net(sk); void __user *p = (void __user *)arg; struct ifreq ifr; struct rtentry rt; switch (cmd) { case SIOCADDRT: case SIOCDELRT: if (copy_from_user(&rt, p, sizeof(struct rtentry))) return -EFAULT; err = ip_rt_ioctl(net, cmd, &rt); break; case SIOCRTMSG: err = -EINVAL; break; case SIOCDARP: case SIOCGARP: case SIOCSARP: err = arp_ioctl(net, cmd, (void __user *)arg); break; case SIOCGIFADDR: case SIOCGIFBRDADDR: case SIOCGIFNETMASK: case SIOCGIFDSTADDR: case SIOCGIFPFLAGS: if (get_user_ifreq(&ifr, NULL, p)) return -EFAULT; err = devinet_ioctl(net, cmd, &ifr); if (!err && put_user_ifreq(&ifr, p)) err = -EFAULT; break; case SIOCSIFADDR: case SIOCSIFBRDADDR: case SIOCSIFNETMASK: case SIOCSIFDSTADDR: case SIOCSIFPFLAGS: case SIOCSIFFLAGS: if (get_user_ifreq(&ifr, NULL, p)) return -EFAULT; err = devinet_ioctl(net, cmd, &ifr); break; default: if (sk->sk_prot->ioctl) err = sk_ioctl(sk, cmd, (void __user *)arg); else err = -ENOIOCTLCMD; break; } return err; } EXPORT_SYMBOL(inet_ioctl); #ifdef CONFIG_COMPAT static int inet_compat_routing_ioctl(struct sock *sk, unsigned int cmd, struct compat_rtentry __user *ur) { compat_uptr_t rtdev; struct rtentry rt; if (copy_from_user(&rt.rt_dst, &ur->rt_dst, 3 * sizeof(struct sockaddr)) || get_user(rt.rt_flags, &ur->rt_flags) || get_user(rt.rt_metric, &ur->rt_metric) || get_user(rt.rt_mtu, &ur->rt_mtu) || get_user(rt.rt_window, &ur->rt_window) || get_user(rt.rt_irtt, &ur->rt_irtt) || get_user(rtdev, &ur->rt_dev)) return -EFAULT; rt.rt_dev = compat_ptr(rtdev); return ip_rt_ioctl(sock_net(sk), cmd, &rt); } static int inet_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { void __user *argp = compat_ptr(arg); struct sock *sk = sock->sk; switch (cmd) { case SIOCADDRT: case SIOCDELRT: return inet_compat_routing_ioctl(sk, cmd, argp); default: if (!sk->sk_prot->compat_ioctl) return -ENOIOCTLCMD; return sk->sk_prot->compat_ioctl(sk, cmd, arg); } } #endif /* CONFIG_COMPAT */ const struct proto_ops inet_stream_ops = { .family = PF_INET, .owner = THIS_MODULE, .release = inet_release, .bind = inet_bind, .connect = inet_stream_connect, .socketpair = sock_no_socketpair, .accept = inet_accept, .getname = inet_getname, .poll = tcp_poll, .ioctl = inet_ioctl, .gettstamp = sock_gettstamp, .listen = inet_listen, .shutdown = inet_shutdown, .setsockopt = sock_common_setsockopt, .getsockopt = sock_common_getsockopt, .sendmsg = inet_sendmsg, .recvmsg = inet_recvmsg, #ifdef CONFIG_MMU .mmap = tcp_mmap, #endif .splice_eof = inet_splice_eof, .splice_read = tcp_splice_read, .set_peek_off = sk_set_peek_off, .read_sock = tcp_read_sock, .read_skb = tcp_read_skb, .sendmsg_locked = tcp_sendmsg_locked, .peek_len = tcp_peek_len, #ifdef CONFIG_COMPAT .compat_ioctl = inet_compat_ioctl, #endif .set_rcvlowat = tcp_set_rcvlowat, }; EXPORT_SYMBOL(inet_stream_ops); const struct proto_ops inet_dgram_ops = { .family = PF_INET, .owner = THIS_MODULE, .release = inet_release, .bind = inet_bind, .connect = inet_dgram_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = inet_getname, .poll = udp_poll, .ioctl = inet_ioctl, .gettstamp = sock_gettstamp, .listen = sock_no_listen, .shutdown = inet_shutdown, .setsockopt = sock_common_setsockopt, .getsockopt = sock_common_getsockopt, .sendmsg = inet_sendmsg, .read_skb = udp_read_skb, .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .splice_eof = inet_splice_eof, .set_peek_off = udp_set_peek_off, #ifdef CONFIG_COMPAT .compat_ioctl = inet_compat_ioctl, #endif }; EXPORT_SYMBOL(inet_dgram_ops); /* * For SOCK_RAW sockets; should be the same as inet_dgram_ops but without * udp_poll */ static const struct proto_ops inet_sockraw_ops = { .family = PF_INET, .owner = THIS_MODULE, .release = inet_release, .bind = inet_bind, .connect = inet_dgram_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = inet_getname, .poll = datagram_poll, .ioctl = inet_ioctl, .gettstamp = sock_gettstamp, .listen = sock_no_listen, .shutdown = inet_shutdown, .setsockopt = sock_common_setsockopt, .getsockopt = sock_common_getsockopt, .sendmsg = inet_sendmsg, .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .splice_eof = inet_splice_eof, #ifdef CONFIG_COMPAT .compat_ioctl = inet_compat_ioctl, #endif }; static const struct net_proto_family inet_family_ops = { .family = PF_INET, .create = inet_create, .owner = THIS_MODULE, }; /* Upon startup we insert all the elements in inetsw_array[] into * the linked list inetsw. */ static struct inet_protosw inetsw_array[] = { { .type = SOCK_STREAM, .protocol = IPPROTO_TCP, .prot = &tcp_prot, .ops = &inet_stream_ops, .flags = INET_PROTOSW_PERMANENT | INET_PROTOSW_ICSK, }, { .type = SOCK_DGRAM, .protocol = IPPROTO_UDP, .prot = &udp_prot, .ops = &inet_dgram_ops, .flags = INET_PROTOSW_PERMANENT, }, { .type = SOCK_DGRAM, .protocol = IPPROTO_ICMP, .prot = &ping_prot, .ops = &inet_sockraw_ops, .flags = INET_PROTOSW_REUSE, }, { .type = SOCK_RAW, .protocol = IPPROTO_IP, /* wild card */ .prot = &raw_prot, .ops = &inet_sockraw_ops, .flags = INET_PROTOSW_REUSE, } }; #define INETSW_ARRAY_LEN ARRAY_SIZE(inetsw_array) void inet_register_protosw(struct inet_protosw *p) { struct list_head *lh; struct inet_protosw *answer; int protocol = p->protocol; struct list_head *last_perm; spin_lock_bh(&inetsw_lock); if (p->type >= SOCK_MAX) goto out_illegal; /* If we are trying to override a permanent protocol, bail. */ last_perm = &inetsw[p->type]; list_for_each(lh, &inetsw[p->type]) { answer = list_entry(lh, struct inet_protosw, list); /* Check only the non-wild match. */ if ((INET_PROTOSW_PERMANENT & answer->flags) == 0) break; if (protocol == answer->protocol) goto out_permanent; last_perm = lh; } /* Add the new entry after the last permanent entry if any, so that * the new entry does not override a permanent entry when matched with * a wild-card protocol. But it is allowed to override any existing * non-permanent entry. This means that when we remove this entry, the * system automatically returns to the old behavior. */ list_add_rcu(&p->list, last_perm); out: spin_unlock_bh(&inetsw_lock); return; out_permanent: pr_err("Attempt to override permanent protocol %d\n", protocol); goto out; out_illegal: pr_err("Ignoring attempt to register invalid socket type %d\n", p->type); goto out; } EXPORT_SYMBOL(inet_register_protosw); void inet_unregister_protosw(struct inet_protosw *p) { if (INET_PROTOSW_PERMANENT & p->flags) { pr_err("Attempt to unregister permanent protocol %d\n", p->protocol); } else { spin_lock_bh(&inetsw_lock); list_del_rcu(&p->list); spin_unlock_bh(&inetsw_lock); synchronize_net(); } } EXPORT_SYMBOL(inet_unregister_protosw); static int inet_sk_reselect_saddr(struct sock *sk) { struct inet_sock *inet = inet_sk(sk); __be32 old_saddr = inet->inet_saddr; __be32 daddr = inet->inet_daddr; struct flowi4 *fl4; struct rtable *rt; __be32 new_saddr; struct ip_options_rcu *inet_opt; int err; inet_opt = rcu_dereference_protected(inet->inet_opt, lockdep_sock_is_held(sk)); if (inet_opt && inet_opt->opt.srr) daddr = inet_opt->opt.faddr; /* Query new route. */ fl4 = &inet->cork.fl.u.ip4; rt = ip_route_connect(fl4, daddr, 0, sk->sk_bound_dev_if, sk->sk_protocol, inet->inet_sport, inet->inet_dport, sk); if (IS_ERR(rt)) return PTR_ERR(rt); new_saddr = fl4->saddr; if (new_saddr == old_saddr) { sk_setup_caps(sk, &rt->dst); return 0; } err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET); if (err) { ip_rt_put(rt); return err; } sk_setup_caps(sk, &rt->dst); if (READ_ONCE(sock_net(sk)->ipv4.sysctl_ip_dynaddr) > 1) { pr_info("%s(): shifting inet->saddr from %pI4 to %pI4\n", __func__, &old_saddr, &new_saddr); } /* * XXX The only one ugly spot where we need to * XXX really change the sockets identity after * XXX it has entered the hashes. -DaveM * * Besides that, it does not check for connection * uniqueness. Wait for troubles. */ return __sk_prot_rehash(sk); } int inet_sk_rebuild_header(struct sock *sk) { struct rtable *rt = dst_rtable(__sk_dst_check(sk, 0)); struct inet_sock *inet = inet_sk(sk); struct flowi4 *fl4; int err; /* Route is OK, nothing to do. */ if (rt) return 0; /* Reroute. */ fl4 = &inet->cork.fl.u.ip4; inet_sk_init_flowi4(inet, fl4); rt = ip_route_output_flow(sock_net(sk), fl4, sk); if (!IS_ERR(rt)) { err = 0; sk_setup_caps(sk, &rt->dst); } else { err = PTR_ERR(rt); /* Routing failed... */ sk->sk_route_caps = 0; if (!READ_ONCE(sock_net(sk)->ipv4.sysctl_ip_dynaddr) || sk->sk_state != TCP_SYN_SENT || (sk->sk_userlocks & SOCK_BINDADDR_LOCK) || (err = inet_sk_reselect_saddr(sk)) != 0) WRITE_ONCE(sk->sk_err_soft, -err); } return err; } EXPORT_SYMBOL(inet_sk_rebuild_header); void inet_sk_set_state(struct sock *sk, int state) { trace_inet_sock_set_state(sk, sk->sk_state, state); sk->sk_state = state; } EXPORT_SYMBOL(inet_sk_set_state); void inet_sk_state_store(struct sock *sk, int newstate) { trace_inet_sock_set_state(sk, sk->sk_state, newstate); smp_store_release(&sk->sk_state, newstate); } struct sk_buff *inet_gso_segment(struct sk_buff *skb, netdev_features_t features) { bool udpfrag = false, fixedid = false, gso_partial, encap; struct sk_buff *segs = ERR_PTR(-EINVAL); const struct net_offload *ops; unsigned int offset = 0; struct iphdr *iph; int proto, tot_len; int nhoff; int ihl; int id; skb_reset_network_header(skb); nhoff = skb_network_header(skb) - skb_mac_header(skb); if (unlikely(!pskb_may_pull(skb, sizeof(*iph)))) goto out; iph = ip_hdr(skb); ihl = iph->ihl * 4; if (ihl < sizeof(*iph)) goto out; id = ntohs(iph->id); proto = iph->protocol; /* Warning: after this point, iph might be no longer valid */ if (unlikely(!pskb_may_pull(skb, ihl))) goto out; __skb_pull(skb, ihl); encap = SKB_GSO_CB(skb)->encap_level > 0; if (encap) features &= skb->dev->hw_enc_features; SKB_GSO_CB(skb)->encap_level += ihl; skb_reset_transport_header(skb); segs = ERR_PTR(-EPROTONOSUPPORT); fixedid = !!(skb_shinfo(skb)->gso_type & (SKB_GSO_TCP_FIXEDID << encap)); if (!skb->encapsulation || encap) udpfrag = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP); ops = rcu_dereference(inet_offloads[proto]); if (likely(ops && ops->callbacks.gso_segment)) { segs = ops->callbacks.gso_segment(skb, features); if (!segs) skb->network_header = skb_mac_header(skb) + nhoff - skb->head; } if (IS_ERR_OR_NULL(segs)) goto out; gso_partial = !!(skb_shinfo(segs)->gso_type & SKB_GSO_PARTIAL); skb = segs; do { iph = (struct iphdr *)(skb_mac_header(skb) + nhoff); if (udpfrag) { iph->frag_off = htons(offset >> 3); if (skb->next) iph->frag_off |= htons(IP_MF); offset += skb->len - nhoff - ihl; tot_len = skb->len - nhoff; } else if (skb_is_gso(skb)) { if (!fixedid) { iph->id = htons(id); id += skb_shinfo(skb)->gso_segs; } if (gso_partial) tot_len = skb_shinfo(skb)->gso_size + SKB_GSO_CB(skb)->data_offset + skb->head - (unsigned char *)iph; else tot_len = skb->len - nhoff; } else { if (!fixedid) iph->id = htons(id++); tot_len = skb->len - nhoff; } iph->tot_len = htons(tot_len); ip_send_check(iph); if (encap) skb_reset_inner_headers(skb); skb->network_header = (u8 *)iph - skb->head; skb_reset_mac_len(skb); } while ((skb = skb->next)); out: return segs; } static struct sk_buff *ipip_gso_segment(struct sk_buff *skb, netdev_features_t features) { if (!(skb_shinfo(skb)->gso_type & SKB_GSO_IPXIP4)) return ERR_PTR(-EINVAL); return inet_gso_segment(skb, features); } struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb) { const struct net_offload *ops; struct sk_buff *pp = NULL; const struct iphdr *iph; struct sk_buff *p; unsigned int hlen; unsigned int off; int flush = 1; int proto; off = skb_gro_offset(skb); hlen = off + sizeof(*iph); iph = skb_gro_header(skb, hlen, off); if (unlikely(!iph)) goto out; proto = iph->protocol; ops = rcu_dereference(inet_offloads[proto]); if (!ops || !ops->callbacks.gro_receive) goto out; if (*(u8 *)iph != 0x45) goto out; if (ip_is_fragment(iph)) goto out; if (unlikely(ip_fast_csum((u8 *)iph, 5))) goto out; NAPI_GRO_CB(skb)->proto = proto; flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (ntohl(*(__be32 *)&iph->id) & ~IP_DF)); list_for_each_entry(p, head, list) { struct iphdr *iph2; if (!NAPI_GRO_CB(p)->same_flow) continue; iph2 = (struct iphdr *)(p->data + off); /* The above works because, with the exception of the top * (inner most) layer, we only aggregate pkts with the same * hdr length so all the hdrs we'll need to verify will start * at the same offset. */ if ((iph->protocol ^ iph2->protocol) | ((__force u32)iph->saddr ^ (__force u32)iph2->saddr) | ((__force u32)iph->daddr ^ (__force u32)iph2->daddr)) { NAPI_GRO_CB(p)->same_flow = 0; continue; } } NAPI_GRO_CB(skb)->flush |= flush; NAPI_GRO_CB(skb)->network_offsets[NAPI_GRO_CB(skb)->encap_mark] = off; /* Note : No need to call skb_gro_postpull_rcsum() here, * as we already checked checksum over ipv4 header was 0 */ skb_gro_pull(skb, sizeof(*iph)); skb_set_transport_header(skb, skb_gro_offset(skb)); pp = indirect_call_gro_receive(tcp4_gro_receive, udp4_gro_receive, ops->callbacks.gro_receive, head, skb); out: skb_gro_flush_final(skb, pp, flush); return pp; } static struct sk_buff *ipip_gro_receive(struct list_head *head, struct sk_buff *skb) { if (NAPI_GRO_CB(skb)->encap_mark) { NAPI_GRO_CB(skb)->flush = 1; return NULL; } NAPI_GRO_CB(skb)->encap_mark = 1; return inet_gro_receive(head, skb); } #define SECONDS_PER_DAY 86400 /* inet_current_timestamp - Return IP network timestamp * * Return milliseconds since midnight in network byte order. */ __be32 inet_current_timestamp(void) { u32 secs; u32 msecs; struct timespec64 ts; ktime_get_real_ts64(&ts); /* Get secs since midnight. */ (void)div_u64_rem(ts.tv_sec, SECONDS_PER_DAY, &secs); /* Convert to msecs. */ msecs = secs * MSEC_PER_SEC; /* Convert nsec to msec. */ msecs += (u32)ts.tv_nsec / NSEC_PER_MSEC; /* Convert to network byte order. */ return htonl(msecs); } EXPORT_SYMBOL(inet_current_timestamp); int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len) { unsigned int family = READ_ONCE(sk->sk_family); if (family == AF_INET) return ip_recv_error(sk, msg, len, addr_len); #if IS_ENABLED(CONFIG_IPV6) if (family == AF_INET6) return pingv6_ops.ipv6_recv_error(sk, msg, len, addr_len); #endif return -EINVAL; } EXPORT_SYMBOL(inet_recv_error); int inet_gro_complete(struct sk_buff *skb, int nhoff) { struct iphdr *iph = (struct iphdr *)(skb->data + nhoff); const struct net_offload *ops; __be16 totlen = iph->tot_len; int proto = iph->protocol; int err = -ENOSYS; if (skb->encapsulation) { skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IP)); skb_set_inner_network_header(skb, nhoff); } iph_set_totlen(iph, skb->len - nhoff); csum_replace2(&iph->check, totlen, iph->tot_len); ops = rcu_dereference(inet_offloads[proto]); if (WARN_ON(!ops || !ops->callbacks.gro_complete)) goto out; /* Only need to add sizeof(*iph) to get to the next hdr below * because any hdr with option will have been flushed in * inet_gro_receive(). */ err = INDIRECT_CALL_2(ops->callbacks.gro_complete, tcp4_gro_complete, udp4_gro_complete, skb, nhoff + sizeof(*iph)); out: return err; } static int ipip_gro_complete(struct sk_buff *skb, int nhoff) { skb->encapsulation = 1; skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP4; return inet_gro_complete(skb, nhoff); } int inet_ctl_sock_create(struct sock **sk, unsigned short family, unsigned short type, unsigned char protocol, struct net *net) { struct socket *sock; int rc = sock_create_kern(net, family, type, protocol, &sock); if (rc == 0) { *sk = sock->sk; (*sk)->sk_allocation = GFP_ATOMIC; (*sk)->sk_use_task_frag = false; /* * Unhash it so that IP input processing does not even see it, * we do not wish this socket to see incoming packets. */ (*sk)->sk_prot->unhash(*sk); } return rc; } EXPORT_SYMBOL_GPL(inet_ctl_sock_create); unsigned long snmp_fold_field(void __percpu *mib, int offt) { unsigned long res = 0; int i; for_each_possible_cpu(i) res += snmp_get_cpu_field(mib, i, offt); return res; } EXPORT_SYMBOL_GPL(snmp_fold_field); #if BITS_PER_LONG==32 u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offt, size_t syncp_offset) { void *bhptr; struct u64_stats_sync *syncp; u64 v; unsigned int start; bhptr = per_cpu_ptr(mib, cpu); syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); do { start = u64_stats_fetch_begin(syncp); v = *(((u64 *)bhptr) + offt); } while (u64_stats_fetch_retry(syncp, start)); return v; } EXPORT_SYMBOL_GPL(snmp_get_cpu_field64); u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset) { u64 res = 0; int cpu; for_each_possible_cpu(cpu) { res += snmp_get_cpu_field64(mib, cpu, offt, syncp_offset); } return res; } EXPORT_SYMBOL_GPL(snmp_fold_field64); #endif #ifdef CONFIG_IP_MULTICAST static const struct net_protocol igmp_protocol = { .handler = igmp_rcv, }; #endif static const struct net_protocol icmp_protocol = { .handler = icmp_rcv, .err_handler = icmp_err, .no_policy = 1, }; static __net_init int ipv4_mib_init_net(struct net *net) { int i; net->mib.tcp_statistics = alloc_percpu(struct tcp_mib); if (!net->mib.tcp_statistics) goto err_tcp_mib; net->mib.ip_statistics = alloc_percpu(struct ipstats_mib); if (!net->mib.ip_statistics) goto err_ip_mib; for_each_possible_cpu(i) { struct ipstats_mib *af_inet_stats; af_inet_stats = per_cpu_ptr(net->mib.ip_statistics, i); u64_stats_init(&af_inet_stats->syncp); } net->mib.net_statistics = alloc_percpu(struct linux_mib); if (!net->mib.net_statistics) goto err_net_mib; net->mib.udp_statistics = alloc_percpu(struct udp_mib); if (!net->mib.udp_statistics) goto err_udp_mib; net->mib.udplite_statistics = alloc_percpu(struct udp_mib); if (!net->mib.udplite_statistics) goto err_udplite_mib; net->mib.icmp_statistics = alloc_percpu(struct icmp_mib); if (!net->mib.icmp_statistics) goto err_icmp_mib; net->mib.icmpmsg_statistics = kzalloc(sizeof(struct icmpmsg_mib), GFP_KERNEL); if (!net->mib.icmpmsg_statistics) goto err_icmpmsg_mib; tcp_mib_init(net); return 0; err_icmpmsg_mib: free_percpu(net->mib.icmp_statistics); err_icmp_mib: free_percpu(net->mib.udplite_statistics); err_udplite_mib: free_percpu(net->mib.udp_statistics); err_udp_mib: free_percpu(net->mib.net_statistics); err_net_mib: free_percpu(net->mib.ip_statistics); err_ip_mib: free_percpu(net->mib.tcp_statistics); err_tcp_mib: return -ENOMEM; } static __net_exit void ipv4_mib_exit_net(struct net *net) { kfree(net->mib.icmpmsg_statistics); free_percpu(net->mib.icmp_statistics); free_percpu(net->mib.udplite_statistics); free_percpu(net->mib.udp_statistics); free_percpu(net->mib.net_statistics); free_percpu(net->mib.ip_statistics); free_percpu(net->mib.tcp_statistics); #ifdef CONFIG_MPTCP /* allocated on demand, see mptcp_init_sock() */ free_percpu(net->mib.mptcp_statistics); #endif } static __net_initdata struct pernet_operations ipv4_mib_ops = { .init = ipv4_mib_init_net, .exit = ipv4_mib_exit_net, }; static int __init init_ipv4_mibs(void) { return register_pernet_subsys(&ipv4_mib_ops); } static __net_init int inet_init_net(struct net *net) { /* * Set defaults for local port range */ net->ipv4.ip_local_ports.range = 60999u << 16 | 32768u; seqlock_init(&net->ipv4.ping_group_range.lock); /* * Sane defaults - nobody may create ping sockets. * Boot scripts should set this to distro-specific group. */ net->ipv4.ping_group_range.range[0] = make_kgid(&init_user_ns, 1); net->ipv4.ping_group_range.range[1] = make_kgid(&init_user_ns, 0); /* Default values for sysctl-controlled parameters. * We set them here, in case sysctl is not compiled. */ net->ipv4.sysctl_ip_default_ttl = IPDEFTTL; net->ipv4.sysctl_ip_fwd_update_priority = 1; net->ipv4.sysctl_ip_dynaddr = 0; net->ipv4.sysctl_ip_early_demux = 1; net->ipv4.sysctl_udp_early_demux = 1; net->ipv4.sysctl_tcp_early_demux = 1; net->ipv4.sysctl_nexthop_compat_mode = 1; #ifdef CONFIG_SYSCTL net->ipv4.sysctl_ip_prot_sock = PROT_SOCK; #endif /* Some igmp sysctl, whose values are always used */ net->ipv4.sysctl_igmp_max_memberships = 20; net->ipv4.sysctl_igmp_max_msf = 10; /* IGMP reports for link-local multicast groups are enabled by default */ net->ipv4.sysctl_igmp_llm_reports = 1; net->ipv4.sysctl_igmp_qrv = 2; net->ipv4.sysctl_fib_notify_on_flag_change = 0; return 0; } static __net_initdata struct pernet_operations af_inet_ops = { .init = inet_init_net, }; static int __init init_inet_pernet_ops(void) { return register_pernet_subsys(&af_inet_ops); } static int ipv4_proc_init(void); /* * IP protocol layer initialiser */ static const struct net_offload ipip_offload = { .callbacks = { .gso_segment = ipip_gso_segment, .gro_receive = ipip_gro_receive, .gro_complete = ipip_gro_complete, }, }; static int __init ipip_offload_init(void) { return inet_add_offload(&ipip_offload, IPPROTO_IPIP); } static int __init ipv4_offload_init(void) { /* * Add offloads */ if (udpv4_offload_init() < 0) pr_crit("%s: Cannot add UDP protocol offload\n", __func__); if (tcpv4_offload_init() < 0) pr_crit("%s: Cannot add TCP protocol offload\n", __func__); if (ipip_offload_init() < 0) pr_crit("%s: Cannot add IPIP protocol offload\n", __func__); net_hotdata.ip_packet_offload = (struct packet_offload) { .type = cpu_to_be16(ETH_P_IP), .callbacks = { .gso_segment = inet_gso_segment, .gro_receive = inet_gro_receive, .gro_complete = inet_gro_complete, }, }; dev_add_offload(&net_hotdata.ip_packet_offload); return 0; } fs_initcall(ipv4_offload_init); static struct packet_type ip_packet_type __read_mostly = { .type = cpu_to_be16(ETH_P_IP), .func = ip_rcv, .list_func = ip_list_rcv, }; static int __init inet_init(void) { struct inet_protosw *q; struct list_head *r; int rc; sock_skb_cb_check_size(sizeof(struct inet_skb_parm)); raw_hashinfo_init(&raw_v4_hashinfo); rc = proto_register(&tcp_prot, 1); if (rc) goto out; rc = proto_register(&udp_prot, 1); if (rc) goto out_unregister_tcp_proto; rc = proto_register(&raw_prot, 1); if (rc) goto out_unregister_udp_proto; rc = proto_register(&ping_prot, 1); if (rc) goto out_unregister_raw_proto; /* * Tell SOCKET that we are alive... */ (void)sock_register(&inet_family_ops); #ifdef CONFIG_SYSCTL ip_static_sysctl_init(); #endif /* * Add all the base protocols. */ if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0) pr_crit("%s: Cannot add ICMP protocol\n", __func__); net_hotdata.udp_protocol = (struct net_protocol) { .handler = udp_rcv, .err_handler = udp_err, .no_policy = 1, }; if (inet_add_protocol(&net_hotdata.udp_protocol, IPPROTO_UDP) < 0) pr_crit("%s: Cannot add UDP protocol\n", __func__); net_hotdata.tcp_protocol = (struct net_protocol) { .handler = tcp_v4_rcv, .err_handler = tcp_v4_err, .no_policy = 1, .icmp_strict_tag_validation = 1, }; if (inet_add_protocol(&net_hotdata.tcp_protocol, IPPROTO_TCP) < 0) pr_crit("%s: Cannot add TCP protocol\n", __func__); #ifdef CONFIG_IP_MULTICAST if (inet_add_protocol(&igmp_protocol, IPPROTO_IGMP) < 0) pr_crit("%s: Cannot add IGMP protocol\n", __func__); #endif /* Register the socket-side information for inet_create. */ for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r) INIT_LIST_HEAD(r); for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q) inet_register_protosw(q); /* * Set the ARP module up */ arp_init(); /* * Set the IP module up */ ip_init(); /* Initialise per-cpu ipv4 mibs */ if (init_ipv4_mibs()) panic("%s: Cannot init ipv4 mibs\n", __func__); /* Setup TCP slab cache for open requests. */ tcp_init(); /* Setup UDP memory threshold */ udp_init(); /* Add UDP-Lite (RFC 3828) */ udplite4_register(); raw_init(); ping_init(); /* * Set the ICMP layer up */ if (icmp_init() < 0) panic("Failed to create the ICMP control socket.\n"); /* * Initialise the multicast router */ #if defined(CONFIG_IP_MROUTE) if (ip_mr_init()) pr_crit("%s: Cannot init ipv4 mroute\n", __func__); #endif if (init_inet_pernet_ops()) pr_crit("%s: Cannot init ipv4 inet pernet ops\n", __func__); ipv4_proc_init(); ipfrag_init(); dev_add_pack(&ip_packet_type); ip_tunnel_core_init(); rc = 0; out: return rc; out_unregister_raw_proto: proto_unregister(&raw_prot); out_unregister_udp_proto: proto_unregister(&udp_prot); out_unregister_tcp_proto: proto_unregister(&tcp_prot); goto out; } fs_initcall(inet_init); /* ------------------------------------------------------------------------ */ #ifdef CONFIG_PROC_FS static int __init ipv4_proc_init(void) { int rc = 0; if (raw_proc_init()) goto out_raw; if (tcp4_proc_init()) goto out_tcp; if (udp4_proc_init()) goto out_udp; if (ping_proc_init()) goto out_ping; if (ip_misc_proc_init()) goto out_misc; out: return rc; out_misc: ping_proc_exit(); out_ping: udp4_proc_exit(); out_udp: tcp4_proc_exit(); out_tcp: raw_proc_exit(); out_raw: rc = -ENOMEM; goto out; } #else /* CONFIG_PROC_FS */ static int __init ipv4_proc_init(void) { return 0; } #endif /* CONFIG_PROC_FS */ |
| 712 1 927 927 929 929 929 867 764 8 9 2 50 50 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_RMAP_H #define _LINUX_RMAP_H /* * Declarations for Reverse Mapping functions in mm/rmap.c */ #include <linux/list.h> #include <linux/slab.h> #include <linux/mm.h> #include <linux/rwsem.h> #include <linux/memcontrol.h> #include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/memremap.h> #include <linux/bit_spinlock.h> /* * The anon_vma heads a list of private "related" vmas, to scan if * an anonymous page pointing to this anon_vma needs to be unmapped: * the vmas on the list will be related by forking, or by splitting. * * Since vmas come and go as they are split and merged (particularly * in mprotect), the mapping field of an anonymous page cannot point * directly to a vma: instead it points to an anon_vma, on whose list * the related vmas can be easily linked or unlinked. * * After unlinking the last vma on the list, we must garbage collect * the anon_vma object itself: we're guaranteed no page can be * pointing to this anon_vma once its vma list is empty. */ struct anon_vma { struct anon_vma *root; /* Root of this anon_vma tree */ struct rw_semaphore rwsem; /* W: modification, R: walking the list */ /* * The refcount is taken on an anon_vma when there is no * guarantee that the vma of page tables will exist for * the duration of the operation. A caller that takes * the reference is responsible for clearing up the * anon_vma if they are the last user on release */ atomic_t refcount; /* * Count of child anon_vmas. Equals to the count of all anon_vmas that * have ->parent pointing to this one, including itself. * * This counter is used for making decision about reusing anon_vma * instead of forking new one. See comments in function anon_vma_clone. */ unsigned long num_children; /* Count of VMAs whose ->anon_vma pointer points to this object. */ unsigned long num_active_vmas; struct anon_vma *parent; /* Parent of this anon_vma */ /* * NOTE: the LSB of the rb_root.rb_node is set by * mm_take_all_locks() _after_ taking the above lock. So the * rb_root must only be read/written after taking the above lock * to be sure to see a valid next pointer. The LSB bit itself * is serialized by a system wide lock only visible to * mm_take_all_locks() (mm_all_locks_mutex). */ /* Interval tree of private "related" vmas */ struct rb_root_cached rb_root; }; /* * The copy-on-write semantics of fork mean that an anon_vma * can become associated with multiple processes. Furthermore, * each child process will have its own anon_vma, where new * pages for that process are instantiated. * * This structure allows us to find the anon_vmas associated * with a VMA, or the VMAs associated with an anon_vma. * The "same_vma" list contains the anon_vma_chains linking * all the anon_vmas associated with this VMA. * The "rb" field indexes on an interval tree the anon_vma_chains * which link all the VMAs associated with this anon_vma. */ struct anon_vma_chain { struct vm_area_struct *vma; struct anon_vma *anon_vma; struct list_head same_vma; /* locked by mmap_lock & page_table_lock */ struct rb_node rb; /* locked by anon_vma->rwsem */ unsigned long rb_subtree_last; #ifdef CONFIG_DEBUG_VM_RB unsigned long cached_vma_start, cached_vma_last; #endif }; enum ttu_flags { TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */ TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */ TTU_SYNC = 0x10, /* avoid racy checks with PVMW_SYNC */ TTU_HWPOISON = 0x20, /* do convert pte to hwpoison entry */ TTU_BATCH_FLUSH = 0x40, /* Batch TLB flushes where possible * and caller guarantees they will * do a final flush if necessary */ TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock: * caller holds it */ }; #ifdef CONFIG_MMU static inline void get_anon_vma(struct anon_vma *anon_vma) { atomic_inc(&anon_vma->refcount); } void __put_anon_vma(struct anon_vma *anon_vma); static inline void put_anon_vma(struct anon_vma *anon_vma) { if (atomic_dec_and_test(&anon_vma->refcount)) __put_anon_vma(anon_vma); } static inline void anon_vma_lock_write(struct anon_vma *anon_vma) { down_write(&anon_vma->root->rwsem); } static inline int anon_vma_trylock_write(struct anon_vma *anon_vma) { return down_write_trylock(&anon_vma->root->rwsem); } static inline void anon_vma_unlock_write(struct anon_vma *anon_vma) { up_write(&anon_vma->root->rwsem); } static inline void anon_vma_lock_read(struct anon_vma *anon_vma) { down_read(&anon_vma->root->rwsem); } static inline int anon_vma_trylock_read(struct anon_vma *anon_vma) { return down_read_trylock(&anon_vma->root->rwsem); } static inline void anon_vma_unlock_read(struct anon_vma *anon_vma) { up_read(&anon_vma->root->rwsem); } /* * anon_vma helper functions. */ void anon_vma_init(void); /* create anon_vma_cachep */ int __anon_vma_prepare(struct vm_area_struct *); void unlink_anon_vmas(struct vm_area_struct *); int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *); int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *); static inline int anon_vma_prepare(struct vm_area_struct *vma) { if (likely(vma->anon_vma)) return 0; return __anon_vma_prepare(vma); } static inline void anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next) { VM_BUG_ON_VMA(vma->anon_vma != next->anon_vma, vma); unlink_anon_vmas(next); } struct anon_vma *folio_get_anon_vma(const struct folio *folio); #ifdef CONFIG_MM_ID static __always_inline void folio_lock_large_mapcount(struct folio *folio) { bit_spin_lock(FOLIO_MM_IDS_LOCK_BITNUM, &folio->_mm_ids); } static __always_inline void folio_unlock_large_mapcount(struct folio *folio) { __bit_spin_unlock(FOLIO_MM_IDS_LOCK_BITNUM, &folio->_mm_ids); } static inline unsigned int folio_mm_id(const struct folio *folio, int idx) { VM_WARN_ON_ONCE(idx != 0 && idx != 1); return folio->_mm_id[idx] & MM_ID_MASK; } static inline void folio_set_mm_id(struct folio *folio, int idx, mm_id_t id) { VM_WARN_ON_ONCE(idx != 0 && idx != 1); folio->_mm_id[idx] &= ~MM_ID_MASK; folio->_mm_id[idx] |= id; } static inline void __folio_large_mapcount_sanity_checks(const struct folio *folio, int diff, mm_id_t mm_id) { VM_WARN_ON_ONCE(!folio_test_large(folio) || folio_test_hugetlb(folio)); VM_WARN_ON_ONCE(diff <= 0); VM_WARN_ON_ONCE(mm_id < MM_ID_MIN || mm_id > MM_ID_MAX); /* * Make sure we can detect at least one complete PTE mapping of the * folio in a single MM as "exclusively mapped". This is primarily * a check on 32bit, where we currently reduce the size of the per-MM * mapcount to a short. */ VM_WARN_ON_ONCE(diff > folio_large_nr_pages(folio)); VM_WARN_ON_ONCE(folio_large_nr_pages(folio) - 1 > MM_ID_MAPCOUNT_MAX); VM_WARN_ON_ONCE(folio_mm_id(folio, 0) == MM_ID_DUMMY && folio->_mm_id_mapcount[0] != -1); VM_WARN_ON_ONCE(folio_mm_id(folio, 0) != MM_ID_DUMMY && folio->_mm_id_mapcount[0] < 0); VM_WARN_ON_ONCE(folio_mm_id(folio, 1) == MM_ID_DUMMY && folio->_mm_id_mapcount[1] != -1); VM_WARN_ON_ONCE(folio_mm_id(folio, 1) != MM_ID_DUMMY && folio->_mm_id_mapcount[1] < 0); VM_WARN_ON_ONCE(!folio_mapped(folio) && test_bit(FOLIO_MM_IDS_SHARED_BITNUM, &folio->_mm_ids)); } static __always_inline void folio_set_large_mapcount(struct folio *folio, int mapcount, struct vm_area_struct *vma) { __folio_large_mapcount_sanity_checks(folio, mapcount, vma->vm_mm->mm_id); VM_WARN_ON_ONCE(folio_mm_id(folio, 0) != MM_ID_DUMMY); VM_WARN_ON_ONCE(folio_mm_id(folio, 1) != MM_ID_DUMMY); /* Note: mapcounts start at -1. */ atomic_set(&folio->_large_mapcount, mapcount - 1); folio->_mm_id_mapcount[0] = mapcount - 1; folio_set_mm_id(folio, 0, vma->vm_mm->mm_id); } static __always_inline int folio_add_return_large_mapcount(struct folio *folio, int diff, struct vm_area_struct *vma) { const mm_id_t mm_id = vma->vm_mm->mm_id; int new_mapcount_val; folio_lock_large_mapcount(folio); __folio_large_mapcount_sanity_checks(folio, diff, mm_id); new_mapcount_val = atomic_read(&folio->_large_mapcount) + diff; atomic_set(&folio->_large_mapcount, new_mapcount_val); /* * If a folio is mapped more than once into an MM on 32bit, we * can in theory overflow the per-MM mapcount (although only for * fairly large folios), turning it negative. In that case, just * free up the slot and mark the folio "mapped shared", otherwise * we might be in trouble when unmapping pages later. */ if (folio_mm_id(folio, 0) == mm_id) { folio->_mm_id_mapcount[0] += diff; if (!IS_ENABLED(CONFIG_64BIT) && unlikely(folio->_mm_id_mapcount[0] < 0)) { folio->_mm_id_mapcount[0] = -1; folio_set_mm_id(folio, 0, MM_ID_DUMMY); folio->_mm_ids |= FOLIO_MM_IDS_SHARED_BIT; } } else if (folio_mm_id(folio, 1) == mm_id) { folio->_mm_id_mapcount[1] += diff; if (!IS_ENABLED(CONFIG_64BIT) && unlikely(folio->_mm_id_mapcount[1] < 0)) { folio->_mm_id_mapcount[1] = -1; folio_set_mm_id(folio, 1, MM_ID_DUMMY); folio->_mm_ids |= FOLIO_MM_IDS_SHARED_BIT; } } else if (folio_mm_id(folio, 0) == MM_ID_DUMMY) { folio_set_mm_id(folio, 0, mm_id); folio->_mm_id_mapcount[0] = diff - 1; /* We might have other mappings already. */ if (new_mapcount_val != diff - 1) folio->_mm_ids |= FOLIO_MM_IDS_SHARED_BIT; } else if (folio_mm_id(folio, 1) == MM_ID_DUMMY) { folio_set_mm_id(folio, 1, mm_id); folio->_mm_id_mapcount[1] = diff - 1; /* Slot 0 certainly has mappings as well. */ folio->_mm_ids |= FOLIO_MM_IDS_SHARED_BIT; } folio_unlock_large_mapcount(folio); return new_mapcount_val + 1; } #define folio_add_large_mapcount folio_add_return_large_mapcount static __always_inline int folio_sub_return_large_mapcount(struct folio *folio, int diff, struct vm_area_struct *vma) { const mm_id_t mm_id = vma->vm_mm->mm_id; int new_mapcount_val; folio_lock_large_mapcount(folio); __folio_large_mapcount_sanity_checks(folio, diff, mm_id); new_mapcount_val = atomic_read(&folio->_large_mapcount) - diff; atomic_set(&folio->_large_mapcount, new_mapcount_val); /* * There are valid corner cases where we might underflow a per-MM * mapcount (some mappings added when no slot was free, some mappings * added once a slot was free), so we always set it to -1 once we go * negative. */ if (folio_mm_id(folio, 0) == mm_id) { folio->_mm_id_mapcount[0] -= diff; if (folio->_mm_id_mapcount[0] >= 0) goto out; folio->_mm_id_mapcount[0] = -1; folio_set_mm_id(folio, 0, MM_ID_DUMMY); } else if (folio_mm_id(folio, 1) == mm_id) { folio->_mm_id_mapcount[1] -= diff; if (folio->_mm_id_mapcount[1] >= 0) goto out; folio->_mm_id_mapcount[1] = -1; folio_set_mm_id(folio, 1, MM_ID_DUMMY); } /* * If one MM slot owns all mappings, the folio is mapped exclusively. * Note that if the folio is now unmapped (new_mapcount_val == -1), both * slots must be free (mapcount == -1), and we'll also mark it as * exclusive. */ if (folio->_mm_id_mapcount[0] == new_mapcount_val || folio->_mm_id_mapcount[1] == new_mapcount_val) folio->_mm_ids &= ~FOLIO_MM_IDS_SHARED_BIT; out: folio_unlock_large_mapcount(folio); return new_mapcount_val + 1; } #define folio_sub_large_mapcount folio_sub_return_large_mapcount #else /* !CONFIG_MM_ID */ /* * See __folio_rmap_sanity_checks(), we might map large folios even without * CONFIG_TRANSPARENT_HUGEPAGE. We'll keep that working for now. */ static inline void folio_set_large_mapcount(struct folio *folio, int mapcount, struct vm_area_struct *vma) { /* Note: mapcounts start at -1. */ atomic_set(&folio->_large_mapcount, mapcount - 1); } static inline void folio_add_large_mapcount(struct folio *folio, int diff, struct vm_area_struct *vma) { atomic_add(diff, &folio->_large_mapcount); } static inline int folio_add_return_large_mapcount(struct folio *folio, int diff, struct vm_area_struct *vma) { BUILD_BUG(); } static inline void folio_sub_large_mapcount(struct folio *folio, int diff, struct vm_area_struct *vma) { atomic_sub(diff, &folio->_large_mapcount); } static inline int folio_sub_return_large_mapcount(struct folio *folio, int diff, struct vm_area_struct *vma) { BUILD_BUG(); } #endif /* CONFIG_MM_ID */ #define folio_inc_large_mapcount(folio, vma) \ folio_add_large_mapcount(folio, 1, vma) #define folio_inc_return_large_mapcount(folio, vma) \ folio_add_return_large_mapcount(folio, 1, vma) #define folio_dec_large_mapcount(folio, vma) \ folio_sub_large_mapcount(folio, 1, vma) #define folio_dec_return_large_mapcount(folio, vma) \ folio_sub_return_large_mapcount(folio, 1, vma) /* RMAP flags, currently only relevant for some anon rmap operations. */ typedef int __bitwise rmap_t; /* * No special request: A mapped anonymous (sub)page is possibly shared between * processes. */ #define RMAP_NONE ((__force rmap_t)0) /* The anonymous (sub)page is exclusive to a single process. */ #define RMAP_EXCLUSIVE ((__force rmap_t)BIT(0)) static __always_inline void __folio_rmap_sanity_checks(const struct folio *folio, const struct page *page, int nr_pages, enum pgtable_level level) { /* hugetlb folios are handled separately. */ VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); /* When (un)mapping zeropages, we should never touch ref+mapcount. */ VM_WARN_ON_FOLIO(is_zero_folio(folio), folio); /* * TODO: we get driver-allocated folios that have nothing to do with * the rmap using vm_insert_page(); therefore, we cannot assume that * folio_test_large_rmappable() holds for large folios. We should * handle any desired mapcount+stats accounting for these folios in * VM_MIXEDMAP VMAs separately, and then sanity-check here that * we really only get rmappable folios. */ VM_WARN_ON_ONCE(nr_pages <= 0); VM_WARN_ON_FOLIO(page_folio(page) != folio, folio); VM_WARN_ON_FOLIO(page_folio(page + nr_pages - 1) != folio, folio); switch (level) { case PGTABLE_LEVEL_PTE: break; case PGTABLE_LEVEL_PMD: /* * We don't support folios larger than a single PMD yet. So * when PGTABLE_LEVEL_PMD is set, we assume that we are creating * a single "entire" mapping of the folio. */ VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio); VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio); break; case PGTABLE_LEVEL_PUD: /* * Assume that we are creating a single "entire" mapping of the * folio. */ VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio); VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio); break; default: BUILD_BUG(); } /* * Anon folios must have an associated live anon_vma as long as they're * mapped into userspace. * Note that the atomic_read() mainly does two things: * * 1. In KASAN builds with CONFIG_SLUB_RCU_DEBUG, it causes KASAN to * check that the associated anon_vma has not yet been freed (subject * to KASAN's usual limitations). This check will pass if the * anon_vma's refcount has already dropped to 0 but an RCU grace * period hasn't passed since then. * 2. If the anon_vma has not yet been freed, it checks that the * anon_vma still has a nonzero refcount (as opposed to being in the * middle of an RCU delay for getting freed). */ if (folio_test_anon(folio) && !folio_test_ksm(folio)) { unsigned long mapping = (unsigned long)folio->mapping; struct anon_vma *anon_vma; anon_vma = (void *)(mapping - FOLIO_MAPPING_ANON); VM_WARN_ON_FOLIO(atomic_read(&anon_vma->refcount) == 0, folio); } } /* * rmap interfaces called when adding or removing pte of page */ void folio_move_anon_rmap(struct folio *, struct vm_area_struct *); void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages, struct vm_area_struct *, unsigned long address, rmap_t flags); #define folio_add_anon_rmap_pte(folio, page, vma, address, flags) \ folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags) void folio_add_anon_rmap_pmd(struct folio *, struct page *, struct vm_area_struct *, unsigned long address, rmap_t flags); void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *, unsigned long address, rmap_t flags); void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages, struct vm_area_struct *); #define folio_add_file_rmap_pte(folio, page, vma) \ folio_add_file_rmap_ptes(folio, page, 1, vma) void folio_add_file_rmap_pmd(struct folio *, struct page *, struct vm_area_struct *); void folio_add_file_rmap_pud(struct folio *, struct page *, struct vm_area_struct *); void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages, struct vm_area_struct *); #define folio_remove_rmap_pte(folio, page, vma) \ folio_remove_rmap_ptes(folio, page, 1, vma) void folio_remove_rmap_pmd(struct folio *, struct page *, struct vm_area_struct *); void folio_remove_rmap_pud(struct folio *, struct page *, struct vm_area_struct *); void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *, unsigned long address, rmap_t flags); void hugetlb_add_new_anon_rmap(struct folio *, struct vm_area_struct *, unsigned long address); /* See folio_try_dup_anon_rmap_*() */ static inline int hugetlb_try_dup_anon_rmap(struct folio *folio, struct vm_area_struct *vma) { VM_WARN_ON_FOLIO(!folio_test_hugetlb(folio), folio); VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); if (PageAnonExclusive(&folio->page)) { if (unlikely(folio_needs_cow_for_dma(vma, folio))) return -EBUSY; ClearPageAnonExclusive(&folio->page); } atomic_inc(&folio->_entire_mapcount); atomic_inc(&folio->_large_mapcount); return 0; } /* See folio_try_share_anon_rmap_*() */ static inline int hugetlb_try_share_anon_rmap(struct folio *folio) { VM_WARN_ON_FOLIO(!folio_test_hugetlb(folio), folio); VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); VM_WARN_ON_FOLIO(!PageAnonExclusive(&folio->page), folio); /* Paired with the memory barrier in try_grab_folio(). */ if (IS_ENABLED(CONFIG_HAVE_GUP_FAST)) smp_mb(); if (unlikely(folio_maybe_dma_pinned(folio))) return -EBUSY; ClearPageAnonExclusive(&folio->page); /* * This is conceptually a smp_wmb() paired with the smp_rmb() in * gup_must_unshare(). */ if (IS_ENABLED(CONFIG_HAVE_GUP_FAST)) smp_mb__after_atomic(); return 0; } static inline void hugetlb_add_file_rmap(struct folio *folio) { VM_WARN_ON_FOLIO(!folio_test_hugetlb(folio), folio); VM_WARN_ON_FOLIO(folio_test_anon(folio), folio); atomic_inc(&folio->_entire_mapcount); atomic_inc(&folio->_large_mapcount); } static inline void hugetlb_remove_rmap(struct folio *folio) { VM_WARN_ON_FOLIO(!folio_test_hugetlb(folio), folio); atomic_dec(&folio->_entire_mapcount); atomic_dec(&folio->_large_mapcount); } static __always_inline void __folio_dup_file_rmap(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *dst_vma, enum pgtable_level level) { const int orig_nr_pages = nr_pages; __folio_rmap_sanity_checks(folio, page, nr_pages, level); switch (level) { case PGTABLE_LEVEL_PTE: if (!folio_test_large(folio)) { atomic_inc(&folio->_mapcount); break; } if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) { do { atomic_inc(&page->_mapcount); } while (page++, --nr_pages > 0); } folio_add_large_mapcount(folio, orig_nr_pages, dst_vma); break; case PGTABLE_LEVEL_PMD: case PGTABLE_LEVEL_PUD: atomic_inc(&folio->_entire_mapcount); folio_inc_large_mapcount(folio, dst_vma); break; default: BUILD_BUG(); } } /** * folio_dup_file_rmap_ptes - duplicate PTE mappings of a page range of a folio * @folio: The folio to duplicate the mappings of * @page: The first page to duplicate the mappings of * @nr_pages: The number of pages of which the mapping will be duplicated * @dst_vma: The destination vm area * * The page range of the folio is defined by [page, page + nr_pages) * * The caller needs to hold the page table lock. */ static inline void folio_dup_file_rmap_ptes(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *dst_vma) { __folio_dup_file_rmap(folio, page, nr_pages, dst_vma, PGTABLE_LEVEL_PTE); } static __always_inline void folio_dup_file_rmap_pte(struct folio *folio, struct page *page, struct vm_area_struct *dst_vma) { __folio_dup_file_rmap(folio, page, 1, dst_vma, PGTABLE_LEVEL_PTE); } /** * folio_dup_file_rmap_pmd - duplicate a PMD mapping of a page range of a folio * @folio: The folio to duplicate the mapping of * @page: The first page to duplicate the mapping of * @dst_vma: The destination vm area * * The page range of the folio is defined by [page, page + HPAGE_PMD_NR) * * The caller needs to hold the page table lock. */ static inline void folio_dup_file_rmap_pmd(struct folio *folio, struct page *page, struct vm_area_struct *dst_vma) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE __folio_dup_file_rmap(folio, page, HPAGE_PMD_NR, dst_vma, PGTABLE_LEVEL_PTE); #else WARN_ON_ONCE(true); #endif } static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, enum pgtable_level level) { const int orig_nr_pages = nr_pages; bool maybe_pinned; int i; VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); __folio_rmap_sanity_checks(folio, page, nr_pages, level); /* * If this folio may have been pinned by the parent process, * don't allow to duplicate the mappings but instead require to e.g., * copy the subpage immediately for the child so that we'll always * guarantee the pinned folio won't be randomly replaced in the * future on write faults. */ maybe_pinned = likely(!folio_is_device_private(folio)) && unlikely(folio_needs_cow_for_dma(src_vma, folio)); /* * No need to check+clear for already shared PTEs/PMDs of the * folio. But if any page is PageAnonExclusive, we must fallback to * copying if the folio maybe pinned. */ switch (level) { case PGTABLE_LEVEL_PTE: if (unlikely(maybe_pinned)) { for (i = 0; i < nr_pages; i++) if (PageAnonExclusive(page + i)) return -EBUSY; } if (!folio_test_large(folio)) { if (PageAnonExclusive(page)) ClearPageAnonExclusive(page); atomic_inc(&folio->_mapcount); break; } do { if (PageAnonExclusive(page)) ClearPageAnonExclusive(page); if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) atomic_inc(&page->_mapcount); } while (page++, --nr_pages > 0); folio_add_large_mapcount(folio, orig_nr_pages, dst_vma); break; case PGTABLE_LEVEL_PMD: case PGTABLE_LEVEL_PUD: if (PageAnonExclusive(page)) { if (unlikely(maybe_pinned)) return -EBUSY; ClearPageAnonExclusive(page); } atomic_inc(&folio->_entire_mapcount); folio_inc_large_mapcount(folio, dst_vma); break; default: BUILD_BUG(); } return 0; } /** * folio_try_dup_anon_rmap_ptes - try duplicating PTE mappings of a page range * of a folio * @folio: The folio to duplicate the mappings of * @page: The first page to duplicate the mappings of * @nr_pages: The number of pages of which the mapping will be duplicated * @dst_vma: The destination vm area * @src_vma: The vm area from which the mappings are duplicated * * The page range of the folio is defined by [page, page + nr_pages) * * The caller needs to hold the page table lock and the * vma->vma_mm->write_protect_seq. * * Duplicating the mappings can only fail if the folio may be pinned; device * private folios cannot get pinned and consequently this function cannot fail * for them. * * If duplicating the mappings succeeded, the duplicated PTEs have to be R/O in * the parent and the child. They must *not* be writable after this call * succeeded. * * Returns 0 if duplicating the mappings succeeded. Returns -EBUSY otherwise. */ static inline int folio_try_dup_anon_rmap_ptes(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { return __folio_try_dup_anon_rmap(folio, page, nr_pages, dst_vma, src_vma, PGTABLE_LEVEL_PTE); } static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio, struct page *page, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { return __folio_try_dup_anon_rmap(folio, page, 1, dst_vma, src_vma, PGTABLE_LEVEL_PTE); } /** * folio_try_dup_anon_rmap_pmd - try duplicating a PMD mapping of a page range * of a folio * @folio: The folio to duplicate the mapping of * @page: The first page to duplicate the mapping of * @dst_vma: The destination vm area * @src_vma: The vm area from which the mapping is duplicated * * The page range of the folio is defined by [page, page + HPAGE_PMD_NR) * * The caller needs to hold the page table lock and the * vma->vma_mm->write_protect_seq. * * Duplicating the mapping can only fail if the folio may be pinned; device * private folios cannot get pinned and consequently this function cannot fail * for them. * * If duplicating the mapping succeeds, the duplicated PMD has to be R/O in * the parent and the child. They must *not* be writable after this call * succeeded. * * Returns 0 if duplicating the mapping succeeded. Returns -EBUSY otherwise. */ static inline int folio_try_dup_anon_rmap_pmd(struct folio *folio, struct page *page, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE return __folio_try_dup_anon_rmap(folio, page, HPAGE_PMD_NR, dst_vma, src_vma, PGTABLE_LEVEL_PMD); #else WARN_ON_ONCE(true); return -EBUSY; #endif } static __always_inline int __folio_try_share_anon_rmap(struct folio *folio, struct page *page, int nr_pages, enum pgtable_level level) { VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); VM_WARN_ON_FOLIO(!PageAnonExclusive(page), folio); __folio_rmap_sanity_checks(folio, page, nr_pages, level); /* device private folios cannot get pinned via GUP. */ if (unlikely(folio_is_device_private(folio))) { ClearPageAnonExclusive(page); return 0; } /* * We have to make sure that when we clear PageAnonExclusive, that * the page is not pinned and that concurrent GUP-fast won't succeed in * concurrently pinning the page. * * Conceptually, PageAnonExclusive clearing consists of: * (A1) Clear PTE * (A2) Check if the page is pinned; back off if so. * (A3) Clear PageAnonExclusive * (A4) Restore PTE (optional, but certainly not writable) * * When clearing PageAnonExclusive, we cannot possibly map the page * writable again, because anon pages that may be shared must never * be writable. So in any case, if the PTE was writable it cannot * be writable anymore afterwards and there would be a PTE change. Only * if the PTE wasn't writable, there might not be a PTE change. * * Conceptually, GUP-fast pinning of an anon page consists of: * (B1) Read the PTE * (B2) FOLL_WRITE: check if the PTE is not writable; back off if so. * (B3) Pin the mapped page * (B4) Check if the PTE changed by re-reading it; back off if so. * (B5) If the original PTE is not writable, check if * PageAnonExclusive is not set; back off if so. * * If the PTE was writable, we only have to make sure that GUP-fast * observes a PTE change and properly backs off. * * If the PTE was not writable, we have to make sure that GUP-fast either * detects a (temporary) PTE change or that PageAnonExclusive is cleared * and properly backs off. * * Consequently, when clearing PageAnonExclusive(), we have to make * sure that (A1), (A2)/(A3) and (A4) happen in the right memory * order. In GUP-fast pinning code, we have to make sure that (B3),(B4) * and (B5) happen in the right memory order. * * We assume that there might not be a memory barrier after * clearing/invalidating the PTE (A1) and before restoring the PTE (A4), * so we use explicit ones here. */ /* Paired with the memory barrier in try_grab_folio(). */ if (IS_ENABLED(CONFIG_HAVE_GUP_FAST)) smp_mb(); if (unlikely(folio_maybe_dma_pinned(folio))) return -EBUSY; ClearPageAnonExclusive(page); /* * This is conceptually a smp_wmb() paired with the smp_rmb() in * gup_must_unshare(). */ if (IS_ENABLED(CONFIG_HAVE_GUP_FAST)) smp_mb__after_atomic(); return 0; } /** * folio_try_share_anon_rmap_pte - try marking an exclusive anonymous page * mapped by a PTE possibly shared to prepare * for KSM or temporary unmapping * @folio: The folio to share a mapping of * @page: The mapped exclusive page * * The caller needs to hold the page table lock and has to have the page table * entries cleared/invalidated. * * This is similar to folio_try_dup_anon_rmap_pte(), however, not used during * fork() to duplicate mappings, but instead to prepare for KSM or temporarily * unmapping parts of a folio (swap, migration) via folio_remove_rmap_pte(). * * Marking the mapped page shared can only fail if the folio maybe pinned; * device private folios cannot get pinned and consequently this function cannot * fail. * * Returns 0 if marking the mapped page possibly shared succeeded. Returns * -EBUSY otherwise. */ static inline int folio_try_share_anon_rmap_pte(struct folio *folio, struct page *page) { return __folio_try_share_anon_rmap(folio, page, 1, PGTABLE_LEVEL_PTE); } /** * folio_try_share_anon_rmap_pmd - try marking an exclusive anonymous page * range mapped by a PMD possibly shared to * prepare for temporary unmapping * @folio: The folio to share the mapping of * @page: The first page to share the mapping of * * The page range of the folio is defined by [page, page + HPAGE_PMD_NR) * * The caller needs to hold the page table lock and has to have the page table * entries cleared/invalidated. * * This is similar to folio_try_dup_anon_rmap_pmd(), however, not used during * fork() to duplicate a mapping, but instead to prepare for temporarily * unmapping parts of a folio (swap, migration) via folio_remove_rmap_pmd(). * * Marking the mapped pages shared can only fail if the folio maybe pinned; * device private folios cannot get pinned and consequently this function cannot * fail. * * Returns 0 if marking the mapped pages possibly shared succeeded. Returns * -EBUSY otherwise. */ static inline int folio_try_share_anon_rmap_pmd(struct folio *folio, struct page *page) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE return __folio_try_share_anon_rmap(folio, page, HPAGE_PMD_NR, PGTABLE_LEVEL_PMD); #else WARN_ON_ONCE(true); return -EBUSY; #endif } /* * Called from mm/vmscan.c to handle paging out */ int folio_referenced(struct folio *, int is_locked, struct mem_cgroup *memcg, vm_flags_t *vm_flags); void try_to_migrate(struct folio *folio, enum ttu_flags flags); void try_to_unmap(struct folio *, enum ttu_flags flags); struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr, void *owner, struct folio **foliop); /* Avoid racy checks */ #define PVMW_SYNC (1 << 0) /* Look for migration entries rather than present PTEs */ #define PVMW_MIGRATION (1 << 1) /* Result flags */ /* The page is mapped across page table boundary */ #define PVMW_PGTABLE_CROSSED (1 << 16) struct page_vma_mapped_walk { unsigned long pfn; unsigned long nr_pages; pgoff_t pgoff; struct vm_area_struct *vma; unsigned long address; pmd_t *pmd; pte_t *pte; spinlock_t *ptl; unsigned int flags; }; #define DEFINE_FOLIO_VMA_WALK(name, _folio, _vma, _address, _flags) \ struct page_vma_mapped_walk name = { \ .pfn = folio_pfn(_folio), \ .nr_pages = folio_nr_pages(_folio), \ .pgoff = folio_pgoff(_folio), \ .vma = _vma, \ .address = _address, \ .flags = _flags, \ } static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) { /* HugeTLB pte is set to the relevant page table entry without pte_mapped. */ if (pvmw->pte && !is_vm_hugetlb_page(pvmw->vma)) pte_unmap(pvmw->pte); if (pvmw->ptl) spin_unlock(pvmw->ptl); } /** * page_vma_mapped_walk_restart - Restart the page table walk. * @pvmw: Pointer to struct page_vma_mapped_walk. * * It restarts the page table walk when changes occur in the page * table, such as splitting a PMD. Ensures that the PTL held during * the previous walk is released and resets the state to allow for * a new walk starting at the current address stored in pvmw->address. */ static inline void page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw) { WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte); if (likely(pvmw->ptl)) spin_unlock(pvmw->ptl); else WARN_ON_ONCE(1); pvmw->ptl = NULL; pvmw->pmd = NULL; pvmw->pte = NULL; } bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); unsigned long page_address_in_vma(const struct folio *folio, const struct page *, const struct vm_area_struct *); /* * Cleans the PTEs of shared mappings. * (and since clean PTEs should also be readonly, write protects them too) * * returns the number of cleaned PTEs. */ int folio_mkclean(struct folio *); int mapping_wrprotect_range(struct address_space *mapping, pgoff_t pgoff, unsigned long pfn, unsigned long nr_pages); int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, struct vm_area_struct *vma); enum rmp_flags { RMP_LOCKED = 1 << 0, RMP_USE_SHARED_ZEROPAGE = 1 << 1, }; void remove_migration_ptes(struct folio *src, struct folio *dst, int flags); /* * rmap_walk_control: To control rmap traversing for specific needs * * arg: passed to rmap_one() and invalid_vma() * try_lock: bail out if the rmap lock is contended * contended: indicate the rmap traversal bailed out due to lock contention * rmap_one: executed on each vma where page is mapped * done: for checking traversing termination condition * anon_lock: for getting anon_lock by optimized way rather than default * invalid_vma: for skipping uninterested vma */ struct rmap_walk_control { void *arg; bool try_lock; bool contended; /* * Return false if page table scanning in rmap_walk should be stopped. * Otherwise, return true. */ bool (*rmap_one)(struct folio *folio, struct vm_area_struct *vma, unsigned long addr, void *arg); int (*done)(struct folio *folio); struct anon_vma *(*anon_lock)(const struct folio *folio, struct rmap_walk_control *rwc); bool (*invalid_vma)(struct vm_area_struct *vma, void *arg); }; void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc); void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc); struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio, struct rmap_walk_control *rwc); #else /* !CONFIG_MMU */ #define anon_vma_init() do {} while (0) #define anon_vma_prepare(vma) (0) static inline int folio_referenced(struct folio *folio, int is_locked, struct mem_cgroup *memcg, vm_flags_t *vm_flags) { *vm_flags = 0; return 0; } static inline void try_to_unmap(struct folio *folio, enum ttu_flags flags) { } static inline int folio_mkclean(struct folio *folio) { return 0; } #endif /* CONFIG_MMU */ #endif /* _LINUX_RMAP_H */ |
| 7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * ALSA sequencer Queue handling * Copyright (c) 1998-1999 by Frank van de Pol <fvdpol@coil.demon.nl> */ #ifndef __SND_SEQ_QUEUE_H #define __SND_SEQ_QUEUE_H #include "seq_memory.h" #include "seq_prioq.h" #include "seq_timer.h" #include "seq_lock.h" #include <linux/interrupt.h> #include <linux/list.h> #include <linux/bitops.h> #define SEQ_QUEUE_NO_OWNER (-1) struct snd_seq_queue { int queue; /* queue number */ char name[64]; /* name of this queue */ struct snd_seq_prioq *tickq; /* midi tick event queue */ struct snd_seq_prioq *timeq; /* real-time event queue */ struct snd_seq_timer *timer; /* time keeper for this queue */ int owner; /* client that 'owns' the timer */ bool locked; /* timer is only accesibble by owner if set */ bool klocked; /* kernel lock (after START) */ bool check_again; /* concurrent access happened during check */ bool check_blocked; /* queue being checked */ unsigned int flags; /* status flags */ unsigned int info_flags; /* info for sync */ spinlock_t owner_lock; spinlock_t check_lock; /* clients which uses this queue (bitmap) */ DECLARE_BITMAP(clients_bitmap, SNDRV_SEQ_MAX_CLIENTS); unsigned int clients; /* users of this queue */ struct mutex timer_mutex; snd_use_lock_t use_lock; }; /* get the number of current queues */ int snd_seq_queue_get_cur_queues(void); /* delete queues */ void snd_seq_queues_delete(void); /* create new queue (constructor) */ struct snd_seq_queue *snd_seq_queue_alloc(int client, int locked, unsigned int flags); /* delete queue (destructor) */ int snd_seq_queue_delete(int client, int queueid); /* final stage */ void snd_seq_queue_client_leave(int client); /* enqueue a event received from one the clients */ int snd_seq_enqueue_event(struct snd_seq_event_cell *cell, int atomic, int hop); /* Remove events */ void snd_seq_queue_remove_cells(int client, struct snd_seq_remove_events *info); /* return pointer to queue structure for specified id */ struct snd_seq_queue *queueptr(int queueid); /* unlock */ #define queuefree(q) snd_use_lock_free(&(q)->use_lock) DEFINE_FREE(snd_seq_queue, struct snd_seq_queue *, if (!IS_ERR_OR_NULL(_T)) queuefree(_T)) /* return the (first) queue matching with the specified name */ struct snd_seq_queue *snd_seq_queue_find_name(char *name); /* check single queue and dispatch events */ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop); /* access to queue's parameters */ int snd_seq_queue_check_access(int queueid, int client); int snd_seq_queue_timer_set_tempo(int queueid, int client, struct snd_seq_queue_tempo *info); int snd_seq_queue_set_owner(int queueid, int client, int locked); int snd_seq_queue_timer_open(int queueid); int snd_seq_queue_timer_close(int queueid); int snd_seq_queue_use(int queueid, int client, int use); int snd_seq_queue_is_used(int queueid, int client); int snd_seq_control_queue(struct snd_seq_event *ev, int atomic, int hop); #endif |
| 847 3592 6518 182 3534 3658 3728 21 318 68 348 3529 49 48 3328 6521 6513 3306 2091 766 1660 2091 1772 1775 1774 1771 904 906 2034 6507 1460 1464 1463 1 494 340 822 33 823 1 1968 31 321 319 1 318 320 318 3699 3692 1980 3601 626 68 665 69 1201 1013 1020 1203 910 858 1092 1194 2 2 1 2 2 2 1193 1204 1205 1193 1194 73 13 6290 363 6260 13 12 6060 6073 3701 39 3464 3465 3464 3468 13 3467 1460 1428 502 21 183 183 183 182 181 182 183 3668 1777 255 3532 3536 206 137 138 135 137 138 3465 13 11 13 13 13 3486 1506 1507 1492 1493 1457 161 298 290 49 49 254 294 270 286 291 291 298 297 263 49 297 431 433 413 198 198 49 49 1657 232 1465 1655 1676 1679 1658 1656 258 148 149 1611 1612 3465 3457 3436 502 3473 11 1807 3467 3467 3110 3304 3306 1 3303 3301 3467 3466 39 37 2 468 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (C) 2001 Momchil Velikov * Portions Copyright (C) 2001 Christoph Hellwig * Copyright (C) 2005 SGI, Christoph Lameter * Copyright (C) 2006 Nick Piggin * Copyright (C) 2012 Konstantin Khlebnikov * Copyright (C) 2016 Intel, Matthew Wilcox * Copyright (C) 2016 Intel, Ross Zwisler */ #include <linux/bitmap.h> #include <linux/bitops.h> #include <linux/bug.h> #include <linux/cpu.h> #include <linux/errno.h> #include <linux/export.h> #include <linux/idr.h> #include <linux/init.h> #include <linux/kernel.h> #include <linux/kmemleak.h> #include <linux/percpu.h> #include <linux/preempt.h> /* in_interrupt() */ #include <linux/radix-tree.h> #include <linux/rcupdate.h> #include <linux/slab.h> #include <linux/string.h> #include <linux/xarray.h> #include "radix-tree.h" /* * Radix tree node cache. */ struct kmem_cache *radix_tree_node_cachep; /* * The radix tree is variable-height, so an insert operation not only has * to build the branch to its corresponding item, it also has to build the * branch to existing items if the size has to be increased (by * radix_tree_extend). * * The worst case is a zero height tree with just a single item at index 0, * and then inserting an item at index ULONG_MAX. This requires 2 new branches * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared. * Hence: */ #define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1) /* * The IDR does not have to be as high as the radix tree since it uses * signed integers, not unsigned longs. */ #define IDR_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(int) - 1) #define IDR_MAX_PATH (DIV_ROUND_UP(IDR_INDEX_BITS, \ RADIX_TREE_MAP_SHIFT)) #define IDR_PRELOAD_SIZE (IDR_MAX_PATH * 2 - 1) /* * Per-cpu pool of preloaded nodes */ DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { .lock = INIT_LOCAL_LOCK(lock), }; EXPORT_PER_CPU_SYMBOL_GPL(radix_tree_preloads); static inline struct radix_tree_node *entry_to_node(void *ptr) { return (void *)((unsigned long)ptr & ~RADIX_TREE_INTERNAL_NODE); } static inline void *node_to_entry(void *ptr) { return (void *)((unsigned long)ptr | RADIX_TREE_INTERNAL_NODE); } #define RADIX_TREE_RETRY XA_RETRY_ENTRY static inline unsigned long get_slot_offset(const struct radix_tree_node *parent, void __rcu **slot) { return parent ? slot - parent->slots : 0; } static unsigned int radix_tree_descend(const struct radix_tree_node *parent, struct radix_tree_node **nodep, unsigned long index) { unsigned int offset = (index >> parent->shift) & RADIX_TREE_MAP_MASK; void __rcu **entry = rcu_dereference_raw(parent->slots[offset]); *nodep = (void *)entry; return offset; } static inline gfp_t root_gfp_mask(const struct radix_tree_root *root) { return root->xa_flags & (__GFP_BITS_MASK & ~GFP_ZONEMASK); } static inline void tag_set(struct radix_tree_node *node, unsigned int tag, int offset) { __set_bit(offset, node->tags[tag]); } static inline void tag_clear(struct radix_tree_node *node, unsigned int tag, int offset) { __clear_bit(offset, node->tags[tag]); } static inline int tag_get(const struct radix_tree_node *node, unsigned int tag, int offset) { return test_bit(offset, node->tags[tag]); } static inline void root_tag_set(struct radix_tree_root *root, unsigned tag) { root->xa_flags |= (__force gfp_t)(1 << (tag + ROOT_TAG_SHIFT)); } static inline void root_tag_clear(struct radix_tree_root *root, unsigned tag) { root->xa_flags &= (__force gfp_t)~(1 << (tag + ROOT_TAG_SHIFT)); } static inline void root_tag_clear_all(struct radix_tree_root *root) { root->xa_flags &= (__force gfp_t)((1 << ROOT_TAG_SHIFT) - 1); } static inline int root_tag_get(const struct radix_tree_root *root, unsigned tag) { return (__force int)root->xa_flags & (1 << (tag + ROOT_TAG_SHIFT)); } static inline unsigned root_tags_get(const struct radix_tree_root *root) { return (__force unsigned)root->xa_flags >> ROOT_TAG_SHIFT; } static inline bool is_idr(const struct radix_tree_root *root) { return !!(root->xa_flags & ROOT_IS_IDR); } /* * Returns 1 if any slot in the node has this tag set. * Otherwise returns 0. */ static inline int any_tag_set(const struct radix_tree_node *node, unsigned int tag) { unsigned idx; for (idx = 0; idx < RADIX_TREE_TAG_LONGS; idx++) { if (node->tags[tag][idx]) return 1; } return 0; } static inline void all_tag_set(struct radix_tree_node *node, unsigned int tag) { bitmap_fill(node->tags[tag], RADIX_TREE_MAP_SIZE); } /** * radix_tree_find_next_bit - find the next set bit in a memory region * * @node: where to begin the search * @tag: the tag index * @offset: the bitnumber to start searching at * * Unrollable variant of find_next_bit() for constant size arrays. * Tail bits starting from size to roundup(size, BITS_PER_LONG) must be zero. * Returns next bit offset, or size if nothing found. */ static __always_inline unsigned long radix_tree_find_next_bit(struct radix_tree_node *node, unsigned int tag, unsigned long offset) { const unsigned long *addr = node->tags[tag]; if (offset < RADIX_TREE_MAP_SIZE) { unsigned long tmp; addr += offset / BITS_PER_LONG; tmp = *addr >> (offset % BITS_PER_LONG); if (tmp) return __ffs(tmp) + offset; offset = (offset + BITS_PER_LONG) & ~(BITS_PER_LONG - 1); while (offset < RADIX_TREE_MAP_SIZE) { tmp = *++addr; if (tmp) return __ffs(tmp) + offset; offset += BITS_PER_LONG; } } return RADIX_TREE_MAP_SIZE; } static unsigned int iter_offset(const struct radix_tree_iter *iter) { return iter->index & RADIX_TREE_MAP_MASK; } /* * The maximum index which can be stored in a radix tree */ static inline unsigned long shift_maxindex(unsigned int shift) { return (RADIX_TREE_MAP_SIZE << shift) - 1; } static inline unsigned long node_maxindex(const struct radix_tree_node *node) { return shift_maxindex(node->shift); } static unsigned long next_index(unsigned long index, const struct radix_tree_node *node, unsigned long offset) { return (index & ~node_maxindex(node)) + (offset << node->shift); } /* * This assumes that the caller has performed appropriate preallocation, and * that the caller has pinned this thread of control to the current CPU. */ static struct radix_tree_node * radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent, struct radix_tree_root *root, unsigned int shift, unsigned int offset, unsigned int count, unsigned int nr_values) { struct radix_tree_node *ret = NULL; /* * Preload code isn't irq safe and it doesn't make sense to use * preloading during an interrupt anyway as all the allocations have * to be atomic. So just do normal allocation when in interrupt. */ if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) { struct radix_tree_preload *rtp; /* * Even if the caller has preloaded, try to allocate from the * cache first for the new node to get accounted to the memory * cgroup. */ ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask | __GFP_NOWARN); if (ret) goto out; /* * Provided the caller has preloaded here, we will always * succeed in getting a node here (and never reach * kmem_cache_alloc) */ rtp = this_cpu_ptr(&radix_tree_preloads); if (rtp->nr) { ret = rtp->nodes; rtp->nodes = ret->parent; rtp->nr--; } /* * Update the allocation stack trace as this is more useful * for debugging. */ kmemleak_update_trace(ret); goto out; } ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); out: BUG_ON(radix_tree_is_internal_node(ret)); if (ret) { ret->shift = shift; ret->offset = offset; ret->count = count; ret->nr_values = nr_values; ret->parent = parent; ret->array = root; } return ret; } void radix_tree_node_rcu_free(struct rcu_head *head) { struct radix_tree_node *node = container_of(head, struct radix_tree_node, rcu_head); /* * Must only free zeroed nodes into the slab. We can be left with * non-NULL entries by radix_tree_free_nodes, so clear the entries * and tags here. */ memset(node->slots, 0, sizeof(node->slots)); memset(node->tags, 0, sizeof(node->tags)); INIT_LIST_HEAD(&node->private_list); kmem_cache_free(radix_tree_node_cachep, node); } static inline void radix_tree_node_free(struct radix_tree_node *node) { call_rcu(&node->rcu_head, radix_tree_node_rcu_free); } /* * Load up this CPU's radix_tree_node buffer with sufficient objects to * ensure that the addition of a single element in the tree cannot fail. On * success, return zero, with preemption disabled. On error, return -ENOMEM * with preemption not disabled. * * To make use of this facility, the radix tree must be initialised without * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE(). */ static __must_check int __radix_tree_preload(gfp_t gfp_mask, unsigned nr) { struct radix_tree_preload *rtp; struct radix_tree_node *node; int ret = -ENOMEM; /* * Nodes preloaded by one cgroup can be used by another cgroup, so * they should never be accounted to any particular memory cgroup. */ gfp_mask &= ~__GFP_ACCOUNT; local_lock(&radix_tree_preloads.lock); rtp = this_cpu_ptr(&radix_tree_preloads); while (rtp->nr < nr) { local_unlock(&radix_tree_preloads.lock); node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); if (node == NULL) goto out; local_lock(&radix_tree_preloads.lock); rtp = this_cpu_ptr(&radix_tree_preloads); if (rtp->nr < nr) { node->parent = rtp->nodes; rtp->nodes = node; rtp->nr++; } else { kmem_cache_free(radix_tree_node_cachep, node); } } ret = 0; out: return ret; } /* * Load up this CPU's radix_tree_node buffer with sufficient objects to * ensure that the addition of a single element in the tree cannot fail. On * success, return zero, with preemption disabled. On error, return -ENOMEM * with preemption not disabled. * * To make use of this facility, the radix tree must be initialised without * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE(). */ int radix_tree_preload(gfp_t gfp_mask) { /* Warn on non-sensical use... */ WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask)); return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE); } EXPORT_SYMBOL(radix_tree_preload); /* * The same as above function, except we don't guarantee preloading happens. * We do it, if we decide it helps. On success, return zero with preemption * disabled. On error, return -ENOMEM with preemption not disabled. */ int radix_tree_maybe_preload(gfp_t gfp_mask) { if (gfpflags_allow_blocking(gfp_mask)) return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE); /* Preloading doesn't help anything with this gfp mask, skip it */ local_lock(&radix_tree_preloads.lock); return 0; } EXPORT_SYMBOL(radix_tree_maybe_preload); static unsigned radix_tree_load_root(const struct radix_tree_root *root, struct radix_tree_node **nodep, unsigned long *maxindex) { struct radix_tree_node *node = rcu_dereference_raw(root->xa_head); *nodep = node; if (likely(radix_tree_is_internal_node(node))) { node = entry_to_node(node); *maxindex = node_maxindex(node); return node->shift + RADIX_TREE_MAP_SHIFT; } *maxindex = 0; return 0; } /* * Extend a radix tree so it can store key @index. */ static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp, unsigned long index, unsigned int shift) { void *entry; unsigned int maxshift; int tag; /* Figure out what the shift should be. */ maxshift = shift; while (index > shift_maxindex(maxshift)) maxshift += RADIX_TREE_MAP_SHIFT; entry = rcu_dereference_raw(root->xa_head); if (!entry && (!is_idr(root) || root_tag_get(root, IDR_FREE))) goto out; do { struct radix_tree_node *node = radix_tree_node_alloc(gfp, NULL, root, shift, 0, 1, 0); if (!node) return -ENOMEM; if (is_idr(root)) { all_tag_set(node, IDR_FREE); if (!root_tag_get(root, IDR_FREE)) { tag_clear(node, IDR_FREE, 0); root_tag_set(root, IDR_FREE); } } else { /* Propagate the aggregated tag info to the new child */ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { if (root_tag_get(root, tag)) tag_set(node, tag, 0); } } BUG_ON(shift > BITS_PER_LONG); if (radix_tree_is_internal_node(entry)) { entry_to_node(entry)->parent = node; } else if (xa_is_value(entry)) { /* Moving a value entry root->xa_head to a node */ node->nr_values = 1; } /* * entry was already in the radix tree, so we do not need * rcu_assign_pointer here */ node->slots[0] = (void __rcu *)entry; entry = node_to_entry(node); rcu_assign_pointer(root->xa_head, entry); shift += RADIX_TREE_MAP_SHIFT; } while (shift <= maxshift); out: return maxshift + RADIX_TREE_MAP_SHIFT; } /** * radix_tree_shrink - shrink radix tree to minimum height * @root: radix tree root */ static inline bool radix_tree_shrink(struct radix_tree_root *root) { bool shrunk = false; for (;;) { struct radix_tree_node *node = rcu_dereference_raw(root->xa_head); struct radix_tree_node *child; if (!radix_tree_is_internal_node(node)) break; node = entry_to_node(node); /* * The candidate node has more than one child, or its child * is not at the leftmost slot, we cannot shrink. */ if (node->count != 1) break; child = rcu_dereference_raw(node->slots[0]); if (!child) break; /* * For an IDR, we must not shrink entry 0 into the root in * case somebody calls idr_replace() with a pointer that * appears to be an internal entry */ if (!node->shift && is_idr(root)) break; if (radix_tree_is_internal_node(child)) entry_to_node(child)->parent = NULL; /* * We don't need rcu_assign_pointer(), since we are simply * moving the node from one part of the tree to another: if it * was safe to dereference the old pointer to it * (node->slots[0]), it will be safe to dereference the new * one (root->xa_head) as far as dependent read barriers go. */ root->xa_head = (void __rcu *)child; if (is_idr(root) && !tag_get(node, IDR_FREE, 0)) root_tag_clear(root, IDR_FREE); /* * We have a dilemma here. The node's slot[0] must not be * NULLed in case there are concurrent lookups expecting to * find the item. However if this was a bottom-level node, * then it may be subject to the slot pointer being visible * to callers dereferencing it. If item corresponding to * slot[0] is subsequently deleted, these callers would expect * their slot to become empty sooner or later. * * For example, lockless pagecache will look up a slot, deref * the page pointer, and if the page has 0 refcount it means it * was concurrently deleted from pagecache so try the deref * again. Fortunately there is already a requirement for logic * to retry the entire slot lookup -- the indirect pointer * problem (replacing direct root node with an indirect pointer * also results in a stale slot). So tag the slot as indirect * to force callers to retry. */ node->count = 0; if (!radix_tree_is_internal_node(child)) { node->slots[0] = (void __rcu *)RADIX_TREE_RETRY; } WARN_ON_ONCE(!list_empty(&node->private_list)); radix_tree_node_free(node); shrunk = true; } return shrunk; } static bool delete_node(struct radix_tree_root *root, struct radix_tree_node *node) { bool deleted = false; do { struct radix_tree_node *parent; if (node->count) { if (node_to_entry(node) == rcu_dereference_raw(root->xa_head)) deleted |= radix_tree_shrink(root); return deleted; } parent = node->parent; if (parent) { parent->slots[node->offset] = NULL; parent->count--; } else { /* * Shouldn't the tags already have all been cleared * by the caller? */ if (!is_idr(root)) root_tag_clear_all(root); root->xa_head = NULL; } WARN_ON_ONCE(!list_empty(&node->private_list)); radix_tree_node_free(node); deleted = true; node = parent; } while (node); return deleted; } /** * __radix_tree_create - create a slot in a radix tree * @root: radix tree root * @index: index key * @nodep: returns node * @slotp: returns slot * * Create, if necessary, and return the node and slot for an item * at position @index in the radix tree @root. * * Until there is more than one item in the tree, no nodes are * allocated and @root->xa_head is used as a direct slot instead of * pointing to a node, in which case *@nodep will be NULL. * * Returns -ENOMEM, or 0 for success. */ static int __radix_tree_create(struct radix_tree_root *root, unsigned long index, struct radix_tree_node **nodep, void __rcu ***slotp) { struct radix_tree_node *node = NULL, *child; void __rcu **slot = (void __rcu **)&root->xa_head; unsigned long maxindex; unsigned int shift, offset = 0; unsigned long max = index; gfp_t gfp = root_gfp_mask(root); shift = radix_tree_load_root(root, &child, &maxindex); /* Make sure the tree is high enough. */ if (max > maxindex) { int error = radix_tree_extend(root, gfp, max, shift); if (error < 0) return error; shift = error; child = rcu_dereference_raw(root->xa_head); } while (shift > 0) { shift -= RADIX_TREE_MAP_SHIFT; if (child == NULL) { /* Have to add a child node. */ child = radix_tree_node_alloc(gfp, node, root, shift, offset, 0, 0); if (!child) return -ENOMEM; rcu_assign_pointer(*slot, node_to_entry(child)); if (node) node->count++; } else if (!radix_tree_is_internal_node(child)) break; /* Go a level down */ node = entry_to_node(child); offset = radix_tree_descend(node, &child, index); slot = &node->slots[offset]; } if (nodep) *nodep = node; if (slotp) *slotp = slot; return 0; } /* * Free any nodes below this node. The tree is presumed to not need * shrinking, and any user data in the tree is presumed to not need a * destructor called on it. If we need to add a destructor, we can * add that functionality later. Note that we may not clear tags or * slots from the tree as an RCU walker may still have a pointer into * this subtree. We could replace the entries with RADIX_TREE_RETRY, * but we'll still have to clear those in rcu_free. */ static void radix_tree_free_nodes(struct radix_tree_node *node) { unsigned offset = 0; struct radix_tree_node *child = entry_to_node(node); for (;;) { void *entry = rcu_dereference_raw(child->slots[offset]); if (xa_is_node(entry) && child->shift) { child = entry_to_node(entry); offset = 0; continue; } offset++; while (offset == RADIX_TREE_MAP_SIZE) { struct radix_tree_node *old = child; offset = child->offset + 1; child = child->parent; WARN_ON_ONCE(!list_empty(&old->private_list)); radix_tree_node_free(old); if (old == entry_to_node(node)) return; } } } static inline int insert_entries(struct radix_tree_node *node, void __rcu **slot, void *item) { if (*slot) return -EEXIST; rcu_assign_pointer(*slot, item); if (node) { node->count++; if (xa_is_value(item)) node->nr_values++; } return 1; } /** * radix_tree_insert - insert into a radix tree * @root: radix tree root * @index: index key * @item: item to insert * * Insert an item into the radix tree at position @index. */ int radix_tree_insert(struct radix_tree_root *root, unsigned long index, void *item) { struct radix_tree_node *node; void __rcu **slot; int error; BUG_ON(radix_tree_is_internal_node(item)); error = __radix_tree_create(root, index, &node, &slot); if (error) return error; error = insert_entries(node, slot, item); if (error < 0) return error; if (node) { unsigned offset = get_slot_offset(node, slot); BUG_ON(tag_get(node, 0, offset)); BUG_ON(tag_get(node, 1, offset)); BUG_ON(tag_get(node, 2, offset)); } else { BUG_ON(root_tags_get(root)); } return 0; } EXPORT_SYMBOL(radix_tree_insert); /** * __radix_tree_lookup - lookup an item in a radix tree * @root: radix tree root * @index: index key * @nodep: returns node * @slotp: returns slot * * Lookup and return the item at position @index in the radix * tree @root. * * Until there is more than one item in the tree, no nodes are * allocated and @root->xa_head is used as a direct slot instead of * pointing to a node, in which case *@nodep will be NULL. */ void *__radix_tree_lookup(const struct radix_tree_root *root, unsigned long index, struct radix_tree_node **nodep, void __rcu ***slotp) { struct radix_tree_node *node, *parent; unsigned long maxindex; void __rcu **slot; restart: parent = NULL; slot = (void __rcu **)&root->xa_head; radix_tree_load_root(root, &node, &maxindex); if (index > maxindex) return NULL; while (radix_tree_is_internal_node(node)) { unsigned offset; parent = entry_to_node(node); offset = radix_tree_descend(parent, &node, index); slot = parent->slots + offset; if (node == RADIX_TREE_RETRY) goto restart; if (parent->shift == 0) break; } if (nodep) *nodep = parent; if (slotp) *slotp = slot; return node; } /** * radix_tree_lookup_slot - lookup a slot in a radix tree * @root: radix tree root * @index: index key * * Returns: the slot corresponding to the position @index in the * radix tree @root. This is useful for update-if-exists operations. * * This function can be called under rcu_read_lock iff the slot is not * modified by radix_tree_replace_slot, otherwise it must be called * exclusive from other writers. Any dereference of the slot must be done * using radix_tree_deref_slot. */ void __rcu **radix_tree_lookup_slot(const struct radix_tree_root *root, unsigned long index) { void __rcu **slot; if (!__radix_tree_lookup(root, index, NULL, &slot)) return NULL; return slot; } EXPORT_SYMBOL(radix_tree_lookup_slot); /** * radix_tree_lookup - perform lookup operation on a radix tree * @root: radix tree root * @index: index key * * Lookup the item at the position @index in the radix tree @root. * * This function can be called under rcu_read_lock, however the caller * must manage lifetimes of leaf nodes (eg. RCU may also be used to free * them safely). No RCU barriers are required to access or modify the * returned item, however. */ void *radix_tree_lookup(const struct radix_tree_root *root, unsigned long index) { return __radix_tree_lookup(root, index, NULL, NULL); } EXPORT_SYMBOL(radix_tree_lookup); static void replace_slot(void __rcu **slot, void *item, struct radix_tree_node *node, int count, int values) { if (node && (count || values)) { node->count += count; node->nr_values += values; } rcu_assign_pointer(*slot, item); } static bool node_tag_get(const struct radix_tree_root *root, const struct radix_tree_node *node, unsigned int tag, unsigned int offset) { if (node) return tag_get(node, tag, offset); return root_tag_get(root, tag); } /* * IDR users want to be able to store NULL in the tree, so if the slot isn't * free, don't adjust the count, even if it's transitioning between NULL and * non-NULL. For the IDA, we mark slots as being IDR_FREE while they still * have empty bits, but it only stores NULL in slots when they're being * deleted. */ static int calculate_count(struct radix_tree_root *root, struct radix_tree_node *node, void __rcu **slot, void *item, void *old) { if (is_idr(root)) { unsigned offset = get_slot_offset(node, slot); bool free = node_tag_get(root, node, IDR_FREE, offset); if (!free) return 0; if (!old) return 1; } return !!item - !!old; } /** * __radix_tree_replace - replace item in a slot * @root: radix tree root * @node: pointer to tree node * @slot: pointer to slot in @node * @item: new item to store in the slot. * * For use with __radix_tree_lookup(). Caller must hold tree write locked * across slot lookup and replacement. */ void __radix_tree_replace(struct radix_tree_root *root, struct radix_tree_node *node, void __rcu **slot, void *item) { void *old = rcu_dereference_raw(*slot); int values = !!xa_is_value(item) - !!xa_is_value(old); int count = calculate_count(root, node, slot, item, old); /* * This function supports replacing value entries and * deleting entries, but that needs accounting against the * node unless the slot is root->xa_head. */ WARN_ON_ONCE(!node && (slot != (void __rcu **)&root->xa_head) && (count || values)); replace_slot(slot, item, node, count, values); if (!node) return; delete_node(root, node); } /** * radix_tree_replace_slot - replace item in a slot * @root: radix tree root * @slot: pointer to slot * @item: new item to store in the slot. * * For use with radix_tree_lookup_slot() and * radix_tree_gang_lookup_tag_slot(). Caller must hold tree write locked * across slot lookup and replacement. * * NOTE: This cannot be used to switch between non-entries (empty slots), * regular entries, and value entries, as that requires accounting * inside the radix tree node. When switching from one type of entry or * deleting, use __radix_tree_lookup() and __radix_tree_replace() or * radix_tree_iter_replace(). */ void radix_tree_replace_slot(struct radix_tree_root *root, void __rcu **slot, void *item) { __radix_tree_replace(root, NULL, slot, item); } EXPORT_SYMBOL(radix_tree_replace_slot); /** * radix_tree_iter_replace - replace item in a slot * @root: radix tree root * @iter: iterator state * @slot: pointer to slot * @item: new item to store in the slot. * * For use with radix_tree_for_each_slot(). * Caller must hold tree write locked. */ void radix_tree_iter_replace(struct radix_tree_root *root, const struct radix_tree_iter *iter, void __rcu **slot, void *item) { __radix_tree_replace(root, iter->node, slot, item); } static void node_tag_set(struct radix_tree_root *root, struct radix_tree_node *node, unsigned int tag, unsigned int offset) { while (node) { if (tag_get(node, tag, offset)) return; tag_set(node, tag, offset); offset = node->offset; node = node->parent; } if (!root_tag_get(root, tag)) root_tag_set(root, tag); } /** * radix_tree_tag_set - set a tag on a radix tree node * @root: radix tree root * @index: index key * @tag: tag index * * Set the search tag (which must be < RADIX_TREE_MAX_TAGS) * corresponding to @index in the radix tree. From * the root all the way down to the leaf node. * * Returns the address of the tagged item. Setting a tag on a not-present * item is a bug. */ void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag) { struct radix_tree_node *node, *parent; unsigned long maxindex; radix_tree_load_root(root, &node, &maxindex); BUG_ON(index > maxindex); while (radix_tree_is_internal_node(node)) { unsigned offset; parent = entry_to_node(node); offset = radix_tree_descend(parent, &node, index); BUG_ON(!node); if (!tag_get(parent, tag, offset)) tag_set(parent, tag, offset); } /* set the root's tag bit */ if (!root_tag_get(root, tag)) root_tag_set(root, tag); return node; } EXPORT_SYMBOL(radix_tree_tag_set); static void node_tag_clear(struct radix_tree_root *root, struct radix_tree_node *node, unsigned int tag, unsigned int offset) { while (node) { if (!tag_get(node, tag, offset)) return; tag_clear(node, tag, offset); if (any_tag_set(node, tag)) return; offset = node->offset; node = node->parent; } /* clear the root's tag bit */ if (root_tag_get(root, tag)) root_tag_clear(root, tag); } /** * radix_tree_tag_clear - clear a tag on a radix tree node * @root: radix tree root * @index: index key * @tag: tag index * * Clear the search tag (which must be < RADIX_TREE_MAX_TAGS) * corresponding to @index in the radix tree. If this causes * the leaf node to have no tags set then clear the tag in the * next-to-leaf node, etc. * * Returns the address of the tagged item on success, else NULL. ie: * has the same return value and semantics as radix_tree_lookup(). */ void *radix_tree_tag_clear(struct radix_tree_root *root, unsigned long index, unsigned int tag) { struct radix_tree_node *node, *parent; unsigned long maxindex; int offset = 0; radix_tree_load_root(root, &node, &maxindex); if (index > maxindex) return NULL; parent = NULL; while (radix_tree_is_internal_node(node)) { parent = entry_to_node(node); offset = radix_tree_descend(parent, &node, index); } if (node) node_tag_clear(root, parent, tag, offset); return node; } EXPORT_SYMBOL(radix_tree_tag_clear); /** * radix_tree_iter_tag_clear - clear a tag on the current iterator entry * @root: radix tree root * @iter: iterator state * @tag: tag to clear */ void radix_tree_iter_tag_clear(struct radix_tree_root *root, const struct radix_tree_iter *iter, unsigned int tag) { node_tag_clear(root, iter->node, tag, iter_offset(iter)); } /** * radix_tree_tag_get - get a tag on a radix tree node * @root: radix tree root * @index: index key * @tag: tag index (< RADIX_TREE_MAX_TAGS) * * Return values: * * 0: tag not present or not set * 1: tag set * * Note that the return value of this function may not be relied on, even if * the RCU lock is held, unless tag modification and node deletion are excluded * from concurrency. */ int radix_tree_tag_get(const struct radix_tree_root *root, unsigned long index, unsigned int tag) { struct radix_tree_node *node, *parent; unsigned long maxindex; if (!root_tag_get(root, tag)) return 0; radix_tree_load_root(root, &node, &maxindex); if (index > maxindex) return 0; while (radix_tree_is_internal_node(node)) { unsigned offset; parent = entry_to_node(node); offset = radix_tree_descend(parent, &node, index); if (!tag_get(parent, tag, offset)) return 0; if (node == RADIX_TREE_RETRY) break; } return 1; } EXPORT_SYMBOL(radix_tree_tag_get); /* Construct iter->tags bit-mask from node->tags[tag] array */ static void set_iter_tags(struct radix_tree_iter *iter, struct radix_tree_node *node, unsigned offset, unsigned tag) { unsigned tag_long = offset / BITS_PER_LONG; unsigned tag_bit = offset % BITS_PER_LONG; if (!node) { iter->tags = 1; return; } iter->tags = node->tags[tag][tag_long] >> tag_bit; /* This never happens if RADIX_TREE_TAG_LONGS == 1 */ if (tag_long < RADIX_TREE_TAG_LONGS - 1) { /* Pick tags from next element */ if (tag_bit) iter->tags |= node->tags[tag][tag_long + 1] << (BITS_PER_LONG - tag_bit); /* Clip chunk size, here only BITS_PER_LONG tags */ iter->next_index = __radix_tree_iter_add(iter, BITS_PER_LONG); } } void __rcu **radix_tree_iter_resume(void __rcu **slot, struct radix_tree_iter *iter) { iter->index = __radix_tree_iter_add(iter, 1); iter->next_index = iter->index; iter->tags = 0; return NULL; } EXPORT_SYMBOL(radix_tree_iter_resume); /** * radix_tree_next_chunk - find next chunk of slots for iteration * * @root: radix tree root * @iter: iterator state * @flags: RADIX_TREE_ITER_* flags and tag index * Returns: pointer to chunk first slot, or NULL if iteration is over */ void __rcu **radix_tree_next_chunk(const struct radix_tree_root *root, struct radix_tree_iter *iter, unsigned flags) { unsigned tag = flags & RADIX_TREE_ITER_TAG_MASK; struct radix_tree_node *node, *child; unsigned long index, offset, maxindex; if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag)) return NULL; /* * Catch next_index overflow after ~0UL. iter->index never overflows * during iterating; it can be zero only at the beginning. * And we cannot overflow iter->next_index in a single step, * because RADIX_TREE_MAP_SHIFT < BITS_PER_LONG. * * This condition also used by radix_tree_next_slot() to stop * contiguous iterating, and forbid switching to the next chunk. */ index = iter->next_index; if (!index && iter->index) return NULL; restart: radix_tree_load_root(root, &child, &maxindex); if (index > maxindex) return NULL; if (!child) return NULL; if (!radix_tree_is_internal_node(child)) { /* Single-slot tree */ iter->index = index; iter->next_index = maxindex + 1; iter->tags = 1; iter->node = NULL; return (void __rcu **)&root->xa_head; } do { node = entry_to_node(child); offset = radix_tree_descend(node, &child, index); if ((flags & RADIX_TREE_ITER_TAGGED) ? !tag_get(node, tag, offset) : !child) { /* Hole detected */ if (flags & RADIX_TREE_ITER_CONTIG) return NULL; if (flags & RADIX_TREE_ITER_TAGGED) offset = radix_tree_find_next_bit(node, tag, offset + 1); else while (++offset < RADIX_TREE_MAP_SIZE) { void *slot = rcu_dereference_raw( node->slots[offset]); if (slot) break; } index &= ~node_maxindex(node); index += offset << node->shift; /* Overflow after ~0UL */ if (!index) return NULL; if (offset == RADIX_TREE_MAP_SIZE) goto restart; child = rcu_dereference_raw(node->slots[offset]); } if (!child) goto restart; if (child == RADIX_TREE_RETRY) break; } while (node->shift && radix_tree_is_internal_node(child)); /* Update the iterator state */ iter->index = (index &~ node_maxindex(node)) | offset; iter->next_index = (index | node_maxindex(node)) + 1; iter->node = node; if (flags & RADIX_TREE_ITER_TAGGED) set_iter_tags(iter, node, offset, tag); return node->slots + offset; } EXPORT_SYMBOL(radix_tree_next_chunk); /** * radix_tree_gang_lookup - perform multiple lookup on a radix tree * @root: radix tree root * @results: where the results of the lookup are placed * @first_index: start the lookup from this key * @max_items: place up to this many items at *results * * Performs an index-ascending scan of the tree for present items. Places * them at *@results and returns the number of items which were placed at * *@results. * * The implementation is naive. * * Like radix_tree_lookup, radix_tree_gang_lookup may be called under * rcu_read_lock. In this case, rather than the returned results being * an atomic snapshot of the tree at a single point in time, the * semantics of an RCU protected gang lookup are as though multiple * radix_tree_lookups have been issued in individual locks, and results * stored in 'results'. */ unsigned int radix_tree_gang_lookup(const struct radix_tree_root *root, void **results, unsigned long first_index, unsigned int max_items) { struct radix_tree_iter iter; void __rcu **slot; unsigned int ret = 0; if (unlikely(!max_items)) return 0; radix_tree_for_each_slot(slot, root, &iter, first_index) { results[ret] = rcu_dereference_raw(*slot); if (!results[ret]) continue; if (radix_tree_is_internal_node(results[ret])) { slot = radix_tree_iter_retry(&iter); continue; } if (++ret == max_items) break; } return ret; } EXPORT_SYMBOL(radix_tree_gang_lookup); /** * radix_tree_gang_lookup_tag - perform multiple lookup on a radix tree * based on a tag * @root: radix tree root * @results: where the results of the lookup are placed * @first_index: start the lookup from this key * @max_items: place up to this many items at *results * @tag: the tag index (< RADIX_TREE_MAX_TAGS) * * Performs an index-ascending scan of the tree for present items which * have the tag indexed by @tag set. Places the items at *@results and * returns the number of items which were placed at *@results. */ unsigned int radix_tree_gang_lookup_tag(const struct radix_tree_root *root, void **results, unsigned long first_index, unsigned int max_items, unsigned int tag) { struct radix_tree_iter iter; void __rcu **slot; unsigned int ret = 0; if (unlikely(!max_items)) return 0; radix_tree_for_each_tagged(slot, root, &iter, first_index, tag) { results[ret] = rcu_dereference_raw(*slot); if (!results[ret]) continue; if (radix_tree_is_internal_node(results[ret])) { slot = radix_tree_iter_retry(&iter); continue; } if (++ret == max_items) break; } return ret; } EXPORT_SYMBOL(radix_tree_gang_lookup_tag); /** * radix_tree_gang_lookup_tag_slot - perform multiple slot lookup on a * radix tree based on a tag * @root: radix tree root * @results: where the results of the lookup are placed * @first_index: start the lookup from this key * @max_items: place up to this many items at *results * @tag: the tag index (< RADIX_TREE_MAX_TAGS) * * Performs an index-ascending scan of the tree for present items which * have the tag indexed by @tag set. Places the slots at *@results and * returns the number of slots which were placed at *@results. */ unsigned int radix_tree_gang_lookup_tag_slot(const struct radix_tree_root *root, void __rcu ***results, unsigned long first_index, unsigned int max_items, unsigned int tag) { struct radix_tree_iter iter; void __rcu **slot; unsigned int ret = 0; if (unlikely(!max_items)) return 0; radix_tree_for_each_tagged(slot, root, &iter, first_index, tag) { results[ret] = slot; if (++ret == max_items) break; } return ret; } EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot); static bool __radix_tree_delete(struct radix_tree_root *root, struct radix_tree_node *node, void __rcu **slot) { void *old = rcu_dereference_raw(*slot); int values = xa_is_value(old) ? -1 : 0; unsigned offset = get_slot_offset(node, slot); int tag; if (is_idr(root)) node_tag_set(root, node, IDR_FREE, offset); else for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) node_tag_clear(root, node, tag, offset); replace_slot(slot, NULL, node, -1, values); return node && delete_node(root, node); } /** * radix_tree_iter_delete - delete the entry at this iterator position * @root: radix tree root * @iter: iterator state * @slot: pointer to slot * * Delete the entry at the position currently pointed to by the iterator. * This may result in the current node being freed; if it is, the iterator * is advanced so that it will not reference the freed memory. This * function may be called without any locking if there are no other threads * which can access this tree. */ void radix_tree_iter_delete(struct radix_tree_root *root, struct radix_tree_iter *iter, void __rcu **slot) { if (__radix_tree_delete(root, iter->node, slot)) iter->index = iter->next_index; } EXPORT_SYMBOL(radix_tree_iter_delete); /** * radix_tree_delete_item - delete an item from a radix tree * @root: radix tree root * @index: index key * @item: expected item * * Remove @item at @index from the radix tree rooted at @root. * * Return: the deleted entry, or %NULL if it was not present * or the entry at the given @index was not @item. */ void *radix_tree_delete_item(struct radix_tree_root *root, unsigned long index, void *item) { struct radix_tree_node *node = NULL; void __rcu **slot = NULL; void *entry; entry = __radix_tree_lookup(root, index, &node, &slot); if (!slot) return NULL; if (!entry && (!is_idr(root) || node_tag_get(root, node, IDR_FREE, get_slot_offset(node, slot)))) return NULL; if (item && entry != item) return NULL; __radix_tree_delete(root, node, slot); return entry; } EXPORT_SYMBOL(radix_tree_delete_item); /** * radix_tree_delete - delete an entry from a radix tree * @root: radix tree root * @index: index key * * Remove the entry at @index from the radix tree rooted at @root. * * Return: The deleted entry, or %NULL if it was not present. */ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index) { return radix_tree_delete_item(root, index, NULL); } EXPORT_SYMBOL(radix_tree_delete); /** * radix_tree_tagged - test whether any items in the tree are tagged * @root: radix tree root * @tag: tag to test */ int radix_tree_tagged(const struct radix_tree_root *root, unsigned int tag) { return root_tag_get(root, tag); } EXPORT_SYMBOL(radix_tree_tagged); /** * idr_preload - preload for idr_alloc() * @gfp_mask: allocation mask to use for preloading * * Preallocate memory to use for the next call to idr_alloc(). This function * returns with preemption disabled. It will be enabled by idr_preload_end(). */ void idr_preload(gfp_t gfp_mask) { if (__radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE)) local_lock(&radix_tree_preloads.lock); } EXPORT_SYMBOL(idr_preload); void __rcu **idr_get_free(struct radix_tree_root *root, struct radix_tree_iter *iter, gfp_t gfp, unsigned long max) { struct radix_tree_node *node = NULL, *child; void __rcu **slot = (void __rcu **)&root->xa_head; unsigned long maxindex, start = iter->next_index; unsigned int shift, offset = 0; grow: shift = radix_tree_load_root(root, &child, &maxindex); if (!radix_tree_tagged(root, IDR_FREE)) start = max(start, maxindex + 1); if (start > max) return ERR_PTR(-ENOSPC); if (start > maxindex) { int error = radix_tree_extend(root, gfp, start, shift); if (error < 0) return ERR_PTR(error); shift = error; child = rcu_dereference_raw(root->xa_head); } if (start == 0 && shift == 0) shift = RADIX_TREE_MAP_SHIFT; while (shift) { shift -= RADIX_TREE_MAP_SHIFT; if (child == NULL) { /* Have to add a child node. */ child = radix_tree_node_alloc(gfp, node, root, shift, offset, 0, 0); if (!child) return ERR_PTR(-ENOMEM); all_tag_set(child, IDR_FREE); rcu_assign_pointer(*slot, node_to_entry(child)); if (node) node->count++; } else if (!radix_tree_is_internal_node(child)) break; node = entry_to_node(child); offset = radix_tree_descend(node, &child, start); if (!tag_get(node, IDR_FREE, offset)) { offset = radix_tree_find_next_bit(node, IDR_FREE, offset + 1); start = next_index(start, node, offset); if (start > max || start == 0) return ERR_PTR(-ENOSPC); while (offset == RADIX_TREE_MAP_SIZE) { offset = node->offset + 1; node = node->parent; if (!node) goto grow; shift = node->shift; } child = rcu_dereference_raw(node->slots[offset]); } slot = &node->slots[offset]; } iter->index = start; if (node) iter->next_index = 1 + min(max, (start | node_maxindex(node))); else iter->next_index = 1; iter->node = node; set_iter_tags(iter, node, offset, IDR_FREE); return slot; } /** * idr_destroy - release all internal memory from an IDR * @idr: idr handle * * After this function is called, the IDR is empty, and may be reused or * the data structure containing it may be freed. * * A typical clean-up sequence for objects stored in an idr tree will use * idr_for_each() to free all objects, if necessary, then idr_destroy() to * free the memory used to keep track of those objects. */ void idr_destroy(struct idr *idr) { struct radix_tree_node *node = rcu_dereference_raw(idr->idr_rt.xa_head); if (radix_tree_is_internal_node(node)) radix_tree_free_nodes(node); idr->idr_rt.xa_head = NULL; root_tag_set(&idr->idr_rt, IDR_FREE); } EXPORT_SYMBOL(idr_destroy); static void radix_tree_node_ctor(void *arg) { struct radix_tree_node *node = arg; memset(node, 0, sizeof(*node)); INIT_LIST_HEAD(&node->private_list); } static int radix_tree_cpu_dead(unsigned int cpu) { struct radix_tree_preload *rtp; struct radix_tree_node *node; /* Free per-cpu pool of preloaded nodes */ rtp = &per_cpu(radix_tree_preloads, cpu); while (rtp->nr) { node = rtp->nodes; rtp->nodes = node->parent; kmem_cache_free(radix_tree_node_cachep, node); rtp->nr--; } return 0; } void __init radix_tree_init(void) { int ret; BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32); BUILD_BUG_ON(ROOT_IS_IDR & ~GFP_ZONEMASK); BUILD_BUG_ON(XA_CHUNK_SIZE > 255); radix_tree_node_cachep = kmem_cache_create("radix_tree_node", sizeof(struct radix_tree_node), 0, SLAB_PANIC | SLAB_RECLAIM_ACCOUNT, radix_tree_node_ctor); ret = cpuhp_setup_state_nocalls(CPUHP_RADIX_DEAD, "lib/radix:dead", NULL, radix_tree_cpu_dead); WARN_ON(ret < 0); } |
| 648 245 6112 500 232 4375 441 64 349 4 3 441 8994 5177 8582 5571 4109 4167 4898 3770 6112 2760 75 5798 6766 3291 6433 6444 6437 182 225 5426 2602 2659 756 133 1727 1727 3769 85 3671 3022 298 734 480 339 492 492 59 29 282 238 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 | /* SPDX-License-Identifier: GPL-2.0+ */ #ifndef _LINUX_XARRAY_H #define _LINUX_XARRAY_H /* * eXtensible Arrays * Copyright (c) 2017 Microsoft Corporation * Author: Matthew Wilcox <willy@infradead.org> * * See Documentation/core-api/xarray.rst for how to use the XArray. */ #include <linux/bitmap.h> #include <linux/bug.h> #include <linux/compiler.h> #include <linux/err.h> #include <linux/gfp.h> #include <linux/kconfig.h> #include <linux/limits.h> #include <linux/lockdep.h> #include <linux/rcupdate.h> #include <linux/sched/mm.h> #include <linux/spinlock.h> #include <linux/types.h> struct list_lru; /* * The bottom two bits of the entry determine how the XArray interprets * the contents: * * 00: Pointer entry * 10: Internal entry * x1: Value entry or tagged pointer * * Attempting to store internal entries in the XArray is a bug. * * Most internal entries are pointers to the next node in the tree. * The following internal entries have a special meaning: * * 0-62: Sibling entries * 256: Retry entry * 257: Zero entry * * Errors are also represented as internal entries, but use the negative * space (-4094 to -2). They're never stored in the slots array; only * returned by the normal API. */ #define BITS_PER_XA_VALUE (BITS_PER_LONG - 1) /** * xa_mk_value() - Create an XArray entry from an integer. * @v: Value to store in XArray. * * Context: Any context. * Return: An entry suitable for storing in the XArray. */ static inline void *xa_mk_value(unsigned long v) { WARN_ON((long)v < 0); return (void *)((v << 1) | 1); } /** * xa_to_value() - Get value stored in an XArray entry. * @entry: XArray entry. * * Context: Any context. * Return: The value stored in the XArray entry. */ static inline unsigned long xa_to_value(const void *entry) { return (unsigned long)entry >> 1; } /** * xa_is_value() - Determine if an entry is a value. * @entry: XArray entry. * * Context: Any context. * Return: True if the entry is a value, false if it is a pointer. */ static inline bool xa_is_value(const void *entry) { return (unsigned long)entry & 1; } /** * xa_tag_pointer() - Create an XArray entry for a tagged pointer. * @p: Plain pointer. * @tag: Tag value (0, 1 or 3). * * If the user of the XArray prefers, they can tag their pointers instead * of storing value entries. Three tags are available (0, 1 and 3). * These are distinct from the xa_mark_t as they are not replicated up * through the array and cannot be searched for. * * Context: Any context. * Return: An XArray entry. */ static inline void *xa_tag_pointer(void *p, unsigned long tag) { return (void *)((unsigned long)p | tag); } /** * xa_untag_pointer() - Turn an XArray entry into a plain pointer. * @entry: XArray entry. * * If you have stored a tagged pointer in the XArray, call this function * to get the untagged version of the pointer. * * Context: Any context. * Return: A pointer. */ static inline void *xa_untag_pointer(void *entry) { return (void *)((unsigned long)entry & ~3UL); } /** * xa_pointer_tag() - Get the tag stored in an XArray entry. * @entry: XArray entry. * * If you have stored a tagged pointer in the XArray, call this function * to get the tag of that pointer. * * Context: Any context. * Return: A tag. */ static inline unsigned int xa_pointer_tag(void *entry) { return (unsigned long)entry & 3UL; } /* * xa_mk_internal() - Create an internal entry. * @v: Value to turn into an internal entry. * * Internal entries are used for a number of purposes. Entries 0-255 are * used for sibling entries (only 0-62 are used by the current code). 256 * is used for the retry entry. 257 is used for the reserved / zero entry. * Negative internal entries are used to represent errnos. Node pointers * are also tagged as internal entries in some situations. * * Context: Any context. * Return: An XArray internal entry corresponding to this value. */ static inline void *xa_mk_internal(unsigned long v) { return (void *)((v << 2) | 2); } /* * xa_to_internal() - Extract the value from an internal entry. * @entry: XArray entry. * * Context: Any context. * Return: The value which was stored in the internal entry. */ static inline unsigned long xa_to_internal(const void *entry) { return (unsigned long)entry >> 2; } /* * xa_is_internal() - Is the entry an internal entry? * @entry: XArray entry. * * Context: Any context. * Return: %true if the entry is an internal entry. */ static inline bool xa_is_internal(const void *entry) { return ((unsigned long)entry & 3) == 2; } #define XA_ZERO_ENTRY xa_mk_internal(257) /** * xa_is_zero() - Is the entry a zero entry? * @entry: Entry retrieved from the XArray * * The normal API will return NULL as the contents of a slot containing * a zero entry. You can only see zero entries by using the advanced API. * * Return: %true if the entry is a zero entry. */ static inline bool xa_is_zero(const void *entry) { return unlikely(entry == XA_ZERO_ENTRY); } /** * xa_is_err() - Report whether an XArray operation returned an error * @entry: Result from calling an XArray function * * If an XArray operation cannot complete an operation, it will return * a special value indicating an error. This function tells you * whether an error occurred; xa_err() tells you which error occurred. * * Context: Any context. * Return: %true if the entry indicates an error. */ static inline bool xa_is_err(const void *entry) { return unlikely(xa_is_internal(entry) && entry >= xa_mk_internal(-MAX_ERRNO)); } /** * xa_err() - Turn an XArray result into an errno. * @entry: Result from calling an XArray function. * * If an XArray operation cannot complete an operation, it will return * a special pointer value which encodes an errno. This function extracts * the errno from the pointer value, or returns 0 if the pointer does not * represent an errno. * * Context: Any context. * Return: A negative errno or 0. */ static inline int xa_err(void *entry) { /* xa_to_internal() would not do sign extension. */ if (xa_is_err(entry)) return (long)entry >> 2; return 0; } /** * struct xa_limit - Represents a range of IDs. * @min: The lowest ID to allocate (inclusive). * @max: The maximum ID to allocate (inclusive). * * This structure is used either directly or via the XA_LIMIT() macro * to communicate the range of IDs that are valid for allocation. * Three common ranges are predefined for you: * * xa_limit_32b - [0 - UINT_MAX] * * xa_limit_31b - [0 - INT_MAX] * * xa_limit_16b - [0 - USHRT_MAX] */ struct xa_limit { u32 max; u32 min; }; #define XA_LIMIT(_min, _max) (struct xa_limit) { .min = _min, .max = _max } #define xa_limit_32b XA_LIMIT(0, UINT_MAX) #define xa_limit_31b XA_LIMIT(0, INT_MAX) #define xa_limit_16b XA_LIMIT(0, USHRT_MAX) typedef unsigned __bitwise xa_mark_t; #define XA_MARK_0 ((__force xa_mark_t)0U) #define XA_MARK_1 ((__force xa_mark_t)1U) #define XA_MARK_2 ((__force xa_mark_t)2U) #define XA_PRESENT ((__force xa_mark_t)8U) #define XA_MARK_MAX XA_MARK_2 #define XA_FREE_MARK XA_MARK_0 enum xa_lock_type { XA_LOCK_IRQ = 1, XA_LOCK_BH = 2, }; /* * Values for xa_flags. The radix tree stores its GFP flags in the xa_flags, * and we remain compatible with that. */ #define XA_FLAGS_LOCK_IRQ ((__force gfp_t)XA_LOCK_IRQ) #define XA_FLAGS_LOCK_BH ((__force gfp_t)XA_LOCK_BH) #define XA_FLAGS_TRACK_FREE ((__force gfp_t)4U) #define XA_FLAGS_ZERO_BUSY ((__force gfp_t)8U) #define XA_FLAGS_ALLOC_WRAPPED ((__force gfp_t)16U) #define XA_FLAGS_ACCOUNT ((__force gfp_t)32U) #define XA_FLAGS_MARK(mark) ((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \ (__force unsigned)(mark))) /* ALLOC is for a normal 0-based alloc. ALLOC1 is for an 1-based alloc */ #define XA_FLAGS_ALLOC (XA_FLAGS_TRACK_FREE | XA_FLAGS_MARK(XA_FREE_MARK)) #define XA_FLAGS_ALLOC1 (XA_FLAGS_TRACK_FREE | XA_FLAGS_ZERO_BUSY) /** * struct xarray - The anchor of the XArray. * @xa_lock: Lock that protects the contents of the XArray. * * To use the xarray, define it statically or embed it in your data structure. * It is a very small data structure, so it does not usually make sense to * allocate it separately and keep a pointer to it in your data structure. * * You may use the xa_lock to protect your own data structures as well. */ /* * If all of the entries in the array are NULL, @xa_head is a NULL pointer. * If the only non-NULL entry in the array is at index 0, @xa_head is that * entry. If any other entry in the array is non-NULL, @xa_head points * to an @xa_node. */ struct xarray { spinlock_t xa_lock; /* private: The rest of the data structure is not to be used directly. */ gfp_t xa_flags; void __rcu * xa_head; }; #define XARRAY_INIT(name, flags) { \ .xa_lock = __SPIN_LOCK_UNLOCKED(name.xa_lock), \ .xa_flags = flags, \ .xa_head = NULL, \ } /** * DEFINE_XARRAY_FLAGS() - Define an XArray with custom flags. * @name: A string that names your XArray. * @flags: XA_FLAG values. * * This is intended for file scope definitions of XArrays. It declares * and initialises an empty XArray with the chosen name and flags. It is * equivalent to calling xa_init_flags() on the array, but it does the * initialisation at compiletime instead of runtime. */ #define DEFINE_XARRAY_FLAGS(name, flags) \ struct xarray name = XARRAY_INIT(name, flags) /** * DEFINE_XARRAY() - Define an XArray. * @name: A string that names your XArray. * * This is intended for file scope definitions of XArrays. It declares * and initialises an empty XArray with the chosen name. It is equivalent * to calling xa_init() on the array, but it does the initialisation at * compiletime instead of runtime. */ #define DEFINE_XARRAY(name) DEFINE_XARRAY_FLAGS(name, 0) /** * DEFINE_XARRAY_ALLOC() - Define an XArray which allocates IDs starting at 0. * @name: A string that names your XArray. * * This is intended for file scope definitions of allocating XArrays. * See also DEFINE_XARRAY(). */ #define DEFINE_XARRAY_ALLOC(name) DEFINE_XARRAY_FLAGS(name, XA_FLAGS_ALLOC) /** * DEFINE_XARRAY_ALLOC1() - Define an XArray which allocates IDs starting at 1. * @name: A string that names your XArray. * * This is intended for file scope definitions of allocating XArrays. * See also DEFINE_XARRAY(). */ #define DEFINE_XARRAY_ALLOC1(name) DEFINE_XARRAY_FLAGS(name, XA_FLAGS_ALLOC1) void *xa_load(struct xarray *, unsigned long index); void *xa_store(struct xarray *, unsigned long index, void *entry, gfp_t); void *xa_erase(struct xarray *, unsigned long index); void *xa_store_range(struct xarray *, unsigned long first, unsigned long last, void *entry, gfp_t); bool xa_get_mark(struct xarray *, unsigned long index, xa_mark_t); void xa_set_mark(struct xarray *, unsigned long index, xa_mark_t); void xa_clear_mark(struct xarray *, unsigned long index, xa_mark_t); void *xa_find(struct xarray *xa, unsigned long *index, unsigned long max, xa_mark_t) __attribute__((nonnull(2))); void *xa_find_after(struct xarray *xa, unsigned long *index, unsigned long max, xa_mark_t) __attribute__((nonnull(2))); unsigned int xa_extract(struct xarray *, void **dst, unsigned long start, unsigned long max, unsigned int n, xa_mark_t); void xa_destroy(struct xarray *); /** * xa_init_flags() - Initialise an empty XArray with flags. * @xa: XArray. * @flags: XA_FLAG values. * * If you need to initialise an XArray with special flags (eg you need * to take the lock from interrupt context), use this function instead * of xa_init(). * * Context: Any context. */ static inline void xa_init_flags(struct xarray *xa, gfp_t flags) { spin_lock_init(&xa->xa_lock); xa->xa_flags = flags; xa->xa_head = NULL; } /** * xa_init() - Initialise an empty XArray. * @xa: XArray. * * An empty XArray is full of NULL entries. * * Context: Any context. */ static inline void xa_init(struct xarray *xa) { xa_init_flags(xa, 0); } /** * xa_empty() - Determine if an array has any present entries. * @xa: XArray. * * Context: Any context. * Return: %true if the array contains only NULL pointers. */ static inline bool xa_empty(const struct xarray *xa) { return xa->xa_head == NULL; } /** * xa_marked() - Inquire whether any entry in this array has a mark set * @xa: Array * @mark: Mark value * * Context: Any context. * Return: %true if any entry has this mark set. */ static inline bool xa_marked(const struct xarray *xa, xa_mark_t mark) { return xa->xa_flags & XA_FLAGS_MARK(mark); } /** * xa_for_each_range() - Iterate over a portion of an XArray. * @xa: XArray. * @index: Index of @entry. * @entry: Entry retrieved from array. * @start: First index to retrieve from array. * @last: Last index to retrieve from array. * * During the iteration, @entry will have the value of the entry stored * in @xa at @index. You may modify @index during the iteration if you * want to skip or reprocess indices. It is safe to modify the array * during the iteration. At the end of the iteration, @entry will be set * to NULL and @index will have a value less than or equal to max. * * xa_for_each_range() is O(n.log(n)) while xas_for_each() is O(n). You have * to handle your own locking with xas_for_each(), and if you have to unlock * after each iteration, it will also end up being O(n.log(n)). * xa_for_each_range() will spin if it hits a retry entry; if you intend to * see retry entries, you should use the xas_for_each() iterator instead. * The xas_for_each() iterator will expand into more inline code than * xa_for_each_range(). * * Context: Any context. Takes and releases the RCU lock. */ #define xa_for_each_range(xa, index, entry, start, last) \ for (index = start, \ entry = xa_find(xa, &index, last, XA_PRESENT); \ entry; \ entry = xa_find_after(xa, &index, last, XA_PRESENT)) /** * xa_for_each_start() - Iterate over a portion of an XArray. * @xa: XArray. * @index: Index of @entry. * @entry: Entry retrieved from array. * @start: First index to retrieve from array. * * During the iteration, @entry will have the value of the entry stored * in @xa at @index. You may modify @index during the iteration if you * want to skip or reprocess indices. It is safe to modify the array * during the iteration. At the end of the iteration, @entry will be set * to NULL and @index will have a value less than or equal to max. * * xa_for_each_start() is O(n.log(n)) while xas_for_each() is O(n). You have * to handle your own locking with xas_for_each(), and if you have to unlock * after each iteration, it will also end up being O(n.log(n)). * xa_for_each_start() will spin if it hits a retry entry; if you intend to * see retry entries, you should use the xas_for_each() iterator instead. * The xas_for_each() iterator will expand into more inline code than * xa_for_each_start(). * * Context: Any context. Takes and releases the RCU lock. */ #define xa_for_each_start(xa, index, entry, start) \ xa_for_each_range(xa, index, entry, start, ULONG_MAX) /** * xa_for_each() - Iterate over present entries in an XArray. * @xa: XArray. * @index: Index of @entry. * @entry: Entry retrieved from array. * * During the iteration, @entry will have the value of the entry stored * in @xa at @index. You may modify @index during the iteration if you want * to skip or reprocess indices. It is safe to modify the array during the * iteration. At the end of the iteration, @entry will be set to NULL and * @index will have a value less than or equal to max. * * xa_for_each() is O(n.log(n)) while xas_for_each() is O(n). You have * to handle your own locking with xas_for_each(), and if you have to unlock * after each iteration, it will also end up being O(n.log(n)). xa_for_each() * will spin if it hits a retry entry; if you intend to see retry entries, * you should use the xas_for_each() iterator instead. The xas_for_each() * iterator will expand into more inline code than xa_for_each(). * * Context: Any context. Takes and releases the RCU lock. */ #define xa_for_each(xa, index, entry) \ xa_for_each_start(xa, index, entry, 0) /** * xa_for_each_marked() - Iterate over marked entries in an XArray. * @xa: XArray. * @index: Index of @entry. * @entry: Entry retrieved from array. * @filter: Selection criterion. * * During the iteration, @entry will have the value of the entry stored * in @xa at @index. The iteration will skip all entries in the array * which do not match @filter. You may modify @index during the iteration * if you want to skip or reprocess indices. It is safe to modify the array * during the iteration. At the end of the iteration, @entry will be set to * NULL and @index will have a value less than or equal to max. * * xa_for_each_marked() is O(n.log(n)) while xas_for_each_marked() is O(n). * You have to handle your own locking with xas_for_each(), and if you have * to unlock after each iteration, it will also end up being O(n.log(n)). * xa_for_each_marked() will spin if it hits a retry entry; if you intend to * see retry entries, you should use the xas_for_each_marked() iterator * instead. The xas_for_each_marked() iterator will expand into more inline * code than xa_for_each_marked(). * * Context: Any context. Takes and releases the RCU lock. */ #define xa_for_each_marked(xa, index, entry, filter) \ for (index = 0, entry = xa_find(xa, &index, ULONG_MAX, filter); \ entry; entry = xa_find_after(xa, &index, ULONG_MAX, filter)) #define xa_trylock(xa) spin_trylock(&(xa)->xa_lock) #define xa_lock(xa) spin_lock(&(xa)->xa_lock) #define xa_unlock(xa) spin_unlock(&(xa)->xa_lock) #define xa_lock_bh(xa) spin_lock_bh(&(xa)->xa_lock) #define xa_unlock_bh(xa) spin_unlock_bh(&(xa)->xa_lock) #define xa_lock_irq(xa) spin_lock_irq(&(xa)->xa_lock) #define xa_unlock_irq(xa) spin_unlock_irq(&(xa)->xa_lock) #define xa_lock_irqsave(xa, flags) \ spin_lock_irqsave(&(xa)->xa_lock, flags) #define xa_unlock_irqrestore(xa, flags) \ spin_unlock_irqrestore(&(xa)->xa_lock, flags) #define xa_lock_nested(xa, subclass) \ spin_lock_nested(&(xa)->xa_lock, subclass) #define xa_lock_bh_nested(xa, subclass) \ spin_lock_bh_nested(&(xa)->xa_lock, subclass) #define xa_lock_irq_nested(xa, subclass) \ spin_lock_irq_nested(&(xa)->xa_lock, subclass) #define xa_lock_irqsave_nested(xa, flags, subclass) \ spin_lock_irqsave_nested(&(xa)->xa_lock, flags, subclass) /* * Versions of the normal API which require the caller to hold the * xa_lock. If the GFP flags allow it, they will drop the lock to * allocate memory, then reacquire it afterwards. These functions * may also re-enable interrupts if the XArray flags indicate the * locking should be interrupt safe. */ void *__xa_erase(struct xarray *, unsigned long index); void *__xa_store(struct xarray *, unsigned long index, void *entry, gfp_t); void *__xa_cmpxchg(struct xarray *, unsigned long index, void *old, void *entry, gfp_t); int __must_check __xa_insert(struct xarray *, unsigned long index, void *entry, gfp_t); int __must_check __xa_alloc(struct xarray *, u32 *id, void *entry, struct xa_limit, gfp_t); int __must_check __xa_alloc_cyclic(struct xarray *, u32 *id, void *entry, struct xa_limit, u32 *next, gfp_t); void __xa_set_mark(struct xarray *, unsigned long index, xa_mark_t); void __xa_clear_mark(struct xarray *, unsigned long index, xa_mark_t); /** * xa_store_bh() - Store this entry in the XArray. * @xa: XArray. * @index: Index into array. * @entry: New entry. * @gfp: Memory allocation flags. * * This function is like calling xa_store() except it disables softirqs * while holding the array lock. * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. * Return: The old entry at this index or xa_err() if an error happened. */ static inline void *xa_store_bh(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) { void *curr; might_alloc(gfp); xa_lock_bh(xa); curr = __xa_store(xa, index, entry, gfp); xa_unlock_bh(xa); return curr; } /** * xa_store_irq() - Store this entry in the XArray. * @xa: XArray. * @index: Index into array. * @entry: New entry. * @gfp: Memory allocation flags. * * This function is like calling xa_store() except it disables interrupts * while holding the array lock. * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. * Return: The old entry at this index or xa_err() if an error happened. */ static inline void *xa_store_irq(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) { void *curr; might_alloc(gfp); xa_lock_irq(xa); curr = __xa_store(xa, index, entry, gfp); xa_unlock_irq(xa); return curr; } /** * xa_erase_bh() - Erase this entry from the XArray. * @xa: XArray. * @index: Index of entry. * * After this function returns, loading from @index will return %NULL. * If the index is part of a multi-index entry, all indices will be erased * and none of the entries will be part of a multi-index entry. * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. * Return: The entry which used to be at this index. */ static inline void *xa_erase_bh(struct xarray *xa, unsigned long index) { void *entry; xa_lock_bh(xa); entry = __xa_erase(xa, index); xa_unlock_bh(xa); return entry; } /** * xa_erase_irq() - Erase this entry from the XArray. * @xa: XArray. * @index: Index of entry. * * After this function returns, loading from @index will return %NULL. * If the index is part of a multi-index entry, all indices will be erased * and none of the entries will be part of a multi-index entry. * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. * Return: The entry which used to be at this index. */ static inline void *xa_erase_irq(struct xarray *xa, unsigned long index) { void *entry; xa_lock_irq(xa); entry = __xa_erase(xa, index); xa_unlock_irq(xa); return entry; } /** * xa_cmpxchg() - Conditionally replace an entry in the XArray. * @xa: XArray. * @index: Index into array. * @old: Old value to test against. * @entry: New value to place in array. * @gfp: Memory allocation flags. * * If the entry at @index is the same as @old, replace it with @entry. * If the return value is equal to @old, then the exchange was successful. * * Context: Any context. Takes and releases the xa_lock. May sleep * if the @gfp flags permit. * Return: The old value at this index or xa_err() if an error happened. */ static inline void *xa_cmpxchg(struct xarray *xa, unsigned long index, void *old, void *entry, gfp_t gfp) { void *curr; might_alloc(gfp); xa_lock(xa); curr = __xa_cmpxchg(xa, index, old, entry, gfp); xa_unlock(xa); return curr; } /** * xa_cmpxchg_bh() - Conditionally replace an entry in the XArray. * @xa: XArray. * @index: Index into array. * @old: Old value to test against. * @entry: New value to place in array. * @gfp: Memory allocation flags. * * This function is like calling xa_cmpxchg() except it disables softirqs * while holding the array lock. * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. May sleep if the @gfp flags permit. * Return: The old value at this index or xa_err() if an error happened. */ static inline void *xa_cmpxchg_bh(struct xarray *xa, unsigned long index, void *old, void *entry, gfp_t gfp) { void *curr; might_alloc(gfp); xa_lock_bh(xa); curr = __xa_cmpxchg(xa, index, old, entry, gfp); xa_unlock_bh(xa); return curr; } /** * xa_cmpxchg_irq() - Conditionally replace an entry in the XArray. * @xa: XArray. * @index: Index into array. * @old: Old value to test against. * @entry: New value to place in array. * @gfp: Memory allocation flags. * * This function is like calling xa_cmpxchg() except it disables interrupts * while holding the array lock. * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. May sleep if the @gfp flags permit. * Return: The old value at this index or xa_err() if an error happened. */ static inline void *xa_cmpxchg_irq(struct xarray *xa, unsigned long index, void *old, void *entry, gfp_t gfp) { void *curr; might_alloc(gfp); xa_lock_irq(xa); curr = __xa_cmpxchg(xa, index, old, entry, gfp); xa_unlock_irq(xa); return curr; } /** * xa_insert() - Store this entry in the XArray unless another entry is * already present. * @xa: XArray. * @index: Index into array. * @entry: New entry. * @gfp: Memory allocation flags. * * Inserting a NULL entry will store a reserved entry (like xa_reserve()) * if no entry is present. Inserting will fail if a reserved entry is * present, even though loading from this index will return NULL. * * Context: Any context. Takes and releases the xa_lock. May sleep if * the @gfp flags permit. * Return: 0 if the store succeeded. -EBUSY if another entry was present. * -ENOMEM if memory could not be allocated. */ static inline int __must_check xa_insert(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) { int err; might_alloc(gfp); xa_lock(xa); err = __xa_insert(xa, index, entry, gfp); xa_unlock(xa); return err; } /** * xa_insert_bh() - Store this entry in the XArray unless another entry is * already present. * @xa: XArray. * @index: Index into array. * @entry: New entry. * @gfp: Memory allocation flags. * * Inserting a NULL entry will store a reserved entry (like xa_reserve()) * if no entry is present. Inserting will fail if a reserved entry is * present, even though loading from this index will return NULL. * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. May sleep if the @gfp flags permit. * Return: 0 if the store succeeded. -EBUSY if another entry was present. * -ENOMEM if memory could not be allocated. */ static inline int __must_check xa_insert_bh(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_bh(xa); err = __xa_insert(xa, index, entry, gfp); xa_unlock_bh(xa); return err; } /** * xa_insert_irq() - Store this entry in the XArray unless another entry is * already present. * @xa: XArray. * @index: Index into array. * @entry: New entry. * @gfp: Memory allocation flags. * * Inserting a NULL entry will store a reserved entry (like xa_reserve()) * if no entry is present. Inserting will fail if a reserved entry is * present, even though loading from this index will return NULL. * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. May sleep if the @gfp flags permit. * Return: 0 if the store succeeded. -EBUSY if another entry was present. * -ENOMEM if memory could not be allocated. */ static inline int __must_check xa_insert_irq(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_irq(xa); err = __xa_insert(xa, index, entry, gfp); xa_unlock_irq(xa); return err; } /** * xa_alloc() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Context: Any context. Takes and releases the xa_lock. May sleep if * the @gfp flags permit. * Return: 0 on success, -ENOMEM if memory could not be allocated or * -EBUSY if there are no free entries in @limit. */ static inline __must_check int xa_alloc(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, gfp_t gfp) { int err; might_alloc(gfp); xa_lock(xa); err = __xa_alloc(xa, id, entry, limit, gfp); xa_unlock(xa); return err; } /** * xa_alloc_bh() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. May sleep if the @gfp flags permit. * Return: 0 on success, -ENOMEM if memory could not be allocated or * -EBUSY if there are no free entries in @limit. */ static inline int __must_check xa_alloc_bh(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_bh(xa); err = __xa_alloc(xa, id, entry, limit, gfp); xa_unlock_bh(xa); return err; } /** * xa_alloc_irq() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. May sleep if the @gfp flags permit. * Return: 0 on success, -ENOMEM if memory could not be allocated or * -EBUSY if there are no free entries in @limit. */ static inline int __must_check xa_alloc_irq(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_irq(xa); err = __xa_alloc(xa, id, entry, limit, gfp); xa_unlock_irq(xa); return err; } /** * xa_alloc_cyclic() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of allocated ID. * @next: Pointer to next ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * The search for an empty entry will start at @next and will wrap * around if necessary. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Note that callers interested in whether wrapping has occurred should * use __xa_alloc_cyclic() instead. * * Context: Any context. Takes and releases the xa_lock. May sleep if * the @gfp flags permit. * Return: 0 if the allocation succeeded, -ENOMEM if memory could not be * allocated or -EBUSY if there are no free entries in @limit. */ static inline int xa_alloc_cyclic(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, u32 *next, gfp_t gfp) { int err; might_alloc(gfp); xa_lock(xa); err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp); xa_unlock(xa); return err < 0 ? err : 0; } /** * xa_alloc_cyclic_bh() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of allocated ID. * @next: Pointer to next ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * The search for an empty entry will start at @next and will wrap * around if necessary. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Note that callers interested in whether wrapping has occurred should * use __xa_alloc_cyclic() instead. * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. May sleep if the @gfp flags permit. * Return: 0 if the allocation succeeded, -ENOMEM if memory could not be * allocated or -EBUSY if there are no free entries in @limit. */ static inline int xa_alloc_cyclic_bh(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, u32 *next, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_bh(xa); err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp); xa_unlock_bh(xa); return err < 0 ? err : 0; } /** * xa_alloc_cyclic_irq() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of allocated ID. * @next: Pointer to next ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * The search for an empty entry will start at @next and will wrap * around if necessary. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Note that callers interested in whether wrapping has occurred should * use __xa_alloc_cyclic() instead. * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. May sleep if the @gfp flags permit. * Return: 0 if the allocation succeeded, -ENOMEM if memory could not be * allocated or -EBUSY if there are no free entries in @limit. */ static inline int xa_alloc_cyclic_irq(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, u32 *next, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_irq(xa); err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp); xa_unlock_irq(xa); return err < 0 ? err : 0; } /** * xa_reserve() - Reserve this index in the XArray. * @xa: XArray. * @index: Index into array. * @gfp: Memory allocation flags. * * Ensures there is somewhere to store an entry at @index in the array. * If there is already something stored at @index, this function does * nothing. If there was nothing there, the entry is marked as reserved. * Loading from a reserved entry returns a %NULL pointer. * * If you do not use the entry that you have reserved, call xa_release() * or xa_erase() to free any unnecessary memory. * * Context: Any context. Takes and releases the xa_lock. * May sleep if the @gfp flags permit. * Return: 0 if the reservation succeeded or -ENOMEM if it failed. */ static inline __must_check int xa_reserve(struct xarray *xa, unsigned long index, gfp_t gfp) { return xa_err(xa_cmpxchg(xa, index, NULL, XA_ZERO_ENTRY, gfp)); } /** * xa_reserve_bh() - Reserve this index in the XArray. * @xa: XArray. * @index: Index into array. * @gfp: Memory allocation flags. * * A softirq-disabling version of xa_reserve(). * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. * Return: 0 if the reservation succeeded or -ENOMEM if it failed. */ static inline __must_check int xa_reserve_bh(struct xarray *xa, unsigned long index, gfp_t gfp) { return xa_err(xa_cmpxchg_bh(xa, index, NULL, XA_ZERO_ENTRY, gfp)); } /** * xa_reserve_irq() - Reserve this index in the XArray. * @xa: XArray. * @index: Index into array. * @gfp: Memory allocation flags. * * An interrupt-disabling version of xa_reserve(). * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. * Return: 0 if the reservation succeeded or -ENOMEM if it failed. */ static inline __must_check int xa_reserve_irq(struct xarray *xa, unsigned long index, gfp_t gfp) { return xa_err(xa_cmpxchg_irq(xa, index, NULL, XA_ZERO_ENTRY, gfp)); } /** * xa_release() - Release a reserved entry. * @xa: XArray. * @index: Index of entry. * * After calling xa_reserve(), you can call this function to release the * reservation. If the entry at @index has been stored to, this function * will do nothing. */ static inline void xa_release(struct xarray *xa, unsigned long index) { xa_cmpxchg(xa, index, XA_ZERO_ENTRY, NULL, 0); } /* Everything below here is the Advanced API. Proceed with caution. */ /* * The xarray is constructed out of a set of 'chunks' of pointers. Choosing * the best chunk size requires some tradeoffs. A power of two recommends * itself so that we can walk the tree based purely on shifts and masks. * Generally, the larger the better; as the number of slots per level of the * tree increases, the less tall the tree needs to be. But that needs to be * balanced against the memory consumption of each node. On a 64-bit system, * xa_node is currently 576 bytes, and we get 7 of them per 4kB page. If we * doubled the number of slots per node, we'd get only 3 nodes per 4kB page. */ #ifndef XA_CHUNK_SHIFT #define XA_CHUNK_SHIFT (IS_ENABLED(CONFIG_BASE_SMALL) ? 4 : 6) #endif #define XA_CHUNK_SIZE (1UL << XA_CHUNK_SHIFT) #define XA_CHUNK_MASK (XA_CHUNK_SIZE - 1) #define XA_MAX_MARKS 3 #define XA_MARK_LONGS BITS_TO_LONGS(XA_CHUNK_SIZE) /* * @count is the count of every non-NULL element in the ->slots array * whether that is a value entry, a retry entry, a user pointer, * a sibling entry or a pointer to the next level of the tree. * @nr_values is the count of every element in ->slots which is * either a value entry or a sibling of a value entry. */ struct xa_node { unsigned char shift; /* Bits remaining in each slot */ unsigned char offset; /* Slot offset in parent */ unsigned char count; /* Total entry count */ unsigned char nr_values; /* Value entry count */ struct xa_node __rcu *parent; /* NULL at top of tree */ struct xarray *array; /* The array we belong to */ union { struct list_head private_list; /* For tree user */ struct rcu_head rcu_head; /* Used when freeing node */ }; void __rcu *slots[XA_CHUNK_SIZE]; union { unsigned long tags[XA_MAX_MARKS][XA_MARK_LONGS]; unsigned long marks[XA_MAX_MARKS][XA_MARK_LONGS]; }; }; void xa_dump(const struct xarray *); void xa_dump_node(const struct xa_node *); #ifdef XA_DEBUG #define XA_BUG_ON(xa, x) do { \ if (x) { \ xa_dump(xa); \ BUG(); \ } \ } while (0) #define XA_NODE_BUG_ON(node, x) do { \ if (x) { \ if (node) xa_dump_node(node); \ BUG(); \ } \ } while (0) #else #define XA_BUG_ON(xa, x) do { } while (0) #define XA_NODE_BUG_ON(node, x) do { } while (0) #endif /* Private */ static inline void *xa_head(const struct xarray *xa) { return rcu_dereference_check(xa->xa_head, lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline void *xa_head_locked(const struct xarray *xa) { return rcu_dereference_protected(xa->xa_head, lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline void *xa_entry(const struct xarray *xa, const struct xa_node *node, unsigned int offset) { XA_NODE_BUG_ON(node, offset >= XA_CHUNK_SIZE); return rcu_dereference_check(node->slots[offset], lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline void *xa_entry_locked(const struct xarray *xa, const struct xa_node *node, unsigned int offset) { XA_NODE_BUG_ON(node, offset >= XA_CHUNK_SIZE); return rcu_dereference_protected(node->slots[offset], lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline struct xa_node *xa_parent(const struct xarray *xa, const struct xa_node *node) { return rcu_dereference_check(node->parent, lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline struct xa_node *xa_parent_locked(const struct xarray *xa, const struct xa_node *node) { return rcu_dereference_protected(node->parent, lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline void *xa_mk_node(const struct xa_node *node) { return (void *)((unsigned long)node | 2); } /* Private */ static inline struct xa_node *xa_to_node(const void *entry) { return (struct xa_node *)((unsigned long)entry - 2); } /* Private */ static inline bool xa_is_node(const void *entry) { return xa_is_internal(entry) && (unsigned long)entry > 4096; } /* Private */ static inline void *xa_mk_sibling(unsigned int offset) { return xa_mk_internal(offset); } /* Private */ static inline unsigned long xa_to_sibling(const void *entry) { return xa_to_internal(entry); } /** * xa_is_sibling() - Is the entry a sibling entry? * @entry: Entry retrieved from the XArray * * Return: %true if the entry is a sibling entry. */ static inline bool xa_is_sibling(const void *entry) { return IS_ENABLED(CONFIG_XARRAY_MULTI) && xa_is_internal(entry) && (entry < xa_mk_sibling(XA_CHUNK_SIZE - 1)); } #define XA_RETRY_ENTRY xa_mk_internal(256) /** * xa_is_retry() - Is the entry a retry entry? * @entry: Entry retrieved from the XArray * * Return: %true if the entry is a retry entry. */ static inline bool xa_is_retry(const void *entry) { return unlikely(entry == XA_RETRY_ENTRY); } /** * xa_is_advanced() - Is the entry only permitted for the advanced API? * @entry: Entry to be stored in the XArray. * * Return: %true if the entry cannot be stored by the normal API. */ static inline bool xa_is_advanced(const void *entry) { return xa_is_internal(entry) && (entry <= XA_RETRY_ENTRY); } /** * typedef xa_update_node_t - A callback function from the XArray. * @node: The node which is being processed * * This function is called every time the XArray updates the count of * present and value entries in a node. It allows advanced users to * maintain the private_list in the node. * * Context: The xa_lock is held and interrupts may be disabled. * Implementations should not drop the xa_lock, nor re-enable * interrupts. */ typedef void (*xa_update_node_t)(struct xa_node *node); void xa_delete_node(struct xa_node *, xa_update_node_t); /* * The xa_state is opaque to its users. It contains various different pieces * of state involved in the current operation on the XArray. It should be * declared on the stack and passed between the various internal routines. * The various elements in it should not be accessed directly, but only * through the provided accessor functions. The below documentation is for * the benefit of those working on the code, not for users of the XArray. * * @xa_node usually points to the xa_node containing the slot we're operating * on (and @xa_offset is the offset in the slots array). If there is a * single entry in the array at index 0, there are no allocated xa_nodes to * point to, and so we store %NULL in @xa_node. @xa_node is set to * the value %XAS_RESTART if the xa_state is not walked to the correct * position in the tree of nodes for this operation. If an error occurs * during an operation, it is set to an %XAS_ERROR value. If we run off the * end of the allocated nodes, it is set to %XAS_BOUNDS. */ struct xa_state { struct xarray *xa; unsigned long xa_index; unsigned char xa_shift; unsigned char xa_sibs; unsigned char xa_offset; unsigned char xa_pad; /* Helps gcc generate better code */ struct xa_node *xa_node; struct xa_node *xa_alloc; xa_update_node_t xa_update; struct list_lru *xa_lru; }; /* * We encode errnos in the xas->xa_node. If an error has happened, we need to * drop the lock to fix it, and once we've done so the xa_state is invalid. */ #define XA_ERROR(errno) ((struct xa_node *)(((unsigned long)errno << 2) | 2UL)) #define XAS_BOUNDS ((struct xa_node *)1UL) #define XAS_RESTART ((struct xa_node *)3UL) #define __XA_STATE(array, index, shift, sibs) { \ .xa = array, \ .xa_index = index, \ .xa_shift = shift, \ .xa_sibs = sibs, \ .xa_offset = 0, \ .xa_pad = 0, \ .xa_node = XAS_RESTART, \ .xa_alloc = NULL, \ .xa_update = NULL, \ .xa_lru = NULL, \ } /** * XA_STATE() - Declare an XArray operation state. * @name: Name of this operation state (usually xas). * @array: Array to operate on. * @index: Initial index of interest. * * Declare and initialise an xa_state on the stack. */ #define XA_STATE(name, array, index) \ struct xa_state name = __XA_STATE(array, index, 0, 0) /** * XA_STATE_ORDER() - Declare an XArray operation state. * @name: Name of this operation state (usually xas). * @array: Array to operate on. * @index: Initial index of interest. * @order: Order of entry. * * Declare and initialise an xa_state on the stack. This variant of * XA_STATE() allows you to specify the 'order' of the element you * want to operate on.` */ #define XA_STATE_ORDER(name, array, index, order) \ struct xa_state name = __XA_STATE(array, \ (index >> order) << order, \ order - (order % XA_CHUNK_SHIFT), \ (1U << (order % XA_CHUNK_SHIFT)) - 1) #define xas_marked(xas, mark) xa_marked((xas)->xa, (mark)) #define xas_trylock(xas) xa_trylock((xas)->xa) #define xas_lock(xas) xa_lock((xas)->xa) #define xas_unlock(xas) xa_unlock((xas)->xa) #define xas_lock_bh(xas) xa_lock_bh((xas)->xa) #define xas_unlock_bh(xas) xa_unlock_bh((xas)->xa) #define xas_lock_irq(xas) xa_lock_irq((xas)->xa) #define xas_unlock_irq(xas) xa_unlock_irq((xas)->xa) #define xas_lock_irqsave(xas, flags) \ xa_lock_irqsave((xas)->xa, flags) #define xas_unlock_irqrestore(xas, flags) \ xa_unlock_irqrestore((xas)->xa, flags) /** * xas_error() - Return an errno stored in the xa_state. * @xas: XArray operation state. * * Return: 0 if no error has been noted. A negative errno if one has. */ static inline int xas_error(const struct xa_state *xas) { return xa_err(xas->xa_node); } /** * xas_set_err() - Note an error in the xa_state. * @xas: XArray operation state. * @err: Negative error number. * * Only call this function with a negative @err; zero or positive errors * will probably not behave the way you think they should. If you want * to clear the error from an xa_state, use xas_reset(). */ static inline void xas_set_err(struct xa_state *xas, long err) { xas->xa_node = XA_ERROR(err); } /** * xas_invalid() - Is the xas in a retry or error state? * @xas: XArray operation state. * * Return: %true if the xas cannot be used for operations. */ static inline bool xas_invalid(const struct xa_state *xas) { return (unsigned long)xas->xa_node & 3; } /** * xas_valid() - Is the xas a valid cursor into the array? * @xas: XArray operation state. * * Return: %true if the xas can be used for operations. */ static inline bool xas_valid(const struct xa_state *xas) { return !xas_invalid(xas); } /** * xas_is_node() - Does the xas point to a node? * @xas: XArray operation state. * * Return: %true if the xas currently references a node. */ static inline bool xas_is_node(const struct xa_state *xas) { return xas_valid(xas) && xas->xa_node; } /* True if the pointer is something other than a node */ static inline bool xas_not_node(struct xa_node *node) { return ((unsigned long)node & 3) || !node; } /* True if the node represents RESTART or an error */ static inline bool xas_frozen(struct xa_node *node) { return (unsigned long)node & 2; } /* True if the node represents head-of-tree, RESTART or BOUNDS */ static inline bool xas_top(struct xa_node *node) { return node <= XAS_RESTART; } /** * xas_reset() - Reset an XArray operation state. * @xas: XArray operation state. * * Resets the error or walk state of the @xas so future walks of the * array will start from the root. Use this if you have dropped the * xarray lock and want to reuse the xa_state. * * Context: Any context. */ static inline void xas_reset(struct xa_state *xas) { xas->xa_node = XAS_RESTART; } /** * xas_retry() - Retry the operation if appropriate. * @xas: XArray operation state. * @entry: Entry from xarray. * * The advanced functions may sometimes return an internal entry, such as * a retry entry or a zero entry. This function sets up the @xas to restart * the walk from the head of the array if needed. * * Context: Any context. * Return: true if the operation needs to be retried. */ static inline bool xas_retry(struct xa_state *xas, const void *entry) { if (xa_is_zero(entry)) return true; if (!xa_is_retry(entry)) return false; xas_reset(xas); return true; } void *xas_load(struct xa_state *); void *xas_store(struct xa_state *, void *entry); void *xas_find(struct xa_state *, unsigned long max); void *xas_find_conflict(struct xa_state *); bool xas_get_mark(const struct xa_state *, xa_mark_t); void xas_set_mark(const struct xa_state *, xa_mark_t); void xas_clear_mark(const struct xa_state *, xa_mark_t); void *xas_find_marked(struct xa_state *, unsigned long max, xa_mark_t); void xas_init_marks(const struct xa_state *); bool xas_nomem(struct xa_state *, gfp_t); void xas_destroy(struct xa_state *); void xas_pause(struct xa_state *); void xas_create_range(struct xa_state *); #ifdef CONFIG_XARRAY_MULTI int xa_get_order(struct xarray *, unsigned long index); int xas_get_order(struct xa_state *xas); void xas_split(struct xa_state *, void *entry, unsigned int order); void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t); void xas_try_split(struct xa_state *xas, void *entry, unsigned int order); unsigned int xas_try_split_min_order(unsigned int order); #else static inline int xa_get_order(struct xarray *xa, unsigned long index) { return 0; } static inline int xas_get_order(struct xa_state *xas) { return 0; } static inline void xas_split(struct xa_state *xas, void *entry, unsigned int order) { xas_store(xas, entry); } static inline void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order, gfp_t gfp) { } static inline void xas_try_split(struct xa_state *xas, void *entry, unsigned int order) { } static inline unsigned int xas_try_split_min_order(unsigned int order) { return 0; } #endif /** * xas_reload() - Refetch an entry from the xarray. * @xas: XArray operation state. * * Use this function to check that a previously loaded entry still has * the same value. This is useful for the lockless pagecache lookup where * we walk the array with only the RCU lock to protect us, lock the page, * then check that the page hasn't moved since we looked it up. * * The caller guarantees that @xas is still valid. If it may be in an * error or restart state, call xas_load() instead. * * Return: The entry at this location in the xarray. */ static inline void *xas_reload(struct xa_state *xas) { struct xa_node *node = xas->xa_node; void *entry; char offset; if (!node) return xa_head(xas->xa); if (IS_ENABLED(CONFIG_XARRAY_MULTI)) { offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK; entry = xa_entry(xas->xa, node, offset); if (!xa_is_sibling(entry)) return entry; offset = xa_to_sibling(entry); } else { offset = xas->xa_offset; } return xa_entry(xas->xa, node, offset); } /** * xas_set() - Set up XArray operation state for a different index. * @xas: XArray operation state. * @index: New index into the XArray. * * Move the operation state to refer to a different index. This will * have the effect of starting a walk from the top; see xas_next() * to move to an adjacent index. */ static inline void xas_set(struct xa_state *xas, unsigned long index) { xas->xa_index = index; xas->xa_node = XAS_RESTART; } /** * xas_advance() - Skip over sibling entries. * @xas: XArray operation state. * @index: Index of last sibling entry. * * Move the operation state to refer to the last sibling entry. * This is useful for loops that normally want to see sibling * entries but sometimes want to skip them. Use xas_set() if you * want to move to an index which is not part of this entry. */ static inline void xas_advance(struct xa_state *xas, unsigned long index) { unsigned char shift = xas_is_node(xas) ? xas->xa_node->shift : 0; xas->xa_index = index; xas->xa_offset = (index >> shift) & XA_CHUNK_MASK; } /** * xas_set_order() - Set up XArray operation state for a multislot entry. * @xas: XArray operation state. * @index: Target of the operation. * @order: Entry occupies 2^@order indices. */ static inline void xas_set_order(struct xa_state *xas, unsigned long index, unsigned int order) { #ifdef CONFIG_XARRAY_MULTI xas->xa_index = order < BITS_PER_LONG ? (index >> order) << order : 0; xas->xa_shift = order - (order % XA_CHUNK_SHIFT); xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1; xas->xa_node = XAS_RESTART; #else BUG_ON(order > 0); xas_set(xas, index); #endif } /** * xas_set_update() - Set up XArray operation state for a callback. * @xas: XArray operation state. * @update: Function to call when updating a node. * * The XArray can notify a caller after it has updated an xa_node. * This is advanced functionality and is only needed by the page * cache and swap cache. */ static inline void xas_set_update(struct xa_state *xas, xa_update_node_t update) { xas->xa_update = update; } static inline void xas_set_lru(struct xa_state *xas, struct list_lru *lru) { xas->xa_lru = lru; } /** * xas_next_entry() - Advance iterator to next present entry. * @xas: XArray operation state. * @max: Highest index to return. * * xas_next_entry() is an inline function to optimise xarray traversal for * speed. It is equivalent to calling xas_find(), and will call xas_find() * for all the hard cases. * * Return: The next present entry after the one currently referred to by @xas. */ static inline void *xas_next_entry(struct xa_state *xas, unsigned long max) { struct xa_node *node = xas->xa_node; void *entry; if (unlikely(xas_not_node(node) || node->shift || xas->xa_offset != (xas->xa_index & XA_CHUNK_MASK))) return xas_find(xas, max); do { if (unlikely(xas->xa_index >= max)) return xas_find(xas, max); if (unlikely(xas->xa_offset == XA_CHUNK_MASK)) return xas_find(xas, max); entry = xa_entry(xas->xa, node, xas->xa_offset + 1); if (unlikely(xa_is_internal(entry))) return xas_find(xas, max); xas->xa_offset++; xas->xa_index++; } while (!entry); return entry; } /* Private */ static inline unsigned int xas_find_chunk(struct xa_state *xas, bool advance, xa_mark_t mark) { unsigned long *addr = xas->xa_node->marks[(__force unsigned)mark]; unsigned int offset = xas->xa_offset; if (advance) offset++; if (XA_CHUNK_SIZE == BITS_PER_LONG) { if (offset < XA_CHUNK_SIZE) { unsigned long data = *addr & (~0UL << offset); if (data) return __ffs(data); } return XA_CHUNK_SIZE; } return find_next_bit(addr, XA_CHUNK_SIZE, offset); } /** * xas_next_marked() - Advance iterator to next marked entry. * @xas: XArray operation state. * @max: Highest index to return. * @mark: Mark to search for. * * xas_next_marked() is an inline function to optimise xarray traversal for * speed. It is equivalent to calling xas_find_marked(), and will call * xas_find_marked() for all the hard cases. * * Return: The next marked entry after the one currently referred to by @xas. */ static inline void *xas_next_marked(struct xa_state *xas, unsigned long max, xa_mark_t mark) { struct xa_node *node = xas->xa_node; void *entry; unsigned int offset; if (unlikely(xas_not_node(node) || node->shift)) return xas_find_marked(xas, max, mark); offset = xas_find_chunk(xas, true, mark); xas->xa_offset = offset; xas->xa_index = (xas->xa_index & ~XA_CHUNK_MASK) + offset; if (xas->xa_index > max) return NULL; if (offset == XA_CHUNK_SIZE) return xas_find_marked(xas, max, mark); entry = xa_entry(xas->xa, node, offset); if (!entry) return xas_find_marked(xas, max, mark); return entry; } /* * If iterating while holding a lock, drop the lock and reschedule * every %XA_CHECK_SCHED loops. */ enum { XA_CHECK_SCHED = 4096, }; /** * xas_for_each() - Iterate over a range of an XArray. * @xas: XArray operation state. * @entry: Entry retrieved from the array. * @max: Maximum index to retrieve from array. * * The loop body will be executed for each entry present in the xarray * between the current xas position and @max. @entry will be set to * the entry retrieved from the xarray. It is safe to delete entries * from the array in the loop body. You should hold either the RCU lock * or the xa_lock while iterating. If you need to drop the lock, call * xas_pause() first. */ #define xas_for_each(xas, entry, max) \ for (entry = xas_find(xas, max); entry; \ entry = xas_next_entry(xas, max)) /** * xas_for_each_marked() - Iterate over a range of an XArray. * @xas: XArray operation state. * @entry: Entry retrieved from the array. * @max: Maximum index to retrieve from array. * @mark: Mark to search for. * * The loop body will be executed for each marked entry in the xarray * between the current xas position and @max. @entry will be set to * the entry retrieved from the xarray. It is safe to delete entries * from the array in the loop body. You should hold either the RCU lock * or the xa_lock while iterating. If you need to drop the lock, call * xas_pause() first. */ #define xas_for_each_marked(xas, entry, max, mark) \ for (entry = xas_find_marked(xas, max, mark); entry; \ entry = xas_next_marked(xas, max, mark)) /** * xas_for_each_conflict() - Iterate over a range of an XArray. * @xas: XArray operation state. * @entry: Entry retrieved from the array. * * The loop body will be executed for each entry in the XArray that * lies within the range specified by @xas. If the loop terminates * normally, @entry will be %NULL. The user may break out of the loop, * which will leave @entry set to the conflicting entry. The caller * may also call xa_set_err() to exit the loop while setting an error * to record the reason. */ #define xas_for_each_conflict(xas, entry) \ while ((entry = xas_find_conflict(xas))) void *__xas_next(struct xa_state *); void *__xas_prev(struct xa_state *); /** * xas_prev() - Move iterator to previous index. * @xas: XArray operation state. * * If the @xas was in an error state, it will remain in an error state * and this function will return %NULL. If the @xas has never been walked, * it will have the effect of calling xas_load(). Otherwise one will be * subtracted from the index and the state will be walked to the correct * location in the array for the next operation. * * If the iterator was referencing index 0, this function wraps * around to %ULONG_MAX. * * Return: The entry at the new index. This may be %NULL or an internal * entry. */ static inline void *xas_prev(struct xa_state *xas) { struct xa_node *node = xas->xa_node; if (unlikely(xas_not_node(node) || node->shift || xas->xa_offset == 0)) return __xas_prev(xas); xas->xa_index--; xas->xa_offset--; return xa_entry(xas->xa, node, xas->xa_offset); } /** * xas_next() - Move state to next index. * @xas: XArray operation state. * * If the @xas was in an error state, it will remain in an error state * and this function will return %NULL. If the @xas has never been walked, * it will have the effect of calling xas_load(). Otherwise one will be * added to the index and the state will be walked to the correct * location in the array for the next operation. * * If the iterator was referencing index %ULONG_MAX, this function wraps * around to 0. * * Return: The entry at the new index. This may be %NULL or an internal * entry. */ static inline void *xas_next(struct xa_state *xas) { struct xa_node *node = xas->xa_node; if (unlikely(xas_not_node(node) || node->shift || xas->xa_offset == XA_CHUNK_MASK)) return __xas_next(xas); xas->xa_index++; xas->xa_offset++; return xa_entry(xas->xa, node, xas->xa_offset); } #endif /* _LINUX_XARRAY_H */ |
| 16 9 24 8 1 1 14 7 2 3 3 3 3 3 4 5 1 4 3 2 2 3 3 1 4 1 2 2 9 9 7 4 2 1 1 1 3 1 2 1 1 1 1 1 3 3 1 1 4 1 3 2 1 2 2 3 1 2 2 2 1 1 1 2 2 2 1 1 1 1 1 3 1 2 1 1 3 3 1 1 3 1 2 2 3 1 2 1 1 3 1 1 1 1 1 1 1 1 2 1 1 56 2 2 1 1 56 5 5 3 3 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 | /* * Copyright (c) 2005-2006 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #include <linux/completion.h> #include <linux/file.h> #include <linux/mutex.h> #include <linux/poll.h> #include <linux/sched.h> #include <linux/idr.h> #include <linux/in.h> #include <linux/in6.h> #include <linux/miscdevice.h> #include <linux/slab.h> #include <linux/sysctl.h> #include <linux/module.h> #include <linux/nsproxy.h> #include <linux/nospec.h> #include <rdma/rdma_user_cm.h> #include <rdma/ib_marshall.h> #include <rdma/rdma_cm.h> #include <rdma/rdma_cm_ib.h> #include <rdma/ib_addr.h> #include <rdma/ib.h> #include <rdma/ib_cm.h> #include <rdma/rdma_netlink.h> #include "core_priv.h" MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access"); MODULE_LICENSE("Dual BSD/GPL"); static unsigned int max_backlog = 1024; static struct ctl_table_header *ucma_ctl_table_hdr; static struct ctl_table ucma_ctl_table[] = { { .procname = "max_backlog", .data = &max_backlog, .maxlen = sizeof max_backlog, .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_INT_MAX, }, }; struct ucma_file { struct mutex mut; struct file *filp; struct list_head ctx_list; struct list_head event_list; wait_queue_head_t poll_wait; }; struct ucma_context { u32 id; struct completion comp; refcount_t ref; int events_reported; atomic_t backlog; struct ucma_file *file; struct rdma_cm_id *cm_id; struct mutex mutex; u64 uid; struct list_head list; struct list_head mc_list; struct work_struct close_work; }; struct ucma_multicast { struct ucma_context *ctx; u32 id; int events_reported; u64 uid; u8 join_state; struct list_head list; struct sockaddr_storage addr; }; struct ucma_event { struct ucma_context *ctx; struct ucma_context *conn_req_ctx; struct ucma_multicast *mc; struct list_head list; struct rdma_ucm_event_resp resp; }; static DEFINE_XARRAY_ALLOC(ctx_table); static DEFINE_XARRAY_ALLOC(multicast_table); static const struct file_operations ucma_fops; static int ucma_destroy_private_ctx(struct ucma_context *ctx); static inline struct ucma_context *_ucma_find_context(int id, struct ucma_file *file) { struct ucma_context *ctx; ctx = xa_load(&ctx_table, id); if (!ctx) ctx = ERR_PTR(-ENOENT); else if (ctx->file != file) ctx = ERR_PTR(-EINVAL); return ctx; } static struct ucma_context *ucma_get_ctx(struct ucma_file *file, int id) { struct ucma_context *ctx; xa_lock(&ctx_table); ctx = _ucma_find_context(id, file); if (!IS_ERR(ctx)) if (!refcount_inc_not_zero(&ctx->ref)) ctx = ERR_PTR(-ENXIO); xa_unlock(&ctx_table); return ctx; } static void ucma_put_ctx(struct ucma_context *ctx) { if (refcount_dec_and_test(&ctx->ref)) complete(&ctx->comp); } /* * Same as ucm_get_ctx but requires that ->cm_id->device is valid, eg that the * CM_ID is bound. */ static struct ucma_context *ucma_get_ctx_dev(struct ucma_file *file, int id) { struct ucma_context *ctx = ucma_get_ctx(file, id); if (IS_ERR(ctx)) return ctx; if (!ctx->cm_id->device) { ucma_put_ctx(ctx); return ERR_PTR(-EINVAL); } return ctx; } static void ucma_close_id(struct work_struct *work) { struct ucma_context *ctx = container_of(work, struct ucma_context, close_work); /* once all inflight tasks are finished, we close all underlying * resources. The context is still alive till its explicit destryoing * by its creator. This puts back the xarray's reference. */ ucma_put_ctx(ctx); wait_for_completion(&ctx->comp); /* No new events will be generated after destroying the id. */ rdma_destroy_id(ctx->cm_id); /* Reading the cm_id without holding a positive ref is not allowed */ ctx->cm_id = NULL; } static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file) { struct ucma_context *ctx; ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); if (!ctx) return NULL; INIT_WORK(&ctx->close_work, ucma_close_id); init_completion(&ctx->comp); INIT_LIST_HEAD(&ctx->mc_list); /* So list_del() will work if we don't do ucma_finish_ctx() */ INIT_LIST_HEAD(&ctx->list); ctx->file = file; mutex_init(&ctx->mutex); if (xa_alloc(&ctx_table, &ctx->id, NULL, xa_limit_32b, GFP_KERNEL)) { kfree(ctx); return NULL; } return ctx; } static void ucma_set_ctx_cm_id(struct ucma_context *ctx, struct rdma_cm_id *cm_id) { refcount_set(&ctx->ref, 1); ctx->cm_id = cm_id; } static void ucma_finish_ctx(struct ucma_context *ctx) { lockdep_assert_held(&ctx->file->mut); list_add_tail(&ctx->list, &ctx->file->ctx_list); xa_store(&ctx_table, ctx->id, ctx, GFP_KERNEL); } static void ucma_copy_conn_event(struct rdma_ucm_conn_param *dst, struct rdma_conn_param *src) { if (src->private_data_len) memcpy(dst->private_data, src->private_data, src->private_data_len); dst->private_data_len = src->private_data_len; dst->responder_resources = src->responder_resources; dst->initiator_depth = src->initiator_depth; dst->flow_control = src->flow_control; dst->retry_count = src->retry_count; dst->rnr_retry_count = src->rnr_retry_count; dst->srq = src->srq; dst->qp_num = src->qp_num; } static void ucma_copy_ud_event(struct ib_device *device, struct rdma_ucm_ud_param *dst, struct rdma_ud_param *src) { if (src->private_data_len) memcpy(dst->private_data, src->private_data, src->private_data_len); dst->private_data_len = src->private_data_len; ib_copy_ah_attr_to_user(device, &dst->ah_attr, &src->ah_attr); dst->qp_num = src->qp_num; dst->qkey = src->qkey; } static struct ucma_event *ucma_create_uevent(struct ucma_context *ctx, struct rdma_cm_event *event) { struct ucma_event *uevent; uevent = kzalloc(sizeof(*uevent), GFP_KERNEL); if (!uevent) return NULL; uevent->ctx = ctx; switch (event->event) { case RDMA_CM_EVENT_MULTICAST_JOIN: case RDMA_CM_EVENT_MULTICAST_ERROR: uevent->mc = (struct ucma_multicast *) event->param.ud.private_data; uevent->resp.uid = uevent->mc->uid; uevent->resp.id = uevent->mc->id; break; default: uevent->resp.uid = ctx->uid; uevent->resp.id = ctx->id; break; } uevent->resp.event = event->event; uevent->resp.status = event->status; if (event->event == RDMA_CM_EVENT_ADDRINFO_RESOLVED) goto out; if (ctx->cm_id->qp_type == IB_QPT_UD) ucma_copy_ud_event(ctx->cm_id->device, &uevent->resp.param.ud, &event->param.ud); else ucma_copy_conn_event(&uevent->resp.param.conn, &event->param.conn); out: uevent->resp.ece.vendor_id = event->ece.vendor_id; uevent->resp.ece.attr_mod = event->ece.attr_mod; return uevent; } static int ucma_connect_event_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event *event) { struct ucma_context *listen_ctx = cm_id->context; struct ucma_context *ctx; struct ucma_event *uevent; if (!atomic_add_unless(&listen_ctx->backlog, -1, 0)) return -ENOMEM; ctx = ucma_alloc_ctx(listen_ctx->file); if (!ctx) goto err_backlog; ucma_set_ctx_cm_id(ctx, cm_id); uevent = ucma_create_uevent(listen_ctx, event); if (!uevent) goto err_alloc; uevent->conn_req_ctx = ctx; uevent->resp.id = ctx->id; ctx->cm_id->context = ctx; mutex_lock(&ctx->file->mut); ucma_finish_ctx(ctx); list_add_tail(&uevent->list, &ctx->file->event_list); mutex_unlock(&ctx->file->mut); wake_up_interruptible(&ctx->file->poll_wait); return 0; err_alloc: ucma_destroy_private_ctx(ctx); err_backlog: atomic_inc(&listen_ctx->backlog); /* Returning error causes the new ID to be destroyed */ return -ENOMEM; } static int ucma_event_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event *event) { struct ucma_event *uevent; struct ucma_context *ctx = cm_id->context; if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) return ucma_connect_event_handler(cm_id, event); /* * We ignore events for new connections until userspace has set their * context. This can only happen if an error occurs on a new connection * before the user accepts it. This is okay, since the accept will just * fail later. However, we do need to release the underlying HW * resources in case of a device removal event. */ if (ctx->uid) { uevent = ucma_create_uevent(ctx, event); if (!uevent) return 0; mutex_lock(&ctx->file->mut); list_add_tail(&uevent->list, &ctx->file->event_list); mutex_unlock(&ctx->file->mut); wake_up_interruptible(&ctx->file->poll_wait); } if (event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) { xa_lock(&ctx_table); if (xa_load(&ctx_table, ctx->id) == ctx) queue_work(system_dfl_wq, &ctx->close_work); xa_unlock(&ctx_table); } return 0; } static ssize_t ucma_get_event(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_get_event cmd; struct ucma_event *uevent; /* * Old 32 bit user space does not send the 4 byte padding in the * reserved field. We don't care, allow it to keep working. */ if (out_len < sizeof(uevent->resp) - sizeof(uevent->resp.reserved) - sizeof(uevent->resp.ece)) return -ENOSPC; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; mutex_lock(&file->mut); while (list_empty(&file->event_list)) { mutex_unlock(&file->mut); if (file->filp->f_flags & O_NONBLOCK) return -EAGAIN; if (wait_event_interruptible(file->poll_wait, !list_empty(&file->event_list))) return -ERESTARTSYS; mutex_lock(&file->mut); } uevent = list_first_entry(&file->event_list, struct ucma_event, list); if (copy_to_user(u64_to_user_ptr(cmd.response), &uevent->resp, min_t(size_t, out_len, sizeof(uevent->resp)))) { mutex_unlock(&file->mut); return -EFAULT; } list_del(&uevent->list); uevent->ctx->events_reported++; if (uevent->mc) uevent->mc->events_reported++; if (uevent->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) atomic_inc(&uevent->ctx->backlog); mutex_unlock(&file->mut); kfree(uevent); return 0; } static int ucma_get_qp_type(struct rdma_ucm_create_id *cmd, enum ib_qp_type *qp_type) { switch (cmd->ps) { case RDMA_PS_TCP: *qp_type = IB_QPT_RC; return 0; case RDMA_PS_UDP: case RDMA_PS_IPOIB: *qp_type = IB_QPT_UD; return 0; case RDMA_PS_IB: *qp_type = cmd->qp_type; return 0; default: return -EINVAL; } } static ssize_t ucma_create_id(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_create_id cmd; struct rdma_ucm_create_id_resp resp; struct ucma_context *ctx; struct rdma_cm_id *cm_id; enum ib_qp_type qp_type; int ret; if (out_len < sizeof(resp)) return -ENOSPC; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; ret = ucma_get_qp_type(&cmd, &qp_type); if (ret) return ret; ctx = ucma_alloc_ctx(file); if (!ctx) return -ENOMEM; ctx->uid = cmd.uid; cm_id = rdma_create_user_id(ucma_event_handler, ctx, cmd.ps, qp_type); if (IS_ERR(cm_id)) { ret = PTR_ERR(cm_id); goto err1; } ucma_set_ctx_cm_id(ctx, cm_id); resp.id = ctx->id; if (copy_to_user(u64_to_user_ptr(cmd.response), &resp, sizeof(resp))) { ret = -EFAULT; goto err1; } mutex_lock(&file->mut); ucma_finish_ctx(ctx); mutex_unlock(&file->mut); return 0; err1: ucma_destroy_private_ctx(ctx); return ret; } static void ucma_cleanup_multicast(struct ucma_context *ctx) { struct ucma_multicast *mc, *tmp; xa_lock(&multicast_table); list_for_each_entry_safe(mc, tmp, &ctx->mc_list, list) { list_del(&mc->list); /* * At this point mc->ctx->ref is 0 so the mc cannot leave the * lock on the reader and this is enough serialization */ __xa_erase(&multicast_table, mc->id); kfree(mc); } xa_unlock(&multicast_table); } static void ucma_cleanup_mc_events(struct ucma_multicast *mc) { struct ucma_event *uevent, *tmp; rdma_lock_handler(mc->ctx->cm_id); mutex_lock(&mc->ctx->file->mut); list_for_each_entry_safe(uevent, tmp, &mc->ctx->file->event_list, list) { if (uevent->mc != mc) continue; list_del(&uevent->list); kfree(uevent); } mutex_unlock(&mc->ctx->file->mut); rdma_unlock_handler(mc->ctx->cm_id); } static int ucma_cleanup_ctx_events(struct ucma_context *ctx) { int events_reported; struct ucma_event *uevent, *tmp; LIST_HEAD(list); /* Cleanup events not yet reported to the user.*/ mutex_lock(&ctx->file->mut); list_for_each_entry_safe(uevent, tmp, &ctx->file->event_list, list) { if (uevent->ctx != ctx) continue; if (uevent->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST && xa_cmpxchg(&ctx_table, uevent->conn_req_ctx->id, uevent->conn_req_ctx, XA_ZERO_ENTRY, GFP_KERNEL) == uevent->conn_req_ctx) { list_move_tail(&uevent->list, &list); continue; } list_del(&uevent->list); kfree(uevent); } list_del(&ctx->list); events_reported = ctx->events_reported; mutex_unlock(&ctx->file->mut); /* * If this was a listening ID then any connections spawned from it that * have not been delivered to userspace are cleaned up too. Must be done * outside any locks. */ list_for_each_entry_safe(uevent, tmp, &list, list) { ucma_destroy_private_ctx(uevent->conn_req_ctx); kfree(uevent); } return events_reported; } /* * When this is called the xarray must have a XA_ZERO_ENTRY in the ctx->id (ie * the ctx is not public to the user). This either because: * - ucma_finish_ctx() hasn't been called * - xa_cmpxchg() succeed to remove the entry (only one thread can succeed) */ static int ucma_destroy_private_ctx(struct ucma_context *ctx) { int events_reported; /* * Destroy the underlying cm_id. New work queuing is prevented now by * the removal from the xarray. Once the work is cancled ref will either * be 0 because the work ran to completion and consumed the ref from the * xarray, or it will be positive because we still have the ref from the * xarray. This can also be 0 in cases where cm_id was never set */ cancel_work_sync(&ctx->close_work); if (refcount_read(&ctx->ref)) ucma_close_id(&ctx->close_work); events_reported = ucma_cleanup_ctx_events(ctx); ucma_cleanup_multicast(ctx); WARN_ON(xa_cmpxchg(&ctx_table, ctx->id, XA_ZERO_ENTRY, NULL, GFP_KERNEL) != NULL); mutex_destroy(&ctx->mutex); kfree(ctx); return events_reported; } static ssize_t ucma_destroy_id(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_destroy_id cmd; struct rdma_ucm_destroy_id_resp resp; struct ucma_context *ctx; int ret = 0; if (out_len < sizeof(resp)) return -ENOSPC; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; xa_lock(&ctx_table); ctx = _ucma_find_context(cmd.id, file); if (!IS_ERR(ctx)) { if (__xa_cmpxchg(&ctx_table, ctx->id, ctx, XA_ZERO_ENTRY, GFP_KERNEL) != ctx) ctx = ERR_PTR(-ENOENT); } xa_unlock(&ctx_table); if (IS_ERR(ctx)) return PTR_ERR(ctx); resp.events_reported = ucma_destroy_private_ctx(ctx); if (copy_to_user(u64_to_user_ptr(cmd.response), &resp, sizeof(resp))) ret = -EFAULT; return ret; } static ssize_t ucma_bind_ip(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_bind_ip cmd; struct ucma_context *ctx; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; if (!rdma_addr_size_in6(&cmd.addr)) return -EINVAL; ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); ret = rdma_bind_addr(ctx->cm_id, (struct sockaddr *) &cmd.addr); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_bind(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_bind cmd; struct ucma_context *ctx; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; if (cmd.reserved || !cmd.addr_size || cmd.addr_size != rdma_addr_size_kss(&cmd.addr)) return -EINVAL; ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); ret = rdma_bind_addr(ctx->cm_id, (struct sockaddr *) &cmd.addr); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_resolve_ip(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_resolve_ip cmd; struct ucma_context *ctx; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; if ((cmd.src_addr.sin6_family && !rdma_addr_size_in6(&cmd.src_addr)) || !rdma_addr_size_in6(&cmd.dst_addr)) return -EINVAL; ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); ret = rdma_resolve_addr(ctx->cm_id, (struct sockaddr *) &cmd.src_addr, (struct sockaddr *) &cmd.dst_addr, cmd.timeout_ms); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_resolve_addr(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_resolve_addr cmd; struct ucma_context *ctx; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; if (cmd.reserved || (cmd.src_size && (cmd.src_size != rdma_addr_size_kss(&cmd.src_addr))) || !cmd.dst_size || (cmd.dst_size != rdma_addr_size_kss(&cmd.dst_addr))) return -EINVAL; ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); ret = rdma_resolve_addr(ctx->cm_id, (struct sockaddr *) &cmd.src_addr, (struct sockaddr *) &cmd.dst_addr, cmd.timeout_ms); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_resolve_ib_service(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_resolve_ib_service cmd; struct ucma_context *ctx; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); ret = rdma_resolve_ib_service(ctx->cm_id, &cmd.ibs); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_resolve_route(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_resolve_route cmd; struct ucma_context *ctx; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; ctx = ucma_get_ctx_dev(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); ret = rdma_resolve_route(ctx->cm_id, cmd.timeout_ms); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, struct rdma_route *route) { struct rdma_dev_addr *dev_addr; resp->num_paths = route->num_pri_alt_paths; switch (route->num_pri_alt_paths) { case 0: dev_addr = &route->addr.dev_addr; rdma_addr_get_dgid(dev_addr, (union ib_gid *) &resp->ib_route[0].dgid); rdma_addr_get_sgid(dev_addr, (union ib_gid *) &resp->ib_route[0].sgid); resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); break; case 2: ib_copy_path_rec_to_user(&resp->ib_route[1], &route->path_rec[1]); fallthrough; case 1: ib_copy_path_rec_to_user(&resp->ib_route[0], &route->path_rec[0]); break; default: break; } } static void ucma_copy_iboe_route(struct rdma_ucm_query_route_resp *resp, struct rdma_route *route) { resp->num_paths = route->num_pri_alt_paths; switch (route->num_pri_alt_paths) { case 0: rdma_ip2gid((struct sockaddr *)&route->addr.dst_addr, (union ib_gid *)&resp->ib_route[0].dgid); rdma_ip2gid((struct sockaddr *)&route->addr.src_addr, (union ib_gid *)&resp->ib_route[0].sgid); resp->ib_route[0].pkey = cpu_to_be16(0xffff); break; case 2: ib_copy_path_rec_to_user(&resp->ib_route[1], &route->path_rec[1]); fallthrough; case 1: ib_copy_path_rec_to_user(&resp->ib_route[0], &route->path_rec[0]); break; default: break; } } static void ucma_copy_iw_route(struct rdma_ucm_query_route_resp *resp, struct rdma_route *route) { struct rdma_dev_addr *dev_addr; dev_addr = &route->addr.dev_addr; rdma_addr_get_dgid(dev_addr, (union ib_gid *) &resp->ib_route[0].dgid); rdma_addr_get_sgid(dev_addr, (union ib_gid *) &resp->ib_route[0].sgid); } static ssize_t ucma_query_route(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_query cmd; struct rdma_ucm_query_route_resp resp; struct ucma_context *ctx; struct sockaddr *addr; int ret = 0; if (out_len < offsetof(struct rdma_ucm_query_route_resp, ibdev_index)) return -ENOSPC; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); memset(&resp, 0, sizeof resp); addr = (struct sockaddr *) &ctx->cm_id->route.addr.src_addr; memcpy(&resp.src_addr, addr, addr->sa_family == AF_INET ? sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6)); addr = (struct sockaddr *) &ctx->cm_id->route.addr.dst_addr; memcpy(&resp.dst_addr, addr, addr->sa_family == AF_INET ? sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6)); if (!ctx->cm_id->device) goto out; resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid; resp.ibdev_index = ctx->cm_id->device->index; resp.port_num = ctx->cm_id->port_num; if (rdma_cap_ib_sa(ctx->cm_id->device, ctx->cm_id->port_num)) ucma_copy_ib_route(&resp, &ctx->cm_id->route); else if (rdma_protocol_roce(ctx->cm_id->device, ctx->cm_id->port_num)) ucma_copy_iboe_route(&resp, &ctx->cm_id->route); else if (rdma_protocol_iwarp(ctx->cm_id->device, ctx->cm_id->port_num)) ucma_copy_iw_route(&resp, &ctx->cm_id->route); out: mutex_unlock(&ctx->mutex); if (copy_to_user(u64_to_user_ptr(cmd.response), &resp, min_t(size_t, out_len, sizeof(resp)))) ret = -EFAULT; ucma_put_ctx(ctx); return ret; } static void ucma_query_device_addr(struct rdma_cm_id *cm_id, struct rdma_ucm_query_addr_resp *resp) { if (!cm_id->device) return; resp->node_guid = (__force __u64) cm_id->device->node_guid; resp->ibdev_index = cm_id->device->index; resp->port_num = cm_id->port_num; resp->pkey = (__force __u16) cpu_to_be16( ib_addr_get_pkey(&cm_id->route.addr.dev_addr)); } static ssize_t ucma_query_addr(struct ucma_context *ctx, void __user *response, int out_len) { struct rdma_ucm_query_addr_resp resp; struct sockaddr *addr; int ret = 0; if (out_len < offsetof(struct rdma_ucm_query_addr_resp, ibdev_index)) return -ENOSPC; memset(&resp, 0, sizeof resp); addr = (struct sockaddr *) &ctx->cm_id->route.addr.src_addr; resp.src_size = rdma_addr_size(addr); memcpy(&resp.src_addr, addr, resp.src_size); addr = (struct sockaddr *) &ctx->cm_id->route.addr.dst_addr; resp.dst_size = rdma_addr_size(addr); memcpy(&resp.dst_addr, addr, resp.dst_size); ucma_query_device_addr(ctx->cm_id, &resp); if (copy_to_user(response, &resp, min_t(size_t, out_len, sizeof(resp)))) ret = -EFAULT; return ret; } static ssize_t ucma_query_path(struct ucma_context *ctx, void __user *response, int out_len) { struct rdma_ucm_query_path_resp *resp; int i, ret = 0; if (out_len < sizeof(*resp)) return -ENOSPC; resp = kzalloc(out_len, GFP_KERNEL); if (!resp) return -ENOMEM; resp->num_paths = ctx->cm_id->route.num_pri_alt_paths; for (i = 0, out_len -= sizeof(*resp); i < resp->num_paths && out_len > sizeof(struct ib_path_rec_data); i++, out_len -= sizeof(struct ib_path_rec_data)) { struct sa_path_rec *rec = &ctx->cm_id->route.path_rec[i]; resp->path_data[i].flags = IB_PATH_GMP | IB_PATH_PRIMARY | IB_PATH_BIDIRECTIONAL; if (rec->rec_type == SA_PATH_REC_TYPE_OPA) { struct sa_path_rec ib; sa_convert_path_opa_to_ib(&ib, rec); ib_sa_pack_path(&ib, &resp->path_data[i].path_rec); } else { ib_sa_pack_path(rec, &resp->path_data[i].path_rec); } } if (copy_to_user(response, resp, struct_size(resp, path_data, i))) ret = -EFAULT; kfree(resp); return ret; } static ssize_t ucma_query_gid(struct ucma_context *ctx, void __user *response, int out_len) { struct rdma_ucm_query_addr_resp resp; struct sockaddr_ib *addr; int ret = 0; if (out_len < offsetof(struct rdma_ucm_query_addr_resp, ibdev_index)) return -ENOSPC; memset(&resp, 0, sizeof resp); ucma_query_device_addr(ctx->cm_id, &resp); addr = (struct sockaddr_ib *) &resp.src_addr; resp.src_size = sizeof(*addr); if (ctx->cm_id->route.addr.src_addr.ss_family == AF_IB) { memcpy(addr, &ctx->cm_id->route.addr.src_addr, resp.src_size); } else { addr->sib_family = AF_IB; addr->sib_pkey = (__force __be16) resp.pkey; rdma_read_gids(ctx->cm_id, (union ib_gid *)&addr->sib_addr, NULL); addr->sib_sid = rdma_get_service_id(ctx->cm_id, (struct sockaddr *) &ctx->cm_id->route.addr.src_addr); } addr = (struct sockaddr_ib *) &resp.dst_addr; resp.dst_size = sizeof(*addr); if (ctx->cm_id->route.addr.dst_addr.ss_family == AF_IB) { memcpy(addr, &ctx->cm_id->route.addr.dst_addr, resp.dst_size); } else { addr->sib_family = AF_IB; addr->sib_pkey = (__force __be16) resp.pkey; rdma_read_gids(ctx->cm_id, NULL, (union ib_gid *)&addr->sib_addr); addr->sib_sid = rdma_get_service_id(ctx->cm_id, (struct sockaddr *) &ctx->cm_id->route.addr.dst_addr); } if (copy_to_user(response, &resp, min_t(size_t, out_len, sizeof(resp)))) ret = -EFAULT; return ret; } static ssize_t ucma_query_ib_service(struct ucma_context *ctx, void __user *response, int out_len) { struct rdma_ucm_query_ib_service_resp *resp; int n, ret = 0; if (out_len < sizeof(struct rdma_ucm_query_ib_service_resp)) return -ENOSPC; if (!ctx->cm_id->route.service_recs) return -ENODATA; resp = kzalloc(out_len, GFP_KERNEL); if (!resp) return -ENOMEM; resp->num_service_recs = ctx->cm_id->route.num_service_recs; n = (out_len - sizeof(struct rdma_ucm_query_ib_service_resp)) / sizeof(struct ib_user_service_rec); if (!n) goto out; if (n > ctx->cm_id->route.num_service_recs) n = ctx->cm_id->route.num_service_recs; memcpy(resp->recs, ctx->cm_id->route.service_recs, sizeof(*resp->recs) * n); if (copy_to_user(response, resp, struct_size(resp, recs, n))) ret = -EFAULT; out: kfree(resp); return ret; } static ssize_t ucma_query(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_query cmd; struct ucma_context *ctx; void __user *response; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; response = u64_to_user_ptr(cmd.response); ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); switch (cmd.option) { case RDMA_USER_CM_QUERY_ADDR: ret = ucma_query_addr(ctx, response, out_len); break; case RDMA_USER_CM_QUERY_PATH: ret = ucma_query_path(ctx, response, out_len); break; case RDMA_USER_CM_QUERY_GID: ret = ucma_query_gid(ctx, response, out_len); break; case RDMA_USER_CM_QUERY_IB_SERVICE: ret = ucma_query_ib_service(ctx, response, out_len); break; default: ret = -ENOSYS; break; } mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static void ucma_copy_conn_param(struct rdma_cm_id *id, struct rdma_conn_param *dst, struct rdma_ucm_conn_param *src) { dst->private_data = src->private_data; dst->private_data_len = src->private_data_len; dst->responder_resources = src->responder_resources; dst->initiator_depth = src->initiator_depth; dst->flow_control = src->flow_control; dst->retry_count = src->retry_count; dst->rnr_retry_count = src->rnr_retry_count; dst->srq = src->srq; dst->qp_num = src->qp_num & 0xFFFFFF; dst->qkey = (id->route.addr.src_addr.ss_family == AF_IB) ? src->qkey : 0; } static ssize_t ucma_connect(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_conn_param conn_param; struct rdma_ucm_ece ece = {}; struct rdma_ucm_connect cmd; struct ucma_context *ctx; size_t in_size; int ret; if (in_len < offsetofend(typeof(cmd), reserved)) return -EINVAL; in_size = min_t(size_t, in_len, sizeof(cmd)); if (copy_from_user(&cmd, inbuf, in_size)) return -EFAULT; if (!cmd.conn_param.valid) return -EINVAL; ctx = ucma_get_ctx_dev(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); ucma_copy_conn_param(ctx->cm_id, &conn_param, &cmd.conn_param); if (offsetofend(typeof(cmd), ece) <= in_size) { ece.vendor_id = cmd.ece.vendor_id; ece.attr_mod = cmd.ece.attr_mod; } mutex_lock(&ctx->mutex); ret = rdma_connect_ece(ctx->cm_id, &conn_param, &ece); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_listen(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_listen cmd; struct ucma_context *ctx; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); if (cmd.backlog <= 0 || cmd.backlog > max_backlog) cmd.backlog = max_backlog; atomic_set(&ctx->backlog, cmd.backlog); mutex_lock(&ctx->mutex); ret = rdma_listen(ctx->cm_id, cmd.backlog); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_accept(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_accept cmd; struct rdma_conn_param conn_param; struct rdma_ucm_ece ece = {}; struct ucma_context *ctx; size_t in_size; int ret; if (in_len < offsetofend(typeof(cmd), reserved)) return -EINVAL; in_size = min_t(size_t, in_len, sizeof(cmd)); if (copy_from_user(&cmd, inbuf, in_size)) return -EFAULT; ctx = ucma_get_ctx_dev(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); if (offsetofend(typeof(cmd), ece) <= in_size) { ece.vendor_id = cmd.ece.vendor_id; ece.attr_mod = cmd.ece.attr_mod; } if (cmd.conn_param.valid) { ucma_copy_conn_param(ctx->cm_id, &conn_param, &cmd.conn_param); mutex_lock(&ctx->mutex); rdma_lock_handler(ctx->cm_id); ret = rdma_accept_ece(ctx->cm_id, &conn_param, &ece); if (!ret) { /* The uid must be set atomically with the handler */ ctx->uid = cmd.uid; } rdma_unlock_handler(ctx->cm_id); mutex_unlock(&ctx->mutex); } else { mutex_lock(&ctx->mutex); rdma_lock_handler(ctx->cm_id); ret = rdma_accept_ece(ctx->cm_id, NULL, &ece); rdma_unlock_handler(ctx->cm_id); mutex_unlock(&ctx->mutex); } ucma_put_ctx(ctx); return ret; } static ssize_t ucma_reject(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_reject cmd; struct ucma_context *ctx; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; if (!cmd.reason) cmd.reason = IB_CM_REJ_CONSUMER_DEFINED; switch (cmd.reason) { case IB_CM_REJ_CONSUMER_DEFINED: case IB_CM_REJ_VENDOR_OPTION_NOT_SUPPORTED: break; default: return -EINVAL; } ctx = ucma_get_ctx_dev(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); ret = rdma_reject(ctx->cm_id, cmd.private_data, cmd.private_data_len, cmd.reason); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_disconnect(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_disconnect cmd; struct ucma_context *ctx; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; ctx = ucma_get_ctx_dev(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); ret = rdma_disconnect(ctx->cm_id); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_init_qp_attr(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_init_qp_attr cmd; struct ib_uverbs_qp_attr resp; struct ucma_context *ctx; struct ib_qp_attr qp_attr; int ret; if (out_len < sizeof(resp)) return -ENOSPC; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; if (cmd.qp_state > IB_QPS_ERR) return -EINVAL; ctx = ucma_get_ctx_dev(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); resp.qp_attr_mask = 0; memset(&qp_attr, 0, sizeof qp_attr); qp_attr.qp_state = cmd.qp_state; mutex_lock(&ctx->mutex); ret = rdma_init_qp_attr(ctx->cm_id, &qp_attr, &resp.qp_attr_mask); mutex_unlock(&ctx->mutex); if (ret) goto out; ib_copy_qp_attr_to_user(ctx->cm_id->device, &resp, &qp_attr); if (copy_to_user(u64_to_user_ptr(cmd.response), &resp, sizeof(resp))) ret = -EFAULT; out: ucma_put_ctx(ctx); return ret; } static int ucma_set_option_id(struct ucma_context *ctx, int optname, void *optval, size_t optlen) { int ret = 0; switch (optname) { case RDMA_OPTION_ID_TOS: if (optlen != sizeof(u8)) { ret = -EINVAL; break; } rdma_set_service_type(ctx->cm_id, *((u8 *) optval)); break; case RDMA_OPTION_ID_REUSEADDR: if (optlen != sizeof(int)) { ret = -EINVAL; break; } ret = rdma_set_reuseaddr(ctx->cm_id, *((int *) optval) ? 1 : 0); break; case RDMA_OPTION_ID_AFONLY: if (optlen != sizeof(int)) { ret = -EINVAL; break; } ret = rdma_set_afonly(ctx->cm_id, *((int *) optval) ? 1 : 0); break; case RDMA_OPTION_ID_ACK_TIMEOUT: if (optlen != sizeof(u8)) { ret = -EINVAL; break; } ret = rdma_set_ack_timeout(ctx->cm_id, *((u8 *)optval)); break; default: ret = -ENOSYS; } return ret; } static int ucma_set_ib_path(struct ucma_context *ctx, struct ib_path_rec_data *path_data, size_t optlen) { struct sa_path_rec sa_path; struct rdma_cm_event event; int ret; if (optlen % sizeof(*path_data)) return -EINVAL; for (; optlen; optlen -= sizeof(*path_data), path_data++) { if (path_data->flags == (IB_PATH_GMP | IB_PATH_PRIMARY | IB_PATH_BIDIRECTIONAL)) break; } if (!optlen) return -EINVAL; if (!ctx->cm_id->device) return -EINVAL; memset(&sa_path, 0, sizeof(sa_path)); sa_path.rec_type = SA_PATH_REC_TYPE_IB; ib_sa_unpack_path(path_data->path_rec, &sa_path); if (rdma_cap_opa_ah(ctx->cm_id->device, ctx->cm_id->port_num)) { struct sa_path_rec opa; sa_convert_path_ib_to_opa(&opa, &sa_path); mutex_lock(&ctx->mutex); ret = rdma_set_ib_path(ctx->cm_id, &opa); mutex_unlock(&ctx->mutex); } else { mutex_lock(&ctx->mutex); ret = rdma_set_ib_path(ctx->cm_id, &sa_path); mutex_unlock(&ctx->mutex); } if (ret) return ret; memset(&event, 0, sizeof event); event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; return ucma_event_handler(ctx->cm_id, &event); } static int ucma_set_option_ib(struct ucma_context *ctx, int optname, void *optval, size_t optlen) { int ret; switch (optname) { case RDMA_OPTION_IB_PATH: ret = ucma_set_ib_path(ctx, optval, optlen); break; default: ret = -ENOSYS; } return ret; } static int ucma_set_option_level(struct ucma_context *ctx, int level, int optname, void *optval, size_t optlen) { int ret; switch (level) { case RDMA_OPTION_ID: mutex_lock(&ctx->mutex); ret = ucma_set_option_id(ctx, optname, optval, optlen); mutex_unlock(&ctx->mutex); break; case RDMA_OPTION_IB: ret = ucma_set_option_ib(ctx, optname, optval, optlen); break; default: ret = -ENOSYS; } return ret; } static ssize_t ucma_set_option(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_set_option cmd; struct ucma_context *ctx; void *optval; int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; if (unlikely(cmd.optlen > KMALLOC_MAX_SIZE)) return -EINVAL; ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); optval = memdup_user(u64_to_user_ptr(cmd.optval), cmd.optlen); if (IS_ERR(optval)) { ret = PTR_ERR(optval); goto out; } ret = ucma_set_option_level(ctx, cmd.level, cmd.optname, optval, cmd.optlen); kfree(optval); out: ucma_put_ctx(ctx); return ret; } static ssize_t ucma_notify(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_notify cmd; struct ucma_context *ctx; int ret = -EINVAL; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mutex_lock(&ctx->mutex); if (ctx->cm_id->device) ret = rdma_notify(ctx->cm_id, (enum ib_event_type)cmd.event); mutex_unlock(&ctx->mutex); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_process_join(struct ucma_file *file, struct rdma_ucm_join_mcast *cmd, int out_len) { struct rdma_ucm_create_id_resp resp; struct ucma_context *ctx; struct ucma_multicast *mc; struct sockaddr *addr; int ret; u8 join_state; if (out_len < sizeof(resp)) return -ENOSPC; addr = (struct sockaddr *) &cmd->addr; if (cmd->addr_size != rdma_addr_size(addr)) return -EINVAL; if (cmd->join_flags == RDMA_MC_JOIN_FLAG_FULLMEMBER) join_state = BIT(FULLMEMBER_JOIN); else if (cmd->join_flags == RDMA_MC_JOIN_FLAG_SENDONLY_FULLMEMBER) join_state = BIT(SENDONLY_FULLMEMBER_JOIN); else return -EINVAL; ctx = ucma_get_ctx_dev(file, cmd->id); if (IS_ERR(ctx)) return PTR_ERR(ctx); mc = kzalloc(sizeof(*mc), GFP_KERNEL); if (!mc) { ret = -ENOMEM; goto err_put_ctx; } mc->ctx = ctx; mc->join_state = join_state; mc->uid = cmd->uid; memcpy(&mc->addr, addr, cmd->addr_size); xa_lock(&multicast_table); if (__xa_alloc(&multicast_table, &mc->id, NULL, xa_limit_32b, GFP_KERNEL)) { ret = -ENOMEM; goto err_free_mc; } list_add_tail(&mc->list, &ctx->mc_list); xa_unlock(&multicast_table); mutex_lock(&ctx->mutex); ret = rdma_join_multicast(ctx->cm_id, (struct sockaddr *)&mc->addr, join_state, mc); mutex_unlock(&ctx->mutex); if (ret) goto err_xa_erase; resp.id = mc->id; if (copy_to_user(u64_to_user_ptr(cmd->response), &resp, sizeof(resp))) { ret = -EFAULT; goto err_leave_multicast; } xa_store(&multicast_table, mc->id, mc, 0); ucma_put_ctx(ctx); return 0; err_leave_multicast: mutex_lock(&ctx->mutex); rdma_leave_multicast(ctx->cm_id, (struct sockaddr *) &mc->addr); mutex_unlock(&ctx->mutex); ucma_cleanup_mc_events(mc); err_xa_erase: xa_lock(&multicast_table); list_del(&mc->list); __xa_erase(&multicast_table, mc->id); err_free_mc: xa_unlock(&multicast_table); kfree(mc); err_put_ctx: ucma_put_ctx(ctx); return ret; } static ssize_t ucma_join_ip_multicast(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_join_ip_mcast cmd; struct rdma_ucm_join_mcast join_cmd; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; join_cmd.response = cmd.response; join_cmd.uid = cmd.uid; join_cmd.id = cmd.id; join_cmd.addr_size = rdma_addr_size_in6(&cmd.addr); if (!join_cmd.addr_size) return -EINVAL; join_cmd.join_flags = RDMA_MC_JOIN_FLAG_FULLMEMBER; memcpy(&join_cmd.addr, &cmd.addr, join_cmd.addr_size); return ucma_process_join(file, &join_cmd, out_len); } static ssize_t ucma_join_multicast(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_join_mcast cmd; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; if (!rdma_addr_size_kss(&cmd.addr)) return -EINVAL; return ucma_process_join(file, &cmd, out_len); } static ssize_t ucma_leave_multicast(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_destroy_id cmd; struct rdma_ucm_destroy_id_resp resp; struct ucma_multicast *mc; int ret = 0; if (out_len < sizeof(resp)) return -ENOSPC; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; xa_lock(&multicast_table); mc = xa_load(&multicast_table, cmd.id); if (!mc) mc = ERR_PTR(-ENOENT); else if (READ_ONCE(mc->ctx->file) != file) mc = ERR_PTR(-EINVAL); else if (!refcount_inc_not_zero(&mc->ctx->ref)) mc = ERR_PTR(-ENXIO); if (IS_ERR(mc)) { xa_unlock(&multicast_table); ret = PTR_ERR(mc); goto out; } list_del(&mc->list); __xa_erase(&multicast_table, mc->id); xa_unlock(&multicast_table); mutex_lock(&mc->ctx->mutex); rdma_leave_multicast(mc->ctx->cm_id, (struct sockaddr *) &mc->addr); mutex_unlock(&mc->ctx->mutex); ucma_cleanup_mc_events(mc); ucma_put_ctx(mc->ctx); resp.events_reported = mc->events_reported; kfree(mc); if (copy_to_user(u64_to_user_ptr(cmd.response), &resp, sizeof(resp))) ret = -EFAULT; out: return ret; } static ssize_t ucma_migrate_id(struct ucma_file *new_file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_migrate_id cmd; struct rdma_ucm_migrate_resp resp; struct ucma_event *uevent, *tmp; struct ucma_context *ctx; LIST_HEAD(event_list); struct ucma_file *cur_file; int ret = 0; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; /* Get current fd to protect against it being closed */ CLASS(fd, f)(cmd.fd); if (fd_empty(f)) return -ENOENT; if (fd_file(f)->f_op != &ucma_fops) return -EINVAL; cur_file = fd_file(f)->private_data; /* Validate current fd and prevent destruction of id. */ ctx = ucma_get_ctx(cur_file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); rdma_lock_handler(ctx->cm_id); /* * ctx->file can only be changed under the handler & xa_lock. xa_load() * must be checked again to ensure the ctx hasn't begun destruction * since the ucma_get_ctx(). */ xa_lock(&ctx_table); if (_ucma_find_context(cmd.id, cur_file) != ctx) { xa_unlock(&ctx_table); ret = -ENOENT; goto err_unlock; } ctx->file = new_file; xa_unlock(&ctx_table); mutex_lock(&cur_file->mut); list_del(&ctx->list); /* * At this point lock_handler() prevents addition of new uevents for * this ctx. */ list_for_each_entry_safe(uevent, tmp, &cur_file->event_list, list) if (uevent->ctx == ctx) list_move_tail(&uevent->list, &event_list); resp.events_reported = ctx->events_reported; mutex_unlock(&cur_file->mut); mutex_lock(&new_file->mut); list_add_tail(&ctx->list, &new_file->ctx_list); list_splice_tail(&event_list, &new_file->event_list); mutex_unlock(&new_file->mut); if (copy_to_user(u64_to_user_ptr(cmd.response), &resp, sizeof(resp))) ret = -EFAULT; err_unlock: rdma_unlock_handler(ctx->cm_id); ucma_put_ctx(ctx); return ret; } static ssize_t ucma_write_cm_event(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { struct rdma_ucm_write_cm_event cmd; struct rdma_cm_event event = {}; struct ucma_event *uevent; struct ucma_context *ctx; int ret = 0; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; if ((cmd.event != RDMA_CM_EVENT_USER) && (cmd.event != RDMA_CM_EVENT_INTERNAL)) return -EINVAL; ctx = ucma_get_ctx(file, cmd.id); if (IS_ERR(ctx)) return PTR_ERR(ctx); event.event = cmd.event; event.status = cmd.status; event.param.arg = cmd.param.arg; uevent = kzalloc(sizeof(*uevent), GFP_KERNEL); if (!uevent) { ret = -ENOMEM; goto out; } uevent->ctx = ctx; uevent->resp.uid = ctx->uid; uevent->resp.id = ctx->id; uevent->resp.event = event.event; uevent->resp.status = event.status; memcpy(uevent->resp.param.arg32, &event.param.arg, sizeof(event.param.arg)); mutex_lock(&ctx->file->mut); list_add_tail(&uevent->list, &ctx->file->event_list); mutex_unlock(&ctx->file->mut); wake_up_interruptible(&ctx->file->poll_wait); out: ucma_put_ctx(ctx); return ret; } static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) = { [RDMA_USER_CM_CMD_CREATE_ID] = ucma_create_id, [RDMA_USER_CM_CMD_DESTROY_ID] = ucma_destroy_id, [RDMA_USER_CM_CMD_BIND_IP] = ucma_bind_ip, [RDMA_USER_CM_CMD_RESOLVE_IP] = ucma_resolve_ip, [RDMA_USER_CM_CMD_RESOLVE_ROUTE] = ucma_resolve_route, [RDMA_USER_CM_CMD_QUERY_ROUTE] = ucma_query_route, [RDMA_USER_CM_CMD_CONNECT] = ucma_connect, [RDMA_USER_CM_CMD_LISTEN] = ucma_listen, [RDMA_USER_CM_CMD_ACCEPT] = ucma_accept, [RDMA_USER_CM_CMD_REJECT] = ucma_reject, [RDMA_USER_CM_CMD_DISCONNECT] = ucma_disconnect, [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr, [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event, [RDMA_USER_CM_CMD_GET_OPTION] = NULL, [RDMA_USER_CM_CMD_SET_OPTION] = ucma_set_option, [RDMA_USER_CM_CMD_NOTIFY] = ucma_notify, [RDMA_USER_CM_CMD_JOIN_IP_MCAST] = ucma_join_ip_multicast, [RDMA_USER_CM_CMD_LEAVE_MCAST] = ucma_leave_multicast, [RDMA_USER_CM_CMD_MIGRATE_ID] = ucma_migrate_id, [RDMA_USER_CM_CMD_QUERY] = ucma_query, [RDMA_USER_CM_CMD_BIND] = ucma_bind, [RDMA_USER_CM_CMD_RESOLVE_ADDR] = ucma_resolve_addr, [RDMA_USER_CM_CMD_JOIN_MCAST] = ucma_join_multicast, [RDMA_USER_CM_CMD_RESOLVE_IB_SERVICE] = ucma_resolve_ib_service, [RDMA_USER_CM_CMD_WRITE_CM_EVENT] = ucma_write_cm_event, }; static ssize_t ucma_write(struct file *filp, const char __user *buf, size_t len, loff_t *pos) { struct ucma_file *file = filp->private_data; struct rdma_ucm_cmd_hdr hdr; ssize_t ret; if (!ib_safe_file_access(filp)) { pr_err_once("%s: process %d (%s) changed security contexts after opening file descriptor, this is not allowed.\n", __func__, task_tgid_vnr(current), current->comm); return -EACCES; } if (len < sizeof(hdr)) return -EINVAL; if (copy_from_user(&hdr, buf, sizeof(hdr))) return -EFAULT; if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) return -EINVAL; hdr.cmd = array_index_nospec(hdr.cmd, ARRAY_SIZE(ucma_cmd_table)); if (hdr.in + sizeof(hdr) > len) return -EINVAL; if (!ucma_cmd_table[hdr.cmd]) return -ENOSYS; ret = ucma_cmd_table[hdr.cmd](file, buf + sizeof(hdr), hdr.in, hdr.out); if (!ret) ret = len; return ret; } static __poll_t ucma_poll(struct file *filp, struct poll_table_struct *wait) { struct ucma_file *file = filp->private_data; __poll_t mask = 0; poll_wait(filp, &file->poll_wait, wait); if (!list_empty(&file->event_list)) mask = EPOLLIN | EPOLLRDNORM; return mask; } /* * ucma_open() does not need the BKL: * * - no global state is referred to; * - there is no ioctl method to race against; * - no further module initialization is required for open to work * after the device is registered. */ static int ucma_open(struct inode *inode, struct file *filp) { struct ucma_file *file; file = kmalloc(sizeof *file, GFP_KERNEL); if (!file) return -ENOMEM; INIT_LIST_HEAD(&file->event_list); INIT_LIST_HEAD(&file->ctx_list); init_waitqueue_head(&file->poll_wait); mutex_init(&file->mut); filp->private_data = file; file->filp = filp; return stream_open(inode, filp); } static int ucma_close(struct inode *inode, struct file *filp) { struct ucma_file *file = filp->private_data; /* * All paths that touch ctx_list or ctx_list starting from write() are * prevented by this being a FD release function. The list_add_tail() in * ucma_connect_event_handler() can run concurrently, however it only * adds to the list *after* a listening ID. By only reading the first of * the list, and relying on ucma_destroy_private_ctx() to block * ucma_connect_event_handler(), no additional locking is needed. */ while (!list_empty(&file->ctx_list)) { struct ucma_context *ctx = list_first_entry( &file->ctx_list, struct ucma_context, list); WARN_ON(xa_cmpxchg(&ctx_table, ctx->id, ctx, XA_ZERO_ENTRY, GFP_KERNEL) != ctx); ucma_destroy_private_ctx(ctx); } kfree(file); return 0; } static const struct file_operations ucma_fops = { .owner = THIS_MODULE, .open = ucma_open, .release = ucma_close, .write = ucma_write, .poll = ucma_poll, }; static struct miscdevice ucma_misc = { .minor = MISC_DYNAMIC_MINOR, .name = "rdma_cm", .nodename = "infiniband/rdma_cm", .mode = 0666, .fops = &ucma_fops, }; static int ucma_get_global_nl_info(struct ib_client_nl_info *res) { res->abi = RDMA_USER_CM_ABI_VERSION; res->cdev = ucma_misc.this_device; return 0; } static struct ib_client rdma_cma_client = { .name = "rdma_cm", .get_global_nl_info = ucma_get_global_nl_info, }; MODULE_ALIAS_RDMA_CLIENT("rdma_cm"); static ssize_t abi_version_show(struct device *dev, struct device_attribute *attr, char *buf) { return sysfs_emit(buf, "%d\n", RDMA_USER_CM_ABI_VERSION); } static DEVICE_ATTR_RO(abi_version); static int __init ucma_init(void) { int ret; ret = misc_register(&ucma_misc); if (ret) return ret; ret = device_create_file(ucma_misc.this_device, &dev_attr_abi_version); if (ret) { pr_err("rdma_ucm: couldn't create abi_version attr\n"); goto err1; } ucma_ctl_table_hdr = register_net_sysctl(&init_net, "net/rdma_ucm", ucma_ctl_table); if (!ucma_ctl_table_hdr) { pr_err("rdma_ucm: couldn't register sysctl paths\n"); ret = -ENOMEM; goto err2; } ret = ib_register_client(&rdma_cma_client); if (ret) goto err3; return 0; err3: unregister_net_sysctl_table(ucma_ctl_table_hdr); err2: device_remove_file(ucma_misc.this_device, &dev_attr_abi_version); err1: misc_deregister(&ucma_misc); return ret; } static void __exit ucma_cleanup(void) { ib_unregister_client(&rdma_cma_client); unregister_net_sysctl_table(ucma_ctl_table_hdr); device_remove_file(ucma_misc.this_device, &dev_attr_abi_version); misc_deregister(&ucma_misc); } module_init(ucma_init); module_exit(ucma_cleanup); |
| 1 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 6165 6166 6167 6168 6169 6170 6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 6192 6193 6194 6195 6196 6197 6198 6199 6200 6201 6202 6203 6204 6205 6206 6207 6208 6209 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 6300 6301 6302 6303 6304 6305 6306 6307 6308 6309 6310 6311 6312 6313 6314 6315 6316 6317 6318 6319 6320 6321 6322 6323 6324 6325 6326 6327 6328 6329 6330 6331 6332 6333 6334 6335 6336 6337 6338 6339 6340 6341 6342 6343 6344 6345 6346 6347 6348 6349 6350 6351 6352 6353 6354 6355 6356 6357 6358 6359 6360 6361 6362 6363 6364 6365 6366 6367 6368 6369 6370 6371 6372 6373 6374 6375 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435 6436 6437 6438 6439 6440 6441 6442 6443 6444 6445 6446 6447 6448 6449 6450 6451 6452 6453 6454 6455 6456 6457 6458 6459 6460 6461 6462 6463 6464 6465 6466 6467 6468 6469 6470 6471 6472 6473 6474 6475 6476 6477 6478 6479 6480 6481 6482 6483 6484 6485 6486 6487 6488 6489 6490 6491 6492 6493 6494 6495 6496 6497 6498 6499 6500 6501 6502 6503 6504 6505 6506 6507 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6562 6563 6564 6565 6566 6567 6568 6569 6570 6571 6572 6573 6574 6575 6576 6577 6578 6579 6580 6581 6582 6583 6584 6585 6586 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596 6597 6598 6599 6600 6601 6602 6603 6604 6605 6606 6607 6608 6609 6610 6611 6612 6613 6614 6615 6616 6617 6618 6619 6620 6621 6622 6623 6624 6625 6626 6627 6628 6629 6630 6631 6632 6633 6634 6635 6636 6637 6638 6639 6640 6641 6642 6643 6644 6645 6646 6647 6648 6649 6650 6651 6652 6653 6654 6655 6656 6657 6658 6659 6660 6661 6662 6663 6664 6665 6666 6667 6668 6669 6670 6671 6672 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684 6685 6686 6687 6688 6689 6690 6691 6692 6693 6694 6695 6696 6697 6698 6699 6700 6701 6702 6703 6704 6705 6706 6707 6708 6709 6710 6711 6712 6713 6714 6715 6716 6717 6718 6719 6720 6721 6722 6723 6724 6725 6726 6727 6728 6729 6730 6731 6732 6733 6734 6735 6736 6737 6738 6739 6740 6741 6742 6743 6744 6745 6746 6747 6748 6749 6750 6751 6752 6753 6754 6755 6756 6757 6758 6759 6760 6761 6762 6763 6764 6765 6766 6767 6768 6769 6770 6771 6772 6773 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6784 6785 6786 6787 6788 6789 6790 6791 6792 6793 6794 6795 6796 6797 6798 6799 6800 6801 6802 6803 6804 6805 6806 6807 6808 6809 6810 6811 6812 6813 6814 6815 6816 6817 6818 6819 6820 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 6833 6834 6835 6836 6837 6838 6839 6840 6841 6842 6843 6844 6845 6846 6847 6848 6849 6850 6851 6852 6853 6854 6855 6856 6857 6858 6859 6860 6861 6862 6863 6864 6865 6866 6867 6868 6869 6870 6871 6872 6873 6874 6875 6876 6877 6878 6879 6880 6881 6882 6883 6884 6885 6886 6887 6888 6889 6890 6891 6892 6893 6894 6895 6896 6897 6898 6899 6900 6901 6902 6903 6904 6905 6906 6907 6908 6909 6910 6911 6912 6913 6914 6915 6916 6917 6918 6919 6920 6921 6922 6923 6924 6925 6926 6927 6928 6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960 6961 6962 6963 6964 6965 6966 6967 6968 6969 6970 6971 6972 6973 6974 6975 6976 6977 6978 6979 6980 6981 6982 6983 6984 6985 6986 6987 6988 6989 6990 6991 6992 6993 6994 6995 6996 6997 6998 6999 7000 7001 7002 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012 7013 7014 7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028 7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042 7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056 7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070 7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126 7127 7128 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170 7171 7172 7173 7174 7175 7176 7177 7178 7179 7180 7181 7182 7183 7184 7185 7186 7187 7188 7189 7190 7191 7192 7193 7194 7195 7196 7197 7198 7199 7200 7201 7202 7203 7204 7205 7206 7207 7208 7209 7210 7211 7212 7213 7214 7215 7216 7217 7218 7219 7220 7221 7222 7223 7224 7225 7226 7227 7228 7229 7230 7231 7232 7233 7234 7235 7236 7237 7238 7239 7240 7241 7242 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 7253 7254 7255 7256 7257 7258 7259 7260 7261 7262 7263 7264 7265 7266 7267 7268 7269 7270 7271 7272 7273 7274 7275 7276 7277 7278 7279 7280 7281 7282 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 7293 7294 7295 7296 7297 7298 7299 7300 7301 7302 7303 7304 7305 7306 7307 7308 7309 7310 7311 7312 7313 7314 7315 7316 7317 7318 7319 7320 7321 7322 7323 7324 7325 7326 7327 7328 7329 7330 7331 7332 7333 7334 7335 7336 7337 7338 7339 7340 7341 7342 7343 7344 7345 7346 7347 7348 7349 7350 7351 7352 7353 7354 7355 7356 7357 7358 7359 7360 7361 7362 7363 7364 7365 7366 7367 7368 7369 7370 7371 7372 7373 7374 7375 7376 7377 7378 7379 7380 7381 7382 7383 7384 7385 7386 7387 7388 7389 7390 7391 7392 7393 7394 7395 7396 7397 7398 7399 7400 7401 7402 7403 7404 7405 7406 7407 7408 7409 7410 7411 7412 7413 7414 7415 7416 7417 7418 7419 7420 7421 7422 7423 7424 7425 7426 7427 7428 7429 7430 7431 7432 7433 7434 7435 7436 7437 7438 7439 7440 7441 7442 7443 7444 7445 7446 7447 7448 7449 7450 7451 7452 7453 7454 7455 7456 7457 7458 7459 7460 7461 7462 7463 7464 7465 7466 7467 7468 7469 7470 7471 7472 7473 7474 7475 7476 7477 7478 7479 7480 7481 7482 7483 7484 7485 7486 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497 7498 7499 7500 7501 7502 7503 7504 7505 7506 7507 7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 7526 7527 7528 7529 7530 7531 7532 7533 7534 7535 7536 7537 7538 7539 7540 7541 7542 7543 7544 7545 7546 7547 7548 7549 7550 7551 7552 7553 7554 7555 7556 7557 7558 7559 7560 7561 7562 7563 7564 7565 7566 7567 7568 7569 7570 7571 7572 7573 7574 7575 7576 7577 7578 7579 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599 7600 7601 7602 7603 7604 7605 7606 7607 7608 7609 7610 7611 7612 7613 7614 7615 7616 7617 7618 7619 7620 7621 7622 7623 7624 7625 7626 7627 7628 7629 7630 7631 7632 7633 7634 7635 7636 7637 7638 7639 7640 7641 7642 7643 7644 7645 7646 7647 7648 7649 7650 7651 7652 7653 7654 7655 7656 7657 7658 7659 7660 7661 7662 7663 7664 7665 7666 7667 7668 7669 7670 7671 7672 7673 7674 7675 7676 7677 7678 7679 7680 7681 7682 7683 7684 7685 7686 7687 7688 7689 7690 7691 7692 7693 7694 7695 7696 7697 7698 7699 7700 7701 7702 7703 7704 7705 7706 7707 7708 7709 7710 7711 7712 7713 7714 7715 7716 7717 7718 7719 7720 7721 7722 7723 7724 7725 7726 7727 7728 7729 7730 7731 7732 7733 7734 7735 7736 7737 7738 7739 7740 7741 7742 7743 7744 7745 7746 7747 7748 7749 7750 7751 7752 7753 7754 7755 7756 7757 7758 7759 7760 7761 7762 7763 7764 7765 7766 7767 7768 7769 7770 7771 7772 7773 7774 7775 7776 7777 7778 7779 7780 7781 7782 7783 7784 7785 7786 7787 7788 7789 7790 7791 7792 7793 7794 7795 7796 7797 7798 7799 7800 7801 7802 7803 7804 7805 7806 7807 7808 7809 7810 7811 7812 7813 7814 7815 7816 7817 7818 7819 7820 7821 7822 7823 7824 7825 7826 7827 7828 7829 7830 7831 7832 7833 7834 7835 7836 7837 7838 7839 7840 7841 7842 7843 7844 7845 7846 7847 7848 7849 7850 7851 7852 7853 7854 7855 7856 7857 7858 7859 7860 7861 7862 7863 7864 7865 7866 7867 7868 7869 7870 7871 7872 7873 7874 7875 7876 7877 7878 7879 7880 7881 7882 7883 7884 7885 7886 7887 7888 7889 7890 7891 7892 7893 7894 7895 7896 7897 7898 7899 7900 7901 7902 7903 7904 7905 7906 7907 7908 7909 7910 7911 7912 7913 7914 7915 7916 7917 7918 7919 7920 7921 7922 7923 7924 7925 7926 7927 7928 7929 7930 7931 7932 7933 7934 7935 7936 7937 7938 7939 7940 7941 7942 7943 7944 7945 7946 7947 7948 7949 7950 7951 7952 7953 7954 7955 7956 7957 7958 7959 7960 7961 7962 7963 7964 7965 7966 7967 7968 7969 7970 7971 7972 7973 7974 7975 7976 7977 7978 7979 7980 7981 7982 7983 7984 7985 7986 7987 7988 7989 7990 7991 7992 7993 7994 7995 7996 7997 7998 7999 8000 8001 8002 8003 8004 8005 8006 8007 8008 8009 8010 8011 8012 8013 8014 8015 8016 8017 8018 8019 8020 8021 8022 8023 8024 8025 8026 8027 8028 8029 8030 8031 8032 8033 8034 8035 8036 8037 8038 8039 8040 8041 8042 8043 8044 8045 8046 8047 8048 8049 8050 8051 8052 8053 8054 8055 8056 8057 8058 8059 8060 8061 8062 8063 8064 8065 8066 8067 8068 8069 8070 8071 8072 8073 8074 8075 8076 8077 8078 8079 8080 8081 8082 8083 8084 8085 8086 8087 8088 8089 8090 8091 8092 8093 8094 8095 8096 8097 8098 8099 8100 8101 8102 8103 8104 8105 8106 8107 8108 8109 8110 8111 8112 8113 8114 8115 8116 8117 8118 8119 8120 8121 8122 8123 8124 8125 8126 8127 8128 8129 8130 8131 8132 8133 8134 8135 8136 8137 8138 8139 8140 8141 8142 8143 8144 8145 8146 8147 8148 8149 8150 8151 8152 8153 8154 8155 8156 8157 8158 8159 8160 8161 8162 8163 8164 8165 8166 8167 8168 8169 8170 8171 8172 8173 8174 8175 8176 8177 8178 8179 8180 8181 8182 8183 8184 8185 8186 8187 8188 8189 8190 8191 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8204 8205 8206 8207 8208 8209 8210 8211 8212 8213 8214 8215 8216 8217 8218 8219 8220 8221 8222 8223 8224 8225 8226 8227 8228 8229 8230 8231 8232 8233 8234 8235 8236 8237 8238 8239 8240 8241 8242 8243 8244 8245 8246 8247 8248 8249 8250 8251 8252 8253 8254 8255 8256 8257 8258 8259 8260 8261 8262 8263 8264 8265 8266 8267 8268 8269 8270 8271 8272 8273 8274 8275 8276 8277 8278 8279 8280 8281 8282 8283 8284 8285 8286 8287 8288 8289 8290 8291 8292 8293 8294 8295 8296 8297 8298 8299 8300 8301 8302 8303 8304 8305 8306 8307 8308 8309 8310 8311 8312 8313 8314 8315 8316 8317 8318 8319 8320 8321 8322 8323 8324 8325 8326 8327 8328 8329 8330 8331 8332 8333 8334 8335 8336 8337 8338 8339 8340 8341 8342 8343 8344 8345 8346 8347 8348 8349 8350 8351 8352 8353 8354 8355 8356 8357 8358 8359 8360 8361 8362 8363 8364 8365 8366 8367 8368 8369 8370 8371 8372 8373 8374 8375 8376 8377 8378 8379 8380 8381 8382 8383 8384 8385 8386 8387 8388 8389 8390 8391 8392 8393 8394 8395 8396 8397 8398 8399 8400 8401 8402 8403 8404 8405 8406 8407 8408 8409 8410 8411 8412 8413 8414 8415 8416 8417 8418 8419 8420 8421 8422 8423 8424 8425 8426 8427 8428 8429 8430 8431 8432 8433 8434 8435 8436 8437 8438 8439 8440 8441 8442 8443 8444 8445 8446 8447 8448 8449 8450 8451 8452 8453 8454 8455 8456 8457 8458 8459 8460 8461 8462 8463 8464 8465 8466 8467 8468 8469 8470 8471 8472 8473 8474 8475 8476 8477 8478 8479 8480 8481 8482 8483 8484 8485 8486 8487 8488 8489 8490 8491 8492 8493 8494 8495 8496 8497 8498 8499 8500 8501 8502 8503 8504 8505 8506 8507 8508 8509 8510 8511 8512 8513 8514 8515 8516 8517 8518 8519 8520 8521 8522 8523 8524 8525 8526 8527 8528 8529 8530 8531 8532 8533 8534 8535 8536 8537 8538 8539 8540 8541 8542 8543 8544 8545 8546 8547 8548 8549 8550 8551 8552 8553 8554 8555 8556 8557 8558 8559 8560 8561 8562 8563 8564 8565 8566 8567 8568 8569 8570 8571 8572 8573 8574 8575 8576 8577 8578 8579 8580 8581 8582 8583 8584 8585 8586 8587 8588 8589 8590 8591 8592 8593 8594 8595 8596 8597 8598 8599 8600 8601 8602 8603 8604 8605 8606 8607 8608 8609 8610 8611 8612 8613 8614 8615 8616 8617 8618 8619 8620 8621 8622 8623 8624 8625 8626 8627 8628 8629 8630 8631 8632 8633 8634 8635 8636 8637 8638 8639 8640 8641 8642 8643 8644 8645 8646 8647 8648 8649 8650 8651 8652 8653 8654 8655 8656 8657 8658 8659 8660 8661 8662 8663 8664 8665 8666 8667 8668 8669 8670 8671 8672 8673 8674 8675 8676 8677 8678 8679 8680 8681 8682 8683 8684 8685 8686 8687 8688 8689 8690 8691 8692 8693 8694 8695 8696 8697 8698 8699 8700 8701 8702 8703 8704 8705 8706 8707 8708 8709 8710 8711 8712 8713 8714 8715 8716 8717 8718 8719 8720 8721 8722 8723 8724 8725 8726 8727 8728 8729 8730 8731 8732 8733 8734 8735 8736 8737 8738 8739 8740 8741 8742 8743 8744 8745 8746 8747 8748 8749 8750 8751 8752 8753 8754 8755 8756 8757 8758 8759 8760 8761 8762 8763 8764 8765 8766 8767 8768 8769 8770 8771 8772 8773 8774 8775 8776 8777 8778 8779 8780 8781 8782 8783 8784 8785 8786 8787 8788 8789 8790 8791 8792 8793 8794 8795 8796 8797 8798 8799 8800 8801 8802 8803 8804 8805 8806 8807 8808 8809 8810 8811 8812 8813 8814 8815 8816 8817 8818 8819 8820 8821 8822 8823 8824 8825 8826 8827 8828 8829 8830 8831 8832 8833 8834 8835 8836 8837 8838 8839 8840 8841 8842 8843 8844 8845 8846 8847 8848 8849 8850 8851 8852 8853 8854 8855 8856 8857 8858 8859 8860 8861 8862 8863 8864 8865 8866 8867 8868 8869 8870 8871 8872 8873 8874 8875 8876 8877 8878 8879 8880 8881 8882 8883 8884 8885 8886 8887 8888 8889 8890 8891 8892 8893 8894 8895 8896 8897 8898 8899 8900 8901 8902 8903 8904 8905 8906 8907 8908 8909 8910 8911 8912 8913 8914 8915 8916 8917 8918 8919 8920 8921 8922 8923 8924 8925 8926 8927 8928 8929 8930 8931 8932 8933 8934 8935 8936 8937 8938 8939 8940 8941 8942 8943 8944 8945 8946 8947 8948 8949 8950 8951 8952 8953 8954 8955 8956 8957 8958 8959 8960 8961 8962 8963 8964 8965 8966 8967 8968 8969 8970 8971 8972 8973 8974 8975 8976 8977 8978 8979 8980 8981 8982 8983 8984 8985 8986 8987 8988 8989 8990 8991 8992 8993 8994 8995 8996 8997 8998 8999 9000 9001 9002 9003 9004 9005 9006 9007 9008 9009 9010 9011 9012 9013 9014 9015 9016 9017 9018 9019 9020 9021 9022 9023 9024 9025 9026 9027 9028 9029 9030 9031 9032 9033 9034 9035 9036 9037 9038 9039 9040 9041 9042 9043 9044 9045 9046 9047 9048 9049 9050 9051 9052 9053 9054 9055 9056 9057 9058 9059 9060 9061 9062 9063 9064 9065 9066 9067 9068 9069 9070 9071 9072 9073 9074 9075 9076 9077 9078 9079 9080 9081 9082 9083 9084 9085 9086 9087 9088 9089 9090 9091 9092 9093 9094 9095 9096 9097 9098 9099 9100 9101 9102 9103 9104 9105 9106 9107 9108 9109 9110 9111 9112 9113 9114 9115 9116 9117 9118 9119 9120 9121 9122 9123 9124 9125 9126 9127 9128 9129 9130 9131 9132 9133 9134 9135 9136 9137 9138 9139 9140 9141 9142 9143 9144 9145 9146 9147 9148 9149 9150 9151 9152 9153 9154 9155 9156 9157 9158 9159 9160 9161 9162 9163 9164 9165 9166 9167 9168 9169 9170 9171 9172 9173 9174 9175 9176 9177 9178 9179 9180 9181 9182 9183 9184 9185 9186 9187 9188 9189 9190 9191 9192 9193 9194 9195 9196 9197 9198 9199 9200 9201 9202 9203 9204 9205 9206 9207 9208 9209 9210 9211 9212 9213 9214 9215 9216 9217 9218 9219 9220 9221 9222 9223 9224 9225 9226 9227 9228 9229 9230 9231 9232 9233 9234 9235 9236 9237 9238 9239 9240 9241 9242 9243 9244 9245 9246 9247 9248 9249 9250 9251 9252 9253 9254 9255 9256 9257 9258 9259 9260 9261 9262 9263 9264 9265 9266 9267 9268 9269 9270 9271 9272 9273 9274 9275 9276 9277 9278 9279 9280 9281 9282 9283 9284 9285 9286 9287 9288 9289 9290 9291 9292 9293 9294 9295 9296 9297 9298 9299 9300 9301 9302 9303 9304 9305 9306 9307 9308 9309 9310 9311 9312 9313 9314 9315 9316 9317 9318 9319 9320 9321 9322 9323 9324 9325 9326 9327 9328 9329 9330 9331 9332 9333 9334 9335 9336 9337 9338 9339 9340 9341 9342 9343 9344 9345 9346 9347 9348 9349 9350 9351 9352 9353 9354 9355 9356 9357 9358 9359 9360 9361 9362 9363 9364 9365 9366 9367 9368 9369 9370 9371 9372 9373 9374 9375 9376 9377 9378 9379 9380 9381 9382 9383 9384 9385 9386 9387 9388 9389 9390 9391 9392 9393 9394 9395 9396 9397 9398 9399 9400 9401 9402 9403 9404 9405 9406 9407 9408 9409 9410 9411 9412 9413 9414 9415 9416 9417 9418 9419 9420 9421 9422 9423 9424 9425 9426 9427 9428 9429 9430 9431 9432 9433 9434 9435 9436 9437 9438 9439 9440 9441 9442 9443 9444 9445 9446 9447 9448 9449 9450 9451 9452 9453 9454 9455 9456 9457 9458 9459 9460 9461 9462 9463 9464 9465 9466 9467 9468 9469 9470 9471 9472 9473 9474 9475 9476 9477 9478 9479 9480 9481 9482 9483 9484 9485 9486 9487 9488 9489 9490 9491 9492 9493 9494 9495 9496 9497 9498 9499 9500 9501 9502 9503 9504 9505 9506 9507 9508 9509 9510 9511 9512 9513 9514 9515 9516 9517 9518 9519 9520 9521 9522 9523 9524 9525 9526 9527 9528 9529 9530 9531 9532 9533 9534 9535 9536 9537 9538 9539 9540 9541 9542 9543 9544 9545 9546 9547 9548 9549 9550 9551 9552 9553 9554 9555 9556 9557 9558 9559 9560 9561 9562 9563 9564 9565 9566 9567 9568 9569 9570 9571 9572 9573 9574 9575 9576 9577 9578 9579 9580 9581 9582 9583 9584 9585 9586 9587 9588 9589 9590 9591 9592 9593 9594 9595 9596 9597 9598 9599 9600 9601 9602 9603 9604 9605 9606 9607 9608 9609 9610 9611 9612 9613 9614 9615 9616 9617 9618 9619 9620 9621 9622 9623 9624 9625 9626 9627 9628 9629 9630 9631 9632 9633 9634 9635 9636 9637 9638 9639 9640 9641 9642 9643 9644 9645 9646 9647 9648 9649 9650 9651 9652 9653 9654 9655 9656 9657 9658 9659 9660 9661 9662 9663 9664 9665 9666 9667 9668 9669 9670 9671 9672 9673 9674 9675 9676 9677 9678 9679 9680 9681 9682 9683 9684 9685 9686 9687 9688 9689 9690 9691 9692 9693 9694 9695 9696 9697 9698 9699 9700 9701 9702 9703 9704 9705 9706 9707 9708 9709 9710 9711 9712 9713 9714 9715 9716 9717 9718 9719 9720 9721 9722 9723 9724 9725 9726 9727 9728 9729 9730 9731 9732 9733 9734 9735 9736 9737 9738 9739 9740 9741 9742 9743 9744 9745 9746 9747 9748 9749 9750 9751 9752 9753 9754 9755 9756 9757 9758 9759 9760 9761 9762 9763 9764 9765 9766 9767 9768 9769 9770 9771 9772 9773 9774 9775 9776 9777 9778 9779 9780 9781 9782 9783 9784 9785 9786 9787 9788 9789 9790 9791 9792 9793 9794 9795 9796 9797 9798 9799 9800 9801 9802 9803 9804 9805 9806 9807 9808 9809 9810 9811 9812 9813 9814 9815 9816 9817 9818 9819 9820 9821 9822 9823 9824 9825 9826 9827 9828 9829 9830 9831 9832 9833 9834 9835 9836 9837 9838 9839 9840 9841 9842 9843 9844 9845 9846 9847 9848 9849 9850 9851 9852 9853 9854 9855 9856 9857 9858 9859 9860 9861 9862 9863 9864 9865 9866 9867 9868 9869 9870 9871 9872 9873 9874 9875 9876 9877 9878 9879 9880 9881 9882 9883 9884 9885 9886 9887 9888 9889 9890 9891 9892 9893 9894 9895 9896 9897 9898 9899 9900 9901 9902 9903 9904 9905 9906 9907 9908 9909 9910 9911 9912 9913 9914 9915 9916 9917 9918 9919 9920 9921 9922 9923 9924 9925 9926 9927 9928 9929 9930 9931 9932 9933 9934 9935 9936 9937 9938 9939 9940 9941 9942 9943 9944 9945 9946 9947 9948 9949 9950 9951 9952 9953 9954 9955 9956 9957 9958 9959 9960 9961 9962 9963 9964 9965 9966 9967 9968 9969 9970 9971 9972 9973 9974 9975 9976 9977 9978 9979 9980 9981 9982 9983 9984 9985 9986 9987 9988 9989 9990 9991 9992 9993 9994 9995 9996 9997 9998 9999 10000 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 10011 10012 10013 10014 10015 10016 10017 10018 10019 10020 10021 10022 10023 10024 10025 10026 10027 10028 10029 10030 10031 10032 10033 10034 10035 10036 10037 10038 10039 10040 10041 10042 10043 10044 10045 10046 10047 10048 10049 10050 10051 10052 10053 10054 10055 10056 10057 10058 10059 10060 10061 10062 10063 10064 10065 10066 10067 10068 10069 10070 10071 10072 10073 10074 10075 10076 10077 10078 10079 10080 10081 10082 10083 10084 10085 10086 10087 10088 10089 10090 10091 10092 10093 10094 10095 10096 10097 10098 10099 10100 10101 10102 10103 10104 10105 10106 10107 10108 10109 10110 10111 10112 10113 10114 10115 10116 10117 10118 10119 10120 10121 10122 10123 10124 10125 10126 10127 10128 10129 10130 10131 10132 10133 10134 10135 10136 10137 10138 10139 10140 10141 10142 10143 10144 10145 10146 10147 10148 10149 10150 10151 10152 10153 10154 10155 10156 10157 10158 10159 10160 10161 10162 10163 10164 10165 10166 10167 10168 10169 10170 10171 10172 10173 10174 10175 10176 10177 10178 10179 10180 10181 10182 10183 10184 10185 10186 10187 10188 10189 10190 10191 10192 10193 10194 10195 10196 10197 10198 10199 10200 10201 10202 10203 10204 10205 10206 10207 10208 10209 10210 10211 10212 10213 10214 10215 10216 10217 10218 10219 10220 10221 10222 10223 10224 10225 10226 10227 10228 10229 10230 10231 10232 10233 10234 10235 10236 10237 10238 10239 10240 10241 10242 10243 10244 10245 10246 10247 10248 10249 10250 10251 10252 10253 10254 10255 10256 10257 10258 10259 10260 10261 10262 10263 10264 10265 10266 10267 10268 10269 10270 10271 10272 10273 10274 10275 10276 10277 10278 10279 10280 10281 10282 10283 10284 10285 10286 10287 10288 10289 10290 10291 10292 10293 10294 10295 10296 10297 10298 10299 10300 10301 10302 10303 10304 10305 10306 10307 10308 10309 10310 10311 10312 10313 10314 10315 10316 10317 10318 10319 10320 10321 10322 10323 10324 10325 10326 10327 10328 10329 10330 10331 10332 10333 10334 10335 10336 10337 10338 10339 10340 10341 10342 10343 10344 10345 10346 10347 10348 10349 10350 10351 10352 10353 10354 10355 10356 10357 10358 10359 10360 10361 10362 10363 10364 10365 10366 10367 10368 10369 10370 10371 10372 10373 10374 10375 10376 10377 10378 10379 10380 10381 10382 10383 10384 10385 10386 10387 10388 10389 10390 10391 10392 10393 10394 10395 10396 10397 10398 10399 10400 10401 10402 10403 10404 10405 10406 10407 10408 10409 10410 10411 10412 10413 10414 10415 10416 10417 10418 10419 10420 10421 10422 10423 10424 10425 10426 10427 10428 10429 10430 10431 10432 10433 10434 10435 10436 10437 10438 10439 10440 10441 10442 10443 10444 10445 10446 10447 10448 10449 10450 10451 10452 10453 10454 10455 10456 10457 10458 10459 10460 10461 10462 10463 10464 10465 10466 10467 10468 10469 10470 10471 10472 10473 10474 10475 10476 10477 10478 10479 10480 10481 10482 10483 10484 10485 10486 10487 10488 10489 10490 10491 10492 10493 10494 10495 10496 10497 10498 10499 10500 10501 10502 10503 10504 10505 10506 10507 10508 10509 10510 10511 10512 10513 10514 10515 10516 10517 10518 10519 10520 10521 10522 10523 10524 10525 10526 10527 10528 10529 10530 10531 10532 10533 10534 10535 10536 10537 10538 10539 10540 10541 10542 10543 10544 10545 10546 10547 10548 10549 10550 10551 10552 10553 10554 10555 10556 10557 10558 10559 10560 10561 10562 10563 10564 10565 10566 10567 10568 10569 10570 10571 10572 10573 10574 10575 10576 10577 10578 10579 10580 10581 10582 10583 10584 10585 10586 10587 10588 10589 10590 10591 10592 10593 10594 10595 10596 10597 10598 10599 10600 10601 10602 10603 10604 10605 10606 10607 10608 10609 10610 10611 10612 10613 10614 10615 10616 10617 10618 10619 10620 10621 10622 10623 10624 10625 10626 10627 10628 10629 10630 10631 10632 10633 10634 10635 10636 10637 10638 10639 10640 10641 10642 10643 10644 10645 10646 10647 10648 10649 10650 10651 10652 10653 10654 10655 10656 10657 10658 10659 10660 10661 10662 10663 10664 10665 10666 10667 10668 10669 10670 10671 10672 10673 10674 10675 10676 10677 10678 10679 10680 10681 10682 10683 10684 10685 10686 10687 10688 10689 10690 10691 10692 10693 10694 10695 10696 10697 10698 10699 10700 10701 10702 10703 10704 10705 10706 10707 10708 10709 10710 10711 10712 10713 10714 10715 10716 10717 10718 10719 10720 10721 10722 10723 10724 10725 10726 10727 10728 10729 10730 10731 10732 10733 10734 10735 10736 10737 10738 10739 10740 10741 10742 10743 10744 10745 10746 10747 10748 10749 10750 10751 10752 10753 10754 10755 10756 10757 10758 10759 10760 10761 10762 10763 10764 10765 10766 10767 10768 10769 10770 10771 10772 10773 10774 10775 10776 10777 10778 10779 10780 10781 10782 10783 10784 10785 10786 10787 10788 10789 10790 10791 10792 10793 10794 10795 10796 10797 10798 10799 10800 10801 10802 10803 10804 10805 10806 10807 10808 10809 10810 10811 10812 10813 10814 10815 10816 10817 10818 10819 10820 10821 10822 10823 10824 10825 10826 10827 10828 10829 10830 10831 10832 10833 10834 10835 10836 10837 10838 10839 10840 10841 10842 10843 10844 10845 10846 10847 10848 10849 10850 10851 10852 10853 10854 10855 10856 10857 10858 10859 10860 10861 10862 10863 10864 10865 10866 10867 10868 10869 10870 10871 10872 10873 10874 10875 10876 10877 10878 10879 10880 10881 10882 10883 10884 10885 10886 10887 10888 10889 10890 10891 10892 10893 10894 10895 10896 10897 10898 10899 10900 10901 10902 10903 10904 10905 10906 10907 10908 10909 10910 10911 10912 10913 10914 10915 10916 10917 10918 10919 10920 10921 10922 10923 10924 10925 10926 10927 10928 10929 10930 10931 10932 10933 10934 10935 10936 10937 10938 10939 10940 10941 10942 10943 10944 10945 10946 10947 10948 10949 10950 10951 10952 10953 10954 10955 10956 10957 10958 10959 10960 10961 10962 10963 10964 10965 10966 10967 10968 10969 10970 10971 10972 10973 10974 10975 10976 10977 10978 10979 10980 10981 10982 10983 10984 10985 10986 10987 10988 10989 10990 10991 10992 10993 10994 10995 10996 10997 10998 10999 11000 11001 11002 11003 11004 11005 11006 11007 11008 11009 11010 11011 11012 11013 11014 11015 11016 11017 11018 11019 11020 11021 11022 11023 11024 11025 11026 11027 11028 11029 11030 11031 11032 11033 11034 11035 11036 11037 11038 11039 11040 11041 11042 11043 11044 11045 11046 11047 11048 11049 11050 11051 11052 11053 11054 11055 11056 11057 11058 11059 11060 11061 11062 11063 11064 11065 11066 11067 11068 11069 11070 11071 11072 11073 11074 11075 11076 11077 11078 11079 11080 11081 11082 11083 11084 11085 11086 11087 11088 11089 11090 11091 11092 11093 11094 11095 11096 11097 11098 11099 11100 11101 11102 11103 11104 11105 11106 11107 11108 11109 11110 11111 11112 11113 11114 11115 11116 11117 11118 11119 11120 11121 11122 11123 11124 11125 11126 11127 11128 11129 11130 11131 11132 11133 11134 11135 11136 11137 11138 11139 11140 11141 11142 11143 11144 11145 11146 11147 11148 11149 11150 11151 11152 11153 11154 11155 11156 11157 11158 11159 11160 11161 11162 11163 11164 11165 11166 11167 11168 11169 11170 11171 11172 11173 11174 11175 11176 11177 11178 11179 11180 11181 11182 11183 11184 11185 11186 11187 11188 11189 11190 11191 11192 11193 11194 11195 11196 11197 11198 11199 11200 11201 11202 11203 11204 11205 11206 11207 11208 11209 11210 11211 11212 11213 11214 11215 11216 11217 11218 11219 11220 11221 11222 11223 11224 11225 11226 11227 11228 11229 11230 11231 11232 11233 11234 11235 11236 11237 11238 11239 11240 11241 11242 11243 11244 11245 11246 11247 11248 11249 11250 11251 11252 11253 11254 11255 11256 11257 11258 11259 11260 11261 11262 11263 11264 11265 11266 11267 11268 11269 11270 11271 11272 11273 11274 11275 11276 11277 11278 11279 11280 11281 11282 11283 11284 11285 11286 11287 11288 11289 11290 11291 11292 11293 11294 11295 11296 11297 11298 11299 11300 11301 11302 11303 11304 11305 11306 11307 11308 11309 11310 11311 11312 11313 11314 11315 11316 11317 11318 11319 11320 11321 11322 11323 11324 11325 11326 11327 11328 11329 11330 11331 11332 11333 11334 11335 11336 11337 11338 11339 11340 11341 11342 11343 11344 11345 11346 11347 11348 11349 11350 11351 11352 11353 11354 11355 11356 11357 11358 11359 11360 11361 11362 11363 11364 11365 11366 11367 11368 11369 11370 11371 11372 11373 11374 11375 11376 11377 11378 11379 11380 11381 11382 11383 11384 11385 11386 11387 11388 11389 11390 11391 11392 11393 11394 11395 11396 11397 11398 11399 11400 11401 11402 11403 11404 11405 11406 11407 11408 11409 11410 11411 11412 11413 11414 11415 11416 11417 11418 11419 11420 11421 11422 11423 11424 11425 11426 11427 11428 11429 11430 11431 11432 11433 11434 11435 11436 11437 11438 11439 11440 11441 11442 11443 11444 11445 11446 11447 11448 11449 11450 11451 11452 11453 11454 11455 11456 11457 11458 11459 11460 11461 11462 11463 11464 11465 11466 11467 11468 11469 11470 11471 11472 11473 11474 11475 11476 11477 11478 11479 11480 11481 11482 11483 11484 11485 11486 11487 11488 11489 11490 11491 11492 11493 11494 11495 11496 11497 11498 11499 11500 11501 11502 11503 11504 11505 11506 11507 11508 11509 11510 11511 11512 11513 11514 11515 11516 11517 11518 11519 11520 11521 11522 11523 11524 11525 11526 11527 11528 11529 11530 11531 11532 11533 11534 11535 11536 11537 11538 11539 11540 11541 11542 11543 11544 11545 11546 11547 11548 11549 11550 11551 11552 11553 11554 11555 11556 11557 11558 11559 11560 11561 11562 11563 11564 11565 11566 11567 11568 11569 11570 11571 11572 11573 11574 11575 11576 11577 11578 11579 11580 11581 11582 11583 11584 11585 11586 11587 11588 11589 11590 11591 11592 11593 11594 11595 11596 11597 11598 11599 11600 11601 11602 11603 11604 11605 11606 11607 11608 11609 11610 11611 11612 11613 11614 11615 11616 11617 11618 11619 11620 11621 11622 11623 11624 11625 11626 11627 11628 11629 11630 11631 11632 11633 11634 11635 11636 11637 11638 11639 11640 11641 11642 11643 11644 11645 11646 11647 11648 11649 11650 11651 11652 11653 11654 11655 11656 11657 11658 11659 11660 11661 11662 11663 11664 11665 11666 11667 11668 11669 11670 11671 11672 11673 11674 11675 11676 11677 11678 11679 11680 11681 11682 11683 11684 11685 11686 11687 11688 11689 11690 11691 11692 11693 11694 11695 11696 11697 11698 11699 11700 11701 11702 11703 11704 11705 11706 11707 11708 11709 11710 11711 11712 11713 11714 11715 11716 11717 11718 11719 11720 11721 11722 11723 11724 11725 11726 11727 11728 11729 11730 11731 11732 11733 11734 11735 11736 11737 11738 11739 11740 11741 11742 11743 11744 11745 11746 11747 11748 11749 11750 11751 11752 11753 11754 11755 11756 11757 11758 11759 11760 11761 11762 11763 11764 11765 11766 11767 11768 11769 11770 11771 11772 11773 11774 11775 11776 11777 11778 11779 11780 11781 11782 11783 11784 11785 11786 11787 11788 11789 11790 11791 11792 11793 11794 11795 11796 11797 11798 11799 11800 11801 11802 11803 11804 11805 11806 11807 11808 11809 11810 11811 11812 11813 11814 11815 11816 11817 11818 11819 11820 11821 11822 11823 11824 11825 11826 11827 11828 11829 11830 11831 11832 11833 11834 11835 11836 11837 11838 11839 11840 11841 11842 11843 11844 11845 11846 11847 11848 11849 11850 11851 11852 11853 11854 11855 11856 11857 11858 11859 11860 11861 11862 11863 11864 11865 11866 11867 11868 11869 11870 11871 11872 11873 11874 11875 11876 11877 11878 11879 11880 11881 11882 11883 11884 11885 11886 11887 11888 11889 11890 11891 11892 11893 11894 11895 11896 11897 11898 11899 11900 11901 11902 11903 11904 11905 11906 11907 11908 11909 11910 11911 11912 11913 11914 11915 11916 11917 11918 11919 11920 11921 11922 11923 11924 11925 11926 11927 11928 11929 11930 11931 11932 11933 11934 11935 11936 11937 11938 11939 11940 11941 11942 11943 11944 11945 11946 11947 11948 11949 11950 11951 11952 11953 11954 11955 11956 11957 11958 11959 11960 11961 11962 11963 11964 11965 11966 11967 11968 11969 11970 11971 11972 11973 11974 11975 11976 11977 11978 11979 11980 11981 11982 11983 11984 11985 11986 11987 11988 11989 11990 11991 11992 11993 11994 11995 11996 11997 11998 11999 12000 12001 12002 12003 12004 12005 12006 12007 12008 12009 12010 12011 12012 12013 12014 12015 12016 12017 12018 12019 12020 12021 12022 12023 12024 12025 12026 12027 12028 12029 12030 12031 12032 12033 12034 12035 12036 12037 12038 12039 12040 12041 12042 12043 12044 12045 12046 12047 12048 12049 12050 12051 12052 12053 12054 12055 12056 12057 12058 12059 12060 12061 12062 12063 12064 12065 12066 12067 12068 12069 12070 12071 12072 12073 12074 12075 12076 12077 12078 12079 12080 12081 12082 12083 12084 12085 12086 12087 12088 12089 12090 12091 12092 12093 12094 12095 12096 12097 12098 12099 12100 12101 12102 12103 12104 12105 12106 12107 12108 12109 12110 12111 12112 12113 12114 12115 12116 12117 12118 12119 12120 12121 12122 12123 12124 12125 12126 12127 12128 12129 12130 12131 12132 12133 12134 12135 12136 12137 12138 12139 12140 12141 12142 12143 12144 12145 12146 12147 12148 12149 12150 12151 12152 12153 12154 12155 12156 12157 12158 12159 12160 12161 12162 12163 12164 12165 12166 12167 12168 12169 12170 12171 12172 12173 12174 12175 12176 12177 12178 12179 12180 12181 12182 12183 12184 12185 12186 12187 12188 12189 12190 12191 12192 12193 12194 12195 12196 12197 12198 12199 12200 12201 12202 12203 12204 12205 12206 12207 12208 12209 12210 12211 12212 12213 12214 12215 12216 12217 12218 12219 12220 12221 12222 12223 12224 12225 12226 12227 12228 12229 12230 12231 12232 12233 12234 12235 12236 12237 12238 12239 12240 12241 12242 12243 12244 12245 12246 12247 12248 12249 12250 12251 12252 12253 12254 12255 12256 12257 12258 12259 12260 12261 12262 12263 12264 12265 12266 12267 12268 12269 12270 12271 12272 12273 12274 12275 12276 12277 12278 12279 12280 12281 12282 12283 12284 12285 12286 12287 12288 12289 12290 12291 12292 12293 12294 12295 12296 12297 12298 12299 12300 12301 12302 12303 12304 12305 12306 12307 12308 12309 12310 12311 12312 12313 12314 12315 12316 12317 12318 12319 12320 12321 12322 12323 12324 12325 12326 12327 12328 12329 12330 12331 12332 12333 12334 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (c) 2007-2009 Patrick McHardy <kaber@trash.net> * * Development of this code funded by Astaro AG (http://www.astaro.com/) */ #include <linux/module.h> #include <linux/init.h> #include <linux/list.h> #include <linux/skbuff.h> #include <linux/netlink.h> #include <linux/vmalloc.h> #include <linux/rhashtable.h> #include <linux/audit.h> #include <linux/netfilter.h> #include <linux/netfilter/nfnetlink.h> #include <linux/netfilter/nf_tables.h> #include <net/netfilter/nf_flow_table.h> #include <net/netfilter/nf_tables_core.h> #include <net/netfilter/nf_tables.h> #include <net/netfilter/nf_tables_offload.h> #include <net/net_namespace.h> #include <net/sock.h> #define NFT_MODULE_AUTOLOAD_LIMIT (MODULE_NAME_LEN - sizeof("nft-expr-255-")) #define NFT_SET_MAX_ANONLEN 16 /* limit compaction to avoid huge kmalloc/krealloc sizes. */ #define NFT_MAX_SET_NELEMS ((2048 - sizeof(struct nft_trans_elem)) / sizeof(struct nft_trans_one_elem)) unsigned int nf_tables_net_id __read_mostly; static LIST_HEAD(nf_tables_expressions); static LIST_HEAD(nf_tables_objects); static LIST_HEAD(nf_tables_flowtables); static LIST_HEAD(nf_tables_gc_list); static DEFINE_SPINLOCK(nf_tables_destroy_list_lock); static DEFINE_SPINLOCK(nf_tables_gc_list_lock); enum { NFT_VALIDATE_SKIP = 0, NFT_VALIDATE_NEED, NFT_VALIDATE_DO, }; static struct rhltable nft_objname_ht; static u32 nft_chain_hash(const void *data, u32 len, u32 seed); static u32 nft_chain_hash_obj(const void *data, u32 len, u32 seed); static int nft_chain_hash_cmp(struct rhashtable_compare_arg *, const void *); static u32 nft_objname_hash(const void *data, u32 len, u32 seed); static u32 nft_objname_hash_obj(const void *data, u32 len, u32 seed); static int nft_objname_hash_cmp(struct rhashtable_compare_arg *, const void *); static const struct rhashtable_params nft_chain_ht_params = { .head_offset = offsetof(struct nft_chain, rhlhead), .key_offset = offsetof(struct nft_chain, name), .hashfn = nft_chain_hash, .obj_hashfn = nft_chain_hash_obj, .obj_cmpfn = nft_chain_hash_cmp, .automatic_shrinking = true, }; static const struct rhashtable_params nft_objname_ht_params = { .head_offset = offsetof(struct nft_object, rhlhead), .key_offset = offsetof(struct nft_object, key), .hashfn = nft_objname_hash, .obj_hashfn = nft_objname_hash_obj, .obj_cmpfn = nft_objname_hash_cmp, .automatic_shrinking = true, }; struct nft_audit_data { struct nft_table *table; int entries; int op; struct list_head list; }; static const u8 nft2audit_op[NFT_MSG_MAX] = { // enum nf_tables_msg_types [NFT_MSG_NEWTABLE] = AUDIT_NFT_OP_TABLE_REGISTER, [NFT_MSG_GETTABLE] = AUDIT_NFT_OP_INVALID, [NFT_MSG_DELTABLE] = AUDIT_NFT_OP_TABLE_UNREGISTER, [NFT_MSG_NEWCHAIN] = AUDIT_NFT_OP_CHAIN_REGISTER, [NFT_MSG_GETCHAIN] = AUDIT_NFT_OP_INVALID, [NFT_MSG_DELCHAIN] = AUDIT_NFT_OP_CHAIN_UNREGISTER, [NFT_MSG_NEWRULE] = AUDIT_NFT_OP_RULE_REGISTER, [NFT_MSG_GETRULE] = AUDIT_NFT_OP_INVALID, [NFT_MSG_DELRULE] = AUDIT_NFT_OP_RULE_UNREGISTER, [NFT_MSG_NEWSET] = AUDIT_NFT_OP_SET_REGISTER, [NFT_MSG_GETSET] = AUDIT_NFT_OP_INVALID, [NFT_MSG_DELSET] = AUDIT_NFT_OP_SET_UNREGISTER, [NFT_MSG_NEWSETELEM] = AUDIT_NFT_OP_SETELEM_REGISTER, [NFT_MSG_GETSETELEM] = AUDIT_NFT_OP_INVALID, [NFT_MSG_DELSETELEM] = AUDIT_NFT_OP_SETELEM_UNREGISTER, [NFT_MSG_NEWGEN] = AUDIT_NFT_OP_GEN_REGISTER, [NFT_MSG_GETGEN] = AUDIT_NFT_OP_INVALID, [NFT_MSG_TRACE] = AUDIT_NFT_OP_INVALID, [NFT_MSG_NEWOBJ] = AUDIT_NFT_OP_OBJ_REGISTER, [NFT_MSG_GETOBJ] = AUDIT_NFT_OP_INVALID, [NFT_MSG_DELOBJ] = AUDIT_NFT_OP_OBJ_UNREGISTER, [NFT_MSG_GETOBJ_RESET] = AUDIT_NFT_OP_OBJ_RESET, [NFT_MSG_NEWFLOWTABLE] = AUDIT_NFT_OP_FLOWTABLE_REGISTER, [NFT_MSG_GETFLOWTABLE] = AUDIT_NFT_OP_INVALID, [NFT_MSG_DELFLOWTABLE] = AUDIT_NFT_OP_FLOWTABLE_UNREGISTER, [NFT_MSG_GETSETELEM_RESET] = AUDIT_NFT_OP_SETELEM_RESET, }; static void nft_validate_state_update(struct nft_table *table, u8 new_validate_state) { switch (table->validate_state) { case NFT_VALIDATE_SKIP: WARN_ON_ONCE(new_validate_state == NFT_VALIDATE_DO); break; case NFT_VALIDATE_NEED: break; case NFT_VALIDATE_DO: if (new_validate_state == NFT_VALIDATE_NEED) return; } table->validate_state = new_validate_state; } static bool nft_chain_vstate_valid(const struct nft_ctx *ctx, const struct nft_chain *chain) { const struct nft_base_chain *base_chain; enum nft_chain_types type; u8 hooknum; if (WARN_ON_ONCE(!nft_is_base_chain(ctx->chain))) return false; base_chain = nft_base_chain(ctx->chain); hooknum = base_chain->ops.hooknum; type = base_chain->type->type; /* chain is already validated for this call depth */ if (chain->vstate.depth >= ctx->level && chain->vstate.hook_mask[type] & BIT(hooknum)) return true; return false; } static void nf_tables_trans_destroy_work(struct work_struct *w); static void nft_trans_gc_work(struct work_struct *work); static DECLARE_WORK(trans_gc_work, nft_trans_gc_work); static void nft_ctx_init(struct nft_ctx *ctx, struct net *net, const struct sk_buff *skb, const struct nlmsghdr *nlh, u8 family, struct nft_table *table, struct nft_chain *chain, const struct nlattr * const *nla) { ctx->net = net; ctx->family = family; ctx->level = 0; ctx->table = table; ctx->chain = chain; ctx->nla = nla; ctx->portid = NETLINK_CB(skb).portid; ctx->report = nlmsg_report(nlh); ctx->flags = nlh->nlmsg_flags; ctx->seq = nlh->nlmsg_seq; bitmap_zero(ctx->reg_inited, NFT_REG32_NUM); } static struct nft_trans *nft_trans_alloc(const struct nft_ctx *ctx, int msg_type, u32 size) { struct nft_trans *trans; trans = kzalloc(size, GFP_KERNEL); if (trans == NULL) return NULL; INIT_LIST_HEAD(&trans->list); trans->msg_type = msg_type; trans->net = ctx->net; trans->table = ctx->table; trans->seq = ctx->seq; trans->flags = ctx->flags; trans->report = ctx->report; return trans; } static struct nft_trans_binding *nft_trans_get_binding(struct nft_trans *trans) { switch (trans->msg_type) { case NFT_MSG_NEWCHAIN: case NFT_MSG_NEWSET: return container_of(trans, struct nft_trans_binding, nft_trans); } return NULL; } static void nft_trans_list_del(struct nft_trans *trans) { struct nft_trans_binding *trans_binding; list_del(&trans->list); trans_binding = nft_trans_get_binding(trans); if (trans_binding) list_del(&trans_binding->binding_list); } static void nft_trans_destroy(struct nft_trans *trans) { nft_trans_list_del(trans); kfree(trans); } static void __nft_set_trans_bind(const struct nft_ctx *ctx, struct nft_set *set, bool bind) { struct nftables_pernet *nft_net; struct net *net = ctx->net; struct nft_trans *trans; if (!nft_set_is_anonymous(set)) return; nft_net = nft_pernet(net); list_for_each_entry_reverse(trans, &nft_net->commit_list, list) { switch (trans->msg_type) { case NFT_MSG_NEWSET: if (nft_trans_set(trans) == set) nft_trans_set_bound(trans) = bind; break; case NFT_MSG_NEWSETELEM: if (nft_trans_elem_set(trans) == set) nft_trans_elem_set_bound(trans) = bind; break; } } } static void nft_set_trans_bind(const struct nft_ctx *ctx, struct nft_set *set) { return __nft_set_trans_bind(ctx, set, true); } static void nft_set_trans_unbind(const struct nft_ctx *ctx, struct nft_set *set) { return __nft_set_trans_bind(ctx, set, false); } static void __nft_chain_trans_bind(const struct nft_ctx *ctx, struct nft_chain *chain, bool bind) { struct nftables_pernet *nft_net; struct net *net = ctx->net; struct nft_trans *trans; if (!nft_chain_binding(chain)) return; nft_net = nft_pernet(net); list_for_each_entry_reverse(trans, &nft_net->commit_list, list) { switch (trans->msg_type) { case NFT_MSG_NEWCHAIN: if (nft_trans_chain(trans) == chain) nft_trans_chain_bound(trans) = bind; break; case NFT_MSG_NEWRULE: if (nft_trans_rule_chain(trans) == chain) nft_trans_rule_bound(trans) = bind; break; } } } static void nft_chain_trans_bind(const struct nft_ctx *ctx, struct nft_chain *chain) { __nft_chain_trans_bind(ctx, chain, true); } int nf_tables_bind_chain(const struct nft_ctx *ctx, struct nft_chain *chain) { if (!nft_chain_binding(chain)) return 0; if (nft_chain_binding(ctx->chain)) return -EOPNOTSUPP; if (chain->bound) return -EBUSY; if (!nft_use_inc(&chain->use)) return -EMFILE; chain->bound = true; nft_chain_trans_bind(ctx, chain); return 0; } void nf_tables_unbind_chain(const struct nft_ctx *ctx, struct nft_chain *chain) { __nft_chain_trans_bind(ctx, chain, false); } static int nft_netdev_register_hooks(struct net *net, struct list_head *hook_list) { struct nf_hook_ops *ops; struct nft_hook *hook; int err, j; j = 0; list_for_each_entry(hook, hook_list, list) { list_for_each_entry(ops, &hook->ops_list, list) { err = nf_register_net_hook(net, ops); if (err < 0) goto err_register; j++; } } return 0; err_register: list_for_each_entry(hook, hook_list, list) { list_for_each_entry(ops, &hook->ops_list, list) { if (j-- <= 0) break; nf_unregister_net_hook(net, ops); } } return err; } static void nft_netdev_hook_free_ops(struct nft_hook *hook) { struct nf_hook_ops *ops, *next; list_for_each_entry_safe(ops, next, &hook->ops_list, list) { list_del(&ops->list); kfree(ops); } } static void nft_netdev_hook_free(struct nft_hook *hook) { nft_netdev_hook_free_ops(hook); kfree(hook); } static void __nft_netdev_hook_free_rcu(struct rcu_head *rcu) { struct nft_hook *hook = container_of(rcu, struct nft_hook, rcu); nft_netdev_hook_free(hook); } static void nft_netdev_hook_free_rcu(struct nft_hook *hook) { call_rcu(&hook->rcu, __nft_netdev_hook_free_rcu); } static void nft_netdev_unregister_hooks(struct net *net, struct list_head *hook_list, bool release_netdev) { struct nft_hook *hook, *next; struct nf_hook_ops *ops; list_for_each_entry_safe(hook, next, hook_list, list) { list_for_each_entry(ops, &hook->ops_list, list) nf_unregister_net_hook(net, ops); if (release_netdev) { list_del(&hook->list); nft_netdev_hook_free_rcu(hook); } } } static int nf_tables_register_hook(struct net *net, const struct nft_table *table, struct nft_chain *chain) { struct nft_base_chain *basechain; const struct nf_hook_ops *ops; if (table->flags & NFT_TABLE_F_DORMANT || !nft_is_base_chain(chain)) return 0; basechain = nft_base_chain(chain); ops = &basechain->ops; if (basechain->type->ops_register) return basechain->type->ops_register(net, ops); if (nft_base_chain_netdev(table->family, basechain->ops.hooknum)) return nft_netdev_register_hooks(net, &basechain->hook_list); return nf_register_net_hook(net, &basechain->ops); } static void __nf_tables_unregister_hook(struct net *net, const struct nft_table *table, struct nft_chain *chain, bool release_netdev) { struct nft_base_chain *basechain; const struct nf_hook_ops *ops; if (table->flags & NFT_TABLE_F_DORMANT || !nft_is_base_chain(chain)) return; basechain = nft_base_chain(chain); ops = &basechain->ops; if (basechain->type->ops_unregister) return basechain->type->ops_unregister(net, ops); if (nft_base_chain_netdev(table->family, basechain->ops.hooknum)) nft_netdev_unregister_hooks(net, &basechain->hook_list, release_netdev); else nf_unregister_net_hook(net, &basechain->ops); } static void nf_tables_unregister_hook(struct net *net, const struct nft_table *table, struct nft_chain *chain) { return __nf_tables_unregister_hook(net, table, chain, false); } static bool nft_trans_collapse_set_elem_allowed(const struct nft_trans_elem *a, const struct nft_trans_elem *b) { /* NB: the ->bound equality check is defensive, at this time we only merge * a new nft_trans_elem transaction request with the transaction tail * element, but a->bound != b->bound would imply a NEWRULE transaction * is queued in-between. * * The set check is mandatory, the NFT_MAX_SET_NELEMS check prevents * huge krealloc() requests. */ return a->set == b->set && a->bound == b->bound && a->nelems < NFT_MAX_SET_NELEMS; } static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net, struct nft_trans_elem *tail, struct nft_trans_elem *trans) { unsigned int nelems, old_nelems = tail->nelems; struct nft_trans_elem *new_trans; if (!nft_trans_collapse_set_elem_allowed(tail, trans)) return false; /* "cannot happen", at this time userspace element add * requests always allocate a new transaction element. * * This serves as a reminder to adjust the list_add_tail * logic below in case this ever changes. */ if (WARN_ON_ONCE(trans->nelems != 1)) return false; if (check_add_overflow(old_nelems, trans->nelems, &nelems)) return false; /* krealloc might free tail which invalidates list pointers */ list_del_init(&tail->nft_trans.list); new_trans = krealloc(tail, struct_size(tail, elems, nelems), GFP_KERNEL); if (!new_trans) { list_add_tail(&tail->nft_trans.list, &nft_net->commit_list); return false; } /* * new_trans->nft_trans.list contains garbage, but * list_add_tail() doesn't care. */ new_trans->nelems = nelems; new_trans->elems[old_nelems] = trans->elems[0]; list_add_tail(&new_trans->nft_trans.list, &nft_net->commit_list); return true; } static bool nft_trans_try_collapse(struct nftables_pernet *nft_net, struct nft_trans *trans) { struct nft_trans *tail; if (list_empty(&nft_net->commit_list)) return false; tail = list_last_entry(&nft_net->commit_list, struct nft_trans, list); if (tail->msg_type != trans->msg_type) return false; switch (trans->msg_type) { case NFT_MSG_NEWSETELEM: case NFT_MSG_DELSETELEM: return nft_trans_collapse_set_elem(nft_net, nft_trans_container_elem(tail), nft_trans_container_elem(trans)); } return false; } static void nft_trans_commit_list_add_tail(struct net *net, struct nft_trans *trans) { struct nftables_pernet *nft_net = nft_pernet(net); struct nft_trans_binding *binding; struct nft_trans_set *trans_set; list_add_tail(&trans->list, &nft_net->commit_list); binding = nft_trans_get_binding(trans); if (!binding) return; switch (trans->msg_type) { case NFT_MSG_NEWSET: trans_set = nft_trans_container_set(trans); if (!nft_trans_set_update(trans) && nft_set_is_anonymous(nft_trans_set(trans))) list_add_tail(&binding->binding_list, &nft_net->binding_list); list_add_tail(&trans_set->list_trans_newset, &nft_net->commit_set_list); break; case NFT_MSG_NEWCHAIN: if (!nft_trans_chain_update(trans) && nft_chain_binding(nft_trans_chain(trans))) list_add_tail(&binding->binding_list, &nft_net->binding_list); break; } } static void nft_trans_commit_list_add_elem(struct net *net, struct nft_trans *trans) { struct nftables_pernet *nft_net = nft_pernet(net); WARN_ON_ONCE(trans->msg_type != NFT_MSG_NEWSETELEM && trans->msg_type != NFT_MSG_DELSETELEM); if (nft_trans_try_collapse(nft_net, trans)) { kfree(trans); return; } nft_trans_commit_list_add_tail(net, trans); } static int nft_trans_table_add(struct nft_ctx *ctx, int msg_type) { struct nft_trans *trans; trans = nft_trans_alloc(ctx, msg_type, sizeof(struct nft_trans_table)); if (trans == NULL) return -ENOMEM; if (msg_type == NFT_MSG_NEWTABLE) nft_activate_next(ctx->net, ctx->table); nft_trans_commit_list_add_tail(ctx->net, trans); return 0; } static int nft_deltable(struct nft_ctx *ctx) { int err; err = nft_trans_table_add(ctx, NFT_MSG_DELTABLE); if (err < 0) return err; nft_deactivate_next(ctx->net, ctx->table); return err; } static struct nft_trans * nft_trans_alloc_chain(const struct nft_ctx *ctx, int msg_type) { struct nft_trans_chain *trans_chain; struct nft_trans *trans; trans = nft_trans_alloc(ctx, msg_type, sizeof(struct nft_trans_chain)); if (!trans) return NULL; trans_chain = nft_trans_container_chain(trans); INIT_LIST_HEAD(&trans_chain->nft_trans_binding.binding_list); trans_chain->chain = ctx->chain; return trans; } static struct nft_trans *nft_trans_chain_add(struct nft_ctx *ctx, int msg_type) { struct nft_trans *trans; trans = nft_trans_alloc_chain(ctx, msg_type); if (trans == NULL) return ERR_PTR(-ENOMEM); if (msg_type == NFT_MSG_NEWCHAIN) { nft_activate_next(ctx->net, ctx->chain); if (ctx->nla[NFTA_CHAIN_ID]) { nft_trans_chain_id(trans) = ntohl(nla_get_be32(ctx->nla[NFTA_CHAIN_ID])); } } nft_trans_commit_list_add_tail(ctx->net, trans); return trans; } static int nft_delchain(struct nft_ctx *ctx) { struct nft_trans *trans; trans = nft_trans_chain_add(ctx, NFT_MSG_DELCHAIN); if (IS_ERR(trans)) return PTR_ERR(trans); nft_use_dec(&ctx->table->use); nft_deactivate_next(ctx->net, ctx->chain); return 0; } void nft_rule_expr_activate(const struct nft_ctx *ctx, struct nft_rule *rule) { struct nft_expr *expr; expr = nft_expr_first(rule); while (nft_expr_more(rule, expr)) { if (expr->ops->activate) expr->ops->activate(ctx, expr); expr = nft_expr_next(expr); } } void nft_rule_expr_deactivate(const struct nft_ctx *ctx, struct nft_rule *rule, enum nft_trans_phase phase) { struct nft_expr *expr; expr = nft_expr_first(rule); while (nft_expr_more(rule, expr)) { if (expr->ops->deactivate) expr->ops->deactivate(ctx, expr, phase); expr = nft_expr_next(expr); } } static int nf_tables_delrule_deactivate(struct nft_ctx *ctx, struct nft_rule *rule) { /* You cannot delete the same rule twice */ if (nft_is_active_next(ctx->net, rule)) { nft_deactivate_next(ctx->net, rule); nft_use_dec(&ctx->chain->use); return 0; } return -ENOENT; } static struct nft_trans *nft_trans_rule_add(struct nft_ctx *ctx, int msg_type, struct nft_rule *rule) { struct nft_trans *trans; trans = nft_trans_alloc(ctx, msg_type, sizeof(struct nft_trans_rule)); if (trans == NULL) return NULL; if (msg_type == NFT_MSG_NEWRULE && ctx->nla[NFTA_RULE_ID] != NULL) { nft_trans_rule_id(trans) = ntohl(nla_get_be32(ctx->nla[NFTA_RULE_ID])); } nft_trans_rule(trans) = rule; nft_trans_rule_chain(trans) = ctx->chain; nft_trans_commit_list_add_tail(ctx->net, trans); return trans; } static int nft_delrule(struct nft_ctx *ctx, struct nft_rule *rule) { struct nft_flow_rule *flow; struct nft_trans *trans; int err; trans = nft_trans_rule_add(ctx, NFT_MSG_DELRULE, rule); if (trans == NULL) return -ENOMEM; if (ctx->chain->flags & NFT_CHAIN_HW_OFFLOAD) { flow = nft_flow_rule_create(ctx->net, rule); if (IS_ERR(flow)) { nft_trans_destroy(trans); return PTR_ERR(flow); } nft_trans_flow_rule(trans) = flow; } err = nf_tables_delrule_deactivate(ctx, rule); if (err < 0) { nft_trans_destroy(trans); return err; } nft_rule_expr_deactivate(ctx, rule, NFT_TRANS_PREPARE); return 0; } static int nft_delrule_by_chain(struct nft_ctx *ctx) { struct nft_rule *rule; int err; list_for_each_entry(rule, &ctx->chain->rules, list) { if (!nft_is_active_next(ctx->net, rule)) continue; err = nft_delrule(ctx, rule); if (err < 0) return err; } return 0; } static int __nft_trans_set_add(const struct nft_ctx *ctx, int msg_type, struct nft_set *set, const struct nft_set_desc *desc) { struct nft_trans_set *trans_set; struct nft_trans *trans; trans = nft_trans_alloc(ctx, msg_type, sizeof(struct nft_trans_set)); if (trans == NULL) return -ENOMEM; trans_set = nft_trans_container_set(trans); INIT_LIST_HEAD(&trans_set->nft_trans_binding.binding_list); INIT_LIST_HEAD(&trans_set->list_trans_newset); if (msg_type == NFT_MSG_NEWSET && ctx->nla[NFTA_SET_ID] && !desc) { nft_trans_set_id(trans) = ntohl(nla_get_be32(ctx->nla[NFTA_SET_ID])); nft_activate_next(ctx->net, set); } nft_trans_set(trans) = set; if (desc) { nft_trans_set_update(trans) = true; nft_trans_set_gc_int(trans) = desc->gc_int; nft_trans_set_timeout(trans) = desc->timeout; nft_trans_set_size(trans) = desc->size; } nft_trans_commit_list_add_tail(ctx->net, trans); return 0; } static int nft_trans_set_add(const struct nft_ctx *ctx, int msg_type, struct nft_set *set) { return __nft_trans_set_add(ctx, msg_type, set, NULL); } static int nft_mapelem_deactivate(const struct nft_ctx *ctx, struct nft_set *set, const struct nft_set_iter *iter, struct nft_elem_priv *elem_priv) { struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); if (!nft_set_elem_active(ext, iter->genmask)) return 0; nft_set_elem_change_active(ctx->net, set, ext); nft_setelem_data_deactivate(ctx->net, set, elem_priv); return 0; } struct nft_set_elem_catchall { struct list_head list; struct rcu_head rcu; struct nft_elem_priv *elem; }; static void nft_map_catchall_deactivate(const struct nft_ctx *ctx, struct nft_set *set) { u8 genmask = nft_genmask_next(ctx->net); struct nft_set_elem_catchall *catchall; struct nft_set_ext *ext; list_for_each_entry(catchall, &set->catchall_list, list) { ext = nft_set_elem_ext(set, catchall->elem); if (!nft_set_elem_active(ext, genmask)) continue; nft_set_elem_change_active(ctx->net, set, ext); nft_setelem_data_deactivate(ctx->net, set, catchall->elem); break; } } static void nft_map_deactivate(const struct nft_ctx *ctx, struct nft_set *set) { struct nft_set_iter iter = { .genmask = nft_genmask_next(ctx->net), .type = NFT_ITER_UPDATE, .fn = nft_mapelem_deactivate, }; set->ops->walk(ctx, set, &iter); WARN_ON_ONCE(iter.err); nft_map_catchall_deactivate(ctx, set); } static int nft_delset(const struct nft_ctx *ctx, struct nft_set *set) { int err; err = nft_trans_set_add(ctx, NFT_MSG_DELSET, set); if (err < 0) return err; if (set->flags & (NFT_SET_MAP | NFT_SET_OBJECT)) nft_map_deactivate(ctx, set); nft_deactivate_next(ctx->net, set); nft_use_dec(&ctx->table->use); return err; } static int nft_trans_obj_add(struct nft_ctx *ctx, int msg_type, struct nft_object *obj) { struct nft_trans *trans; trans = nft_trans_alloc(ctx, msg_type, sizeof(struct nft_trans_obj)); if (trans == NULL) return -ENOMEM; if (msg_type == NFT_MSG_NEWOBJ) nft_activate_next(ctx->net, obj); nft_trans_obj(trans) = obj; nft_trans_commit_list_add_tail(ctx->net, trans); return 0; } static int nft_delobj(struct nft_ctx *ctx, struct nft_object *obj) { int err; err = nft_trans_obj_add(ctx, NFT_MSG_DELOBJ, obj); if (err < 0) return err; nft_deactivate_next(ctx->net, obj); nft_use_dec(&ctx->table->use); return err; } static struct nft_trans * nft_trans_flowtable_add(struct nft_ctx *ctx, int msg_type, struct nft_flowtable *flowtable) { struct nft_trans *trans; trans = nft_trans_alloc(ctx, msg_type, sizeof(struct nft_trans_flowtable)); if (trans == NULL) return ERR_PTR(-ENOMEM); if (msg_type == NFT_MSG_NEWFLOWTABLE) nft_activate_next(ctx->net, flowtable); INIT_LIST_HEAD(&nft_trans_flowtable_hooks(trans)); nft_trans_flowtable(trans) = flowtable; nft_trans_commit_list_add_tail(ctx->net, trans); return trans; } static int nft_delflowtable(struct nft_ctx *ctx, struct nft_flowtable *flowtable) { struct nft_trans *trans; trans = nft_trans_flowtable_add(ctx, NFT_MSG_DELFLOWTABLE, flowtable); if (IS_ERR(trans)) return PTR_ERR(trans); nft_deactivate_next(ctx->net, flowtable); nft_use_dec(&ctx->table->use); return 0; } static void __nft_reg_track_clobber(struct nft_regs_track *track, u8 dreg) { int i; for (i = track->regs[dreg].num_reg; i > 0; i--) __nft_reg_track_cancel(track, dreg - i); } static void __nft_reg_track_update(struct nft_regs_track *track, const struct nft_expr *expr, u8 dreg, u8 num_reg) { track->regs[dreg].selector = expr; track->regs[dreg].bitwise = NULL; track->regs[dreg].num_reg = num_reg; } void nft_reg_track_update(struct nft_regs_track *track, const struct nft_expr *expr, u8 dreg, u8 len) { unsigned int regcount; int i; __nft_reg_track_clobber(track, dreg); regcount = DIV_ROUND_UP(len, NFT_REG32_SIZE); for (i = 0; i < regcount; i++, dreg++) __nft_reg_track_update(track, expr, dreg, i); } EXPORT_SYMBOL_GPL(nft_reg_track_update); void nft_reg_track_cancel(struct nft_regs_track *track, u8 dreg, u8 len) { unsigned int regcount; int i; __nft_reg_track_clobber(track, dreg); regcount = DIV_ROUND_UP(len, NFT_REG32_SIZE); for (i = 0; i < regcount; i++, dreg++) __nft_reg_track_cancel(track, dreg); } EXPORT_SYMBOL_GPL(nft_reg_track_cancel); void __nft_reg_track_cancel(struct nft_regs_track *track, u8 dreg) { track->regs[dreg].selector = NULL; track->regs[dreg].bitwise = NULL; track->regs[dreg].num_reg = 0; } EXPORT_SYMBOL_GPL(__nft_reg_track_cancel); /* * Tables */ static struct nft_table *nft_table_lookup(const struct net *net, const struct nlattr *nla, u8 family, u8 genmask, u32 nlpid) { struct nftables_pernet *nft_net; struct nft_table *table; if (nla == NULL) return ERR_PTR(-EINVAL); nft_net = nft_pernet(net); list_for_each_entry_rcu(table, &nft_net->tables, list, lockdep_is_held(&nft_net->commit_mutex)) { if (!nla_strcmp(nla, table->name) && table->family == family && nft_active_genmask(table, genmask)) { if (nft_table_has_owner(table) && nlpid && table->nlpid != nlpid) return ERR_PTR(-EPERM); return table; } } return ERR_PTR(-ENOENT); } static struct nft_table *nft_table_lookup_byhandle(const struct net *net, const struct nlattr *nla, int family, u8 genmask, u32 nlpid) { struct nftables_pernet *nft_net; struct nft_table *table; nft_net = nft_pernet(net); list_for_each_entry(table, &nft_net->tables, list) { if (be64_to_cpu(nla_get_be64(nla)) == table->handle && table->family == family && nft_active_genmask(table, genmask)) { if (nft_table_has_owner(table) && nlpid && table->nlpid != nlpid) return ERR_PTR(-EPERM); return table; } } return ERR_PTR(-ENOENT); } static inline u64 nf_tables_alloc_handle(struct nft_table *table) { return ++table->hgenerator; } static const struct nft_chain_type *chain_type[NFPROTO_NUMPROTO][NFT_CHAIN_T_MAX]; static const struct nft_chain_type * __nft_chain_type_get(u8 family, enum nft_chain_types type) { if (family >= NFPROTO_NUMPROTO || type >= NFT_CHAIN_T_MAX) return NULL; return chain_type[family][type]; } static const struct nft_chain_type * __nf_tables_chain_type_lookup(const struct nlattr *nla, u8 family) { const struct nft_chain_type *type; int i; for (i = 0; i < NFT_CHAIN_T_MAX; i++) { type = __nft_chain_type_get(family, i); if (!type) continue; if (!nla_strcmp(nla, type->name)) return type; } return NULL; } struct nft_module_request { struct list_head list; char module[MODULE_NAME_LEN]; bool done; }; #ifdef CONFIG_MODULES __printf(2, 3) int nft_request_module(struct net *net, const char *fmt, ...) { char module_name[MODULE_NAME_LEN]; struct nftables_pernet *nft_net; struct nft_module_request *req; va_list args; int ret; va_start(args, fmt); ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args); va_end(args); if (ret >= MODULE_NAME_LEN) return 0; nft_net = nft_pernet(net); list_for_each_entry(req, &nft_net->module_list, list) { if (!strcmp(req->module, module_name)) { if (req->done) return 0; /* A request to load this module already exists. */ return -EAGAIN; } } req = kmalloc(sizeof(*req), GFP_KERNEL); if (!req) return -ENOMEM; req->done = false; strscpy(req->module, module_name, MODULE_NAME_LEN); list_add_tail(&req->list, &nft_net->module_list); return -EAGAIN; } EXPORT_SYMBOL_GPL(nft_request_module); #endif static void lockdep_nfnl_nft_mutex_not_held(void) { #ifdef CONFIG_PROVE_LOCKING if (debug_locks) WARN_ON_ONCE(lockdep_nfnl_is_held(NFNL_SUBSYS_NFTABLES)); #endif } static const struct nft_chain_type * nf_tables_chain_type_lookup(struct net *net, const struct nlattr *nla, u8 family, bool autoload) { const struct nft_chain_type *type; type = __nf_tables_chain_type_lookup(nla, family); if (type != NULL) return type; lockdep_nfnl_nft_mutex_not_held(); #ifdef CONFIG_MODULES if (autoload) { if (nft_request_module(net, "nft-chain-%u-%.*s", family, nla_len(nla), (const char *)nla_data(nla)) == -EAGAIN) return ERR_PTR(-EAGAIN); } #endif return ERR_PTR(-ENOENT); } static unsigned int nft_base_seq(const struct net *net) { return READ_ONCE(net->nft.base_seq); } static __be16 nft_base_seq_be16(const struct net *net) { return htons(nft_base_seq(net) & 0xffff); } static const struct nla_policy nft_table_policy[NFTA_TABLE_MAX + 1] = { [NFTA_TABLE_NAME] = { .type = NLA_STRING, .len = NFT_TABLE_MAXNAMELEN - 1 }, [NFTA_TABLE_FLAGS] = { .type = NLA_U32 }, [NFTA_TABLE_HANDLE] = { .type = NLA_U64 }, [NFTA_TABLE_USERDATA] = { .type = NLA_BINARY, .len = NFT_USERDATA_MAXLEN } }; static int nf_tables_fill_table_info(struct sk_buff *skb, struct net *net, u32 portid, u32 seq, int event, u32 flags, int family, const struct nft_table *table) { struct nlmsghdr *nlh; nlh = nfnl_msg_put(skb, portid, seq, nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event), flags, family, NFNETLINK_V0, nft_base_seq_be16(net)); if (!nlh) goto nla_put_failure; if (nla_put_string(skb, NFTA_TABLE_NAME, table->name) || nla_put_be32(skb, NFTA_TABLE_USE, htonl(table->use)) || nla_put_be64(skb, NFTA_TABLE_HANDLE, cpu_to_be64(table->handle), NFTA_TABLE_PAD)) goto nla_put_failure; if (event == NFT_MSG_DELTABLE || event == NFT_MSG_DESTROYTABLE) { nlmsg_end(skb, nlh); return 0; } if (nla_put_be32(skb, NFTA_TABLE_FLAGS, htonl(table->flags & NFT_TABLE_F_MASK))) goto nla_put_failure; if (nft_table_has_owner(table) && nla_put_be32(skb, NFTA_TABLE_OWNER, htonl(table->nlpid))) goto nla_put_failure; if (table->udata) { if (nla_put(skb, NFTA_TABLE_USERDATA, table->udlen, table->udata)) goto nla_put_failure; } nlmsg_end(skb, nlh); return 0; nla_put_failure: nlmsg_trim(skb, nlh); return -1; } struct nftnl_skb_parms { bool report; }; #define NFT_CB(skb) (*(struct nftnl_skb_parms*)&((skb)->cb)) static void nft_notify_enqueue(struct sk_buff *skb, bool report, struct list_head *notify_list) { NFT_CB(skb).report = report; list_add_tail(&skb->list, notify_list); } static void nf_tables_table_notify(const struct nft_ctx *ctx, int event) { struct nftables_pernet *nft_net; struct sk_buff *skb; u16 flags = 0; int err; if (!ctx->report && !nfnetlink_has_listeners(ctx->net, NFNLGRP_NFTABLES)) return; skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL); if (skb == NULL) goto err; if (ctx->flags & (NLM_F_CREATE | NLM_F_EXCL)) flags |= ctx->flags & (NLM_F_CREATE | NLM_F_EXCL); err = nf_tables_fill_table_info(skb, ctx->net, ctx->portid, ctx->seq, event, flags, ctx->family, ctx->table); if (err < 0) { kfree_skb(skb); goto err; } nft_net = nft_pernet(ctx->net); nft_notify_enqueue(skb, ctx->report, &nft_net->notify_list); return; err: nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES, -ENOBUFS); } static int nf_tables_dump_tables(struct sk_buff *skb, struct netlink_callback *cb) { const struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh); struct nftables_pernet *nft_net; const struct nft_table *table; unsigned int idx = 0, s_idx = cb->args[0]; struct net *net = sock_net(skb->sk); int family = nfmsg->nfgen_family; rcu_read_lock(); nft_net = nft_pernet(net); cb->seq = nft_base_seq(net); list_for_each_entry_rcu(table, &nft_net->tables, list) { if (family != NFPROTO_UNSPEC && family != table->family) continue; if (idx < s_idx) goto cont; if (idx > s_idx) memset(&cb->args[1], 0, sizeof(cb->args) - sizeof(cb->args[0])); if (!nft_is_active(net, table)) continue; if (nf_tables_fill_table_info(skb, net, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NFT_MSG_NEWTABLE, NLM_F_MULTI, table->family, table) < 0) goto done; nl_dump_check_consistent(cb, nlmsg_hdr(skb)); cont: idx++; } done: rcu_read_unlock(); cb->args[0] = idx; return skb->len; } static int nft_netlink_dump_start_rcu(struct sock *nlsk, struct sk_buff *skb, const struct nlmsghdr *nlh, struct netlink_dump_control *c) { int err; if (!try_module_get(THIS_MODULE)) return -EINVAL; rcu_read_unlock(); err = netlink_dump_start(nlsk, skb, nlh, c); rcu_read_lock(); module_put(THIS_MODULE); return err; } /* called with rcu_read_lock held */ static int nf_tables_gettable(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_cur(info->net); u8 family = info->nfmsg->nfgen_family; const struct nft_table *table; struct net *net = info->net; struct sk_buff *skb2; int err; if (info->nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = { .dump = nf_tables_dump_tables, .module = THIS_MODULE, }; return nft_netlink_dump_start_rcu(info->sk, skb, info->nlh, &c); } table = nft_table_lookup(net, nla[NFTA_TABLE_NAME], family, genmask, 0); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_TABLE_NAME]); return PTR_ERR(table); } skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); if (!skb2) return -ENOMEM; err = nf_tables_fill_table_info(skb2, net, NETLINK_CB(skb).portid, info->nlh->nlmsg_seq, NFT_MSG_NEWTABLE, 0, family, table); if (err < 0) goto err_fill_table_info; return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid); err_fill_table_info: kfree_skb(skb2); return err; } static void nft_table_disable(struct net *net, struct nft_table *table, u32 cnt) { struct nft_chain *chain; u32 i = 0; list_for_each_entry(chain, &table->chains, list) { if (!nft_is_active_next(net, chain)) continue; if (!nft_is_base_chain(chain)) continue; if (cnt && i++ == cnt) break; nf_tables_unregister_hook(net, table, chain); } } static int nf_tables_table_enable(struct net *net, struct nft_table *table) { struct nft_chain *chain; int err, i = 0; list_for_each_entry(chain, &table->chains, list) { if (!nft_is_active_next(net, chain)) continue; if (!nft_is_base_chain(chain)) continue; err = nf_tables_register_hook(net, table, chain); if (err < 0) goto err_register_hooks; i++; } return 0; err_register_hooks: if (i) nft_table_disable(net, table, i); return err; } static void nf_tables_table_disable(struct net *net, struct nft_table *table) { table->flags &= ~NFT_TABLE_F_DORMANT; nft_table_disable(net, table, 0); table->flags |= NFT_TABLE_F_DORMANT; } #define __NFT_TABLE_F_INTERNAL (NFT_TABLE_F_MASK + 1) #define __NFT_TABLE_F_WAS_DORMANT (__NFT_TABLE_F_INTERNAL << 0) #define __NFT_TABLE_F_WAS_AWAKEN (__NFT_TABLE_F_INTERNAL << 1) #define __NFT_TABLE_F_WAS_ORPHAN (__NFT_TABLE_F_INTERNAL << 2) #define __NFT_TABLE_F_UPDATE (__NFT_TABLE_F_WAS_DORMANT | \ __NFT_TABLE_F_WAS_AWAKEN | \ __NFT_TABLE_F_WAS_ORPHAN) static bool nft_table_pending_update(const struct nft_ctx *ctx) { struct nftables_pernet *nft_net = nft_pernet(ctx->net); struct nft_trans *trans; if (ctx->table->flags & __NFT_TABLE_F_UPDATE) return true; list_for_each_entry(trans, &nft_net->commit_list, list) { if (trans->table == ctx->table && ((trans->msg_type == NFT_MSG_NEWCHAIN && nft_trans_chain_update(trans)) || (trans->msg_type == NFT_MSG_DELCHAIN && nft_is_base_chain(nft_trans_chain(trans))))) return true; } return false; } static int nf_tables_updtable(struct nft_ctx *ctx) { struct nft_trans *trans; u32 flags; int ret; if (!ctx->nla[NFTA_TABLE_FLAGS]) return 0; flags = ntohl(nla_get_be32(ctx->nla[NFTA_TABLE_FLAGS])); if (flags & ~NFT_TABLE_F_MASK) return -EOPNOTSUPP; if (flags == (ctx->table->flags & NFT_TABLE_F_MASK)) return 0; if ((nft_table_has_owner(ctx->table) && !(flags & NFT_TABLE_F_OWNER)) || (flags & NFT_TABLE_F_OWNER && !nft_table_is_orphan(ctx->table))) return -EOPNOTSUPP; if ((flags ^ ctx->table->flags) & NFT_TABLE_F_PERSIST) return -EOPNOTSUPP; /* No dormant off/on/off/on games in single transaction */ if (nft_table_pending_update(ctx)) return -EINVAL; trans = nft_trans_alloc(ctx, NFT_MSG_NEWTABLE, sizeof(struct nft_trans_table)); if (trans == NULL) return -ENOMEM; if ((flags & NFT_TABLE_F_DORMANT) && !(ctx->table->flags & NFT_TABLE_F_DORMANT)) { ctx->table->flags |= NFT_TABLE_F_DORMANT; if (!(ctx->table->flags & __NFT_TABLE_F_UPDATE)) ctx->table->flags |= __NFT_TABLE_F_WAS_AWAKEN; } else if (!(flags & NFT_TABLE_F_DORMANT) && ctx->table->flags & NFT_TABLE_F_DORMANT) { ctx->table->flags &= ~NFT_TABLE_F_DORMANT; if (!(ctx->table->flags & __NFT_TABLE_F_UPDATE)) { ret = nf_tables_table_enable(ctx->net, ctx->table); if (ret < 0) goto err_register_hooks; ctx->table->flags |= __NFT_TABLE_F_WAS_DORMANT; } } if ((flags & NFT_TABLE_F_OWNER) && !nft_table_has_owner(ctx->table)) { ctx->table->nlpid = ctx->portid; ctx->table->flags |= NFT_TABLE_F_OWNER | __NFT_TABLE_F_WAS_ORPHAN; } nft_trans_table_update(trans) = true; nft_trans_commit_list_add_tail(ctx->net, trans); return 0; err_register_hooks: ctx->table->flags |= NFT_TABLE_F_DORMANT; nft_trans_destroy(trans); return ret; } static u32 nft_chain_hash(const void *data, u32 len, u32 seed) { const char *name = data; return jhash(name, strlen(name), seed); } static u32 nft_chain_hash_obj(const void *data, u32 len, u32 seed) { const struct nft_chain *chain = data; return nft_chain_hash(chain->name, 0, seed); } static int nft_chain_hash_cmp(struct rhashtable_compare_arg *arg, const void *ptr) { const struct nft_chain *chain = ptr; const char *name = arg->key; return strcmp(chain->name, name); } static u32 nft_objname_hash(const void *data, u32 len, u32 seed) { const struct nft_object_hash_key *k = data; seed ^= hash_ptr(k->table, 32); return jhash(k->name, strlen(k->name), seed); } static u32 nft_objname_hash_obj(const void *data, u32 len, u32 seed) { const struct nft_object *obj = data; return nft_objname_hash(&obj->key, 0, seed); } static int nft_objname_hash_cmp(struct rhashtable_compare_arg *arg, const void *ptr) { const struct nft_object_hash_key *k = arg->key; const struct nft_object *obj = ptr; if (obj->key.table != k->table) return -1; return strcmp(obj->key.name, k->name); } static bool nft_supported_family(u8 family) { return false #ifdef CONFIG_NF_TABLES_INET || family == NFPROTO_INET #endif #ifdef CONFIG_NF_TABLES_IPV4 || family == NFPROTO_IPV4 #endif #ifdef CONFIG_NF_TABLES_ARP || family == NFPROTO_ARP #endif #ifdef CONFIG_NF_TABLES_NETDEV || family == NFPROTO_NETDEV #endif #if IS_ENABLED(CONFIG_NF_TABLES_BRIDGE) || family == NFPROTO_BRIDGE #endif #ifdef CONFIG_NF_TABLES_IPV6 || family == NFPROTO_IPV6 #endif ; } static int nf_tables_newtable(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct nftables_pernet *nft_net = nft_pernet(info->net); struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; struct net *net = info->net; const struct nlattr *attr; struct nft_table *table; struct nft_ctx ctx; u32 flags = 0; int err; if (!nft_supported_family(family)) return -EOPNOTSUPP; lockdep_assert_held(&nft_net->commit_mutex); attr = nla[NFTA_TABLE_NAME]; table = nft_table_lookup(net, attr, family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { if (PTR_ERR(table) != -ENOENT) return PTR_ERR(table); } else { if (info->nlh->nlmsg_flags & NLM_F_EXCL) { NL_SET_BAD_ATTR(extack, attr); return -EEXIST; } if (info->nlh->nlmsg_flags & NLM_F_REPLACE) return -EOPNOTSUPP; nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); return nf_tables_updtable(&ctx); } if (nla[NFTA_TABLE_FLAGS]) { flags = ntohl(nla_get_be32(nla[NFTA_TABLE_FLAGS])); if (flags & ~NFT_TABLE_F_MASK) return -EOPNOTSUPP; } err = -ENOMEM; table = kzalloc(sizeof(*table), GFP_KERNEL_ACCOUNT); if (table == NULL) goto err_kzalloc; table->validate_state = nft_net->validate_state; table->name = nla_strdup(attr, GFP_KERNEL_ACCOUNT); if (table->name == NULL) goto err_strdup; if (nla[NFTA_TABLE_USERDATA]) { table->udata = nla_memdup(nla[NFTA_TABLE_USERDATA], GFP_KERNEL_ACCOUNT); if (table->udata == NULL) goto err_table_udata; table->udlen = nla_len(nla[NFTA_TABLE_USERDATA]); } err = rhltable_init(&table->chains_ht, &nft_chain_ht_params); if (err) goto err_chain_ht; INIT_LIST_HEAD(&table->chains); INIT_LIST_HEAD(&table->sets); INIT_LIST_HEAD(&table->objects); INIT_LIST_HEAD(&table->flowtables); table->family = family; table->flags = flags; table->handle = ++nft_net->table_handle; if (table->flags & NFT_TABLE_F_OWNER) table->nlpid = NETLINK_CB(skb).portid; nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); err = nft_trans_table_add(&ctx, NFT_MSG_NEWTABLE); if (err < 0) goto err_trans; list_add_tail_rcu(&table->list, &nft_net->tables); return 0; err_trans: rhltable_destroy(&table->chains_ht); err_chain_ht: kfree(table->udata); err_table_udata: kfree(table->name); err_strdup: kfree(table); err_kzalloc: return err; } static int nft_flush_table(struct nft_ctx *ctx) { struct nft_flowtable *flowtable, *nft; struct nft_chain *chain, *nc; struct nft_object *obj, *ne; struct nft_set *set, *ns; int err; list_for_each_entry(chain, &ctx->table->chains, list) { if (!nft_is_active_next(ctx->net, chain)) continue; if (nft_chain_binding(chain)) continue; ctx->chain = chain; err = nft_delrule_by_chain(ctx); if (err < 0) goto out; } list_for_each_entry_safe(set, ns, &ctx->table->sets, list) { if (!nft_is_active_next(ctx->net, set)) continue; if (nft_set_is_anonymous(set)) continue; err = nft_delset(ctx, set); if (err < 0) goto out; } list_for_each_entry_safe(flowtable, nft, &ctx->table->flowtables, list) { if (!nft_is_active_next(ctx->net, flowtable)) continue; err = nft_delflowtable(ctx, flowtable); if (err < 0) goto out; } list_for_each_entry_safe(obj, ne, &ctx->table->objects, list) { if (!nft_is_active_next(ctx->net, obj)) continue; err = nft_delobj(ctx, obj); if (err < 0) goto out; } list_for_each_entry_safe(chain, nc, &ctx->table->chains, list) { if (!nft_is_active_next(ctx->net, chain)) continue; if (nft_chain_binding(chain)) continue; ctx->chain = chain; err = nft_delchain(ctx); if (err < 0) goto out; } err = nft_deltable(ctx); out: return err; } static int nft_flush(struct nft_ctx *ctx, int family) { struct nftables_pernet *nft_net = nft_pernet(ctx->net); const struct nlattr * const *nla = ctx->nla; struct nft_table *table, *nt; int err = 0; list_for_each_entry_safe(table, nt, &nft_net->tables, list) { if (family != AF_UNSPEC && table->family != family) continue; ctx->family = table->family; if (!nft_is_active_next(ctx->net, table)) continue; if (nft_table_has_owner(table) && table->nlpid != ctx->portid) continue; if (nla[NFTA_TABLE_NAME] && nla_strcmp(nla[NFTA_TABLE_NAME], table->name) != 0) continue; ctx->table = table; err = nft_flush_table(ctx); if (err < 0) goto out; } out: return err; } static int nf_tables_deltable(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; struct net *net = info->net; const struct nlattr *attr; struct nft_table *table; struct nft_ctx ctx; nft_ctx_init(&ctx, net, skb, info->nlh, 0, NULL, NULL, nla); if (family == AF_UNSPEC || (!nla[NFTA_TABLE_NAME] && !nla[NFTA_TABLE_HANDLE])) return nft_flush(&ctx, family); if (nla[NFTA_TABLE_HANDLE]) { attr = nla[NFTA_TABLE_HANDLE]; table = nft_table_lookup_byhandle(net, attr, family, genmask, NETLINK_CB(skb).portid); } else { attr = nla[NFTA_TABLE_NAME]; table = nft_table_lookup(net, attr, family, genmask, NETLINK_CB(skb).portid); } if (IS_ERR(table)) { if (PTR_ERR(table) == -ENOENT && NFNL_MSG_TYPE(info->nlh->nlmsg_type) == NFT_MSG_DESTROYTABLE) return 0; NL_SET_BAD_ATTR(extack, attr); return PTR_ERR(table); } if (info->nlh->nlmsg_flags & NLM_F_NONREC && table->use > 0) return -EBUSY; ctx.family = family; ctx.table = table; return nft_flush_table(&ctx); } static void nf_tables_table_destroy(struct nft_table *table) { if (WARN_ON(table->use > 0)) return; rhltable_destroy(&table->chains_ht); kfree(table->name); kfree(table->udata); kfree(table); } void nft_register_chain_type(const struct nft_chain_type *ctype) { nfnl_lock(NFNL_SUBSYS_NFTABLES); if (WARN_ON(__nft_chain_type_get(ctype->family, ctype->type))) { nfnl_unlock(NFNL_SUBSYS_NFTABLES); return; } chain_type[ctype->family][ctype->type] = ctype; nfnl_unlock(NFNL_SUBSYS_NFTABLES); } EXPORT_SYMBOL_GPL(nft_register_chain_type); void nft_unregister_chain_type(const struct nft_chain_type *ctype) { nfnl_lock(NFNL_SUBSYS_NFTABLES); chain_type[ctype->family][ctype->type] = NULL; nfnl_unlock(NFNL_SUBSYS_NFTABLES); } EXPORT_SYMBOL_GPL(nft_unregister_chain_type); /* * Chains */ static struct nft_chain * nft_chain_lookup_byhandle(const struct nft_table *table, u64 handle, u8 genmask) { struct nft_chain *chain; list_for_each_entry(chain, &table->chains, list) { if (chain->handle == handle && nft_active_genmask(chain, genmask)) return chain; } return ERR_PTR(-ENOENT); } static bool lockdep_commit_lock_is_held(const struct net *net) { #ifdef CONFIG_PROVE_LOCKING struct nftables_pernet *nft_net = nft_pernet(net); return lockdep_is_held(&nft_net->commit_mutex); #else return true; #endif } static struct nft_chain *nft_chain_lookup(struct net *net, struct nft_table *table, const struct nlattr *nla, u8 genmask) { char search[NFT_CHAIN_MAXNAMELEN + 1]; struct rhlist_head *tmp, *list; struct nft_chain *chain; if (nla == NULL) return ERR_PTR(-EINVAL); nla_strscpy(search, nla, sizeof(search)); WARN_ON(!rcu_read_lock_held() && !lockdep_commit_lock_is_held(net)); chain = ERR_PTR(-ENOENT); rcu_read_lock(); list = rhltable_lookup(&table->chains_ht, search, nft_chain_ht_params); if (!list) goto out_unlock; rhl_for_each_entry_rcu(chain, tmp, list, rhlhead) { if (nft_active_genmask(chain, genmask)) goto out_unlock; } chain = ERR_PTR(-ENOENT); out_unlock: rcu_read_unlock(); return chain; } static const struct nla_policy nft_chain_policy[NFTA_CHAIN_MAX + 1] = { [NFTA_CHAIN_TABLE] = { .type = NLA_STRING, .len = NFT_TABLE_MAXNAMELEN - 1 }, [NFTA_CHAIN_HANDLE] = { .type = NLA_U64 }, [NFTA_CHAIN_NAME] = { .type = NLA_STRING, .len = NFT_CHAIN_MAXNAMELEN - 1 }, [NFTA_CHAIN_HOOK] = { .type = NLA_NESTED }, [NFTA_CHAIN_POLICY] = { .type = NLA_U32 }, [NFTA_CHAIN_TYPE] = { .type = NLA_STRING, .len = NFT_MODULE_AUTOLOAD_LIMIT }, [NFTA_CHAIN_COUNTERS] = { .type = NLA_NESTED }, [NFTA_CHAIN_FLAGS] = { .type = NLA_U32 }, [NFTA_CHAIN_ID] = { .type = NLA_U32 }, [NFTA_CHAIN_USERDATA] = { .type = NLA_BINARY, .len = NFT_USERDATA_MAXLEN }, }; static const struct nla_policy nft_hook_policy[NFTA_HOOK_MAX + 1] = { [NFTA_HOOK_HOOKNUM] = { .type = NLA_U32 }, [NFTA_HOOK_PRIORITY] = { .type = NLA_U32 }, [NFTA_HOOK_DEV] = { .type = NLA_STRING, .len = IFNAMSIZ - 1 }, }; static int nft_dump_stats(struct sk_buff *skb, struct nft_stats __percpu *stats) { struct nft_stats *cpu_stats, total; struct nlattr *nest; unsigned int seq; u64 pkts, bytes; int cpu; if (!stats) return 0; memset(&total, 0, sizeof(total)); for_each_possible_cpu(cpu) { cpu_stats = per_cpu_ptr(stats, cpu); do { seq = u64_stats_fetch_begin(&cpu_stats->syncp); pkts = cpu_stats->pkts; bytes = cpu_stats->bytes; } while (u64_stats_fetch_retry(&cpu_stats->syncp, seq)); total.pkts += pkts; total.bytes += bytes; } nest = nla_nest_start_noflag(skb, NFTA_CHAIN_COUNTERS); if (nest == NULL) goto nla_put_failure; if (nla_put_be64(skb, NFTA_COUNTER_PACKETS, cpu_to_be64(total.pkts), NFTA_COUNTER_PAD) || nla_put_be64(skb, NFTA_COUNTER_BYTES, cpu_to_be64(total.bytes), NFTA_COUNTER_PAD)) goto nla_put_failure; nla_nest_end(skb, nest); return 0; nla_put_failure: return -ENOSPC; } static bool hook_is_prefix(struct nft_hook *hook) { return strlen(hook->ifname) >= hook->ifnamelen; } static int nft_nla_put_hook_dev(struct sk_buff *skb, struct nft_hook *hook) { int attr = hook_is_prefix(hook) ? NFTA_DEVICE_PREFIX : NFTA_DEVICE_NAME; return nla_put_string(skb, attr, hook->ifname); } static int nft_dump_basechain_hook(struct sk_buff *skb, const struct net *net, int family, const struct nft_base_chain *basechain, const struct list_head *hook_list) { const struct nf_hook_ops *ops = &basechain->ops; struct nft_hook *hook, *first = NULL; struct nlattr *nest, *nest_devs; int n = 0; nest = nla_nest_start_noflag(skb, NFTA_CHAIN_HOOK); if (nest == NULL) goto nla_put_failure; if (nla_put_be32(skb, NFTA_HOOK_HOOKNUM, htonl(ops->hooknum))) goto nla_put_failure; if (nla_put_be32(skb, NFTA_HOOK_PRIORITY, htonl(ops->priority))) goto nla_put_failure; if (nft_base_chain_netdev(family, ops->hooknum)) { nest_devs = nla_nest_start_noflag(skb, NFTA_HOOK_DEVS); if (!nest_devs) goto nla_put_failure; if (!hook_list) hook_list = &basechain->hook_list; list_for_each_entry_rcu(hook, hook_list, list, lockdep_commit_lock_is_held(net)) { if (!first) first = hook; if (nft_nla_put_hook_dev(skb, hook)) goto nla_put_failure; n++; } nla_nest_end(skb, nest_devs); if (n == 1 && !hook_is_prefix(first) && nla_put_string(skb, NFTA_HOOK_DEV, first->ifname)) goto nla_put_failure; } nla_nest_end(skb, nest); return 0; nla_put_failure: return -1; } static int nf_tables_fill_chain_info(struct sk_buff *skb, struct net *net, u32 portid, u32 seq, int event, u32 flags, int family, const struct nft_table *table, const struct nft_chain *chain, const struct list_head *hook_list) { struct nlmsghdr *nlh; nlh = nfnl_msg_put(skb, portid, seq, nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event), flags, family, NFNETLINK_V0, nft_base_seq_be16(net)); if (!nlh) goto nla_put_failure; if (nla_put_string(skb, NFTA_CHAIN_TABLE, table->name) || nla_put_string(skb, NFTA_CHAIN_NAME, chain->name) || nla_put_be64(skb, NFTA_CHAIN_HANDLE, cpu_to_be64(chain->handle), NFTA_CHAIN_PAD)) goto nla_put_failure; if (!hook_list && (event == NFT_MSG_DELCHAIN || event == NFT_MSG_DESTROYCHAIN)) { nlmsg_end(skb, nlh); return 0; } if (nft_is_base_chain(chain)) { const struct nft_base_chain *basechain = nft_base_chain(chain); struct nft_stats __percpu *stats; if (nft_dump_basechain_hook(skb, net, family, basechain, hook_list)) goto nla_put_failure; if (nla_put_be32(skb, NFTA_CHAIN_POLICY, htonl(basechain->policy))) goto nla_put_failure; if (nla_put_string(skb, NFTA_CHAIN_TYPE, basechain->type->name)) goto nla_put_failure; stats = rcu_dereference_check(basechain->stats, lockdep_commit_lock_is_held(net)); if (nft_dump_stats(skb, stats)) goto nla_put_failure; } if (chain->flags && nla_put_be32(skb, NFTA_CHAIN_FLAGS, htonl(chain->flags))) goto nla_put_failure; if (nla_put_be32(skb, NFTA_CHAIN_USE, htonl(chain->use))) goto nla_put_failure; if (chain->udata && nla_put(skb, NFTA_CHAIN_USERDATA, chain->udlen, chain->udata)) goto nla_put_failure; nlmsg_end(skb, nlh); return 0; nla_put_failure: nlmsg_trim(skb, nlh); return -1; } static void nf_tables_chain_notify(const struct nft_ctx *ctx, int event, const struct list_head *hook_list) { struct nftables_pernet *nft_net; struct sk_buff *skb; u16 flags = 0; int err; if (!ctx->report && !nfnetlink_has_listeners(ctx->net, NFNLGRP_NFTABLES)) return; skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL); if (skb == NULL) goto err; if (ctx->flags & (NLM_F_CREATE | NLM_F_EXCL)) flags |= ctx->flags & (NLM_F_CREATE | NLM_F_EXCL); err = nf_tables_fill_chain_info(skb, ctx->net, ctx->portid, ctx->seq, event, flags, ctx->family, ctx->table, ctx->chain, hook_list); if (err < 0) { kfree_skb(skb); goto err; } nft_net = nft_pernet(ctx->net); nft_notify_enqueue(skb, ctx->report, &nft_net->notify_list); return; err: nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES, -ENOBUFS); } static int nf_tables_dump_chains(struct sk_buff *skb, struct netlink_callback *cb) { const struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh); unsigned int idx = 0, s_idx = cb->args[0]; struct net *net = sock_net(skb->sk); int family = nfmsg->nfgen_family; struct nftables_pernet *nft_net; const struct nft_table *table; const struct nft_chain *chain; rcu_read_lock(); nft_net = nft_pernet(net); cb->seq = nft_base_seq(net); list_for_each_entry_rcu(table, &nft_net->tables, list) { if (family != NFPROTO_UNSPEC && family != table->family) continue; list_for_each_entry_rcu(chain, &table->chains, list) { if (idx < s_idx) goto cont; if (idx > s_idx) memset(&cb->args[1], 0, sizeof(cb->args) - sizeof(cb->args[0])); if (!nft_is_active(net, chain)) continue; if (nf_tables_fill_chain_info(skb, net, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NFT_MSG_NEWCHAIN, NLM_F_MULTI, table->family, table, chain, NULL) < 0) goto done; nl_dump_check_consistent(cb, nlmsg_hdr(skb)); cont: idx++; } } done: rcu_read_unlock(); cb->args[0] = idx; return skb->len; } /* called with rcu_read_lock held */ static int nf_tables_getchain(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_cur(info->net); u8 family = info->nfmsg->nfgen_family; const struct nft_chain *chain; struct net *net = info->net; struct nft_table *table; struct sk_buff *skb2; int err; if (info->nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = { .dump = nf_tables_dump_chains, .module = THIS_MODULE, }; return nft_netlink_dump_start_rcu(info->sk, skb, info->nlh, &c); } table = nft_table_lookup(net, nla[NFTA_CHAIN_TABLE], family, genmask, 0); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_TABLE]); return PTR_ERR(table); } chain = nft_chain_lookup(net, table, nla[NFTA_CHAIN_NAME], genmask); if (IS_ERR(chain)) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_NAME]); return PTR_ERR(chain); } skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); if (!skb2) return -ENOMEM; err = nf_tables_fill_chain_info(skb2, net, NETLINK_CB(skb).portid, info->nlh->nlmsg_seq, NFT_MSG_NEWCHAIN, 0, family, table, chain, NULL); if (err < 0) goto err_fill_chain_info; return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid); err_fill_chain_info: kfree_skb(skb2); return err; } static const struct nla_policy nft_counter_policy[NFTA_COUNTER_MAX + 1] = { [NFTA_COUNTER_PACKETS] = { .type = NLA_U64 }, [NFTA_COUNTER_BYTES] = { .type = NLA_U64 }, }; static struct nft_stats __percpu *nft_stats_alloc(const struct nlattr *attr) { struct nlattr *tb[NFTA_COUNTER_MAX+1]; struct nft_stats __percpu *newstats; struct nft_stats *stats; int err; err = nla_parse_nested_deprecated(tb, NFTA_COUNTER_MAX, attr, nft_counter_policy, NULL); if (err < 0) return ERR_PTR_PCPU(err); if (!tb[NFTA_COUNTER_BYTES] || !tb[NFTA_COUNTER_PACKETS]) return ERR_PTR_PCPU(-EINVAL); newstats = netdev_alloc_pcpu_stats(struct nft_stats); if (newstats == NULL) return ERR_PTR_PCPU(-ENOMEM); /* Restore old counters on this cpu, no problem. Per-cpu statistics * are not exposed to userspace. */ preempt_disable(); stats = this_cpu_ptr(newstats); stats->bytes = be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_BYTES])); stats->pkts = be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_PACKETS])); preempt_enable(); return newstats; } static void nft_chain_stats_replace(struct nft_trans_chain *trans) { const struct nft_trans *t = &trans->nft_trans_binding.nft_trans; struct nft_base_chain *chain = nft_base_chain(trans->chain); if (!trans->stats) return; trans->stats = rcu_replace_pointer(chain->stats, trans->stats, lockdep_commit_lock_is_held(t->net)); if (!trans->stats) static_branch_inc(&nft_counters_enabled); } static void nf_tables_chain_free_chain_rules(struct nft_chain *chain) { struct nft_rule_blob *g0 = rcu_dereference_raw(chain->blob_gen_0); struct nft_rule_blob *g1 = rcu_dereference_raw(chain->blob_gen_1); if (g0 != g1) kvfree(g1); kvfree(g0); /* should be NULL either via abort or via successful commit */ WARN_ON_ONCE(chain->blob_next); kvfree(chain->blob_next); } void nf_tables_chain_destroy(struct nft_chain *chain) { const struct nft_table *table = chain->table; struct nft_hook *hook, *next; if (WARN_ON(chain->use > 0)) return; /* no concurrent access possible anymore */ nf_tables_chain_free_chain_rules(chain); if (nft_is_base_chain(chain)) { struct nft_base_chain *basechain = nft_base_chain(chain); if (nft_base_chain_netdev(table->family, basechain->ops.hooknum)) { list_for_each_entry_safe(hook, next, &basechain->hook_list, list) { list_del_rcu(&hook->list); nft_netdev_hook_free_rcu(hook); } } module_put(basechain->type->owner); if (rcu_access_pointer(basechain->stats)) { static_branch_dec(&nft_counters_enabled); free_percpu(rcu_dereference_raw(basechain->stats)); } kfree(chain->name); kfree(chain->udata); kfree(basechain); } else { kfree(chain->name); kfree(chain->udata); kfree(chain); } } static struct nft_hook *nft_netdev_hook_alloc(struct net *net, const struct nlattr *attr, bool prefix) { struct nf_hook_ops *ops; struct net_device *dev; struct nft_hook *hook; int err; hook = kzalloc(sizeof(struct nft_hook), GFP_KERNEL_ACCOUNT); if (!hook) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&hook->ops_list); err = nla_strscpy(hook->ifname, attr, IFNAMSIZ); if (err < 0) goto err_hook_free; /* include the terminating NUL-char when comparing non-prefixes */ hook->ifnamelen = strlen(hook->ifname) + !prefix; /* nf_tables_netdev_event() is called under rtnl_mutex, this is * indirectly serializing all the other holders of the commit_mutex with * the rtnl_mutex. */ for_each_netdev(net, dev) { if (strncmp(dev->name, hook->ifname, hook->ifnamelen)) continue; ops = kzalloc(sizeof(struct nf_hook_ops), GFP_KERNEL_ACCOUNT); if (!ops) { err = -ENOMEM; goto err_hook_free; } ops->dev = dev; list_add_tail(&ops->list, &hook->ops_list); } return hook; err_hook_free: nft_netdev_hook_free(hook); return ERR_PTR(err); } static struct nft_hook *nft_hook_list_find(struct list_head *hook_list, const struct nft_hook *this) { struct nft_hook *hook; list_for_each_entry(hook, hook_list, list) { if (!strncmp(hook->ifname, this->ifname, min(hook->ifnamelen, this->ifnamelen))) return hook; } return NULL; } static int nf_tables_parse_netdev_hooks(struct net *net, const struct nlattr *attr, struct list_head *hook_list, struct netlink_ext_ack *extack) { struct nft_hook *hook, *next; const struct nlattr *tmp; int rem, n = 0, err; bool prefix; nla_for_each_nested(tmp, attr, rem) { switch (nla_type(tmp)) { case NFTA_DEVICE_NAME: prefix = false; break; case NFTA_DEVICE_PREFIX: prefix = true; break; default: err = -EINVAL; goto err_hook; } hook = nft_netdev_hook_alloc(net, tmp, prefix); if (IS_ERR(hook)) { NL_SET_BAD_ATTR(extack, tmp); err = PTR_ERR(hook); goto err_hook; } if (nft_hook_list_find(hook_list, hook)) { NL_SET_BAD_ATTR(extack, tmp); nft_netdev_hook_free(hook); err = -EEXIST; goto err_hook; } list_add_tail(&hook->list, hook_list); n++; if (n == NFT_NETDEVICE_MAX) { err = -EFBIG; goto err_hook; } } return 0; err_hook: list_for_each_entry_safe(hook, next, hook_list, list) { list_del(&hook->list); nft_netdev_hook_free(hook); } return err; } struct nft_chain_hook { u32 num; s32 priority; const struct nft_chain_type *type; struct list_head list; }; static int nft_chain_parse_netdev(struct net *net, struct nlattr *tb[], struct list_head *hook_list, struct netlink_ext_ack *extack, u32 flags) { struct nft_hook *hook; int err; if (tb[NFTA_HOOK_DEV]) { hook = nft_netdev_hook_alloc(net, tb[NFTA_HOOK_DEV], false); if (IS_ERR(hook)) { NL_SET_BAD_ATTR(extack, tb[NFTA_HOOK_DEV]); return PTR_ERR(hook); } list_add_tail(&hook->list, hook_list); } else if (tb[NFTA_HOOK_DEVS]) { err = nf_tables_parse_netdev_hooks(net, tb[NFTA_HOOK_DEVS], hook_list, extack); if (err < 0) return err; } if (flags & NFT_CHAIN_HW_OFFLOAD && list_empty(hook_list)) return -EINVAL; return 0; } static int nft_chain_parse_hook(struct net *net, struct nft_base_chain *basechain, const struct nlattr * const nla[], struct nft_chain_hook *hook, u8 family, u32 flags, struct netlink_ext_ack *extack) { struct nftables_pernet *nft_net = nft_pernet(net); struct nlattr *ha[NFTA_HOOK_MAX + 1]; const struct nft_chain_type *type; int err; lockdep_assert_held(&nft_net->commit_mutex); lockdep_nfnl_nft_mutex_not_held(); err = nla_parse_nested_deprecated(ha, NFTA_HOOK_MAX, nla[NFTA_CHAIN_HOOK], nft_hook_policy, NULL); if (err < 0) return err; if (!basechain) { if (!ha[NFTA_HOOK_HOOKNUM] || !ha[NFTA_HOOK_PRIORITY]) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_NAME]); return -ENOENT; } hook->num = ntohl(nla_get_be32(ha[NFTA_HOOK_HOOKNUM])); hook->priority = ntohl(nla_get_be32(ha[NFTA_HOOK_PRIORITY])); type = __nft_chain_type_get(family, NFT_CHAIN_T_DEFAULT); if (!type) return -EOPNOTSUPP; if (nla[NFTA_CHAIN_TYPE]) { type = nf_tables_chain_type_lookup(net, nla[NFTA_CHAIN_TYPE], family, true); if (IS_ERR(type)) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_TYPE]); return PTR_ERR(type); } } if (hook->num >= NFT_MAX_HOOKS || !(type->hook_mask & (1 << hook->num))) return -EOPNOTSUPP; if (type->type == NFT_CHAIN_T_NAT && hook->priority <= NF_IP_PRI_CONNTRACK) return -EOPNOTSUPP; } else { if (ha[NFTA_HOOK_HOOKNUM]) { hook->num = ntohl(nla_get_be32(ha[NFTA_HOOK_HOOKNUM])); if (hook->num != basechain->ops.hooknum) return -EOPNOTSUPP; } if (ha[NFTA_HOOK_PRIORITY]) { hook->priority = ntohl(nla_get_be32(ha[NFTA_HOOK_PRIORITY])); if (hook->priority != basechain->ops.priority) return -EOPNOTSUPP; } if (nla[NFTA_CHAIN_TYPE]) { type = __nf_tables_chain_type_lookup(nla[NFTA_CHAIN_TYPE], family); if (!type) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_TYPE]); return -ENOENT; } } else { type = basechain->type; } } if (!try_module_get(type->owner)) { if (nla[NFTA_CHAIN_TYPE]) NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_TYPE]); return -ENOENT; } hook->type = type; INIT_LIST_HEAD(&hook->list); if (nft_base_chain_netdev(family, hook->num)) { err = nft_chain_parse_netdev(net, ha, &hook->list, extack, flags); if (err < 0) { module_put(type->owner); return err; } } else if (ha[NFTA_HOOK_DEV] || ha[NFTA_HOOK_DEVS]) { module_put(type->owner); return -EOPNOTSUPP; } return 0; } static void nft_chain_release_hook(struct nft_chain_hook *hook) { struct nft_hook *h, *next; list_for_each_entry_safe(h, next, &hook->list, list) { list_del(&h->list); nft_netdev_hook_free(h); } module_put(hook->type->owner); } static void nft_last_rule(const struct nft_chain *chain, const void *ptr) { struct nft_rule_dp_last *lrule; BUILD_BUG_ON(offsetof(struct nft_rule_dp_last, end) != 0); lrule = (struct nft_rule_dp_last *)ptr; lrule->end.is_last = 1; lrule->chain = chain; /* blob size does not include the trailer rule */ } static struct nft_rule_blob *nf_tables_chain_alloc_rules(const struct nft_chain *chain, unsigned int size) { struct nft_rule_blob *blob; if (size > INT_MAX) return NULL; size += sizeof(struct nft_rule_blob) + sizeof(struct nft_rule_dp_last); blob = kvmalloc(size, GFP_KERNEL_ACCOUNT); if (!blob) return NULL; blob->size = 0; nft_last_rule(chain, blob->data); return blob; } static void nft_basechain_hook_init(struct nf_hook_ops *ops, u8 family, const struct nft_chain_hook *hook, struct nft_chain *chain) { ops->pf = family; ops->hooknum = hook->num; ops->priority = hook->priority; ops->priv = chain; ops->hook = hook->type->hooks[ops->hooknum]; ops->hook_ops_type = NF_HOOK_OP_NF_TABLES; } static int nft_basechain_init(struct nft_base_chain *basechain, u8 family, struct nft_chain_hook *hook, u32 flags) { struct nft_chain *chain; struct nf_hook_ops *ops; struct nft_hook *h; basechain->type = hook->type; INIT_LIST_HEAD(&basechain->hook_list); chain = &basechain->chain; if (nft_base_chain_netdev(family, hook->num)) { list_splice_init(&hook->list, &basechain->hook_list); list_for_each_entry(h, &basechain->hook_list, list) { list_for_each_entry(ops, &h->ops_list, list) nft_basechain_hook_init(ops, family, hook, chain); } } nft_basechain_hook_init(&basechain->ops, family, hook, chain); chain->flags |= NFT_CHAIN_BASE | flags; basechain->policy = NF_ACCEPT; if (chain->flags & NFT_CHAIN_HW_OFFLOAD && !nft_chain_offload_support(basechain)) { list_splice_init(&basechain->hook_list, &hook->list); return -EOPNOTSUPP; } flow_block_init(&basechain->flow_block); return 0; } int nft_chain_add(struct nft_table *table, struct nft_chain *chain) { int err; err = rhltable_insert_key(&table->chains_ht, chain->name, &chain->rhlhead, nft_chain_ht_params); if (err) return err; list_add_tail_rcu(&chain->list, &table->chains); return 0; } static u64 chain_id; static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 policy, u32 flags, struct netlink_ext_ack *extack) { const struct nlattr * const *nla = ctx->nla; struct nft_table *table = ctx->table; struct nft_base_chain *basechain; struct net *net = ctx->net; char name[NFT_NAME_MAXLEN]; struct nft_rule_blob *blob; struct nft_trans *trans; struct nft_chain *chain; int err; if (nla[NFTA_CHAIN_HOOK]) { struct nft_stats __percpu *stats = NULL; struct nft_chain_hook hook = {}; if (table->flags & __NFT_TABLE_F_UPDATE) return -EINVAL; if (flags & NFT_CHAIN_BINDING) return -EOPNOTSUPP; err = nft_chain_parse_hook(net, NULL, nla, &hook, family, flags, extack); if (err < 0) return err; basechain = kzalloc(sizeof(*basechain), GFP_KERNEL_ACCOUNT); if (basechain == NULL) { nft_chain_release_hook(&hook); return -ENOMEM; } chain = &basechain->chain; if (nla[NFTA_CHAIN_COUNTERS]) { stats = nft_stats_alloc(nla[NFTA_CHAIN_COUNTERS]); if (IS_ERR_PCPU(stats)) { nft_chain_release_hook(&hook); kfree(basechain); return PTR_ERR_PCPU(stats); } rcu_assign_pointer(basechain->stats, stats); } err = nft_basechain_init(basechain, family, &hook, flags); if (err < 0) { nft_chain_release_hook(&hook); kfree(basechain); free_percpu(stats); return err; } if (stats) static_branch_inc(&nft_counters_enabled); } else { if (flags & NFT_CHAIN_BASE) return -EINVAL; if (flags & NFT_CHAIN_HW_OFFLOAD) return -EOPNOTSUPP; chain = kzalloc(sizeof(*chain), GFP_KERNEL_ACCOUNT); if (chain == NULL) return -ENOMEM; chain->flags = flags; } ctx->chain = chain; INIT_LIST_HEAD(&chain->rules); chain->handle = nf_tables_alloc_handle(table); chain->table = table; if (nla[NFTA_CHAIN_NAME]) { chain->name = nla_strdup(nla[NFTA_CHAIN_NAME], GFP_KERNEL_ACCOUNT); } else { if (!(flags & NFT_CHAIN_BINDING)) { err = -EINVAL; goto err_destroy_chain; } snprintf(name, sizeof(name), "__chain%llu", ++chain_id); chain->name = kstrdup(name, GFP_KERNEL_ACCOUNT); } if (!chain->name) { err = -ENOMEM; goto err_destroy_chain; } if (nla[NFTA_CHAIN_USERDATA]) { chain->udata = nla_memdup(nla[NFTA_CHAIN_USERDATA], GFP_KERNEL_ACCOUNT); if (chain->udata == NULL) { err = -ENOMEM; goto err_destroy_chain; } chain->udlen = nla_len(nla[NFTA_CHAIN_USERDATA]); } blob = nf_tables_chain_alloc_rules(chain, 0); if (!blob) { err = -ENOMEM; goto err_destroy_chain; } RCU_INIT_POINTER(chain->blob_gen_0, blob); RCU_INIT_POINTER(chain->blob_gen_1, blob); if (!nft_use_inc(&table->use)) { err = -EMFILE; goto err_destroy_chain; } trans = nft_trans_chain_add(ctx, NFT_MSG_NEWCHAIN); if (IS_ERR(trans)) { err = PTR_ERR(trans); goto err_trans; } nft_trans_chain_policy(trans) = NFT_CHAIN_POLICY_UNSET; if (nft_is_base_chain(chain)) nft_trans_chain_policy(trans) = policy; err = nft_chain_add(table, chain); if (err < 0) goto err_chain_add; /* This must be LAST to ensure no packets are walking over this chain. */ err = nf_tables_register_hook(net, table, chain); if (err < 0) goto err_register_hook; return 0; err_register_hook: nft_chain_del(chain); err_chain_add: nft_trans_destroy(trans); err_trans: nft_use_dec_restore(&table->use); err_destroy_chain: nf_tables_chain_destroy(chain); return err; } static int nf_tables_updchain(struct nft_ctx *ctx, u8 genmask, u8 policy, u32 flags, const struct nlattr *attr, struct netlink_ext_ack *extack) { const struct nlattr * const *nla = ctx->nla; struct nft_base_chain *basechain = NULL; struct nft_table *table = ctx->table; struct nft_chain *chain = ctx->chain; struct nft_chain_hook hook = {}; struct nft_stats __percpu *stats = NULL; struct nftables_pernet *nft_net; struct nft_hook *h, *next; struct nf_hook_ops *ops; struct nft_trans *trans; bool unregister = false; int err; if (chain->flags ^ flags) return -EOPNOTSUPP; INIT_LIST_HEAD(&hook.list); if (nla[NFTA_CHAIN_HOOK]) { if (!nft_is_base_chain(chain)) { NL_SET_BAD_ATTR(extack, attr); return -EEXIST; } basechain = nft_base_chain(chain); err = nft_chain_parse_hook(ctx->net, basechain, nla, &hook, ctx->family, flags, extack); if (err < 0) return err; if (basechain->type != hook.type) { nft_chain_release_hook(&hook); NL_SET_BAD_ATTR(extack, attr); return -EEXIST; } if (nft_base_chain_netdev(ctx->family, basechain->ops.hooknum)) { list_for_each_entry_safe(h, next, &hook.list, list) { list_for_each_entry(ops, &h->ops_list, list) { ops->pf = basechain->ops.pf; ops->hooknum = basechain->ops.hooknum; ops->priority = basechain->ops.priority; ops->priv = basechain->ops.priv; ops->hook = basechain->ops.hook; } if (nft_hook_list_find(&basechain->hook_list, h)) { list_del(&h->list); nft_netdev_hook_free(h); continue; } nft_net = nft_pernet(ctx->net); list_for_each_entry(trans, &nft_net->commit_list, list) { if (trans->msg_type != NFT_MSG_NEWCHAIN || trans->table != ctx->table || !nft_trans_chain_update(trans)) continue; if (nft_hook_list_find(&nft_trans_chain_hooks(trans), h)) { nft_chain_release_hook(&hook); return -EEXIST; } } } } else { ops = &basechain->ops; if (ops->hooknum != hook.num || ops->priority != hook.priority) { nft_chain_release_hook(&hook); NL_SET_BAD_ATTR(extack, attr); return -EEXIST; } } } if (nla[NFTA_CHAIN_HANDLE] && nla[NFTA_CHAIN_NAME]) { struct nft_chain *chain2; chain2 = nft_chain_lookup(ctx->net, table, nla[NFTA_CHAIN_NAME], genmask); if (!IS_ERR(chain2)) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_NAME]); err = -EEXIST; goto err_hooks; } } if (table->flags & __NFT_TABLE_F_UPDATE && !list_empty(&hook.list)) { NL_SET_BAD_ATTR(extack, attr); err = -EOPNOTSUPP; goto err_hooks; } if (!(table->flags & NFT_TABLE_F_DORMANT) && nft_is_base_chain(chain) && !list_empty(&hook.list)) { basechain = nft_base_chain(chain); ops = &basechain->ops; if (nft_base_chain_netdev(table->family, basechain->ops.hooknum)) { err = nft_netdev_register_hooks(ctx->net, &hook.list); if (err < 0) goto err_hooks; unregister = true; } } if (nla[NFTA_CHAIN_COUNTERS]) { if (!nft_is_base_chain(chain)) { err = -EOPNOTSUPP; goto err_hooks; } stats = nft_stats_alloc(nla[NFTA_CHAIN_COUNTERS]); if (IS_ERR_PCPU(stats)) { err = PTR_ERR_PCPU(stats); goto err_hooks; } } err = -ENOMEM; trans = nft_trans_alloc_chain(ctx, NFT_MSG_NEWCHAIN); if (trans == NULL) goto err_trans; nft_trans_chain_stats(trans) = stats; nft_trans_chain_update(trans) = true; if (nla[NFTA_CHAIN_POLICY]) nft_trans_chain_policy(trans) = policy; else nft_trans_chain_policy(trans) = -1; if (nla[NFTA_CHAIN_HANDLE] && nla[NFTA_CHAIN_NAME]) { struct nftables_pernet *nft_net = nft_pernet(ctx->net); struct nft_trans *tmp; char *name; err = -ENOMEM; name = nla_strdup(nla[NFTA_CHAIN_NAME], GFP_KERNEL_ACCOUNT); if (!name) goto err_trans; err = -EEXIST; list_for_each_entry(tmp, &nft_net->commit_list, list) { if (tmp->msg_type == NFT_MSG_NEWCHAIN && tmp->table == table && nft_trans_chain_update(tmp) && nft_trans_chain_name(tmp) && strcmp(name, nft_trans_chain_name(tmp)) == 0) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_NAME]); kfree(name); goto err_trans; } } nft_trans_chain_name(trans) = name; } nft_trans_basechain(trans) = basechain; INIT_LIST_HEAD(&nft_trans_chain_hooks(trans)); list_splice(&hook.list, &nft_trans_chain_hooks(trans)); if (nla[NFTA_CHAIN_HOOK]) module_put(hook.type->owner); nft_trans_commit_list_add_tail(ctx->net, trans); return 0; err_trans: free_percpu(stats); kfree(trans); err_hooks: if (nla[NFTA_CHAIN_HOOK]) { list_for_each_entry_safe(h, next, &hook.list, list) { if (unregister) { list_for_each_entry(ops, &h->ops_list, list) nf_unregister_net_hook(ctx->net, ops); } list_del(&h->list); nft_netdev_hook_free_rcu(h); } module_put(hook.type->owner); } return err; } static struct nft_chain *nft_chain_lookup_byid(const struct net *net, const struct nft_table *table, const struct nlattr *nla, u8 genmask) { struct nftables_pernet *nft_net = nft_pernet(net); u32 id = ntohl(nla_get_be32(nla)); struct nft_trans *trans; list_for_each_entry(trans, &nft_net->commit_list, list) { if (trans->msg_type == NFT_MSG_NEWCHAIN && nft_trans_chain(trans)->table == table && id == nft_trans_chain_id(trans) && nft_active_genmask(nft_trans_chain(trans), genmask)) return nft_trans_chain(trans); } return ERR_PTR(-ENOENT); } static int nf_tables_newchain(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct nftables_pernet *nft_net = nft_pernet(info->net); struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; struct nft_chain *chain = NULL; struct net *net = info->net; const struct nlattr *attr; struct nft_table *table; u8 policy = NF_ACCEPT; struct nft_ctx ctx; u64 handle = 0; u32 flags = 0; lockdep_assert_held(&nft_net->commit_mutex); table = nft_table_lookup(net, nla[NFTA_CHAIN_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_TABLE]); return PTR_ERR(table); } chain = NULL; attr = nla[NFTA_CHAIN_NAME]; if (nla[NFTA_CHAIN_HANDLE]) { handle = be64_to_cpu(nla_get_be64(nla[NFTA_CHAIN_HANDLE])); chain = nft_chain_lookup_byhandle(table, handle, genmask); if (IS_ERR(chain)) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_HANDLE]); return PTR_ERR(chain); } attr = nla[NFTA_CHAIN_HANDLE]; } else if (nla[NFTA_CHAIN_NAME]) { chain = nft_chain_lookup(net, table, attr, genmask); if (IS_ERR(chain)) { if (PTR_ERR(chain) != -ENOENT) { NL_SET_BAD_ATTR(extack, attr); return PTR_ERR(chain); } chain = NULL; } } else if (!nla[NFTA_CHAIN_ID]) { return -EINVAL; } if (nla[NFTA_CHAIN_POLICY]) { if (chain != NULL && !nft_is_base_chain(chain)) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_POLICY]); return -EOPNOTSUPP; } if (chain == NULL && nla[NFTA_CHAIN_HOOK] == NULL) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_POLICY]); return -EOPNOTSUPP; } policy = ntohl(nla_get_be32(nla[NFTA_CHAIN_POLICY])); switch (policy) { case NF_DROP: case NF_ACCEPT: break; default: return -EINVAL; } } if (nla[NFTA_CHAIN_FLAGS]) flags = ntohl(nla_get_be32(nla[NFTA_CHAIN_FLAGS])); else if (chain) flags = chain->flags; if (flags & ~NFT_CHAIN_FLAGS) return -EOPNOTSUPP; nft_ctx_init(&ctx, net, skb, info->nlh, family, table, chain, nla); if (chain != NULL) { if (chain->flags & NFT_CHAIN_BINDING) return -EINVAL; if (info->nlh->nlmsg_flags & NLM_F_EXCL) { NL_SET_BAD_ATTR(extack, attr); return -EEXIST; } if (info->nlh->nlmsg_flags & NLM_F_REPLACE) return -EOPNOTSUPP; flags |= chain->flags & NFT_CHAIN_BASE; return nf_tables_updchain(&ctx, genmask, policy, flags, attr, extack); } return nf_tables_addchain(&ctx, family, policy, flags, extack); } static int nft_delchain_hook(struct nft_ctx *ctx, struct nft_base_chain *basechain, struct netlink_ext_ack *extack) { const struct nft_chain *chain = &basechain->chain; const struct nlattr * const *nla = ctx->nla; struct nft_chain_hook chain_hook = {}; struct nft_hook *this, *hook; LIST_HEAD(chain_del_list); struct nft_trans *trans; int err; if (ctx->table->flags & __NFT_TABLE_F_UPDATE) return -EOPNOTSUPP; err = nft_chain_parse_hook(ctx->net, basechain, nla, &chain_hook, ctx->family, chain->flags, extack); if (err < 0) return err; list_for_each_entry(this, &chain_hook.list, list) { hook = nft_hook_list_find(&basechain->hook_list, this); if (!hook) { err = -ENOENT; goto err_chain_del_hook; } list_move(&hook->list, &chain_del_list); } trans = nft_trans_alloc_chain(ctx, NFT_MSG_DELCHAIN); if (!trans) { err = -ENOMEM; goto err_chain_del_hook; } nft_trans_basechain(trans) = basechain; nft_trans_chain_update(trans) = true; INIT_LIST_HEAD(&nft_trans_chain_hooks(trans)); list_splice(&chain_del_list, &nft_trans_chain_hooks(trans)); nft_chain_release_hook(&chain_hook); nft_trans_commit_list_add_tail(ctx->net, trans); return 0; err_chain_del_hook: list_splice(&chain_del_list, &basechain->hook_list); nft_chain_release_hook(&chain_hook); return err; } static int nf_tables_delchain(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; struct net *net = info->net; const struct nlattr *attr; struct nft_table *table; struct nft_chain *chain; struct nft_rule *rule; struct nft_ctx ctx; u64 handle; u32 use; int err; table = nft_table_lookup(net, nla[NFTA_CHAIN_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_TABLE]); return PTR_ERR(table); } if (nla[NFTA_CHAIN_HANDLE]) { attr = nla[NFTA_CHAIN_HANDLE]; handle = be64_to_cpu(nla_get_be64(attr)); chain = nft_chain_lookup_byhandle(table, handle, genmask); } else { attr = nla[NFTA_CHAIN_NAME]; chain = nft_chain_lookup(net, table, attr, genmask); } if (IS_ERR(chain)) { if (PTR_ERR(chain) == -ENOENT && NFNL_MSG_TYPE(info->nlh->nlmsg_type) == NFT_MSG_DESTROYCHAIN) return 0; NL_SET_BAD_ATTR(extack, attr); return PTR_ERR(chain); } if (nft_chain_binding(chain)) return -EOPNOTSUPP; nft_ctx_init(&ctx, net, skb, info->nlh, family, table, chain, nla); if (nla[NFTA_CHAIN_HOOK]) { if (NFNL_MSG_TYPE(info->nlh->nlmsg_type) == NFT_MSG_DESTROYCHAIN || chain->flags & NFT_CHAIN_HW_OFFLOAD) return -EOPNOTSUPP; if (nft_is_base_chain(chain)) { struct nft_base_chain *basechain = nft_base_chain(chain); if (nft_base_chain_netdev(table->family, basechain->ops.hooknum)) return nft_delchain_hook(&ctx, basechain, extack); } } if (info->nlh->nlmsg_flags & NLM_F_NONREC && chain->use > 0) return -EBUSY; use = chain->use; list_for_each_entry(rule, &chain->rules, list) { if (!nft_is_active_next(net, rule)) continue; use--; err = nft_delrule(&ctx, rule); if (err < 0) return err; } /* There are rules and elements that are still holding references to us, * we cannot do a recursive removal in this case. */ if (use > 0) { NL_SET_BAD_ATTR(extack, attr); return -EBUSY; } return nft_delchain(&ctx); } /* * Expressions */ /** * nft_register_expr - register nf_tables expr type * @type: expr type * * Registers the expr type for use with nf_tables. Returns zero on * success or a negative errno code otherwise. */ int nft_register_expr(struct nft_expr_type *type) { if (WARN_ON_ONCE(type->maxattr > NFT_EXPR_MAXATTR)) return -ENOMEM; nfnl_lock(NFNL_SUBSYS_NFTABLES); if (type->family == NFPROTO_UNSPEC) list_add_tail_rcu(&type->list, &nf_tables_expressions); else list_add_rcu(&type->list, &nf_tables_expressions); nfnl_unlock(NFNL_SUBSYS_NFTABLES); return 0; } EXPORT_SYMBOL_GPL(nft_register_expr); /** * nft_unregister_expr - unregister nf_tables expr type * @type: expr type * * Unregisters the expr typefor use with nf_tables. */ void nft_unregister_expr(struct nft_expr_type *type) { nfnl_lock(NFNL_SUBSYS_NFTABLES); list_del_rcu(&type->list); nfnl_unlock(NFNL_SUBSYS_NFTABLES); } EXPORT_SYMBOL_GPL(nft_unregister_expr); static const struct nft_expr_type *__nft_expr_type_get(u8 family, struct nlattr *nla) { const struct nft_expr_type *type, *candidate = NULL; list_for_each_entry_rcu(type, &nf_tables_expressions, list) { if (!nla_strcmp(nla, type->name)) { if (!type->family && !candidate) candidate = type; else if (type->family == family) candidate = type; } } return candidate; } #ifdef CONFIG_MODULES static int nft_expr_type_request_module(struct net *net, u8 family, struct nlattr *nla) { if (nft_request_module(net, "nft-expr-%u-%.*s", family, nla_len(nla), (char *)nla_data(nla)) == -EAGAIN) return -EAGAIN; return 0; } #endif static const struct nft_expr_type *nft_expr_type_get(struct net *net, u8 family, struct nlattr *nla) { const struct nft_expr_type *type; if (nla == NULL) return ERR_PTR(-EINVAL); rcu_read_lock(); type = __nft_expr_type_get(family, nla); if (type != NULL && try_module_get(type->owner)) { rcu_read_unlock(); return type; } rcu_read_unlock(); lockdep_nfnl_nft_mutex_not_held(); #ifdef CONFIG_MODULES if (type == NULL) { if (nft_expr_type_request_module(net, family, nla) == -EAGAIN) return ERR_PTR(-EAGAIN); if (nft_request_module(net, "nft-expr-%.*s", nla_len(nla), (char *)nla_data(nla)) == -EAGAIN) return ERR_PTR(-EAGAIN); } #endif return ERR_PTR(-ENOENT); } static const struct nla_policy nft_expr_policy[NFTA_EXPR_MAX + 1] = { [NFTA_EXPR_NAME] = { .type = NLA_STRING, .len = NFT_MODULE_AUTOLOAD_LIMIT }, [NFTA_EXPR_DATA] = { .type = NLA_NESTED }, }; static int nf_tables_fill_expr_info(struct sk_buff *skb, const struct nft_expr *expr, bool reset) { if (nla_put_string(skb, NFTA_EXPR_NAME, expr->ops->type->name)) goto nla_put_failure; if (expr->ops->dump) { struct nlattr *data = nla_nest_start_noflag(skb, NFTA_EXPR_DATA); if (data == NULL) goto nla_put_failure; if (expr->ops->dump(skb, expr, reset) < 0) goto nla_put_failure; nla_nest_end(skb, data); } return skb->len; nla_put_failure: return -1; }; int nft_expr_dump(struct sk_buff *skb, unsigned int attr, const struct nft_expr *expr, bool reset) { struct nlattr *nest; nest = nla_nest_start_noflag(skb, attr); if (!nest) goto nla_put_failure; if (nf_tables_fill_expr_info(skb, expr, reset) < 0) goto nla_put_failure; nla_nest_end(skb, nest); return 0; nla_put_failure: return -1; } struct nft_expr_info { const struct nft_expr_ops *ops; const struct nlattr *attr; struct nlattr *tb[NFT_EXPR_MAXATTR + 1]; }; static int nf_tables_expr_parse(const struct nft_ctx *ctx, const struct nlattr *nla, struct nft_expr_info *info) { const struct nft_expr_type *type; const struct nft_expr_ops *ops; struct nlattr *tb[NFTA_EXPR_MAX + 1]; int err; err = nla_parse_nested_deprecated(tb, NFTA_EXPR_MAX, nla, nft_expr_policy, NULL); if (err < 0) return err; type = nft_expr_type_get(ctx->net, ctx->family, tb[NFTA_EXPR_NAME]); if (IS_ERR(type)) return PTR_ERR(type); if (tb[NFTA_EXPR_DATA]) { err = nla_parse_nested_deprecated(info->tb, type->maxattr, tb[NFTA_EXPR_DATA], type->policy, NULL); if (err < 0) goto err1; } else memset(info->tb, 0, sizeof(info->tb[0]) * (type->maxattr + 1)); if (type->select_ops != NULL) { ops = type->select_ops(ctx, (const struct nlattr * const *)info->tb); if (IS_ERR(ops)) { err = PTR_ERR(ops); #ifdef CONFIG_MODULES if (err == -EAGAIN) if (nft_expr_type_request_module(ctx->net, ctx->family, tb[NFTA_EXPR_NAME]) != -EAGAIN) err = -ENOENT; #endif goto err1; } } else ops = type->ops; info->attr = nla; info->ops = ops; return 0; err1: module_put(type->owner); return err; } int nft_expr_inner_parse(const struct nft_ctx *ctx, const struct nlattr *nla, struct nft_expr_info *info) { struct nlattr *tb[NFTA_EXPR_MAX + 1]; const struct nft_expr_type *type; int err; err = nla_parse_nested_deprecated(tb, NFTA_EXPR_MAX, nla, nft_expr_policy, NULL); if (err < 0) return err; if (!tb[NFTA_EXPR_DATA] || !tb[NFTA_EXPR_NAME]) return -EINVAL; rcu_read_lock(); type = __nft_expr_type_get(ctx->family, tb[NFTA_EXPR_NAME]); if (!type) { err = -ENOENT; goto out_unlock; } if (!type->inner_ops) { err = -EOPNOTSUPP; goto out_unlock; } err = nla_parse_nested_deprecated(info->tb, type->maxattr, tb[NFTA_EXPR_DATA], type->policy, NULL); if (err < 0) goto out_unlock; info->attr = nla; info->ops = type->inner_ops; /* No module reference will be taken on type->owner. * Presence of type->inner_ops implies that the expression * is builtin, so it cannot go away. */ rcu_read_unlock(); return 0; out_unlock: rcu_read_unlock(); return err; } static int nf_tables_newexpr(const struct nft_ctx *ctx, const struct nft_expr_info *expr_info, struct nft_expr *expr) { const struct nft_expr_ops *ops = expr_info->ops; int err; expr->ops = ops; if (ops->init) { err = ops->init(ctx, expr, (const struct nlattr **)expr_info->tb); if (err < 0) goto err1; } return 0; err1: expr->ops = NULL; return err; } static void nf_tables_expr_destroy(const struct nft_ctx *ctx, struct nft_expr *expr) { const struct nft_expr_type *type = expr->ops->type; if (expr->ops->destroy) expr->ops->destroy(ctx, expr); module_put(type->owner); } static struct nft_expr *nft_expr_init(const struct nft_ctx *ctx, const struct nlattr *nla) { struct nft_expr_info expr_info; struct nft_expr *expr; struct module *owner; int err; err = nf_tables_expr_parse(ctx, nla, &expr_info); if (err < 0) goto err_expr_parse; err = -EOPNOTSUPP; if (!(expr_info.ops->type->flags & NFT_EXPR_STATEFUL)) goto err_expr_stateful; err = -ENOMEM; expr = kzalloc(expr_info.ops->size, GFP_KERNEL_ACCOUNT); if (expr == NULL) goto err_expr_stateful; err = nf_tables_newexpr(ctx, &expr_info, expr); if (err < 0) goto err_expr_new; return expr; err_expr_new: kfree(expr); err_expr_stateful: owner = expr_info.ops->type->owner; if (expr_info.ops->type->release_ops) expr_info.ops->type->release_ops(expr_info.ops); module_put(owner); err_expr_parse: return ERR_PTR(err); } int nft_expr_clone(struct nft_expr *dst, struct nft_expr *src, gfp_t gfp) { int err; if (WARN_ON_ONCE(!src->ops->clone)) return -EINVAL; dst->ops = src->ops; err = src->ops->clone(dst, src, gfp); if (err < 0) return err; __module_get(src->ops->type->owner); return 0; } void nft_expr_destroy(const struct nft_ctx *ctx, struct nft_expr *expr) { nf_tables_expr_destroy(ctx, expr); kfree(expr); } /* * Rules */ static struct nft_rule *__nft_rule_lookup(const struct net *net, const struct nft_chain *chain, u64 handle) { struct nft_rule *rule; // FIXME: this sucks list_for_each_entry_rcu(rule, &chain->rules, list, lockdep_commit_lock_is_held(net)) { if (handle == rule->handle) return rule; } return ERR_PTR(-ENOENT); } static struct nft_rule *nft_rule_lookup(const struct net *net, const struct nft_chain *chain, const struct nlattr *nla) { if (nla == NULL) return ERR_PTR(-EINVAL); return __nft_rule_lookup(net, chain, be64_to_cpu(nla_get_be64(nla))); } static const struct nla_policy nft_rule_policy[NFTA_RULE_MAX + 1] = { [NFTA_RULE_TABLE] = { .type = NLA_STRING, .len = NFT_TABLE_MAXNAMELEN - 1 }, [NFTA_RULE_CHAIN] = { .type = NLA_STRING, .len = NFT_CHAIN_MAXNAMELEN - 1 }, [NFTA_RULE_HANDLE] = { .type = NLA_U64 }, [NFTA_RULE_EXPRESSIONS] = NLA_POLICY_NESTED_ARRAY(nft_expr_policy), [NFTA_RULE_COMPAT] = { .type = NLA_NESTED }, [NFTA_RULE_POSITION] = { .type = NLA_U64 }, [NFTA_RULE_USERDATA] = { .type = NLA_BINARY, .len = NFT_USERDATA_MAXLEN }, [NFTA_RULE_ID] = { .type = NLA_U32 }, [NFTA_RULE_POSITION_ID] = { .type = NLA_U32 }, [NFTA_RULE_CHAIN_ID] = { .type = NLA_U32 }, }; static int nf_tables_fill_rule_info(struct sk_buff *skb, struct net *net, u32 portid, u32 seq, int event, u32 flags, int family, const struct nft_table *table, const struct nft_chain *chain, const struct nft_rule *rule, u64 handle, bool reset) { struct nlmsghdr *nlh; const struct nft_expr *expr, *next; struct nlattr *list; u16 type = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event); nlh = nfnl_msg_put(skb, portid, seq, type, flags, family, NFNETLINK_V0, nft_base_seq_be16(net)); if (!nlh) goto nla_put_failure; if (nla_put_string(skb, NFTA_RULE_TABLE, table->name)) goto nla_put_failure; if (nla_put_string(skb, NFTA_RULE_CHAIN, chain->name)) goto nla_put_failure; if (nla_put_be64(skb, NFTA_RULE_HANDLE, cpu_to_be64(rule->handle), NFTA_RULE_PAD)) goto nla_put_failure; if (event != NFT_MSG_DELRULE && handle) { if (nla_put_be64(skb, NFTA_RULE_POSITION, cpu_to_be64(handle), NFTA_RULE_PAD)) goto nla_put_failure; } if (chain->flags & NFT_CHAIN_HW_OFFLOAD) nft_flow_rule_stats(chain, rule); list = nla_nest_start_noflag(skb, NFTA_RULE_EXPRESSIONS); if (list == NULL) goto nla_put_failure; nft_rule_for_each_expr(expr, next, rule) { if (nft_expr_dump(skb, NFTA_LIST_ELEM, expr, reset) < 0) goto nla_put_failure; } nla_nest_end(skb, list); if (rule->udata) { struct nft_userdata *udata = nft_userdata(rule); if (nla_put(skb, NFTA_RULE_USERDATA, udata->len + 1, udata->data) < 0) goto nla_put_failure; } nlmsg_end(skb, nlh); return 0; nla_put_failure: nlmsg_trim(skb, nlh); return -1; } static void nf_tables_rule_notify(const struct nft_ctx *ctx, const struct nft_rule *rule, int event) { struct nftables_pernet *nft_net = nft_pernet(ctx->net); const struct nft_rule *prule; struct sk_buff *skb; u64 handle = 0; u16 flags = 0; int err; if (!ctx->report && !nfnetlink_has_listeners(ctx->net, NFNLGRP_NFTABLES)) return; skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL); if (skb == NULL) goto err; if (event == NFT_MSG_NEWRULE && !list_is_first(&rule->list, &ctx->chain->rules) && !list_is_last(&rule->list, &ctx->chain->rules)) { prule = list_prev_entry(rule, list); handle = prule->handle; } if (ctx->flags & (NLM_F_APPEND | NLM_F_REPLACE)) flags |= NLM_F_APPEND; if (ctx->flags & (NLM_F_CREATE | NLM_F_EXCL)) flags |= ctx->flags & (NLM_F_CREATE | NLM_F_EXCL); err = nf_tables_fill_rule_info(skb, ctx->net, ctx->portid, ctx->seq, event, flags, ctx->family, ctx->table, ctx->chain, rule, handle, false); if (err < 0) { kfree_skb(skb); goto err; } nft_notify_enqueue(skb, ctx->report, &nft_net->notify_list); return; err: nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES, -ENOBUFS); } static void audit_log_rule_reset(const struct nft_table *table, unsigned int base_seq, unsigned int nentries) { char *buf = kasprintf(GFP_ATOMIC, "%s:%u", table->name, base_seq); audit_log_nfcfg(buf, table->family, nentries, AUDIT_NFT_OP_RULE_RESET, GFP_ATOMIC); kfree(buf); } struct nft_rule_dump_ctx { unsigned int s_idx; char *table; char *chain; bool reset; }; static int __nf_tables_dump_rules(struct sk_buff *skb, unsigned int *idx, struct netlink_callback *cb, const struct nft_table *table, const struct nft_chain *chain) { struct nft_rule_dump_ctx *ctx = (void *)cb->ctx; struct net *net = sock_net(skb->sk); const struct nft_rule *rule, *prule; unsigned int entries = 0; int ret = 0; u64 handle; prule = NULL; list_for_each_entry_rcu(rule, &chain->rules, list) { if (!nft_is_active(net, rule)) goto cont_skip; if (*idx < ctx->s_idx) goto cont; if (prule) handle = prule->handle; else handle = 0; if (nf_tables_fill_rule_info(skb, net, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NFT_MSG_NEWRULE, NLM_F_MULTI | NLM_F_APPEND, table->family, table, chain, rule, handle, ctx->reset) < 0) { ret = 1; break; } entries++; nl_dump_check_consistent(cb, nlmsg_hdr(skb)); cont: prule = rule; cont_skip: (*idx)++; } if (ctx->reset && entries) audit_log_rule_reset(table, cb->seq, entries); return ret; } static int nf_tables_dump_rules(struct sk_buff *skb, struct netlink_callback *cb) { const struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh); struct nft_rule_dump_ctx *ctx = (void *)cb->ctx; struct nft_table *table; const struct nft_chain *chain; unsigned int idx = 0; struct net *net = sock_net(skb->sk); int family = nfmsg->nfgen_family; struct nftables_pernet *nft_net; rcu_read_lock(); nft_net = nft_pernet(net); cb->seq = nft_base_seq(net); list_for_each_entry_rcu(table, &nft_net->tables, list) { if (family != NFPROTO_UNSPEC && family != table->family) continue; if (ctx->table && strcmp(ctx->table, table->name) != 0) continue; if (ctx->table && ctx->chain) { struct rhlist_head *list, *tmp; list = rhltable_lookup(&table->chains_ht, ctx->chain, nft_chain_ht_params); if (!list) goto done; rhl_for_each_entry_rcu(chain, tmp, list, rhlhead) { if (!nft_is_active(net, chain)) continue; __nf_tables_dump_rules(skb, &idx, cb, table, chain); break; } goto done; } list_for_each_entry_rcu(chain, &table->chains, list) { if (__nf_tables_dump_rules(skb, &idx, cb, table, chain)) goto done; } if (ctx->table) break; } done: rcu_read_unlock(); ctx->s_idx = idx; return skb->len; } static int nf_tables_dumpreset_rules(struct sk_buff *skb, struct netlink_callback *cb) { struct nftables_pernet *nft_net = nft_pernet(sock_net(skb->sk)); int ret; /* Mutex is held is to prevent that two concurrent dump-and-reset calls * do not underrun counters and quotas. The commit_mutex is used for * the lack a better lock, this is not transaction path. */ mutex_lock(&nft_net->commit_mutex); ret = nf_tables_dump_rules(skb, cb); mutex_unlock(&nft_net->commit_mutex); return ret; } static int nf_tables_dump_rules_start(struct netlink_callback *cb) { struct nft_rule_dump_ctx *ctx = (void *)cb->ctx; const struct nlattr * const *nla = cb->data; BUILD_BUG_ON(sizeof(*ctx) > sizeof(cb->ctx)); if (nla[NFTA_RULE_TABLE]) { ctx->table = nla_strdup(nla[NFTA_RULE_TABLE], GFP_ATOMIC); if (!ctx->table) return -ENOMEM; } if (nla[NFTA_RULE_CHAIN]) { ctx->chain = nla_strdup(nla[NFTA_RULE_CHAIN], GFP_ATOMIC); if (!ctx->chain) { kfree(ctx->table); return -ENOMEM; } } return 0; } static int nf_tables_dumpreset_rules_start(struct netlink_callback *cb) { struct nft_rule_dump_ctx *ctx = (void *)cb->ctx; ctx->reset = true; return nf_tables_dump_rules_start(cb); } static int nf_tables_dump_rules_done(struct netlink_callback *cb) { struct nft_rule_dump_ctx *ctx = (void *)cb->ctx; kfree(ctx->table); kfree(ctx->chain); return 0; } /* Caller must hold rcu read lock or transaction mutex */ static struct sk_buff * nf_tables_getrule_single(u32 portid, const struct nfnl_info *info, const struct nlattr * const nla[], bool reset) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_cur(info->net); u8 family = info->nfmsg->nfgen_family; const struct nft_chain *chain; const struct nft_rule *rule; struct net *net = info->net; struct nft_table *table; struct sk_buff *skb2; int err; table = nft_table_lookup(net, nla[NFTA_RULE_TABLE], family, genmask, 0); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_TABLE]); return ERR_CAST(table); } chain = nft_chain_lookup(net, table, nla[NFTA_RULE_CHAIN], genmask); if (IS_ERR(chain)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_CHAIN]); return ERR_CAST(chain); } rule = nft_rule_lookup(net, chain, nla[NFTA_RULE_HANDLE]); if (IS_ERR(rule)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_HANDLE]); return ERR_CAST(rule); } skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); if (!skb2) return ERR_PTR(-ENOMEM); err = nf_tables_fill_rule_info(skb2, net, portid, info->nlh->nlmsg_seq, NFT_MSG_NEWRULE, 0, family, table, chain, rule, 0, reset); if (err < 0) { kfree_skb(skb2); return ERR_PTR(err); } return skb2; } static int nf_tables_getrule(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { u32 portid = NETLINK_CB(skb).portid; struct net *net = info->net; struct sk_buff *skb2; if (info->nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = { .start= nf_tables_dump_rules_start, .dump = nf_tables_dump_rules, .done = nf_tables_dump_rules_done, .module = THIS_MODULE, .data = (void *)nla, }; return nft_netlink_dump_start_rcu(info->sk, skb, info->nlh, &c); } skb2 = nf_tables_getrule_single(portid, info, nla, false); if (IS_ERR(skb2)) return PTR_ERR(skb2); return nfnetlink_unicast(skb2, net, portid); } static int nf_tables_getrule_reset(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct nftables_pernet *nft_net = nft_pernet(info->net); u32 portid = NETLINK_CB(skb).portid; struct net *net = info->net; struct sk_buff *skb2; char *buf; if (info->nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = { .start= nf_tables_dumpreset_rules_start, .dump = nf_tables_dumpreset_rules, .done = nf_tables_dump_rules_done, .module = THIS_MODULE, .data = (void *)nla, }; return nft_netlink_dump_start_rcu(info->sk, skb, info->nlh, &c); } if (!try_module_get(THIS_MODULE)) return -EINVAL; rcu_read_unlock(); mutex_lock(&nft_net->commit_mutex); skb2 = nf_tables_getrule_single(portid, info, nla, true); mutex_unlock(&nft_net->commit_mutex); rcu_read_lock(); module_put(THIS_MODULE); if (IS_ERR(skb2)) return PTR_ERR(skb2); buf = kasprintf(GFP_ATOMIC, "%.*s:%u", nla_len(nla[NFTA_RULE_TABLE]), (char *)nla_data(nla[NFTA_RULE_TABLE]), nft_base_seq(net)); audit_log_nfcfg(buf, info->nfmsg->nfgen_family, 1, AUDIT_NFT_OP_RULE_RESET, GFP_ATOMIC); kfree(buf); return nfnetlink_unicast(skb2, net, portid); } void nf_tables_rule_destroy(const struct nft_ctx *ctx, struct nft_rule *rule) { struct nft_expr *expr, *next; /* * Careful: some expressions might not be initialized in case this * is called on error from nf_tables_newrule(). */ expr = nft_expr_first(rule); while (nft_expr_more(rule, expr)) { next = nft_expr_next(expr); nf_tables_expr_destroy(ctx, expr); expr = next; } kfree(rule); } /* can only be used if rule is no longer visible to dumps */ static void nf_tables_rule_release(const struct nft_ctx *ctx, struct nft_rule *rule) { WARN_ON_ONCE(!lockdep_commit_lock_is_held(ctx->net)); nft_rule_expr_deactivate(ctx, rule, NFT_TRANS_RELEASE); nf_tables_rule_destroy(ctx, rule); } static void nft_chain_vstate_update(const struct nft_ctx *ctx, struct nft_chain *chain) { const struct nft_base_chain *base_chain; enum nft_chain_types type; u8 hooknum; /* ctx->chain must hold the calling base chain. */ if (WARN_ON_ONCE(!nft_is_base_chain(ctx->chain))) { memset(&chain->vstate, 0, sizeof(chain->vstate)); return; } base_chain = nft_base_chain(ctx->chain); hooknum = base_chain->ops.hooknum; type = base_chain->type->type; BUILD_BUG_ON(BIT(NF_INET_NUMHOOKS) > U8_MAX); chain->vstate.hook_mask[type] |= BIT(hooknum); if (chain->vstate.depth < ctx->level) chain->vstate.depth = ctx->level; } /** nft_chain_validate - loop detection and hook validation * * @ctx: context containing call depth and base chain * @chain: chain to validate * * Walk through the rules of the given chain and chase all jumps/gotos * and set lookups until either the jump limit is hit or all reachable * chains have been validated. */ int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain) { struct nft_expr *expr, *last; struct nft_rule *rule; int err; BUILD_BUG_ON(NFT_JUMP_STACK_SIZE > 255); if (ctx->level == NFT_JUMP_STACK_SIZE) return -EMLINK; if (ctx->level > 0) { /* jumps to base chains are not allowed. */ if (nft_is_base_chain(chain)) return -ELOOP; if (nft_chain_vstate_valid(ctx, chain)) return 0; } list_for_each_entry(rule, &chain->rules, list) { if (fatal_signal_pending(current)) return -EINTR; if (!nft_is_active_next(ctx->net, rule)) continue; nft_rule_for_each_expr(expr, last, rule) { if (!expr->ops->validate) continue; /* This may call nft_chain_validate() recursively, * callers that do so must increment ctx->level. */ err = expr->ops->validate(ctx, expr); if (err < 0) return err; } cond_resched(); } nft_chain_vstate_update(ctx, chain); return 0; } EXPORT_SYMBOL_GPL(nft_chain_validate); static int nft_table_validate(struct net *net, const struct nft_table *table) { struct nft_chain *chain; struct nft_ctx ctx = { .net = net, .family = table->family, }; int err = 0; list_for_each_entry(chain, &table->chains, list) { if (!nft_is_base_chain(chain)) continue; ctx.chain = chain; err = nft_chain_validate(&ctx, chain); if (err < 0) goto err; } err: list_for_each_entry(chain, &table->chains, list) memset(&chain->vstate, 0, sizeof(chain->vstate)); return err; } int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set, const struct nft_set_iter *iter, struct nft_elem_priv *elem_priv) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); struct nft_ctx *pctx = (struct nft_ctx *)ctx; const struct nft_data *data; int err; if (!nft_set_elem_active(ext, iter->genmask)) return 0; if (nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS) && *nft_set_ext_flags(ext) & NFT_SET_ELEM_INTERVAL_END) return 0; data = nft_set_ext_data(ext); switch (data->verdict.code) { case NFT_JUMP: case NFT_GOTO: pctx->level++; err = nft_chain_validate(ctx, data->verdict.chain); if (err < 0) return err; pctx->level--; break; default: break; } return 0; } int nft_set_catchall_validate(const struct nft_ctx *ctx, struct nft_set *set) { struct nft_set_iter dummy_iter = { .genmask = nft_genmask_next(ctx->net), }; struct nft_set_elem_catchall *catchall; struct nft_set_ext *ext; int ret = 0; list_for_each_entry_rcu(catchall, &set->catchall_list, list, lockdep_commit_lock_is_held(ctx->net)) { ext = nft_set_elem_ext(set, catchall->elem); if (!nft_set_elem_active(ext, dummy_iter.genmask)) continue; ret = nft_setelem_validate(ctx, set, &dummy_iter, catchall->elem); if (ret < 0) return ret; } return ret; } static struct nft_rule *nft_rule_lookup_byid(const struct net *net, const struct nft_chain *chain, const struct nlattr *nla); #define NFT_RULE_MAXEXPRS 128 static int nf_tables_newrule(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct nftables_pernet *nft_net = nft_pernet(info->net); struct netlink_ext_ack *extack = info->extack; unsigned int size, i, n, ulen = 0, usize = 0; u8 genmask = nft_genmask_next(info->net); struct nft_rule *rule, *old_rule = NULL; struct nft_expr_info *expr_info = NULL; u8 family = info->nfmsg->nfgen_family; struct nft_flow_rule *flow = NULL; struct net *net = info->net; struct nft_userdata *udata; struct nft_table *table; struct nft_chain *chain; struct nft_trans *trans; u64 handle, pos_handle; struct nft_expr *expr; struct nft_ctx ctx; struct nlattr *tmp; int err, rem; lockdep_assert_held(&nft_net->commit_mutex); table = nft_table_lookup(net, nla[NFTA_RULE_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_TABLE]); return PTR_ERR(table); } if (nla[NFTA_RULE_CHAIN]) { chain = nft_chain_lookup(net, table, nla[NFTA_RULE_CHAIN], genmask); if (IS_ERR(chain)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_CHAIN]); return PTR_ERR(chain); } } else if (nla[NFTA_RULE_CHAIN_ID]) { chain = nft_chain_lookup_byid(net, table, nla[NFTA_RULE_CHAIN_ID], genmask); if (IS_ERR(chain)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_CHAIN_ID]); return PTR_ERR(chain); } } else { return -EINVAL; } if (nft_chain_is_bound(chain)) return -EOPNOTSUPP; if (nla[NFTA_RULE_HANDLE]) { handle = be64_to_cpu(nla_get_be64(nla[NFTA_RULE_HANDLE])); rule = __nft_rule_lookup(net, chain, handle); if (IS_ERR(rule)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_HANDLE]); return PTR_ERR(rule); } if (info->nlh->nlmsg_flags & NLM_F_EXCL) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_HANDLE]); return -EEXIST; } if (info->nlh->nlmsg_flags & NLM_F_REPLACE) old_rule = rule; else return -EOPNOTSUPP; } else { if (!(info->nlh->nlmsg_flags & NLM_F_CREATE) || info->nlh->nlmsg_flags & NLM_F_REPLACE) return -EINVAL; handle = nf_tables_alloc_handle(table); if (nla[NFTA_RULE_POSITION]) { pos_handle = be64_to_cpu(nla_get_be64(nla[NFTA_RULE_POSITION])); old_rule = __nft_rule_lookup(net, chain, pos_handle); if (IS_ERR(old_rule)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_POSITION]); return PTR_ERR(old_rule); } } else if (nla[NFTA_RULE_POSITION_ID]) { old_rule = nft_rule_lookup_byid(net, chain, nla[NFTA_RULE_POSITION_ID]); if (IS_ERR(old_rule)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_POSITION_ID]); return PTR_ERR(old_rule); } } } nft_ctx_init(&ctx, net, skb, info->nlh, family, table, chain, nla); n = 0; size = 0; if (nla[NFTA_RULE_EXPRESSIONS]) { expr_info = kvmalloc_array(NFT_RULE_MAXEXPRS, sizeof(struct nft_expr_info), GFP_KERNEL); if (!expr_info) return -ENOMEM; nla_for_each_nested(tmp, nla[NFTA_RULE_EXPRESSIONS], rem) { err = -EINVAL; if (nla_type(tmp) != NFTA_LIST_ELEM) goto err_release_expr; if (n == NFT_RULE_MAXEXPRS) goto err_release_expr; err = nf_tables_expr_parse(&ctx, tmp, &expr_info[n]); if (err < 0) { NL_SET_BAD_ATTR(extack, tmp); goto err_release_expr; } size += expr_info[n].ops->size; n++; } } /* Check for overflow of dlen field */ err = -EFBIG; if (size >= 1 << 12) goto err_release_expr; if (nla[NFTA_RULE_USERDATA]) { ulen = nla_len(nla[NFTA_RULE_USERDATA]); if (ulen > 0) usize = sizeof(struct nft_userdata) + ulen; } err = -ENOMEM; rule = kzalloc(sizeof(*rule) + size + usize, GFP_KERNEL_ACCOUNT); if (rule == NULL) goto err_release_expr; nft_activate_next(net, rule); rule->handle = handle; rule->dlen = size; rule->udata = ulen ? 1 : 0; if (ulen) { udata = nft_userdata(rule); udata->len = ulen - 1; nla_memcpy(udata->data, nla[NFTA_RULE_USERDATA], ulen); } expr = nft_expr_first(rule); for (i = 0; i < n; i++) { err = nf_tables_newexpr(&ctx, &expr_info[i], expr); if (err < 0) { NL_SET_BAD_ATTR(extack, expr_info[i].attr); goto err_release_rule; } if (expr_info[i].ops->validate) nft_validate_state_update(table, NFT_VALIDATE_NEED); expr_info[i].ops = NULL; expr = nft_expr_next(expr); } if (chain->flags & NFT_CHAIN_HW_OFFLOAD) { flow = nft_flow_rule_create(net, rule); if (IS_ERR(flow)) { err = PTR_ERR(flow); goto err_release_rule; } } if (!nft_use_inc(&chain->use)) { err = -EMFILE; goto err_destroy_flow; } if (info->nlh->nlmsg_flags & NLM_F_REPLACE) { if (nft_chain_binding(chain)) { err = -EOPNOTSUPP; goto err_destroy_flow_rule; } err = nft_delrule(&ctx, old_rule); if (err < 0) goto err_destroy_flow_rule; trans = nft_trans_rule_add(&ctx, NFT_MSG_NEWRULE, rule); if (trans == NULL) { err = -ENOMEM; goto err_destroy_flow_rule; } list_add_tail_rcu(&rule->list, &old_rule->list); } else { trans = nft_trans_rule_add(&ctx, NFT_MSG_NEWRULE, rule); if (!trans) { err = -ENOMEM; goto err_destroy_flow_rule; } if (info->nlh->nlmsg_flags & NLM_F_APPEND) { if (old_rule) list_add_rcu(&rule->list, &old_rule->list); else list_add_tail_rcu(&rule->list, &chain->rules); } else { if (old_rule) list_add_tail_rcu(&rule->list, &old_rule->list); else list_add_rcu(&rule->list, &chain->rules); } } kvfree(expr_info); if (flow) nft_trans_flow_rule(trans) = flow; if (table->validate_state == NFT_VALIDATE_DO) return nft_table_validate(net, table); return 0; err_destroy_flow_rule: nft_use_dec_restore(&chain->use); err_destroy_flow: if (flow) nft_flow_rule_destroy(flow); err_release_rule: nft_rule_expr_deactivate(&ctx, rule, NFT_TRANS_PREPARE_ERROR); nf_tables_rule_destroy(&ctx, rule); err_release_expr: for (i = 0; i < n; i++) { if (expr_info[i].ops) { module_put(expr_info[i].ops->type->owner); if (expr_info[i].ops->type->release_ops) expr_info[i].ops->type->release_ops(expr_info[i].ops); } } kvfree(expr_info); return err; } static struct nft_rule *nft_rule_lookup_byid(const struct net *net, const struct nft_chain *chain, const struct nlattr *nla) { struct nftables_pernet *nft_net = nft_pernet(net); u32 id = ntohl(nla_get_be32(nla)); struct nft_trans *trans; list_for_each_entry(trans, &nft_net->commit_list, list) { if (trans->msg_type == NFT_MSG_NEWRULE && nft_trans_rule_chain(trans) == chain && id == nft_trans_rule_id(trans)) return nft_trans_rule(trans); } return ERR_PTR(-ENOENT); } static int nf_tables_delrule(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; struct nft_chain *chain = NULL; struct net *net = info->net; struct nft_table *table; struct nft_rule *rule; struct nft_ctx ctx; int err = 0; table = nft_table_lookup(net, nla[NFTA_RULE_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_TABLE]); return PTR_ERR(table); } if (nla[NFTA_RULE_CHAIN]) { chain = nft_chain_lookup(net, table, nla[NFTA_RULE_CHAIN], genmask); if (IS_ERR(chain)) { if (PTR_ERR(chain) == -ENOENT && NFNL_MSG_TYPE(info->nlh->nlmsg_type) == NFT_MSG_DESTROYRULE) return 0; NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_CHAIN]); return PTR_ERR(chain); } if (nft_chain_binding(chain)) return -EOPNOTSUPP; } nft_ctx_init(&ctx, net, skb, info->nlh, family, table, chain, nla); if (chain) { if (nla[NFTA_RULE_HANDLE]) { rule = nft_rule_lookup(info->net, chain, nla[NFTA_RULE_HANDLE]); if (IS_ERR(rule)) { if (PTR_ERR(rule) == -ENOENT && NFNL_MSG_TYPE(info->nlh->nlmsg_type) == NFT_MSG_DESTROYRULE) return 0; NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_HANDLE]); return PTR_ERR(rule); } err = nft_delrule(&ctx, rule); } else if (nla[NFTA_RULE_ID]) { rule = nft_rule_lookup_byid(net, chain, nla[NFTA_RULE_ID]); if (IS_ERR(rule)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_ID]); return PTR_ERR(rule); } err = nft_delrule(&ctx, rule); } else { err = nft_delrule_by_chain(&ctx); } } else { list_for_each_entry(chain, &table->chains, list) { if (!nft_is_active_next(net, chain)) continue; if (nft_chain_binding(chain)) continue; ctx.chain = chain; err = nft_delrule_by_chain(&ctx); if (err < 0) break; } } return err; } /* * Sets */ static const struct nft_set_type *nft_set_types[] = { &nft_set_hash_fast_type, &nft_set_hash_type, &nft_set_rhash_type, &nft_set_bitmap_type, &nft_set_rbtree_type, #if defined(CONFIG_X86_64) && !defined(CONFIG_UML) &nft_set_pipapo_avx2_type, #endif &nft_set_pipapo_type, }; #define NFT_SET_FEATURES (NFT_SET_INTERVAL | NFT_SET_MAP | \ NFT_SET_TIMEOUT | NFT_SET_OBJECT | \ NFT_SET_EVAL) static bool nft_set_ops_candidate(const struct nft_set_type *type, u32 flags) { return (flags & type->features) == (flags & NFT_SET_FEATURES); } /* * Select a set implementation based on the data characteristics and the * given policy. The total memory use might not be known if no size is * given, in that case the amount of memory per element is used. */ static const struct nft_set_ops * nft_select_set_ops(const struct nft_ctx *ctx, u32 flags, const struct nft_set_desc *desc) { struct nftables_pernet *nft_net = nft_pernet(ctx->net); const struct nft_set_ops *ops, *bops; struct nft_set_estimate est, best; const struct nft_set_type *type; int i; lockdep_assert_held(&nft_net->commit_mutex); lockdep_nfnl_nft_mutex_not_held(); bops = NULL; best.size = ~0; best.lookup = ~0; best.space = ~0; for (i = 0; i < ARRAY_SIZE(nft_set_types); i++) { type = nft_set_types[i]; ops = &type->ops; if (!nft_set_ops_candidate(type, flags)) continue; if (!ops->estimate(desc, flags, &est)) continue; switch (desc->policy) { case NFT_SET_POL_PERFORMANCE: if (est.lookup < best.lookup) break; if (est.lookup == best.lookup && est.space < best.space) break; continue; case NFT_SET_POL_MEMORY: if (!desc->size) { if (est.space < best.space) break; if (est.space == best.space && est.lookup < best.lookup) break; } else if (est.size < best.size || !bops) { break; } continue; default: break; } bops = ops; best = est; } if (bops != NULL) return bops; return ERR_PTR(-EOPNOTSUPP); } static const struct nla_policy nft_set_policy[NFTA_SET_MAX + 1] = { [NFTA_SET_TABLE] = { .type = NLA_STRING, .len = NFT_TABLE_MAXNAMELEN - 1 }, [NFTA_SET_NAME] = { .type = NLA_STRING, .len = NFT_SET_MAXNAMELEN - 1 }, [NFTA_SET_FLAGS] = { .type = NLA_U32 }, [NFTA_SET_KEY_TYPE] = { .type = NLA_U32 }, [NFTA_SET_KEY_LEN] = { .type = NLA_U32 }, [NFTA_SET_DATA_TYPE] = { .type = NLA_U32 }, [NFTA_SET_DATA_LEN] = { .type = NLA_U32 }, [NFTA_SET_POLICY] = { .type = NLA_U32 }, [NFTA_SET_DESC] = { .type = NLA_NESTED }, [NFTA_SET_ID] = { .type = NLA_U32 }, [NFTA_SET_TIMEOUT] = { .type = NLA_U64 }, [NFTA_SET_GC_INTERVAL] = { .type = NLA_U32 }, [NFTA_SET_USERDATA] = { .type = NLA_BINARY, .len = NFT_USERDATA_MAXLEN }, [NFTA_SET_OBJ_TYPE] = { .type = NLA_U32 }, [NFTA_SET_HANDLE] = { .type = NLA_U64 }, [NFTA_SET_EXPR] = { .type = NLA_NESTED }, [NFTA_SET_EXPRESSIONS] = NLA_POLICY_NESTED_ARRAY(nft_expr_policy), [NFTA_SET_TYPE] = { .type = NLA_REJECT }, [NFTA_SET_COUNT] = { .type = NLA_REJECT }, }; static const struct nla_policy nft_concat_policy[NFTA_SET_FIELD_MAX + 1] = { [NFTA_SET_FIELD_LEN] = { .type = NLA_U32 }, }; static const struct nla_policy nft_set_desc_policy[NFTA_SET_DESC_MAX + 1] = { [NFTA_SET_DESC_SIZE] = { .type = NLA_U32 }, [NFTA_SET_DESC_CONCAT] = NLA_POLICY_NESTED_ARRAY(nft_concat_policy), }; static struct nft_set *nft_set_lookup(const struct net *net, const struct nft_table *table, const struct nlattr *nla, u8 genmask) { struct nft_set *set; if (nla == NULL) return ERR_PTR(-EINVAL); list_for_each_entry_rcu(set, &table->sets, list, lockdep_commit_lock_is_held(net)) { if (!nla_strcmp(nla, set->name) && nft_active_genmask(set, genmask)) return set; } return ERR_PTR(-ENOENT); } static struct nft_set *nft_set_lookup_byhandle(const struct nft_table *table, const struct nlattr *nla, u8 genmask) { struct nft_set *set; list_for_each_entry(set, &table->sets, list) { if (be64_to_cpu(nla_get_be64(nla)) == set->handle && nft_active_genmask(set, genmask)) return set; } return ERR_PTR(-ENOENT); } static struct nft_set *nft_set_lookup_byid(const struct net *net, const struct nft_table *table, const struct nlattr *nla, u8 genmask) { struct nftables_pernet *nft_net = nft_pernet(net); u32 id = ntohl(nla_get_be32(nla)); struct nft_trans_set *trans; /* its likely the id we need is at the tail, not at start */ list_for_each_entry_reverse(trans, &nft_net->commit_set_list, list_trans_newset) { struct nft_set *set = trans->set; if (id == trans->set_id && set->table == table && nft_active_genmask(set, genmask)) return set; } return ERR_PTR(-ENOENT); } struct nft_set *nft_set_lookup_global(const struct net *net, const struct nft_table *table, const struct nlattr *nla_set_name, const struct nlattr *nla_set_id, u8 genmask) { struct nft_set *set; set = nft_set_lookup(net, table, nla_set_name, genmask); if (IS_ERR(set)) { if (!nla_set_id) return set; set = nft_set_lookup_byid(net, table, nla_set_id, genmask); } return set; } EXPORT_SYMBOL_GPL(nft_set_lookup_global); static int nf_tables_set_alloc_name(struct nft_ctx *ctx, struct nft_set *set, const char *name) { const struct nft_set *i; const char *p; unsigned long *inuse; unsigned int n = 0, min = 0; p = strchr(name, '%'); if (p != NULL) { if (p[1] != 'd' || strchr(p + 2, '%')) return -EINVAL; if (strnlen(name, NFT_SET_MAX_ANONLEN) >= NFT_SET_MAX_ANONLEN) return -EINVAL; inuse = (unsigned long *)get_zeroed_page(GFP_KERNEL); if (inuse == NULL) return -ENOMEM; cont: list_for_each_entry(i, &ctx->table->sets, list) { int tmp; if (!nft_is_active_next(ctx->net, i)) continue; if (!sscanf(i->name, name, &tmp)) continue; if (tmp < min || tmp >= min + BITS_PER_BYTE * PAGE_SIZE) continue; set_bit(tmp - min, inuse); } n = find_first_zero_bit(inuse, BITS_PER_BYTE * PAGE_SIZE); if (n >= BITS_PER_BYTE * PAGE_SIZE) { min += BITS_PER_BYTE * PAGE_SIZE; memset(inuse, 0, PAGE_SIZE); goto cont; } free_page((unsigned long)inuse); } set->name = kasprintf(GFP_KERNEL_ACCOUNT, name, min + n); if (!set->name) return -ENOMEM; list_for_each_entry(i, &ctx->table->sets, list) { if (!nft_is_active_next(ctx->net, i)) continue; if (!strcmp(set->name, i->name)) { kfree(set->name); set->name = NULL; return -ENFILE; } } return 0; } int nf_msecs_to_jiffies64(const struct nlattr *nla, u64 *result) { u64 ms = be64_to_cpu(nla_get_be64(nla)); u64 max = (u64)(~((u64)0)); max = div_u64(max, NSEC_PER_MSEC); if (ms >= max) return -ERANGE; ms *= NSEC_PER_MSEC; *result = nsecs_to_jiffies64(ms) ? : !!ms; return 0; } __be64 nf_jiffies64_to_msecs(u64 input) { return cpu_to_be64(jiffies64_to_msecs(input)); } static int nf_tables_fill_set_concat(struct sk_buff *skb, const struct nft_set *set) { struct nlattr *concat, *field; int i; concat = nla_nest_start_noflag(skb, NFTA_SET_DESC_CONCAT); if (!concat) return -ENOMEM; for (i = 0; i < set->field_count; i++) { field = nla_nest_start_noflag(skb, NFTA_LIST_ELEM); if (!field) return -ENOMEM; if (nla_put_be32(skb, NFTA_SET_FIELD_LEN, htonl(set->field_len[i]))) return -ENOMEM; nla_nest_end(skb, field); } nla_nest_end(skb, concat); return 0; } static u32 nft_set_userspace_size(const struct nft_set_ops *ops, u32 size) { if (ops->usize) return ops->usize(size); return size; } static noinline_for_stack int nf_tables_fill_set_info(struct sk_buff *skb, const struct nft_set *set) { unsigned int nelems; char str[40]; int ret; ret = snprintf(str, sizeof(str), "%ps", set->ops); /* Not expected to happen and harmless: NFTA_SET_TYPE is dumped * to userspace purely for informational/debug purposes. */ DEBUG_NET_WARN_ON_ONCE(ret >= sizeof(str)); if (nla_put_string(skb, NFTA_SET_TYPE, str)) return -EMSGSIZE; nelems = nft_set_userspace_size(set->ops, atomic_read(&set->nelems)); return nla_put_be32(skb, NFTA_SET_COUNT, htonl(nelems)); } static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx, const struct nft_set *set, u16 event, u16 flags) { u64 timeout = READ_ONCE(set->timeout); u32 gc_int = READ_ONCE(set->gc_int); u32 portid = ctx->portid; struct nlmsghdr *nlh; struct nlattr *nest; u32 seq = ctx->seq; int i; nlh = nfnl_msg_put(skb, portid, seq, nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event), flags, ctx->family, NFNETLINK_V0, nft_base_seq_be16(ctx->net)); if (!nlh) goto nla_put_failure; if (nla_put_string(skb, NFTA_SET_TABLE, ctx->table->name)) goto nla_put_failure; if (nla_put_string(skb, NFTA_SET_NAME, set->name)) goto nla_put_failure; if (nla_put_be64(skb, NFTA_SET_HANDLE, cpu_to_be64(set->handle), NFTA_SET_PAD)) goto nla_put_failure; if (event == NFT_MSG_DELSET || event == NFT_MSG_DESTROYSET) { nlmsg_end(skb, nlh); return 0; } if (set->flags != 0) if (nla_put_be32(skb, NFTA_SET_FLAGS, htonl(set->flags))) goto nla_put_failure; if (nla_put_be32(skb, NFTA_SET_KEY_TYPE, htonl(set->ktype))) goto nla_put_failure; if (nla_put_be32(skb, NFTA_SET_KEY_LEN, htonl(set->klen))) goto nla_put_failure; if (set->flags & NFT_SET_MAP) { if (nla_put_be32(skb, NFTA_SET_DATA_TYPE, htonl(set->dtype))) goto nla_put_failure; if (nla_put_be32(skb, NFTA_SET_DATA_LEN, htonl(set->dlen))) goto nla_put_failure; } if (set->flags & NFT_SET_OBJECT && nla_put_be32(skb, NFTA_SET_OBJ_TYPE, htonl(set->objtype))) goto nla_put_failure; if (timeout && nla_put_be64(skb, NFTA_SET_TIMEOUT, nf_jiffies64_to_msecs(timeout), NFTA_SET_PAD)) goto nla_put_failure; if (gc_int && nla_put_be32(skb, NFTA_SET_GC_INTERVAL, htonl(gc_int))) goto nla_put_failure; if (set->policy != NFT_SET_POL_PERFORMANCE) { if (nla_put_be32(skb, NFTA_SET_POLICY, htonl(set->policy))) goto nla_put_failure; } if (set->udata && nla_put(skb, NFTA_SET_USERDATA, set->udlen, set->udata)) goto nla_put_failure; nest = nla_nest_start_noflag(skb, NFTA_SET_DESC); if (!nest) goto nla_put_failure; if (set->size && nla_put_be32(skb, NFTA_SET_DESC_SIZE, htonl(nft_set_userspace_size(set->ops, set->size)))) goto nla_put_failure; if (set->field_count > 1 && nf_tables_fill_set_concat(skb, set)) goto nla_put_failure; nla_nest_end(skb, nest); if (nf_tables_fill_set_info(skb, set)) goto nla_put_failure; if (set->num_exprs == 1) { nest = nla_nest_start_noflag(skb, NFTA_SET_EXPR); if (nf_tables_fill_expr_info(skb, set->exprs[0], false) < 0) goto nla_put_failure; nla_nest_end(skb, nest); } else if (set->num_exprs > 1) { nest = nla_nest_start_noflag(skb, NFTA_SET_EXPRESSIONS); if (nest == NULL) goto nla_put_failure; for (i = 0; i < set->num_exprs; i++) { if (nft_expr_dump(skb, NFTA_LIST_ELEM, set->exprs[i], false) < 0) goto nla_put_failure; } nla_nest_end(skb, nest); } nlmsg_end(skb, nlh); return 0; nla_put_failure: nlmsg_trim(skb, nlh); return -1; } static void nf_tables_set_notify(const struct nft_ctx *ctx, const struct nft_set *set, int event, gfp_t gfp_flags) { struct nftables_pernet *nft_net = nft_pernet(ctx->net); u32 portid = ctx->portid; struct sk_buff *skb; u16 flags = 0; int err; if (!ctx->report && !nfnetlink_has_listeners(ctx->net, NFNLGRP_NFTABLES)) return; skb = nlmsg_new(NLMSG_GOODSIZE, gfp_flags); if (skb == NULL) goto err; if (ctx->flags & (NLM_F_CREATE | NLM_F_EXCL)) flags |= ctx->flags & (NLM_F_CREATE | NLM_F_EXCL); err = nf_tables_fill_set(skb, ctx, set, event, flags); if (err < 0) { kfree_skb(skb); goto err; } nft_notify_enqueue(skb, ctx->report, &nft_net->notify_list); return; err: nfnetlink_set_err(ctx->net, portid, NFNLGRP_NFTABLES, -ENOBUFS); } static int nf_tables_dump_sets(struct sk_buff *skb, struct netlink_callback *cb) { const struct nft_set *set; unsigned int idx, s_idx = cb->args[0]; struct nft_table *table, *cur_table = (struct nft_table *)cb->args[2]; struct net *net = sock_net(skb->sk); struct nft_ctx *ctx = cb->data, ctx_set; struct nftables_pernet *nft_net; if (cb->args[1]) return skb->len; rcu_read_lock(); nft_net = nft_pernet(net); cb->seq = nft_base_seq(net); list_for_each_entry_rcu(table, &nft_net->tables, list) { if (ctx->family != NFPROTO_UNSPEC && ctx->family != table->family) continue; if (ctx->table && ctx->table != table) continue; if (cur_table) { if (cur_table != table) continue; cur_table = NULL; } idx = 0; list_for_each_entry_rcu(set, &table->sets, list) { if (idx < s_idx) goto cont; if (!nft_is_active(net, set)) goto cont; ctx_set = *ctx; ctx_set.table = table; ctx_set.family = table->family; if (nf_tables_fill_set(skb, &ctx_set, set, NFT_MSG_NEWSET, NLM_F_MULTI) < 0) { cb->args[0] = idx; cb->args[2] = (unsigned long) table; goto done; } nl_dump_check_consistent(cb, nlmsg_hdr(skb)); cont: idx++; } if (s_idx) s_idx = 0; } cb->args[1] = 1; done: rcu_read_unlock(); return skb->len; } static int nf_tables_dump_sets_start(struct netlink_callback *cb) { struct nft_ctx *ctx_dump = NULL; ctx_dump = kmemdup(cb->data, sizeof(*ctx_dump), GFP_ATOMIC); if (ctx_dump == NULL) return -ENOMEM; cb->data = ctx_dump; return 0; } static int nf_tables_dump_sets_done(struct netlink_callback *cb) { kfree(cb->data); return 0; } /* called with rcu_read_lock held */ static int nf_tables_getset(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_cur(info->net); u8 family = info->nfmsg->nfgen_family; struct nft_table *table = NULL; struct net *net = info->net; const struct nft_set *set; struct sk_buff *skb2; struct nft_ctx ctx; int err; if (nla[NFTA_SET_TABLE]) { table = nft_table_lookup(net, nla[NFTA_SET_TABLE], family, genmask, 0); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_TABLE]); return PTR_ERR(table); } } nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); if (info->nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = { .start = nf_tables_dump_sets_start, .dump = nf_tables_dump_sets, .done = nf_tables_dump_sets_done, .data = &ctx, .module = THIS_MODULE, }; return nft_netlink_dump_start_rcu(info->sk, skb, info->nlh, &c); } /* Only accept unspec with dump */ if (info->nfmsg->nfgen_family == NFPROTO_UNSPEC) return -EAFNOSUPPORT; if (!nla[NFTA_SET_TABLE]) return -EINVAL; set = nft_set_lookup(net, table, nla[NFTA_SET_NAME], genmask); if (IS_ERR(set)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_NAME]); return PTR_ERR(set); } skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); if (skb2 == NULL) return -ENOMEM; err = nf_tables_fill_set(skb2, &ctx, set, NFT_MSG_NEWSET, 0); if (err < 0) goto err_fill_set_info; return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid); err_fill_set_info: kfree_skb(skb2); return err; } static int nft_set_desc_concat_parse(const struct nlattr *attr, struct nft_set_desc *desc) { struct nlattr *tb[NFTA_SET_FIELD_MAX + 1]; u32 len; int err; if (desc->field_count >= ARRAY_SIZE(desc->field_len)) return -E2BIG; err = nla_parse_nested_deprecated(tb, NFTA_SET_FIELD_MAX, attr, nft_concat_policy, NULL); if (err < 0) return err; if (!tb[NFTA_SET_FIELD_LEN]) return -EINVAL; len = ntohl(nla_get_be32(tb[NFTA_SET_FIELD_LEN])); if (!len || len > U8_MAX) return -EINVAL; desc->field_len[desc->field_count++] = len; return 0; } static int nft_set_desc_concat(struct nft_set_desc *desc, const struct nlattr *nla) { u32 len = 0, num_regs; struct nlattr *attr; int rem, err, i; nla_for_each_nested(attr, nla, rem) { if (nla_type(attr) != NFTA_LIST_ELEM) return -EINVAL; err = nft_set_desc_concat_parse(attr, desc); if (err < 0) return err; } for (i = 0; i < desc->field_count; i++) len += round_up(desc->field_len[i], sizeof(u32)); if (len != desc->klen) return -EINVAL; num_regs = DIV_ROUND_UP(desc->klen, sizeof(u32)); if (num_regs > NFT_REG32_COUNT) return -E2BIG; return 0; } static int nf_tables_set_desc_parse(struct nft_set_desc *desc, const struct nlattr *nla) { struct nlattr *da[NFTA_SET_DESC_MAX + 1]; int err; err = nla_parse_nested_deprecated(da, NFTA_SET_DESC_MAX, nla, nft_set_desc_policy, NULL); if (err < 0) return err; if (da[NFTA_SET_DESC_SIZE] != NULL) desc->size = ntohl(nla_get_be32(da[NFTA_SET_DESC_SIZE])); if (da[NFTA_SET_DESC_CONCAT]) err = nft_set_desc_concat(desc, da[NFTA_SET_DESC_CONCAT]); return err; } static int nft_set_expr_alloc(struct nft_ctx *ctx, struct nft_set *set, const struct nlattr * const *nla, struct nft_expr **exprs, int *num_exprs, u32 flags) { struct nft_expr *expr; int err, i; if (nla[NFTA_SET_EXPR]) { expr = nft_set_elem_expr_alloc(ctx, set, nla[NFTA_SET_EXPR]); if (IS_ERR(expr)) { err = PTR_ERR(expr); goto err_set_expr_alloc; } exprs[0] = expr; (*num_exprs)++; } else if (nla[NFTA_SET_EXPRESSIONS]) { struct nlattr *tmp; int left; if (!(flags & NFT_SET_EXPR)) { err = -EINVAL; goto err_set_expr_alloc; } i = 0; nla_for_each_nested(tmp, nla[NFTA_SET_EXPRESSIONS], left) { if (i == NFT_SET_EXPR_MAX) { err = -E2BIG; goto err_set_expr_alloc; } if (nla_type(tmp) != NFTA_LIST_ELEM) { err = -EINVAL; goto err_set_expr_alloc; } expr = nft_set_elem_expr_alloc(ctx, set, tmp); if (IS_ERR(expr)) { err = PTR_ERR(expr); goto err_set_expr_alloc; } exprs[i++] = expr; (*num_exprs)++; } } return 0; err_set_expr_alloc: for (i = 0; i < *num_exprs; i++) nft_expr_destroy(ctx, exprs[i]); return err; } static bool nft_set_is_same(const struct nft_set *set, const struct nft_set_desc *desc, struct nft_expr *exprs[], u32 num_exprs, u32 flags) { int i; if (set->ktype != desc->ktype || set->dtype != desc->dtype || set->flags != flags || set->klen != desc->klen || set->dlen != desc->dlen || set->field_count != desc->field_count || set->num_exprs != num_exprs) return false; for (i = 0; i < desc->field_count; i++) { if (set->field_len[i] != desc->field_len[i]) return false; } for (i = 0; i < num_exprs; i++) { if (set->exprs[i]->ops != exprs[i]->ops) return false; } return true; } static u32 nft_set_kernel_size(const struct nft_set_ops *ops, const struct nft_set_desc *desc) { if (ops->ksize) return ops->ksize(desc->size); return desc->size; } static int nf_tables_newset(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; const struct nft_set_ops *ops; struct net *net = info->net; struct nft_set_desc desc; struct nft_table *table; unsigned char *udata; struct nft_set *set; struct nft_ctx ctx; size_t alloc_size; int num_exprs = 0; char *name; int err, i; u16 udlen; u32 flags; u64 size; if (nla[NFTA_SET_TABLE] == NULL || nla[NFTA_SET_NAME] == NULL || nla[NFTA_SET_KEY_LEN] == NULL || nla[NFTA_SET_ID] == NULL) return -EINVAL; memset(&desc, 0, sizeof(desc)); desc.ktype = NFT_DATA_VALUE; if (nla[NFTA_SET_KEY_TYPE] != NULL) { desc.ktype = ntohl(nla_get_be32(nla[NFTA_SET_KEY_TYPE])); if ((desc.ktype & NFT_DATA_RESERVED_MASK) == NFT_DATA_RESERVED_MASK) return -EINVAL; } desc.klen = ntohl(nla_get_be32(nla[NFTA_SET_KEY_LEN])); if (desc.klen == 0 || desc.klen > NFT_DATA_VALUE_MAXLEN) return -EINVAL; flags = 0; if (nla[NFTA_SET_FLAGS] != NULL) { flags = ntohl(nla_get_be32(nla[NFTA_SET_FLAGS])); if (flags & ~(NFT_SET_ANONYMOUS | NFT_SET_CONSTANT | NFT_SET_INTERVAL | NFT_SET_TIMEOUT | NFT_SET_MAP | NFT_SET_EVAL | NFT_SET_OBJECT | NFT_SET_CONCAT | NFT_SET_EXPR)) return -EOPNOTSUPP; /* Only one of these operations is supported */ if ((flags & (NFT_SET_MAP | NFT_SET_OBJECT)) == (NFT_SET_MAP | NFT_SET_OBJECT)) return -EOPNOTSUPP; if ((flags & (NFT_SET_EVAL | NFT_SET_OBJECT)) == (NFT_SET_EVAL | NFT_SET_OBJECT)) return -EOPNOTSUPP; if ((flags & (NFT_SET_ANONYMOUS | NFT_SET_TIMEOUT | NFT_SET_EVAL)) == (NFT_SET_ANONYMOUS | NFT_SET_TIMEOUT)) return -EOPNOTSUPP; if ((flags & (NFT_SET_CONSTANT | NFT_SET_TIMEOUT)) == (NFT_SET_CONSTANT | NFT_SET_TIMEOUT)) return -EOPNOTSUPP; } desc.dtype = 0; if (nla[NFTA_SET_DATA_TYPE] != NULL) { if (!(flags & NFT_SET_MAP)) return -EINVAL; desc.dtype = ntohl(nla_get_be32(nla[NFTA_SET_DATA_TYPE])); if ((desc.dtype & NFT_DATA_RESERVED_MASK) == NFT_DATA_RESERVED_MASK && desc.dtype != NFT_DATA_VERDICT) return -EINVAL; if (desc.dtype != NFT_DATA_VERDICT) { if (nla[NFTA_SET_DATA_LEN] == NULL) return -EINVAL; desc.dlen = ntohl(nla_get_be32(nla[NFTA_SET_DATA_LEN])); if (desc.dlen == 0 || desc.dlen > NFT_DATA_VALUE_MAXLEN) return -EINVAL; } else desc.dlen = sizeof(struct nft_verdict); } else if (flags & NFT_SET_MAP) return -EINVAL; if (nla[NFTA_SET_OBJ_TYPE] != NULL) { if (!(flags & NFT_SET_OBJECT)) return -EINVAL; desc.objtype = ntohl(nla_get_be32(nla[NFTA_SET_OBJ_TYPE])); if (desc.objtype == NFT_OBJECT_UNSPEC || desc.objtype > NFT_OBJECT_MAX) return -EOPNOTSUPP; } else if (flags & NFT_SET_OBJECT) return -EINVAL; else desc.objtype = NFT_OBJECT_UNSPEC; desc.timeout = 0; if (nla[NFTA_SET_TIMEOUT] != NULL) { if (!(flags & NFT_SET_TIMEOUT)) return -EINVAL; if (flags & NFT_SET_ANONYMOUS) return -EOPNOTSUPP; err = nf_msecs_to_jiffies64(nla[NFTA_SET_TIMEOUT], &desc.timeout); if (err) return err; } desc.gc_int = 0; if (nla[NFTA_SET_GC_INTERVAL] != NULL) { if (!(flags & NFT_SET_TIMEOUT)) return -EINVAL; if (flags & NFT_SET_ANONYMOUS) return -EOPNOTSUPP; desc.gc_int = ntohl(nla_get_be32(nla[NFTA_SET_GC_INTERVAL])); } desc.policy = NFT_SET_POL_PERFORMANCE; if (nla[NFTA_SET_POLICY] != NULL) { desc.policy = ntohl(nla_get_be32(nla[NFTA_SET_POLICY])); switch (desc.policy) { case NFT_SET_POL_PERFORMANCE: case NFT_SET_POL_MEMORY: break; default: return -EOPNOTSUPP; } } if (nla[NFTA_SET_DESC] != NULL) { err = nf_tables_set_desc_parse(&desc, nla[NFTA_SET_DESC]); if (err < 0) return err; if (desc.field_count > 1) { if (!(flags & NFT_SET_CONCAT)) return -EINVAL; } else if (flags & NFT_SET_CONCAT) { return -EINVAL; } } else if (flags & NFT_SET_CONCAT) { return -EINVAL; } if (nla[NFTA_SET_EXPR] || nla[NFTA_SET_EXPRESSIONS]) desc.expr = true; table = nft_table_lookup(net, nla[NFTA_SET_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_TABLE]); return PTR_ERR(table); } nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); set = nft_set_lookup(net, table, nla[NFTA_SET_NAME], genmask); if (IS_ERR(set)) { if (PTR_ERR(set) != -ENOENT) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_NAME]); return PTR_ERR(set); } } else { struct nft_expr *exprs[NFT_SET_EXPR_MAX] = {}; if (info->nlh->nlmsg_flags & NLM_F_EXCL) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_NAME]); return -EEXIST; } if (info->nlh->nlmsg_flags & NLM_F_REPLACE) return -EOPNOTSUPP; if (nft_set_is_anonymous(set)) return -EOPNOTSUPP; err = nft_set_expr_alloc(&ctx, set, nla, exprs, &num_exprs, flags); if (err < 0) return err; if (desc.size) desc.size = nft_set_kernel_size(set->ops, &desc); err = 0; if (!nft_set_is_same(set, &desc, exprs, num_exprs, flags)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_NAME]); err = -EEXIST; } for (i = 0; i < num_exprs; i++) nft_expr_destroy(&ctx, exprs[i]); if (err < 0) return err; return __nft_trans_set_add(&ctx, NFT_MSG_NEWSET, set, &desc); } if (!(info->nlh->nlmsg_flags & NLM_F_CREATE)) return -ENOENT; ops = nft_select_set_ops(&ctx, flags, &desc); if (IS_ERR(ops)) return PTR_ERR(ops); if (desc.size) desc.size = nft_set_kernel_size(ops, &desc); udlen = 0; if (nla[NFTA_SET_USERDATA]) udlen = nla_len(nla[NFTA_SET_USERDATA]); size = 0; if (ops->privsize != NULL) size = ops->privsize(nla, &desc); alloc_size = sizeof(*set) + size + udlen; if (alloc_size < size || alloc_size > INT_MAX) return -ENOMEM; if (!nft_use_inc(&table->use)) return -EMFILE; set = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT); if (!set) { err = -ENOMEM; goto err_alloc; } name = nla_strdup(nla[NFTA_SET_NAME], GFP_KERNEL_ACCOUNT); if (!name) { err = -ENOMEM; goto err_set_name; } err = nf_tables_set_alloc_name(&ctx, set, name); kfree(name); if (err < 0) goto err_set_name; udata = NULL; if (udlen) { udata = set->data + size; nla_memcpy(udata, nla[NFTA_SET_USERDATA], udlen); } INIT_LIST_HEAD(&set->bindings); INIT_LIST_HEAD(&set->catchall_list); refcount_set(&set->refs, 1); set->table = table; write_pnet(&set->net, net); set->ops = ops; set->ktype = desc.ktype; set->klen = desc.klen; set->dtype = desc.dtype; set->objtype = desc.objtype; set->dlen = desc.dlen; set->flags = flags; set->size = desc.size; set->policy = desc.policy; set->udlen = udlen; set->udata = udata; set->timeout = desc.timeout; set->gc_int = desc.gc_int; set->field_count = desc.field_count; for (i = 0; i < desc.field_count; i++) set->field_len[i] = desc.field_len[i]; err = ops->init(set, &desc, nla); if (err < 0) goto err_set_init; err = nft_set_expr_alloc(&ctx, set, nla, set->exprs, &num_exprs, flags); if (err < 0) goto err_set_destroy; set->num_exprs = num_exprs; set->handle = nf_tables_alloc_handle(table); INIT_LIST_HEAD(&set->pending_update); err = nft_trans_set_add(&ctx, NFT_MSG_NEWSET, set); if (err < 0) goto err_set_expr_alloc; list_add_tail_rcu(&set->list, &table->sets); return 0; err_set_expr_alloc: for (i = 0; i < set->num_exprs; i++) nft_expr_destroy(&ctx, set->exprs[i]); err_set_destroy: ops->destroy(&ctx, set); err_set_init: kfree(set->name); err_set_name: kvfree(set); err_alloc: nft_use_dec_restore(&table->use); return err; } static void nft_set_catchall_destroy(const struct nft_ctx *ctx, struct nft_set *set) { struct nft_set_elem_catchall *next, *catchall; list_for_each_entry_safe(catchall, next, &set->catchall_list, list) { list_del_rcu(&catchall->list); nf_tables_set_elem_destroy(ctx, set, catchall->elem); kfree_rcu(catchall, rcu); } } static void nft_set_put(struct nft_set *set) { if (refcount_dec_and_test(&set->refs)) { kfree(set->name); kvfree(set); } } static void nft_set_destroy(const struct nft_ctx *ctx, struct nft_set *set) { int i; if (WARN_ON(set->use > 0)) return; for (i = 0; i < set->num_exprs; i++) nft_expr_destroy(ctx, set->exprs[i]); set->ops->destroy(ctx, set); nft_set_catchall_destroy(ctx, set); nft_set_put(set); } static int nf_tables_delset(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; struct net *net = info->net; const struct nlattr *attr; struct nft_table *table; struct nft_set *set; struct nft_ctx ctx; if (info->nfmsg->nfgen_family == NFPROTO_UNSPEC) return -EAFNOSUPPORT; table = nft_table_lookup(net, nla[NFTA_SET_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_TABLE]); return PTR_ERR(table); } if (nla[NFTA_SET_HANDLE]) { attr = nla[NFTA_SET_HANDLE]; set = nft_set_lookup_byhandle(table, attr, genmask); } else { attr = nla[NFTA_SET_NAME]; set = nft_set_lookup(net, table, attr, genmask); } if (IS_ERR(set)) { if (PTR_ERR(set) == -ENOENT && NFNL_MSG_TYPE(info->nlh->nlmsg_type) == NFT_MSG_DESTROYSET) return 0; NL_SET_BAD_ATTR(extack, attr); return PTR_ERR(set); } if (set->use || (info->nlh->nlmsg_flags & NLM_F_NONREC && atomic_read(&set->nelems) > 0)) { NL_SET_BAD_ATTR(extack, attr); return -EBUSY; } nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); return nft_delset(&ctx, set); } static int nft_validate_register_store(const struct nft_ctx *ctx, enum nft_registers reg, const struct nft_data *data, enum nft_data_types type, unsigned int len); static int nft_setelem_data_validate(const struct nft_ctx *ctx, struct nft_set *set, struct nft_elem_priv *elem_priv) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); enum nft_registers dreg; dreg = nft_type_to_reg(set->dtype); return nft_validate_register_store(ctx, dreg, nft_set_ext_data(ext), set->dtype == NFT_DATA_VERDICT ? NFT_DATA_VERDICT : NFT_DATA_VALUE, set->dlen); } static int nf_tables_bind_check_setelem(const struct nft_ctx *ctx, struct nft_set *set, const struct nft_set_iter *iter, struct nft_elem_priv *elem_priv) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); if (!nft_set_elem_active(ext, iter->genmask)) return 0; return nft_setelem_data_validate(ctx, set, elem_priv); } static int nft_set_catchall_bind_check(const struct nft_ctx *ctx, struct nft_set *set) { u8 genmask = nft_genmask_next(ctx->net); struct nft_set_elem_catchall *catchall; struct nft_set_ext *ext; int ret = 0; list_for_each_entry_rcu(catchall, &set->catchall_list, list, lockdep_commit_lock_is_held(ctx->net)) { ext = nft_set_elem_ext(set, catchall->elem); if (!nft_set_elem_active(ext, genmask)) continue; ret = nft_setelem_data_validate(ctx, set, catchall->elem); if (ret < 0) break; } return ret; } int nf_tables_bind_set(const struct nft_ctx *ctx, struct nft_set *set, struct nft_set_binding *binding) { struct nft_set_binding *i; struct nft_set_iter iter = { .genmask = nft_genmask_next(ctx->net), .type = NFT_ITER_UPDATE, .fn = nf_tables_bind_check_setelem, }; if (!list_empty(&set->bindings) && nft_set_is_anonymous(set)) return -EBUSY; if (binding->flags & NFT_SET_MAP) { /* If the set is already bound to the same chain all * jumps are already validated for that chain. */ list_for_each_entry(i, &set->bindings, list) { if (i->flags & NFT_SET_MAP && i->chain == binding->chain) goto bind; } set->ops->walk(ctx, set, &iter); if (!iter.err) iter.err = nft_set_catchall_bind_check(ctx, set); if (iter.err < 0) return iter.err; } bind: if (!nft_use_inc(&set->use)) return -EMFILE; binding->chain = ctx->chain; list_add_tail_rcu(&binding->list, &set->bindings); nft_set_trans_bind(ctx, set); return 0; } EXPORT_SYMBOL_GPL(nf_tables_bind_set); static void nf_tables_unbind_set(const struct nft_ctx *ctx, struct nft_set *set, struct nft_set_binding *binding, bool event) { list_del_rcu(&binding->list); if (list_empty(&set->bindings) && nft_set_is_anonymous(set)) { list_del_rcu(&set->list); set->dead = 1; if (event) nf_tables_set_notify(ctx, set, NFT_MSG_DELSET, GFP_KERNEL); } } static void nft_setelem_data_activate(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv); static int nft_mapelem_activate(const struct nft_ctx *ctx, struct nft_set *set, const struct nft_set_iter *iter, struct nft_elem_priv *elem_priv) { struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); /* called from abort path, reverse check to undo changes. */ if (nft_set_elem_active(ext, iter->genmask)) return 0; nft_clear(ctx->net, ext); nft_setelem_data_activate(ctx->net, set, elem_priv); return 0; } static void nft_map_catchall_activate(const struct nft_ctx *ctx, struct nft_set *set) { u8 genmask = nft_genmask_next(ctx->net); struct nft_set_elem_catchall *catchall; struct nft_set_ext *ext; list_for_each_entry(catchall, &set->catchall_list, list) { ext = nft_set_elem_ext(set, catchall->elem); if (nft_set_elem_active(ext, genmask)) continue; nft_clear(ctx->net, ext); nft_setelem_data_activate(ctx->net, set, catchall->elem); break; } } static void nft_map_activate(const struct nft_ctx *ctx, struct nft_set *set) { struct nft_set_iter iter = { .genmask = nft_genmask_next(ctx->net), .type = NFT_ITER_UPDATE, .fn = nft_mapelem_activate, }; set->ops->walk(ctx, set, &iter); WARN_ON_ONCE(iter.err); nft_map_catchall_activate(ctx, set); } void nf_tables_activate_set(const struct nft_ctx *ctx, struct nft_set *set) { if (nft_set_is_anonymous(set)) { if (set->flags & (NFT_SET_MAP | NFT_SET_OBJECT)) nft_map_activate(ctx, set); nft_clear(ctx->net, set); } nft_use_inc_restore(&set->use); } EXPORT_SYMBOL_GPL(nf_tables_activate_set); void nf_tables_deactivate_set(const struct nft_ctx *ctx, struct nft_set *set, struct nft_set_binding *binding, enum nft_trans_phase phase) { WARN_ON_ONCE(!lockdep_commit_lock_is_held(ctx->net)); switch (phase) { case NFT_TRANS_PREPARE_ERROR: nft_set_trans_unbind(ctx, set); if (nft_set_is_anonymous(set)) nft_deactivate_next(ctx->net, set); else list_del_rcu(&binding->list); nft_use_dec(&set->use); break; case NFT_TRANS_PREPARE: if (nft_set_is_anonymous(set)) { if (set->flags & (NFT_SET_MAP | NFT_SET_OBJECT)) nft_map_deactivate(ctx, set); nft_deactivate_next(ctx->net, set); } nft_use_dec(&set->use); return; case NFT_TRANS_ABORT: case NFT_TRANS_RELEASE: if (nft_set_is_anonymous(set) && set->flags & (NFT_SET_MAP | NFT_SET_OBJECT)) nft_map_deactivate(ctx, set); nft_use_dec(&set->use); fallthrough; default: nf_tables_unbind_set(ctx, set, binding, phase == NFT_TRANS_COMMIT); } } EXPORT_SYMBOL_GPL(nf_tables_deactivate_set); void nf_tables_destroy_set(const struct nft_ctx *ctx, struct nft_set *set) { if (list_empty(&set->bindings) && nft_set_is_anonymous(set)) nft_set_destroy(ctx, set); } EXPORT_SYMBOL_GPL(nf_tables_destroy_set); const struct nft_set_ext_type nft_set_ext_types[] = { [NFT_SET_EXT_KEY] = { .align = __alignof__(u32), }, [NFT_SET_EXT_DATA] = { .align = __alignof__(u32), }, [NFT_SET_EXT_EXPRESSIONS] = { .align = __alignof__(struct nft_set_elem_expr), }, [NFT_SET_EXT_OBJREF] = { .len = sizeof(struct nft_object *), .align = __alignof__(struct nft_object *), }, [NFT_SET_EXT_FLAGS] = { .len = sizeof(u8), .align = __alignof__(u8), }, [NFT_SET_EXT_TIMEOUT] = { .len = sizeof(struct nft_timeout), .align = __alignof__(struct nft_timeout), }, [NFT_SET_EXT_USERDATA] = { .len = sizeof(struct nft_userdata), .align = __alignof__(struct nft_userdata), }, [NFT_SET_EXT_KEY_END] = { .align = __alignof__(u32), }, }; /* * Set elements */ static const struct nla_policy nft_set_elem_policy[NFTA_SET_ELEM_MAX + 1] = { [NFTA_SET_ELEM_KEY] = { .type = NLA_NESTED }, [NFTA_SET_ELEM_DATA] = { .type = NLA_NESTED }, [NFTA_SET_ELEM_FLAGS] = { .type = NLA_U32 }, [NFTA_SET_ELEM_TIMEOUT] = { .type = NLA_U64 }, [NFTA_SET_ELEM_EXPIRATION] = { .type = NLA_U64 }, [NFTA_SET_ELEM_USERDATA] = { .type = NLA_BINARY, .len = NFT_USERDATA_MAXLEN }, [NFTA_SET_ELEM_EXPR] = { .type = NLA_NESTED }, [NFTA_SET_ELEM_OBJREF] = { .type = NLA_STRING, .len = NFT_OBJ_MAXNAMELEN - 1 }, [NFTA_SET_ELEM_KEY_END] = { .type = NLA_NESTED }, [NFTA_SET_ELEM_EXPRESSIONS] = NLA_POLICY_NESTED_ARRAY(nft_expr_policy), }; static const struct nla_policy nft_set_elem_list_policy[NFTA_SET_ELEM_LIST_MAX + 1] = { [NFTA_SET_ELEM_LIST_TABLE] = { .type = NLA_STRING, .len = NFT_TABLE_MAXNAMELEN - 1 }, [NFTA_SET_ELEM_LIST_SET] = { .type = NLA_STRING, .len = NFT_SET_MAXNAMELEN - 1 }, [NFTA_SET_ELEM_LIST_ELEMENTS] = NLA_POLICY_NESTED_ARRAY(nft_set_elem_policy), [NFTA_SET_ELEM_LIST_SET_ID] = { .type = NLA_U32 }, }; static int nft_set_elem_expr_dump(struct sk_buff *skb, const struct nft_set *set, const struct nft_set_ext *ext, bool reset) { struct nft_set_elem_expr *elem_expr; u32 size, num_exprs = 0; struct nft_expr *expr; struct nlattr *nest; elem_expr = nft_set_ext_expr(ext); nft_setelem_expr_foreach(expr, elem_expr, size) num_exprs++; if (num_exprs == 1) { expr = nft_setelem_expr_at(elem_expr, 0); if (nft_expr_dump(skb, NFTA_SET_ELEM_EXPR, expr, reset) < 0) return -1; return 0; } else if (num_exprs > 1) { nest = nla_nest_start_noflag(skb, NFTA_SET_ELEM_EXPRESSIONS); if (nest == NULL) goto nla_put_failure; nft_setelem_expr_foreach(expr, elem_expr, size) { expr = nft_setelem_expr_at(elem_expr, size); if (nft_expr_dump(skb, NFTA_LIST_ELEM, expr, reset) < 0) goto nla_put_failure; } nla_nest_end(skb, nest); } return 0; nla_put_failure: return -1; } static int nf_tables_fill_setelem(struct sk_buff *skb, const struct nft_set *set, const struct nft_elem_priv *elem_priv, bool reset) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); unsigned char *b = skb_tail_pointer(skb); struct nlattr *nest; nest = nla_nest_start_noflag(skb, NFTA_LIST_ELEM); if (nest == NULL) goto nla_put_failure; if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY) && nft_data_dump(skb, NFTA_SET_ELEM_KEY, nft_set_ext_key(ext), NFT_DATA_VALUE, set->klen) < 0) goto nla_put_failure; if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END) && nft_data_dump(skb, NFTA_SET_ELEM_KEY_END, nft_set_ext_key_end(ext), NFT_DATA_VALUE, set->klen) < 0) goto nla_put_failure; if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA) && nft_data_dump(skb, NFTA_SET_ELEM_DATA, nft_set_ext_data(ext), nft_set_datatype(set), set->dlen) < 0) goto nla_put_failure; if (nft_set_ext_exists(ext, NFT_SET_EXT_EXPRESSIONS) && nft_set_elem_expr_dump(skb, set, ext, reset)) goto nla_put_failure; if (nft_set_ext_exists(ext, NFT_SET_EXT_OBJREF) && nla_put_string(skb, NFTA_SET_ELEM_OBJREF, (*nft_set_ext_obj(ext))->key.name) < 0) goto nla_put_failure; if (nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS) && nla_put_be32(skb, NFTA_SET_ELEM_FLAGS, htonl(*nft_set_ext_flags(ext)))) goto nla_put_failure; if (nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT)) { u64 timeout = READ_ONCE(nft_set_ext_timeout(ext)->timeout); u64 set_timeout = READ_ONCE(set->timeout); __be64 msecs = 0; if (set_timeout != timeout) { msecs = nf_jiffies64_to_msecs(timeout); if (nla_put_be64(skb, NFTA_SET_ELEM_TIMEOUT, msecs, NFTA_SET_ELEM_PAD)) goto nla_put_failure; } if (timeout > 0) { u64 expires, now = get_jiffies_64(); expires = READ_ONCE(nft_set_ext_timeout(ext)->expiration); if (time_before64(now, expires)) expires -= now; else expires = 0; if (nla_put_be64(skb, NFTA_SET_ELEM_EXPIRATION, nf_jiffies64_to_msecs(expires), NFTA_SET_ELEM_PAD)) goto nla_put_failure; } } if (nft_set_ext_exists(ext, NFT_SET_EXT_USERDATA)) { struct nft_userdata *udata; udata = nft_set_ext_userdata(ext); if (nla_put(skb, NFTA_SET_ELEM_USERDATA, udata->len + 1, udata->data)) goto nla_put_failure; } nla_nest_end(skb, nest); return 0; nla_put_failure: nlmsg_trim(skb, b); return -EMSGSIZE; } struct nft_set_dump_args { const struct netlink_callback *cb; struct nft_set_iter iter; struct sk_buff *skb; bool reset; }; static int nf_tables_dump_setelem(const struct nft_ctx *ctx, struct nft_set *set, const struct nft_set_iter *iter, struct nft_elem_priv *elem_priv) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); struct nft_set_dump_args *args; if (!nft_set_elem_active(ext, iter->genmask)) return 0; if (nft_set_elem_expired(ext) || nft_set_elem_is_dead(ext)) return 0; args = container_of(iter, struct nft_set_dump_args, iter); return nf_tables_fill_setelem(args->skb, set, elem_priv, args->reset); } static void audit_log_nft_set_reset(const struct nft_table *table, unsigned int base_seq, unsigned int nentries) { char *buf = kasprintf(GFP_ATOMIC, "%s:%u", table->name, base_seq); audit_log_nfcfg(buf, table->family, nentries, AUDIT_NFT_OP_SETELEM_RESET, GFP_ATOMIC); kfree(buf); } struct nft_set_dump_ctx { const struct nft_set *set; struct nft_ctx ctx; bool reset; }; static int nft_set_catchall_dump(struct net *net, struct sk_buff *skb, const struct nft_set *set, bool reset, unsigned int base_seq) { struct nft_set_elem_catchall *catchall; u8 genmask = nft_genmask_cur(net); struct nft_set_ext *ext; int ret = 0; list_for_each_entry_rcu(catchall, &set->catchall_list, list) { ext = nft_set_elem_ext(set, catchall->elem); if (!nft_set_elem_active(ext, genmask) || nft_set_elem_expired(ext)) continue; ret = nf_tables_fill_setelem(skb, set, catchall->elem, reset); if (reset && !ret) audit_log_nft_set_reset(set->table, base_seq, 1); break; } return ret; } static int nf_tables_dump_set(struct sk_buff *skb, struct netlink_callback *cb) { struct nft_set_dump_ctx *dump_ctx = cb->data; struct net *net = sock_net(skb->sk); struct nftables_pernet *nft_net; struct nft_table *table; struct nft_set *set; struct nft_set_dump_args args = { .cb = cb, .skb = skb, .reset = dump_ctx->reset, .iter = { .genmask = nft_genmask_cur(net), .type = NFT_ITER_READ, .skip = cb->args[0], .fn = nf_tables_dump_setelem, }, }; bool set_found = false; struct nlmsghdr *nlh; struct nlattr *nest; u32 portid, seq; int event; rcu_read_lock(); nft_net = nft_pernet(net); cb->seq = nft_base_seq(net); list_for_each_entry_rcu(table, &nft_net->tables, list) { if (dump_ctx->ctx.family != NFPROTO_UNSPEC && dump_ctx->ctx.family != table->family) continue; if (table != dump_ctx->ctx.table) continue; list_for_each_entry_rcu(set, &table->sets, list) { if (set == dump_ctx->set) { set_found = true; break; } } break; } if (!set_found) { rcu_read_unlock(); return -ENOENT; } event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, NFT_MSG_NEWSETELEM); portid = NETLINK_CB(cb->skb).portid; seq = cb->nlh->nlmsg_seq; nlh = nfnl_msg_put(skb, portid, seq, event, NLM_F_MULTI, table->family, NFNETLINK_V0, nft_base_seq_be16(net)); if (!nlh) goto nla_put_failure; if (nla_put_string(skb, NFTA_SET_ELEM_LIST_TABLE, table->name)) goto nla_put_failure; if (nla_put_string(skb, NFTA_SET_ELEM_LIST_SET, set->name)) goto nla_put_failure; nest = nla_nest_start_noflag(skb, NFTA_SET_ELEM_LIST_ELEMENTS); if (nest == NULL) goto nla_put_failure; set->ops->walk(&dump_ctx->ctx, set, &args.iter); if (!args.iter.err && args.iter.count == cb->args[0]) args.iter.err = nft_set_catchall_dump(net, skb, set, dump_ctx->reset, cb->seq); nla_nest_end(skb, nest); nlmsg_end(skb, nlh); rcu_read_unlock(); if (args.iter.err && args.iter.err != -EMSGSIZE) return args.iter.err; if (args.iter.count == cb->args[0]) return 0; cb->args[0] = args.iter.count; return skb->len; nla_put_failure: rcu_read_unlock(); return -ENOSPC; } static int nf_tables_dumpreset_set(struct sk_buff *skb, struct netlink_callback *cb) { struct nftables_pernet *nft_net = nft_pernet(sock_net(skb->sk)); struct nft_set_dump_ctx *dump_ctx = cb->data; int ret, skip = cb->args[0]; mutex_lock(&nft_net->commit_mutex); ret = nf_tables_dump_set(skb, cb); if (cb->args[0] > skip) audit_log_nft_set_reset(dump_ctx->ctx.table, cb->seq, cb->args[0] - skip); mutex_unlock(&nft_net->commit_mutex); return ret; } static int nf_tables_dump_set_start(struct netlink_callback *cb) { struct nft_set_dump_ctx *dump_ctx = cb->data; cb->data = kmemdup(dump_ctx, sizeof(*dump_ctx), GFP_ATOMIC); return cb->data ? 0 : -ENOMEM; } static int nf_tables_dump_set_done(struct netlink_callback *cb) { kfree(cb->data); return 0; } static int nf_tables_fill_setelem_info(struct sk_buff *skb, const struct nft_ctx *ctx, u32 seq, u32 portid, int event, u16 flags, const struct nft_set *set, const struct nft_elem_priv *elem_priv, bool reset) { struct nlmsghdr *nlh; struct nlattr *nest; int err; event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event); nlh = nfnl_msg_put(skb, portid, seq, event, flags, ctx->family, NFNETLINK_V0, nft_base_seq_be16(ctx->net)); if (!nlh) goto nla_put_failure; if (nla_put_string(skb, NFTA_SET_TABLE, ctx->table->name)) goto nla_put_failure; if (nla_put_string(skb, NFTA_SET_NAME, set->name)) goto nla_put_failure; nest = nla_nest_start_noflag(skb, NFTA_SET_ELEM_LIST_ELEMENTS); if (nest == NULL) goto nla_put_failure; err = nf_tables_fill_setelem(skb, set, elem_priv, reset); if (err < 0) goto nla_put_failure; nla_nest_end(skb, nest); nlmsg_end(skb, nlh); return 0; nla_put_failure: nlmsg_trim(skb, nlh); return -1; } static int nft_setelem_parse_flags(const struct nft_set *set, const struct nlattr *attr, u32 *flags) { if (attr == NULL) return 0; *flags = ntohl(nla_get_be32(attr)); if (*flags & ~(NFT_SET_ELEM_INTERVAL_END | NFT_SET_ELEM_CATCHALL)) return -EOPNOTSUPP; if (!(set->flags & NFT_SET_INTERVAL) && *flags & NFT_SET_ELEM_INTERVAL_END) return -EINVAL; if ((*flags & (NFT_SET_ELEM_INTERVAL_END | NFT_SET_ELEM_CATCHALL)) == (NFT_SET_ELEM_INTERVAL_END | NFT_SET_ELEM_CATCHALL)) return -EINVAL; return 0; } static int nft_setelem_parse_key(struct nft_ctx *ctx, const struct nft_set *set, struct nft_data *key, struct nlattr *attr) { struct nft_data_desc desc = { .type = NFT_DATA_VALUE, .size = NFT_DATA_VALUE_MAXLEN, .len = set->klen, }; return nft_data_init(ctx, key, &desc, attr); } static int nft_setelem_parse_data(struct nft_ctx *ctx, struct nft_set *set, struct nft_data_desc *desc, struct nft_data *data, struct nlattr *attr) { u32 dtype; if (set->dtype == NFT_DATA_VERDICT) dtype = NFT_DATA_VERDICT; else dtype = NFT_DATA_VALUE; desc->type = dtype; desc->size = NFT_DATA_VALUE_MAXLEN; desc->len = set->dlen; desc->flags = NFT_DATA_DESC_SETELEM; return nft_data_init(ctx, data, desc, attr); } static void *nft_setelem_catchall_get(const struct net *net, const struct nft_set *set) { struct nft_set_elem_catchall *catchall; u8 genmask = nft_genmask_cur(net); struct nft_set_ext *ext; void *priv = NULL; list_for_each_entry_rcu(catchall, &set->catchall_list, list) { ext = nft_set_elem_ext(set, catchall->elem); if (!nft_set_elem_active(ext, genmask) || nft_set_elem_expired(ext)) continue; priv = catchall->elem; break; } return priv; } static int nft_setelem_get(struct nft_ctx *ctx, const struct nft_set *set, struct nft_set_elem *elem, u32 flags) { void *priv; if (!(flags & NFT_SET_ELEM_CATCHALL)) { priv = set->ops->get(ctx->net, set, elem, flags); if (IS_ERR(priv)) return PTR_ERR(priv); } else { priv = nft_setelem_catchall_get(ctx->net, set); if (!priv) return -ENOENT; } elem->priv = priv; return 0; } static int nft_get_set_elem(struct nft_ctx *ctx, const struct nft_set *set, const struct nlattr *attr, bool reset) { struct nlattr *nla[NFTA_SET_ELEM_MAX + 1]; struct nft_set_elem elem; struct sk_buff *skb; uint32_t flags = 0; int err; err = nla_parse_nested_deprecated(nla, NFTA_SET_ELEM_MAX, attr, nft_set_elem_policy, NULL); if (err < 0) return err; err = nft_setelem_parse_flags(set, nla[NFTA_SET_ELEM_FLAGS], &flags); if (err < 0) return err; if (!nla[NFTA_SET_ELEM_KEY] && !(flags & NFT_SET_ELEM_CATCHALL)) return -EINVAL; if (nla[NFTA_SET_ELEM_KEY]) { err = nft_setelem_parse_key(ctx, set, &elem.key.val, nla[NFTA_SET_ELEM_KEY]); if (err < 0) return err; } if (nla[NFTA_SET_ELEM_KEY_END]) { err = nft_setelem_parse_key(ctx, set, &elem.key_end.val, nla[NFTA_SET_ELEM_KEY_END]); if (err < 0) return err; } err = nft_setelem_get(ctx, set, &elem, flags); if (err < 0) return err; err = -ENOMEM; skb = nlmsg_new(NLMSG_GOODSIZE, GFP_ATOMIC); if (skb == NULL) return err; err = nf_tables_fill_setelem_info(skb, ctx, ctx->seq, ctx->portid, NFT_MSG_NEWSETELEM, 0, set, elem.priv, reset); if (err < 0) goto err_fill_setelem; return nfnetlink_unicast(skb, ctx->net, ctx->portid); err_fill_setelem: kfree_skb(skb); return err; } static int nft_set_dump_ctx_init(struct nft_set_dump_ctx *dump_ctx, const struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[], bool reset) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_cur(info->net); u8 family = info->nfmsg->nfgen_family; struct net *net = info->net; struct nft_table *table; struct nft_set *set; table = nft_table_lookup(net, nla[NFTA_SET_ELEM_LIST_TABLE], family, genmask, 0); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_ELEM_LIST_TABLE]); return PTR_ERR(table); } set = nft_set_lookup(net, table, nla[NFTA_SET_ELEM_LIST_SET], genmask); if (IS_ERR(set)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_ELEM_LIST_SET]); return PTR_ERR(set); } nft_ctx_init(&dump_ctx->ctx, net, skb, info->nlh, family, table, NULL, nla); dump_ctx->set = set; dump_ctx->reset = reset; return 0; } /* called with rcu_read_lock held */ static int nf_tables_getsetelem(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; struct nft_set_dump_ctx dump_ctx; struct nlattr *attr; int rem, err = 0; if (info->nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = { .start = nf_tables_dump_set_start, .dump = nf_tables_dump_set, .done = nf_tables_dump_set_done, .module = THIS_MODULE, }; err = nft_set_dump_ctx_init(&dump_ctx, skb, info, nla, false); if (err) return err; c.data = &dump_ctx; return nft_netlink_dump_start_rcu(info->sk, skb, info->nlh, &c); } if (!nla[NFTA_SET_ELEM_LIST_ELEMENTS]) return -EINVAL; err = nft_set_dump_ctx_init(&dump_ctx, skb, info, nla, false); if (err) return err; nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) { err = nft_get_set_elem(&dump_ctx.ctx, dump_ctx.set, attr, false); if (err < 0) { NL_SET_BAD_ATTR(extack, attr); break; } } return err; } static int nf_tables_getsetelem_reset(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct nftables_pernet *nft_net = nft_pernet(info->net); struct netlink_ext_ack *extack = info->extack; struct nft_set_dump_ctx dump_ctx; int rem, err = 0, nelems = 0; struct nlattr *attr; if (info->nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = { .start = nf_tables_dump_set_start, .dump = nf_tables_dumpreset_set, .done = nf_tables_dump_set_done, .module = THIS_MODULE, }; err = nft_set_dump_ctx_init(&dump_ctx, skb, info, nla, true); if (err) return err; c.data = &dump_ctx; return nft_netlink_dump_start_rcu(info->sk, skb, info->nlh, &c); } if (!nla[NFTA_SET_ELEM_LIST_ELEMENTS]) return -EINVAL; if (!try_module_get(THIS_MODULE)) return -EINVAL; rcu_read_unlock(); mutex_lock(&nft_net->commit_mutex); rcu_read_lock(); err = nft_set_dump_ctx_init(&dump_ctx, skb, info, nla, true); if (err) goto out_unlock; nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) { err = nft_get_set_elem(&dump_ctx.ctx, dump_ctx.set, attr, true); if (err < 0) { NL_SET_BAD_ATTR(extack, attr); break; } nelems++; } audit_log_nft_set_reset(dump_ctx.ctx.table, nft_base_seq(info->net), nelems); out_unlock: rcu_read_unlock(); mutex_unlock(&nft_net->commit_mutex); rcu_read_lock(); module_put(THIS_MODULE); return err; } static void nf_tables_setelem_notify(const struct nft_ctx *ctx, const struct nft_set *set, const struct nft_elem_priv *elem_priv, int event) { struct nftables_pernet *nft_net; struct net *net = ctx->net; u32 portid = ctx->portid; struct sk_buff *skb; u16 flags = 0; int err; if (!ctx->report && !nfnetlink_has_listeners(net, NFNLGRP_NFTABLES)) return; skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL); if (skb == NULL) goto err; if (ctx->flags & (NLM_F_CREATE | NLM_F_EXCL)) flags |= ctx->flags & (NLM_F_CREATE | NLM_F_EXCL); err = nf_tables_fill_setelem_info(skb, ctx, 0, portid, event, flags, set, elem_priv, false); if (err < 0) { kfree_skb(skb); goto err; } nft_net = nft_pernet(net); nft_notify_enqueue(skb, ctx->report, &nft_net->notify_list); return; err: nfnetlink_set_err(net, portid, NFNLGRP_NFTABLES, -ENOBUFS); } static struct nft_trans *nft_trans_elem_alloc(const struct nft_ctx *ctx, int msg_type, struct nft_set *set) { struct nft_trans_elem *te; struct nft_trans *trans; trans = nft_trans_alloc(ctx, msg_type, struct_size(te, elems, 1)); if (trans == NULL) return NULL; te = nft_trans_container_elem(trans); te->nelems = 1; te->set = set; return trans; } struct nft_expr *nft_set_elem_expr_alloc(const struct nft_ctx *ctx, const struct nft_set *set, const struct nlattr *attr) { struct nft_expr *expr; int err; expr = nft_expr_init(ctx, attr); if (IS_ERR(expr)) return expr; err = -EOPNOTSUPP; if (expr->ops->type->flags & NFT_EXPR_GC) { if (set->flags & NFT_SET_TIMEOUT) goto err_set_elem_expr; if (!set->ops->gc_init) goto err_set_elem_expr; set->ops->gc_init(set); } return expr; err_set_elem_expr: nft_expr_destroy(ctx, expr); return ERR_PTR(err); } static int nft_set_ext_check(const struct nft_set_ext_tmpl *tmpl, u8 id, u32 len) { len += nft_set_ext_types[id].len; if (len > tmpl->ext_len[id] || len > U8_MAX) return -1; return 0; } static int nft_set_ext_memcpy(const struct nft_set_ext_tmpl *tmpl, u8 id, void *to, const void *from, u32 len) { if (nft_set_ext_check(tmpl, id, len) < 0) return -1; memcpy(to, from, len); return 0; } struct nft_elem_priv *nft_set_elem_init(const struct nft_set *set, const struct nft_set_ext_tmpl *tmpl, const u32 *key, const u32 *key_end, const u32 *data, u64 timeout, u64 expiration, gfp_t gfp) { struct nft_set_ext *ext; void *elem; elem = kzalloc(set->ops->elemsize + tmpl->len, gfp); if (elem == NULL) return ERR_PTR(-ENOMEM); ext = nft_set_elem_ext(set, elem); nft_set_ext_init(ext, tmpl); if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY) && nft_set_ext_memcpy(tmpl, NFT_SET_EXT_KEY, nft_set_ext_key(ext), key, set->klen) < 0) goto err_ext_check; if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END) && nft_set_ext_memcpy(tmpl, NFT_SET_EXT_KEY_END, nft_set_ext_key_end(ext), key_end, set->klen) < 0) goto err_ext_check; if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA) && nft_set_ext_memcpy(tmpl, NFT_SET_EXT_DATA, nft_set_ext_data(ext), data, set->dlen) < 0) goto err_ext_check; if (nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT)) { nft_set_ext_timeout(ext)->timeout = timeout; if (expiration == 0) expiration = timeout; nft_set_ext_timeout(ext)->expiration = get_jiffies_64() + expiration; } return elem; err_ext_check: kfree(elem); return ERR_PTR(-EINVAL); } static void __nft_set_elem_expr_destroy(const struct nft_ctx *ctx, struct nft_expr *expr) { if (expr->ops->destroy_clone) { expr->ops->destroy_clone(ctx, expr); module_put(expr->ops->type->owner); } else { nf_tables_expr_destroy(ctx, expr); } } static void nft_set_elem_expr_destroy(const struct nft_ctx *ctx, struct nft_set_elem_expr *elem_expr) { struct nft_expr *expr; u32 size; nft_setelem_expr_foreach(expr, elem_expr, size) __nft_set_elem_expr_destroy(ctx, expr); } /* Drop references and destroy. Called from gc, dynset and abort path. */ static void __nft_set_elem_destroy(const struct nft_ctx *ctx, const struct nft_set *set, const struct nft_elem_priv *elem_priv, bool destroy_expr) { struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); nft_data_release(nft_set_ext_key(ext), NFT_DATA_VALUE); if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA)) nft_data_release(nft_set_ext_data(ext), set->dtype); if (destroy_expr && nft_set_ext_exists(ext, NFT_SET_EXT_EXPRESSIONS)) nft_set_elem_expr_destroy(ctx, nft_set_ext_expr(ext)); if (nft_set_ext_exists(ext, NFT_SET_EXT_OBJREF)) nft_use_dec(&(*nft_set_ext_obj(ext))->use); kfree(elem_priv); } /* Drop references and destroy. Called from gc and dynset. */ void nft_set_elem_destroy(const struct nft_set *set, const struct nft_elem_priv *elem_priv, bool destroy_expr) { struct nft_ctx ctx = { .net = read_pnet(&set->net), .family = set->table->family, }; __nft_set_elem_destroy(&ctx, set, elem_priv, destroy_expr); } EXPORT_SYMBOL_GPL(nft_set_elem_destroy); /* Drop references and destroy. Called from abort path. */ static void nft_trans_set_elem_destroy(const struct nft_ctx *ctx, struct nft_trans_elem *te) { int i; for (i = 0; i < te->nelems; i++) { /* skip update request, see nft_trans_elems_new_abort() */ if (!te->elems[i].priv) continue; __nft_set_elem_destroy(ctx, te->set, te->elems[i].priv, true); } } /* Destroy element. References have been already dropped in the preparation * path via nft_setelem_data_deactivate(). */ void nf_tables_set_elem_destroy(const struct nft_ctx *ctx, const struct nft_set *set, const struct nft_elem_priv *elem_priv) { struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); if (nft_set_ext_exists(ext, NFT_SET_EXT_EXPRESSIONS)) nft_set_elem_expr_destroy(ctx, nft_set_ext_expr(ext)); kfree(elem_priv); } static void nft_trans_elems_destroy(const struct nft_ctx *ctx, const struct nft_trans_elem *te) { int i; for (i = 0; i < te->nelems; i++) nf_tables_set_elem_destroy(ctx, te->set, te->elems[i].priv); } int nft_set_elem_expr_clone(const struct nft_ctx *ctx, struct nft_set *set, struct nft_expr *expr_array[]) { struct nft_expr *expr; int err, i, k; for (i = 0; i < set->num_exprs; i++) { expr = kzalloc(set->exprs[i]->ops->size, GFP_KERNEL_ACCOUNT); if (!expr) goto err_expr; err = nft_expr_clone(expr, set->exprs[i], GFP_KERNEL_ACCOUNT); if (err < 0) { kfree(expr); goto err_expr; } expr_array[i] = expr; } return 0; err_expr: for (k = i - 1; k >= 0; k--) nft_expr_destroy(ctx, expr_array[k]); return -ENOMEM; } static int nft_set_elem_expr_setup(struct nft_ctx *ctx, const struct nft_set_ext_tmpl *tmpl, const struct nft_set_ext *ext, struct nft_expr *expr_array[], u32 num_exprs) { struct nft_set_elem_expr *elem_expr = nft_set_ext_expr(ext); u32 len = sizeof(struct nft_set_elem_expr); struct nft_expr *expr; int i, err; if (num_exprs == 0) return 0; for (i = 0; i < num_exprs; i++) len += expr_array[i]->ops->size; if (nft_set_ext_check(tmpl, NFT_SET_EXT_EXPRESSIONS, len) < 0) return -EINVAL; for (i = 0; i < num_exprs; i++) { expr = nft_setelem_expr_at(elem_expr, elem_expr->size); err = nft_expr_clone(expr, expr_array[i], GFP_KERNEL_ACCOUNT); if (err < 0) goto err_elem_expr_setup; elem_expr->size += expr_array[i]->ops->size; nft_expr_destroy(ctx, expr_array[i]); expr_array[i] = NULL; } return 0; err_elem_expr_setup: for (; i < num_exprs; i++) { nft_expr_destroy(ctx, expr_array[i]); expr_array[i] = NULL; } return -ENOMEM; } struct nft_set_ext *nft_set_catchall_lookup(const struct net *net, const struct nft_set *set) { struct nft_set_elem_catchall *catchall; u8 genmask = nft_genmask_cur(net); struct nft_set_ext *ext; list_for_each_entry_rcu(catchall, &set->catchall_list, list) { ext = nft_set_elem_ext(set, catchall->elem); if (nft_set_elem_active(ext, genmask) && !nft_set_elem_expired(ext) && !nft_set_elem_is_dead(ext)) return ext; } return NULL; } EXPORT_SYMBOL_GPL(nft_set_catchall_lookup); static int nft_setelem_catchall_insert(const struct net *net, struct nft_set *set, const struct nft_set_elem *elem, struct nft_elem_priv **priv) { struct nft_set_elem_catchall *catchall; u8 genmask = nft_genmask_next(net); struct nft_set_ext *ext; list_for_each_entry(catchall, &set->catchall_list, list) { ext = nft_set_elem_ext(set, catchall->elem); if (nft_set_elem_active(ext, genmask)) { *priv = catchall->elem; return -EEXIST; } } catchall = kmalloc(sizeof(*catchall), GFP_KERNEL_ACCOUNT); if (!catchall) return -ENOMEM; catchall->elem = elem->priv; list_add_tail_rcu(&catchall->list, &set->catchall_list); return 0; } static int nft_setelem_insert(const struct net *net, struct nft_set *set, const struct nft_set_elem *elem, struct nft_elem_priv **elem_priv, unsigned int flags) { int ret; if (flags & NFT_SET_ELEM_CATCHALL) ret = nft_setelem_catchall_insert(net, set, elem, elem_priv); else ret = set->ops->insert(net, set, elem, elem_priv); return ret; } static bool nft_setelem_is_catchall(const struct nft_set *set, const struct nft_elem_priv *elem_priv) { struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); if (nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS) && *nft_set_ext_flags(ext) & NFT_SET_ELEM_CATCHALL) return true; return false; } static void nft_setelem_activate(struct net *net, struct nft_set *set, struct nft_elem_priv *elem_priv) { struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); if (nft_setelem_is_catchall(set, elem_priv)) { nft_clear(net, ext); } else { set->ops->activate(net, set, elem_priv); } } static void nft_trans_elem_update(const struct nft_set *set, const struct nft_trans_one_elem *elem) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv); const struct nft_elem_update *update = elem->update; if (update->flags & NFT_TRANS_UPD_TIMEOUT) WRITE_ONCE(nft_set_ext_timeout(ext)->timeout, update->timeout); if (update->flags & NFT_TRANS_UPD_EXPIRATION) WRITE_ONCE(nft_set_ext_timeout(ext)->expiration, get_jiffies_64() + update->expiration); } static void nft_trans_elems_add(const struct nft_ctx *ctx, struct nft_trans_elem *te) { int i; for (i = 0; i < te->nelems; i++) { struct nft_trans_one_elem *elem = &te->elems[i]; if (elem->update) nft_trans_elem_update(te->set, elem); else nft_setelem_activate(ctx->net, te->set, elem->priv); nf_tables_setelem_notify(ctx, te->set, elem->priv, NFT_MSG_NEWSETELEM); kfree(elem->update); } } static int nft_setelem_catchall_deactivate(const struct net *net, struct nft_set *set, struct nft_set_elem *elem) { struct nft_set_elem_catchall *catchall; struct nft_set_ext *ext; list_for_each_entry(catchall, &set->catchall_list, list) { ext = nft_set_elem_ext(set, catchall->elem); if (!nft_is_active_next(net, ext)) continue; kfree(elem->priv); elem->priv = catchall->elem; nft_set_elem_change_active(net, set, ext); return 0; } return -ENOENT; } static int __nft_setelem_deactivate(const struct net *net, struct nft_set *set, struct nft_set_elem *elem) { void *priv; priv = set->ops->deactivate(net, set, elem); if (!priv) return -ENOENT; kfree(elem->priv); elem->priv = priv; set->ndeact++; return 0; } static int nft_setelem_deactivate(const struct net *net, struct nft_set *set, struct nft_set_elem *elem, u32 flags) { int ret; if (flags & NFT_SET_ELEM_CATCHALL) ret = nft_setelem_catchall_deactivate(net, set, elem); else ret = __nft_setelem_deactivate(net, set, elem); return ret; } static void nft_setelem_catchall_destroy(struct nft_set_elem_catchall *catchall) { list_del_rcu(&catchall->list); kfree_rcu(catchall, rcu); } static void nft_setelem_catchall_remove(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv) { struct nft_set_elem_catchall *catchall, *next; list_for_each_entry_safe(catchall, next, &set->catchall_list, list) { if (catchall->elem == elem_priv) { nft_setelem_catchall_destroy(catchall); break; } } } static void nft_setelem_remove(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv) { if (nft_setelem_is_catchall(set, elem_priv)) nft_setelem_catchall_remove(net, set, elem_priv); else set->ops->remove(net, set, elem_priv); } static void nft_trans_elems_remove(const struct nft_ctx *ctx, const struct nft_trans_elem *te) { int i; for (i = 0; i < te->nelems; i++) { WARN_ON_ONCE(te->elems[i].update); nf_tables_setelem_notify(ctx, te->set, te->elems[i].priv, te->nft_trans.msg_type); nft_setelem_remove(ctx->net, te->set, te->elems[i].priv); if (!nft_setelem_is_catchall(te->set, te->elems[i].priv)) { atomic_dec(&te->set->nelems); te->set->ndeact--; } } } static bool nft_setelem_valid_key_end(const struct nft_set *set, struct nlattr **nla, u32 flags) { if ((set->flags & (NFT_SET_CONCAT | NFT_SET_INTERVAL)) == (NFT_SET_CONCAT | NFT_SET_INTERVAL)) { if (flags & NFT_SET_ELEM_INTERVAL_END) return false; if (nla[NFTA_SET_ELEM_KEY_END] && flags & NFT_SET_ELEM_CATCHALL) return false; } else { if (nla[NFTA_SET_ELEM_KEY_END]) return false; } return true; } static u32 nft_set_maxsize(const struct nft_set *set) { u32 maxsize, delta; if (!set->size) return UINT_MAX; if (set->ops->adjust_maxsize) delta = set->ops->adjust_maxsize(set); else delta = 0; if (check_add_overflow(set->size, set->ndeact, &maxsize)) return UINT_MAX; if (check_add_overflow(maxsize, delta, &maxsize)) return UINT_MAX; return maxsize; } static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set, const struct nlattr *attr, u32 nlmsg_flags) { struct nft_expr *expr_array[NFT_SET_EXPR_MAX] = {}; struct nlattr *nla[NFTA_SET_ELEM_MAX + 1]; u8 genmask = nft_genmask_next(ctx->net); u32 flags = 0, size = 0, num_exprs = 0; struct nft_set_ext_tmpl tmpl; struct nft_set_ext *ext, *ext2; struct nft_set_elem elem; struct nft_set_binding *binding; struct nft_elem_priv *elem_priv; struct nft_object *obj = NULL; struct nft_userdata *udata; struct nft_data_desc desc; enum nft_registers dreg; struct nft_trans *trans; u64 expiration; u64 timeout; int err, i; u8 ulen; err = nla_parse_nested_deprecated(nla, NFTA_SET_ELEM_MAX, attr, nft_set_elem_policy, NULL); if (err < 0) return err; nft_set_ext_prepare(&tmpl); err = nft_setelem_parse_flags(set, nla[NFTA_SET_ELEM_FLAGS], &flags); if (err < 0) return err; if (((flags & NFT_SET_ELEM_CATCHALL) && nla[NFTA_SET_ELEM_KEY]) || (!(flags & NFT_SET_ELEM_CATCHALL) && !nla[NFTA_SET_ELEM_KEY])) return -EINVAL; if (flags != 0) { err = nft_set_ext_add(&tmpl, NFT_SET_EXT_FLAGS); if (err < 0) return err; } if (set->flags & NFT_SET_MAP) { if (nla[NFTA_SET_ELEM_DATA] == NULL && !(flags & NFT_SET_ELEM_INTERVAL_END)) return -EINVAL; } else { if (nla[NFTA_SET_ELEM_DATA] != NULL) return -EINVAL; } if (set->flags & NFT_SET_OBJECT) { if (!nla[NFTA_SET_ELEM_OBJREF] && !(flags & NFT_SET_ELEM_INTERVAL_END)) return -EINVAL; } else { if (nla[NFTA_SET_ELEM_OBJREF]) return -EINVAL; } if (!nft_setelem_valid_key_end(set, nla, flags)) return -EINVAL; if ((flags & NFT_SET_ELEM_INTERVAL_END) && (nla[NFTA_SET_ELEM_DATA] || nla[NFTA_SET_ELEM_OBJREF] || nla[NFTA_SET_ELEM_TIMEOUT] || nla[NFTA_SET_ELEM_EXPIRATION] || nla[NFTA_SET_ELEM_USERDATA] || nla[NFTA_SET_ELEM_EXPR] || nla[NFTA_SET_ELEM_KEY_END] || nla[NFTA_SET_ELEM_EXPRESSIONS])) return -EINVAL; timeout = 0; if (nla[NFTA_SET_ELEM_TIMEOUT] != NULL) { if (!(set->flags & NFT_SET_TIMEOUT)) return -EINVAL; err = nf_msecs_to_jiffies64(nla[NFTA_SET_ELEM_TIMEOUT], &timeout); if (err) return err; } else if (set->flags & NFT_SET_TIMEOUT && !(flags & NFT_SET_ELEM_INTERVAL_END)) { timeout = set->timeout; } expiration = 0; if (nla[NFTA_SET_ELEM_EXPIRATION] != NULL) { if (!(set->flags & NFT_SET_TIMEOUT)) return -EINVAL; if (timeout == 0) return -EOPNOTSUPP; err = nf_msecs_to_jiffies64(nla[NFTA_SET_ELEM_EXPIRATION], &expiration); if (err) return err; if (expiration > timeout) return -ERANGE; } if (nla[NFTA_SET_ELEM_EXPR]) { struct nft_expr *expr; if (set->num_exprs && set->num_exprs != 1) return -EOPNOTSUPP; expr = nft_set_elem_expr_alloc(ctx, set, nla[NFTA_SET_ELEM_EXPR]); if (IS_ERR(expr)) return PTR_ERR(expr); expr_array[0] = expr; num_exprs = 1; if (set->num_exprs && set->exprs[0]->ops != expr->ops) { err = -EOPNOTSUPP; goto err_set_elem_expr; } } else if (nla[NFTA_SET_ELEM_EXPRESSIONS]) { struct nft_expr *expr; struct nlattr *tmp; int left; i = 0; nla_for_each_nested(tmp, nla[NFTA_SET_ELEM_EXPRESSIONS], left) { if (i == NFT_SET_EXPR_MAX || (set->num_exprs && set->num_exprs == i)) { err = -E2BIG; goto err_set_elem_expr; } if (nla_type(tmp) != NFTA_LIST_ELEM) { err = -EINVAL; goto err_set_elem_expr; } expr = nft_set_elem_expr_alloc(ctx, set, tmp); if (IS_ERR(expr)) { err = PTR_ERR(expr); goto err_set_elem_expr; } expr_array[i] = expr; num_exprs++; if (set->num_exprs && expr->ops != set->exprs[i]->ops) { err = -EOPNOTSUPP; goto err_set_elem_expr; } i++; } if (set->num_exprs && set->num_exprs != i) { err = -EOPNOTSUPP; goto err_set_elem_expr; } } else if (set->num_exprs > 0 && !(flags & NFT_SET_ELEM_INTERVAL_END)) { err = nft_set_elem_expr_clone(ctx, set, expr_array); if (err < 0) goto err_set_elem_expr_clone; num_exprs = set->num_exprs; } if (nla[NFTA_SET_ELEM_KEY]) { err = nft_setelem_parse_key(ctx, set, &elem.key.val, nla[NFTA_SET_ELEM_KEY]); if (err < 0) goto err_set_elem_expr; err = nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY, set->klen); if (err < 0) goto err_parse_key; } if (nla[NFTA_SET_ELEM_KEY_END]) { err = nft_setelem_parse_key(ctx, set, &elem.key_end.val, nla[NFTA_SET_ELEM_KEY_END]); if (err < 0) goto err_parse_key; err = nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY_END, set->klen); if (err < 0) goto err_parse_key_end; } if (set->flags & NFT_SET_TIMEOUT) { err = nft_set_ext_add(&tmpl, NFT_SET_EXT_TIMEOUT); if (err < 0) goto err_parse_key_end; } if (num_exprs) { for (i = 0; i < num_exprs; i++) size += expr_array[i]->ops->size; err = nft_set_ext_add_length(&tmpl, NFT_SET_EXT_EXPRESSIONS, sizeof(struct nft_set_elem_expr) + size); if (err < 0) goto err_parse_key_end; } if (nla[NFTA_SET_ELEM_OBJREF] != NULL) { obj = nft_obj_lookup(ctx->net, ctx->table, nla[NFTA_SET_ELEM_OBJREF], set->objtype, genmask); if (IS_ERR(obj)) { err = PTR_ERR(obj); obj = NULL; goto err_parse_key_end; } if (!nft_use_inc(&obj->use)) { err = -EMFILE; obj = NULL; goto err_parse_key_end; } err = nft_set_ext_add(&tmpl, NFT_SET_EXT_OBJREF); if (err < 0) goto err_parse_key_end; } if (nla[NFTA_SET_ELEM_DATA] != NULL) { err = nft_setelem_parse_data(ctx, set, &desc, &elem.data.val, nla[NFTA_SET_ELEM_DATA]); if (err < 0) goto err_parse_key_end; dreg = nft_type_to_reg(set->dtype); list_for_each_entry(binding, &set->bindings, list) { struct nft_ctx bind_ctx = { .net = ctx->net, .family = ctx->family, .table = ctx->table, .chain = (struct nft_chain *)binding->chain, }; if (!(binding->flags & NFT_SET_MAP)) continue; err = nft_validate_register_store(&bind_ctx, dreg, &elem.data.val, desc.type, desc.len); if (err < 0) goto err_parse_data; if (desc.type == NFT_DATA_VERDICT && (elem.data.val.verdict.code == NFT_GOTO || elem.data.val.verdict.code == NFT_JUMP)) nft_validate_state_update(ctx->table, NFT_VALIDATE_NEED); } err = nft_set_ext_add_length(&tmpl, NFT_SET_EXT_DATA, desc.len); if (err < 0) goto err_parse_data; } /* The full maximum length of userdata can exceed the maximum * offset value (U8_MAX) for following extensions, therefor it * must be the last extension added. */ ulen = 0; if (nla[NFTA_SET_ELEM_USERDATA] != NULL) { ulen = nla_len(nla[NFTA_SET_ELEM_USERDATA]); if (ulen > 0) { err = nft_set_ext_add_length(&tmpl, NFT_SET_EXT_USERDATA, ulen); if (err < 0) goto err_parse_data; } } elem.priv = nft_set_elem_init(set, &tmpl, elem.key.val.data, elem.key_end.val.data, elem.data.val.data, timeout, expiration, GFP_KERNEL_ACCOUNT); if (IS_ERR(elem.priv)) { err = PTR_ERR(elem.priv); goto err_parse_data; } ext = nft_set_elem_ext(set, elem.priv); if (flags) *nft_set_ext_flags(ext) = flags; if (obj) *nft_set_ext_obj(ext) = obj; if (ulen > 0) { if (nft_set_ext_check(&tmpl, NFT_SET_EXT_USERDATA, ulen) < 0) { err = -EINVAL; goto err_elem_free; } udata = nft_set_ext_userdata(ext); udata->len = ulen - 1; nla_memcpy(&udata->data, nla[NFTA_SET_ELEM_USERDATA], ulen); } err = nft_set_elem_expr_setup(ctx, &tmpl, ext, expr_array, num_exprs); if (err < 0) goto err_elem_free; trans = nft_trans_elem_alloc(ctx, NFT_MSG_NEWSETELEM, set); if (trans == NULL) { err = -ENOMEM; goto err_elem_free; } ext->genmask = nft_genmask_cur(ctx->net); err = nft_setelem_insert(ctx->net, set, &elem, &elem_priv, flags); if (err) { if (err == -EEXIST) { ext2 = nft_set_elem_ext(set, elem_priv); if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA) ^ nft_set_ext_exists(ext2, NFT_SET_EXT_DATA) || nft_set_ext_exists(ext, NFT_SET_EXT_OBJREF) ^ nft_set_ext_exists(ext2, NFT_SET_EXT_OBJREF)) goto err_element_clash; if ((nft_set_ext_exists(ext, NFT_SET_EXT_DATA) && nft_set_ext_exists(ext2, NFT_SET_EXT_DATA) && memcmp(nft_set_ext_data(ext), nft_set_ext_data(ext2), set->dlen) != 0) || (nft_set_ext_exists(ext, NFT_SET_EXT_OBJREF) && nft_set_ext_exists(ext2, NFT_SET_EXT_OBJREF) && *nft_set_ext_obj(ext) != *nft_set_ext_obj(ext2))) goto err_element_clash; else if (!(nlmsg_flags & NLM_F_EXCL)) { err = 0; if (nft_set_ext_exists(ext2, NFT_SET_EXT_TIMEOUT)) { struct nft_elem_update update = { }; if (timeout != nft_set_ext_timeout(ext2)->timeout) { update.timeout = timeout; if (expiration == 0) expiration = timeout; update.flags |= NFT_TRANS_UPD_TIMEOUT; } if (expiration) { update.expiration = expiration; update.flags |= NFT_TRANS_UPD_EXPIRATION; } if (update.flags) { struct nft_trans_one_elem *ue; ue = &nft_trans_container_elem(trans)->elems[0]; ue->update = kmemdup(&update, sizeof(update), GFP_KERNEL); if (!ue->update) { err = -ENOMEM; goto err_element_clash; } ue->priv = elem_priv; nft_trans_commit_list_add_elem(ctx->net, trans); goto err_elem_free; } } } } else if (err == -ENOTEMPTY) { /* ENOTEMPTY reports overlapping between this element * and an existing one. */ err = -EEXIST; } goto err_element_clash; } if (!(flags & NFT_SET_ELEM_CATCHALL)) { unsigned int max = nft_set_maxsize(set); if (!atomic_add_unless(&set->nelems, 1, max)) { err = -ENFILE; goto err_set_full; } } nft_trans_container_elem(trans)->elems[0].priv = elem.priv; nft_trans_commit_list_add_elem(ctx->net, trans); return 0; err_set_full: nft_setelem_remove(ctx->net, set, elem.priv); err_element_clash: kfree(trans); err_elem_free: nf_tables_set_elem_destroy(ctx, set, elem.priv); err_parse_data: if (nla[NFTA_SET_ELEM_DATA] != NULL) nft_data_release(&elem.data.val, desc.type); err_parse_key_end: if (obj) nft_use_dec_restore(&obj->use); nft_data_release(&elem.key_end.val, NFT_DATA_VALUE); err_parse_key: nft_data_release(&elem.key.val, NFT_DATA_VALUE); err_set_elem_expr: for (i = 0; i < num_exprs && expr_array[i]; i++) nft_expr_destroy(ctx, expr_array[i]); err_set_elem_expr_clone: return err; } static int nf_tables_newsetelem(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; struct net *net = info->net; const struct nlattr *attr; struct nft_table *table; struct nft_set *set; struct nft_ctx ctx; int rem, err; if (nla[NFTA_SET_ELEM_LIST_ELEMENTS] == NULL) return -EINVAL; table = nft_table_lookup(net, nla[NFTA_SET_ELEM_LIST_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_ELEM_LIST_TABLE]); return PTR_ERR(table); } set = nft_set_lookup_global(net, table, nla[NFTA_SET_ELEM_LIST_SET], nla[NFTA_SET_ELEM_LIST_SET_ID], genmask); if (IS_ERR(set)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_ELEM_LIST_SET]); return PTR_ERR(set); } if (!list_empty(&set->bindings) && (set->flags & (NFT_SET_CONSTANT | NFT_SET_ANONYMOUS))) return -EBUSY; nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) { err = nft_add_set_elem(&ctx, set, attr, info->nlh->nlmsg_flags); if (err < 0) { NL_SET_BAD_ATTR(extack, attr); return err; } } if (table->validate_state == NFT_VALIDATE_DO) return nft_table_validate(net, table); return 0; } /** * nft_data_hold - hold a nft_data item * * @data: struct nft_data to release * @type: type of data * * Hold a nft_data item. NFT_DATA_VALUE types can be silently discarded, * NFT_DATA_VERDICT bumps the reference to chains in case of NFT_JUMP and * NFT_GOTO verdicts. This function must be called on active data objects * from the second phase of the commit protocol. */ void nft_data_hold(const struct nft_data *data, enum nft_data_types type) { struct nft_chain *chain; if (type == NFT_DATA_VERDICT) { switch (data->verdict.code) { case NFT_JUMP: case NFT_GOTO: chain = data->verdict.chain; nft_use_inc_restore(&chain->use); break; } } } static int nft_setelem_active_next(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); u8 genmask = nft_genmask_next(net); return nft_set_elem_active(ext, genmask); } static void nft_setelem_data_activate(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA)) nft_data_hold(nft_set_ext_data(ext), set->dtype); if (nft_set_ext_exists(ext, NFT_SET_EXT_OBJREF)) nft_use_inc_restore(&(*nft_set_ext_obj(ext))->use); } void nft_setelem_data_deactivate(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA)) nft_data_release(nft_set_ext_data(ext), set->dtype); if (nft_set_ext_exists(ext, NFT_SET_EXT_OBJREF)) nft_use_dec(&(*nft_set_ext_obj(ext))->use); } /* similar to nft_trans_elems_remove, but called from abort path to undo newsetelem. * No notifications and no ndeact changes. * * Returns true if set had been added to (i.e., elements need to be removed again). */ static bool nft_trans_elems_new_abort(const struct nft_ctx *ctx, struct nft_trans_elem *te) { bool removed = false; int i; for (i = 0; i < te->nelems; i++) { if (te->elems[i].update) { kfree(te->elems[i].update); te->elems[i].update = NULL; /* Update request, so do not release this element */ te->elems[i].priv = NULL; continue; } if (!te->set->ops->abort || nft_setelem_is_catchall(te->set, te->elems[i].priv)) nft_setelem_remove(ctx->net, te->set, te->elems[i].priv); if (!nft_setelem_is_catchall(te->set, te->elems[i].priv)) atomic_dec(&te->set->nelems); removed = true; } return removed; } /* Called from abort path to undo DELSETELEM/DESTROYSETELEM. */ static void nft_trans_elems_destroy_abort(const struct nft_ctx *ctx, const struct nft_trans_elem *te) { int i; for (i = 0; i < te->nelems; i++) { if (!nft_setelem_active_next(ctx->net, te->set, te->elems[i].priv)) { nft_setelem_data_activate(ctx->net, te->set, te->elems[i].priv); nft_setelem_activate(ctx->net, te->set, te->elems[i].priv); } if (!nft_setelem_is_catchall(te->set, te->elems[i].priv)) te->set->ndeact--; } } static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set, const struct nlattr *attr) { struct nlattr *nla[NFTA_SET_ELEM_MAX + 1]; struct nft_set_ext_tmpl tmpl; struct nft_set_elem elem; struct nft_set_ext *ext; struct nft_trans *trans; u32 flags = 0; int err; err = nla_parse_nested_deprecated(nla, NFTA_SET_ELEM_MAX, attr, nft_set_elem_policy, NULL); if (err < 0) return err; err = nft_setelem_parse_flags(set, nla[NFTA_SET_ELEM_FLAGS], &flags); if (err < 0) return err; if (!nla[NFTA_SET_ELEM_KEY] && !(flags & NFT_SET_ELEM_CATCHALL)) return -EINVAL; if (!nft_setelem_valid_key_end(set, nla, flags)) return -EINVAL; nft_set_ext_prepare(&tmpl); if (flags != 0) { err = nft_set_ext_add(&tmpl, NFT_SET_EXT_FLAGS); if (err < 0) return err; } if (nla[NFTA_SET_ELEM_KEY]) { err = nft_setelem_parse_key(ctx, set, &elem.key.val, nla[NFTA_SET_ELEM_KEY]); if (err < 0) return err; err = nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY, set->klen); if (err < 0) goto fail_elem; } if (nla[NFTA_SET_ELEM_KEY_END]) { err = nft_setelem_parse_key(ctx, set, &elem.key_end.val, nla[NFTA_SET_ELEM_KEY_END]); if (err < 0) goto fail_elem; err = nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY_END, set->klen); if (err < 0) goto fail_elem_key_end; } err = -ENOMEM; elem.priv = nft_set_elem_init(set, &tmpl, elem.key.val.data, elem.key_end.val.data, NULL, 0, 0, GFP_KERNEL_ACCOUNT); if (IS_ERR(elem.priv)) { err = PTR_ERR(elem.priv); goto fail_elem_key_end; } ext = nft_set_elem_ext(set, elem.priv); if (flags) *nft_set_ext_flags(ext) = flags; trans = nft_trans_elem_alloc(ctx, NFT_MSG_DELSETELEM, set); if (trans == NULL) goto fail_trans; err = nft_setelem_deactivate(ctx->net, set, &elem, flags); if (err < 0) goto fail_ops; nft_setelem_data_deactivate(ctx->net, set, elem.priv); nft_trans_container_elem(trans)->elems[0].priv = elem.priv; nft_trans_commit_list_add_elem(ctx->net, trans); return 0; fail_ops: kfree(trans); fail_trans: kfree(elem.priv); fail_elem_key_end: nft_data_release(&elem.key_end.val, NFT_DATA_VALUE); fail_elem: nft_data_release(&elem.key.val, NFT_DATA_VALUE); return err; } static int nft_setelem_flush(const struct nft_ctx *ctx, struct nft_set *set, const struct nft_set_iter *iter, struct nft_elem_priv *elem_priv) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv); struct nft_trans *trans; if (!nft_set_elem_active(ext, iter->genmask)) return 0; trans = nft_trans_alloc(ctx, NFT_MSG_DELSETELEM, struct_size_t(struct nft_trans_elem, elems, 1)); if (!trans) return -ENOMEM; set->ops->flush(ctx->net, set, elem_priv); set->ndeact++; nft_setelem_data_deactivate(ctx->net, set, elem_priv); nft_trans_elem_set(trans) = set; nft_trans_container_elem(trans)->nelems = 1; nft_trans_container_elem(trans)->elems[0].priv = elem_priv; nft_trans_commit_list_add_elem(ctx->net, trans); return 0; } static int __nft_set_catchall_flush(const struct nft_ctx *ctx, struct nft_set *set, struct nft_elem_priv *elem_priv) { struct nft_trans *trans; trans = nft_trans_elem_alloc(ctx, NFT_MSG_DELSETELEM, set); if (!trans) return -ENOMEM; nft_setelem_data_deactivate(ctx->net, set, elem_priv); nft_trans_container_elem(trans)->elems[0].priv = elem_priv; nft_trans_commit_list_add_elem(ctx->net, trans); return 0; } static int nft_set_catchall_flush(const struct nft_ctx *ctx, struct nft_set *set) { u8 genmask = nft_genmask_next(ctx->net); struct nft_set_elem_catchall *catchall; struct nft_set_ext *ext; int ret = 0; list_for_each_entry_rcu(catchall, &set->catchall_list, list, lockdep_commit_lock_is_held(ctx->net)) { ext = nft_set_elem_ext(set, catchall->elem); if (!nft_set_elem_active(ext, genmask)) continue; ret = __nft_set_catchall_flush(ctx, set, catchall->elem); if (ret < 0) break; nft_set_elem_change_active(ctx->net, set, ext); } return ret; } static int nft_set_flush(struct nft_ctx *ctx, struct nft_set *set, u8 genmask) { struct nft_set_iter iter = { .genmask = genmask, .type = NFT_ITER_UPDATE, .fn = nft_setelem_flush, }; set->ops->walk(ctx, set, &iter); if (!iter.err) iter.err = nft_set_catchall_flush(ctx, set); return iter.err; } static int nf_tables_delsetelem(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; struct net *net = info->net; const struct nlattr *attr; struct nft_table *table; struct nft_set *set; struct nft_ctx ctx; int rem, err = 0; table = nft_table_lookup(net, nla[NFTA_SET_ELEM_LIST_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_ELEM_LIST_TABLE]); return PTR_ERR(table); } set = nft_set_lookup(net, table, nla[NFTA_SET_ELEM_LIST_SET], genmask); if (IS_ERR(set)) { NL_SET_BAD_ATTR(extack, nla[NFTA_SET_ELEM_LIST_SET]); return PTR_ERR(set); } if (nft_set_is_anonymous(set)) return -EOPNOTSUPP; if (!list_empty(&set->bindings) && (set->flags & NFT_SET_CONSTANT)) return -EBUSY; nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); if (!nla[NFTA_SET_ELEM_LIST_ELEMENTS]) return nft_set_flush(&ctx, set, genmask); nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) { err = nft_del_setelem(&ctx, set, attr); if (err == -ENOENT && NFNL_MSG_TYPE(info->nlh->nlmsg_type) == NFT_MSG_DESTROYSETELEM) continue; if (err < 0) { NL_SET_BAD_ATTR(extack, attr); return err; } } return 0; } /* * Stateful objects */ /** * nft_register_obj- register nf_tables stateful object type * @obj_type: object type * * Registers the object type for use with nf_tables. Returns zero on * success or a negative errno code otherwise. */ int nft_register_obj(struct nft_object_type *obj_type) { if (obj_type->type == NFT_OBJECT_UNSPEC) return -EINVAL; nfnl_lock(NFNL_SUBSYS_NFTABLES); list_add_rcu(&obj_type->list, &nf_tables_objects); nfnl_unlock(NFNL_SUBSYS_NFTABLES); return 0; } EXPORT_SYMBOL_GPL(nft_register_obj); /** * nft_unregister_obj - unregister nf_tables object type * @obj_type: object type * * Unregisters the object type for use with nf_tables. */ void nft_unregister_obj(struct nft_object_type *obj_type) { nfnl_lock(NFNL_SUBSYS_NFTABLES); list_del_rcu(&obj_type->list); nfnl_unlock(NFNL_SUBSYS_NFTABLES); } EXPORT_SYMBOL_GPL(nft_unregister_obj); struct nft_object *nft_obj_lookup(const struct net *net, const struct nft_table *table, const struct nlattr *nla, u32 objtype, u8 genmask) { struct nft_object_hash_key k = { .table = table }; char search[NFT_OBJ_MAXNAMELEN]; struct rhlist_head *tmp, *list; struct nft_object *obj; nla_strscpy(search, nla, sizeof(search)); k.name = search; WARN_ON_ONCE(!rcu_read_lock_held() && !lockdep_commit_lock_is_held(net)); rcu_read_lock(); list = rhltable_lookup(&nft_objname_ht, &k, nft_objname_ht_params); if (!list) goto out; rhl_for_each_entry_rcu(obj, tmp, list, rhlhead) { if (objtype == obj->ops->type->type && nft_active_genmask(obj, genmask)) { rcu_read_unlock(); return obj; } } out: rcu_read_unlock(); return ERR_PTR(-ENOENT); } EXPORT_SYMBOL_GPL(nft_obj_lookup); static struct nft_object *nft_obj_lookup_byhandle(const struct nft_table *table, const struct nlattr *nla, u32 objtype, u8 genmask) { struct nft_object *obj; list_for_each_entry(obj, &table->objects, list) { if (be64_to_cpu(nla_get_be64(nla)) == obj->handle && objtype == obj->ops->type->type && nft_active_genmask(obj, genmask)) return obj; } return ERR_PTR(-ENOENT); } static const struct nla_policy nft_obj_policy[NFTA_OBJ_MAX + 1] = { [NFTA_OBJ_TABLE] = { .type = NLA_STRING, .len = NFT_TABLE_MAXNAMELEN - 1 }, [NFTA_OBJ_NAME] = { .type = NLA_STRING, .len = NFT_OBJ_MAXNAMELEN - 1 }, [NFTA_OBJ_TYPE] = { .type = NLA_U32 }, [NFTA_OBJ_DATA] = { .type = NLA_NESTED }, [NFTA_OBJ_HANDLE] = { .type = NLA_U64}, [NFTA_OBJ_USERDATA] = { .type = NLA_BINARY, .len = NFT_USERDATA_MAXLEN }, }; static struct nft_object *nft_obj_init(const struct nft_ctx *ctx, const struct nft_object_type *type, const struct nlattr *attr) { struct nlattr **tb; const struct nft_object_ops *ops; struct nft_object *obj; int err = -ENOMEM; tb = kmalloc_array(type->maxattr + 1, sizeof(*tb), GFP_KERNEL); if (!tb) goto err1; if (attr) { err = nla_parse_nested_deprecated(tb, type->maxattr, attr, type->policy, NULL); if (err < 0) goto err2; } else { memset(tb, 0, sizeof(tb[0]) * (type->maxattr + 1)); } if (type->select_ops) { ops = type->select_ops(ctx, (const struct nlattr * const *)tb); if (IS_ERR(ops)) { err = PTR_ERR(ops); goto err2; } } else { ops = type->ops; } err = -ENOMEM; obj = kzalloc(sizeof(*obj) + ops->size, GFP_KERNEL_ACCOUNT); if (!obj) goto err2; err = ops->init(ctx, (const struct nlattr * const *)tb, obj); if (err < 0) goto err3; obj->ops = ops; kfree(tb); return obj; err3: kfree(obj); err2: kfree(tb); err1: return ERR_PTR(err); } static int nft_object_dump(struct sk_buff *skb, unsigned int attr, struct nft_object *obj, bool reset) { struct nlattr *nest; nest = nla_nest_start_noflag(skb, attr); if (!nest) goto nla_put_failure; if (obj->ops->dump(skb, obj, reset) < 0) goto nla_put_failure; nla_nest_end(skb, nest); return 0; nla_put_failure: return -1; } static const struct nft_object_type *__nft_obj_type_get(u32 objtype, u8 family) { const struct nft_object_type *type; list_for_each_entry_rcu(type, &nf_tables_objects, list) { if (type->family != NFPROTO_UNSPEC && type->family != family) continue; if (objtype == type->type) return type; } return NULL; } static const struct nft_object_type * nft_obj_type_get(struct net *net, u32 objtype, u8 family) { const struct nft_object_type *type; rcu_read_lock(); type = __nft_obj_type_get(objtype, family); if (type != NULL && try_module_get(type->owner)) { rcu_read_unlock(); return type; } rcu_read_unlock(); lockdep_nfnl_nft_mutex_not_held(); #ifdef CONFIG_MODULES if (type == NULL) { if (nft_request_module(net, "nft-obj-%u", objtype) == -EAGAIN) return ERR_PTR(-EAGAIN); } #endif return ERR_PTR(-ENOENT); } static int nf_tables_updobj(const struct nft_ctx *ctx, const struct nft_object_type *type, const struct nlattr *attr, struct nft_object *obj) { struct nft_object *newobj; struct nft_trans *trans; int err = -ENOMEM; /* caller must have obtained type->owner reference. */ trans = nft_trans_alloc(ctx, NFT_MSG_NEWOBJ, sizeof(struct nft_trans_obj)); if (!trans) goto err_trans; newobj = nft_obj_init(ctx, type, attr); if (IS_ERR(newobj)) { err = PTR_ERR(newobj); goto err_free_trans; } nft_trans_obj(trans) = obj; nft_trans_obj_update(trans) = true; nft_trans_obj_newobj(trans) = newobj; nft_trans_commit_list_add_tail(ctx->net, trans); return 0; err_free_trans: kfree(trans); err_trans: module_put(type->owner); return err; } static int nf_tables_newobj(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; const struct nft_object_type *type; struct net *net = info->net; struct nft_table *table; struct nft_object *obj; struct nft_ctx ctx; u32 objtype; int err; if (!nla[NFTA_OBJ_TYPE] || !nla[NFTA_OBJ_NAME] || !nla[NFTA_OBJ_DATA]) return -EINVAL; table = nft_table_lookup(net, nla[NFTA_OBJ_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_OBJ_TABLE]); return PTR_ERR(table); } objtype = ntohl(nla_get_be32(nla[NFTA_OBJ_TYPE])); obj = nft_obj_lookup(net, table, nla[NFTA_OBJ_NAME], objtype, genmask); if (IS_ERR(obj)) { err = PTR_ERR(obj); if (err != -ENOENT) { NL_SET_BAD_ATTR(extack, nla[NFTA_OBJ_NAME]); return err; } } else { if (info->nlh->nlmsg_flags & NLM_F_EXCL) { NL_SET_BAD_ATTR(extack, nla[NFTA_OBJ_NAME]); return -EEXIST; } if (info->nlh->nlmsg_flags & NLM_F_REPLACE) return -EOPNOTSUPP; if (!obj->ops->update) return 0; type = nft_obj_type_get(net, objtype, family); if (WARN_ON_ONCE(IS_ERR(type))) return PTR_ERR(type); nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); /* type->owner reference is put when transaction object is released. */ return nf_tables_updobj(&ctx, type, nla[NFTA_OBJ_DATA], obj); } nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); if (!nft_use_inc(&table->use)) return -EMFILE; type = nft_obj_type_get(net, objtype, family); if (IS_ERR(type)) { err = PTR_ERR(type); goto err_type; } obj = nft_obj_init(&ctx, type, nla[NFTA_OBJ_DATA]); if (IS_ERR(obj)) { err = PTR_ERR(obj); goto err_init; } obj->key.table = table; obj->handle = nf_tables_alloc_handle(table); obj->key.name = nla_strdup(nla[NFTA_OBJ_NAME], GFP_KERNEL_ACCOUNT); if (!obj->key.name) { err = -ENOMEM; goto err_strdup; } if (nla[NFTA_OBJ_USERDATA]) { obj->udata = nla_memdup(nla[NFTA_OBJ_USERDATA], GFP_KERNEL_ACCOUNT); if (obj->udata == NULL) goto err_userdata; obj->udlen = nla_len(nla[NFTA_OBJ_USERDATA]); } err = nft_trans_obj_add(&ctx, NFT_MSG_NEWOBJ, obj); if (err < 0) goto err_trans; err = rhltable_insert(&nft_objname_ht, &obj->rhlhead, nft_objname_ht_params); if (err < 0) goto err_obj_ht; list_add_tail_rcu(&obj->list, &table->objects); return 0; err_obj_ht: /* queued in transaction log */ INIT_LIST_HEAD(&obj->list); return err; err_trans: kfree(obj->udata); err_userdata: kfree(obj->key.name); err_strdup: if (obj->ops->destroy) obj->ops->destroy(&ctx, obj); kfree(obj); err_init: module_put(type->owner); err_type: nft_use_dec_restore(&table->use); return err; } static int nf_tables_fill_obj_info(struct sk_buff *skb, struct net *net, u32 portid, u32 seq, int event, u32 flags, int family, const struct nft_table *table, struct nft_object *obj, bool reset) { struct nlmsghdr *nlh; nlh = nfnl_msg_put(skb, portid, seq, nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event), flags, family, NFNETLINK_V0, nft_base_seq_be16(net)); if (!nlh) goto nla_put_failure; if (nla_put_string(skb, NFTA_OBJ_TABLE, table->name) || nla_put_string(skb, NFTA_OBJ_NAME, obj->key.name) || nla_put_be32(skb, NFTA_OBJ_TYPE, htonl(obj->ops->type->type)) || nla_put_be64(skb, NFTA_OBJ_HANDLE, cpu_to_be64(obj->handle), NFTA_OBJ_PAD)) goto nla_put_failure; if (event == NFT_MSG_DELOBJ || event == NFT_MSG_DESTROYOBJ) { nlmsg_end(skb, nlh); return 0; } if (nla_put_be32(skb, NFTA_OBJ_USE, htonl(obj->use)) || nft_object_dump(skb, NFTA_OBJ_DATA, obj, reset)) goto nla_put_failure; if (obj->udata && nla_put(skb, NFTA_OBJ_USERDATA, obj->udlen, obj->udata)) goto nla_put_failure; nlmsg_end(skb, nlh); return 0; nla_put_failure: nlmsg_trim(skb, nlh); return -1; } static void audit_log_obj_reset(const struct nft_table *table, unsigned int base_seq, unsigned int nentries) { char *buf = kasprintf(GFP_ATOMIC, "%s:%u", table->name, base_seq); audit_log_nfcfg(buf, table->family, nentries, AUDIT_NFT_OP_OBJ_RESET, GFP_ATOMIC); kfree(buf); } struct nft_obj_dump_ctx { unsigned int s_idx; char *table; u32 type; bool reset; }; static int nf_tables_dump_obj(struct sk_buff *skb, struct netlink_callback *cb) { const struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh); struct nft_obj_dump_ctx *ctx = (void *)cb->ctx; struct net *net = sock_net(skb->sk); int family = nfmsg->nfgen_family; struct nftables_pernet *nft_net; const struct nft_table *table; unsigned int entries = 0; struct nft_object *obj; unsigned int idx = 0; int rc = 0; rcu_read_lock(); nft_net = nft_pernet(net); cb->seq = nft_base_seq(net); list_for_each_entry_rcu(table, &nft_net->tables, list) { if (family != NFPROTO_UNSPEC && family != table->family) continue; entries = 0; list_for_each_entry_rcu(obj, &table->objects, list) { if (!nft_is_active(net, obj)) goto cont; if (idx < ctx->s_idx) goto cont; if (ctx->table && strcmp(ctx->table, table->name)) goto cont; if (ctx->type != NFT_OBJECT_UNSPEC && obj->ops->type->type != ctx->type) goto cont; rc = nf_tables_fill_obj_info(skb, net, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NFT_MSG_NEWOBJ, NLM_F_MULTI | NLM_F_APPEND, table->family, table, obj, ctx->reset); if (rc < 0) break; entries++; nl_dump_check_consistent(cb, nlmsg_hdr(skb)); cont: idx++; } if (ctx->reset && entries) audit_log_obj_reset(table, nft_base_seq(net), entries); if (rc < 0) break; } rcu_read_unlock(); ctx->s_idx = idx; return skb->len; } static int nf_tables_dumpreset_obj(struct sk_buff *skb, struct netlink_callback *cb) { struct nftables_pernet *nft_net = nft_pernet(sock_net(skb->sk)); int ret; mutex_lock(&nft_net->commit_mutex); ret = nf_tables_dump_obj(skb, cb); mutex_unlock(&nft_net->commit_mutex); return ret; } static int nf_tables_dump_obj_start(struct netlink_callback *cb) { struct nft_obj_dump_ctx *ctx = (void *)cb->ctx; const struct nlattr * const *nla = cb->data; BUILD_BUG_ON(sizeof(*ctx) > sizeof(cb->ctx)); if (nla[NFTA_OBJ_TABLE]) { ctx->table = nla_strdup(nla[NFTA_OBJ_TABLE], GFP_ATOMIC); if (!ctx->table) return -ENOMEM; } if (nla[NFTA_OBJ_TYPE]) ctx->type = ntohl(nla_get_be32(nla[NFTA_OBJ_TYPE])); return 0; } static int nf_tables_dumpreset_obj_start(struct netlink_callback *cb) { struct nft_obj_dump_ctx *ctx = (void *)cb->ctx; ctx->reset = true; return nf_tables_dump_obj_start(cb); } static int nf_tables_dump_obj_done(struct netlink_callback *cb) { struct nft_obj_dump_ctx *ctx = (void *)cb->ctx; kfree(ctx->table); return 0; } /* Caller must hold rcu read lock or transaction mutex */ static struct sk_buff * nf_tables_getobj_single(u32 portid, const struct nfnl_info *info, const struct nlattr * const nla[], bool reset) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_cur(info->net); u8 family = info->nfmsg->nfgen_family; const struct nft_table *table; struct net *net = info->net; struct nft_object *obj; struct sk_buff *skb2; u32 objtype; int err; if (!nla[NFTA_OBJ_NAME] || !nla[NFTA_OBJ_TYPE]) return ERR_PTR(-EINVAL); table = nft_table_lookup(net, nla[NFTA_OBJ_TABLE], family, genmask, 0); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_OBJ_TABLE]); return ERR_CAST(table); } objtype = ntohl(nla_get_be32(nla[NFTA_OBJ_TYPE])); obj = nft_obj_lookup(net, table, nla[NFTA_OBJ_NAME], objtype, genmask); if (IS_ERR(obj)) { NL_SET_BAD_ATTR(extack, nla[NFTA_OBJ_NAME]); return ERR_CAST(obj); } skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); if (!skb2) return ERR_PTR(-ENOMEM); err = nf_tables_fill_obj_info(skb2, net, portid, info->nlh->nlmsg_seq, NFT_MSG_NEWOBJ, 0, family, table, obj, reset); if (err < 0) { kfree_skb(skb2); return ERR_PTR(err); } return skb2; } static int nf_tables_getobj(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { u32 portid = NETLINK_CB(skb).portid; struct sk_buff *skb2; if (info->nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = { .start = nf_tables_dump_obj_start, .dump = nf_tables_dump_obj, .done = nf_tables_dump_obj_done, .module = THIS_MODULE, .data = (void *)nla, }; return nft_netlink_dump_start_rcu(info->sk, skb, info->nlh, &c); } skb2 = nf_tables_getobj_single(portid, info, nla, false); if (IS_ERR(skb2)) return PTR_ERR(skb2); return nfnetlink_unicast(skb2, info->net, portid); } static int nf_tables_getobj_reset(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct nftables_pernet *nft_net = nft_pernet(info->net); u32 portid = NETLINK_CB(skb).portid; struct net *net = info->net; struct sk_buff *skb2; char *buf; if (info->nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = { .start = nf_tables_dumpreset_obj_start, .dump = nf_tables_dumpreset_obj, .done = nf_tables_dump_obj_done, .module = THIS_MODULE, .data = (void *)nla, }; return nft_netlink_dump_start_rcu(info->sk, skb, info->nlh, &c); } if (!try_module_get(THIS_MODULE)) return -EINVAL; rcu_read_unlock(); mutex_lock(&nft_net->commit_mutex); skb2 = nf_tables_getobj_single(portid, info, nla, true); mutex_unlock(&nft_net->commit_mutex); rcu_read_lock(); module_put(THIS_MODULE); if (IS_ERR(skb2)) return PTR_ERR(skb2); buf = kasprintf(GFP_ATOMIC, "%.*s:%u", nla_len(nla[NFTA_OBJ_TABLE]), (char *)nla_data(nla[NFTA_OBJ_TABLE]), nft_base_seq(net)); audit_log_nfcfg(buf, info->nfmsg->nfgen_family, 1, AUDIT_NFT_OP_OBJ_RESET, GFP_ATOMIC); kfree(buf); return nfnetlink_unicast(skb2, net, portid); } static void nft_obj_destroy(const struct nft_ctx *ctx, struct nft_object *obj) { if (obj->ops->destroy) obj->ops->destroy(ctx, obj); module_put(obj->ops->type->owner); kfree(obj->key.name); kfree(obj->udata); kfree(obj); } static int nf_tables_delobj(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; struct net *net = info->net; const struct nlattr *attr; struct nft_table *table; struct nft_object *obj; struct nft_ctx ctx; u32 objtype; if (!nla[NFTA_OBJ_TYPE] || (!nla[NFTA_OBJ_NAME] && !nla[NFTA_OBJ_HANDLE])) return -EINVAL; table = nft_table_lookup(net, nla[NFTA_OBJ_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_OBJ_TABLE]); return PTR_ERR(table); } objtype = ntohl(nla_get_be32(nla[NFTA_OBJ_TYPE])); if (nla[NFTA_OBJ_HANDLE]) { attr = nla[NFTA_OBJ_HANDLE]; obj = nft_obj_lookup_byhandle(table, attr, objtype, genmask); } else { attr = nla[NFTA_OBJ_NAME]; obj = nft_obj_lookup(net, table, attr, objtype, genmask); } if (IS_ERR(obj)) { if (PTR_ERR(obj) == -ENOENT && NFNL_MSG_TYPE(info->nlh->nlmsg_type) == NFT_MSG_DESTROYOBJ) return 0; NL_SET_BAD_ATTR(extack, attr); return PTR_ERR(obj); } if (obj->use > 0) { NL_SET_BAD_ATTR(extack, attr); return -EBUSY; } nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); return nft_delobj(&ctx, obj); } static void __nft_obj_notify(struct net *net, const struct nft_table *table, struct nft_object *obj, u32 portid, u32 seq, int event, u16 flags, int family, int report, gfp_t gfp) { struct nftables_pernet *nft_net = nft_pernet(net); struct sk_buff *skb; int err; if (!report && !nfnetlink_has_listeners(net, NFNLGRP_NFTABLES)) return; skb = nlmsg_new(NLMSG_GOODSIZE, gfp); if (skb == NULL) goto err; err = nf_tables_fill_obj_info(skb, net, portid, seq, event, flags & (NLM_F_CREATE | NLM_F_EXCL), family, table, obj, false); if (err < 0) { kfree_skb(skb); goto err; } nft_notify_enqueue(skb, report, &nft_net->notify_list); return; err: nfnetlink_set_err(net, portid, NFNLGRP_NFTABLES, -ENOBUFS); } void nft_obj_notify(struct net *net, const struct nft_table *table, struct nft_object *obj, u32 portid, u32 seq, int event, u16 flags, int family, int report, gfp_t gfp) { char *buf = kasprintf(gfp, "%s:%u", table->name, nft_base_seq(net)); audit_log_nfcfg(buf, family, obj->handle, event == NFT_MSG_NEWOBJ ? AUDIT_NFT_OP_OBJ_REGISTER : AUDIT_NFT_OP_OBJ_UNREGISTER, gfp); kfree(buf); __nft_obj_notify(net, table, obj, portid, seq, event, flags, family, report, gfp); } EXPORT_SYMBOL_GPL(nft_obj_notify); static void nf_tables_obj_notify(const struct nft_ctx *ctx, struct nft_object *obj, int event) { __nft_obj_notify(ctx->net, ctx->table, obj, ctx->portid, ctx->seq, event, ctx->flags, ctx->family, ctx->report, GFP_KERNEL); } /* * Flow tables */ void nft_register_flowtable_type(struct nf_flowtable_type *type) { nfnl_lock(NFNL_SUBSYS_NFTABLES); list_add_tail_rcu(&type->list, &nf_tables_flowtables); nfnl_unlock(NFNL_SUBSYS_NFTABLES); } EXPORT_SYMBOL_GPL(nft_register_flowtable_type); void nft_unregister_flowtable_type(struct nf_flowtable_type *type) { nfnl_lock(NFNL_SUBSYS_NFTABLES); list_del_rcu(&type->list); nfnl_unlock(NFNL_SUBSYS_NFTABLES); } EXPORT_SYMBOL_GPL(nft_unregister_flowtable_type); static const struct nla_policy nft_flowtable_policy[NFTA_FLOWTABLE_MAX + 1] = { [NFTA_FLOWTABLE_TABLE] = { .type = NLA_STRING, .len = NFT_NAME_MAXLEN - 1 }, [NFTA_FLOWTABLE_NAME] = { .type = NLA_STRING, .len = NFT_NAME_MAXLEN - 1 }, [NFTA_FLOWTABLE_HOOK] = { .type = NLA_NESTED }, [NFTA_FLOWTABLE_HANDLE] = { .type = NLA_U64 }, [NFTA_FLOWTABLE_FLAGS] = { .type = NLA_U32 }, }; struct nft_flowtable *nft_flowtable_lookup(const struct net *net, const struct nft_table *table, const struct nlattr *nla, u8 genmask) { struct nft_flowtable *flowtable; list_for_each_entry_rcu(flowtable, &table->flowtables, list, lockdep_commit_lock_is_held(net)) { if (!nla_strcmp(nla, flowtable->name) && nft_active_genmask(flowtable, genmask)) return flowtable; } return ERR_PTR(-ENOENT); } EXPORT_SYMBOL_GPL(nft_flowtable_lookup); void nf_tables_deactivate_flowtable(const struct nft_ctx *ctx, struct nft_flowtable *flowtable, enum nft_trans_phase phase) { switch (phase) { case NFT_TRANS_PREPARE_ERROR: case NFT_TRANS_PREPARE: case NFT_TRANS_ABORT: case NFT_TRANS_RELEASE: nft_use_dec(&flowtable->use); fallthrough; default: return; } } EXPORT_SYMBOL_GPL(nf_tables_deactivate_flowtable); static struct nft_flowtable * nft_flowtable_lookup_byhandle(const struct nft_table *table, const struct nlattr *nla, u8 genmask) { struct nft_flowtable *flowtable; list_for_each_entry(flowtable, &table->flowtables, list) { if (be64_to_cpu(nla_get_be64(nla)) == flowtable->handle && nft_active_genmask(flowtable, genmask)) return flowtable; } return ERR_PTR(-ENOENT); } struct nft_flowtable_hook { u32 num; int priority; struct list_head list; }; static const struct nla_policy nft_flowtable_hook_policy[NFTA_FLOWTABLE_HOOK_MAX + 1] = { [NFTA_FLOWTABLE_HOOK_NUM] = { .type = NLA_U32 }, [NFTA_FLOWTABLE_HOOK_PRIORITY] = { .type = NLA_U32 }, [NFTA_FLOWTABLE_HOOK_DEVS] = { .type = NLA_NESTED }, }; static int nft_flowtable_parse_hook(const struct nft_ctx *ctx, const struct nlattr * const nla[], struct nft_flowtable_hook *flowtable_hook, struct nft_flowtable *flowtable, struct netlink_ext_ack *extack, bool add) { struct nlattr *tb[NFTA_FLOWTABLE_HOOK_MAX + 1]; struct nf_hook_ops *ops; struct nft_hook *hook; int hooknum, priority; int err; INIT_LIST_HEAD(&flowtable_hook->list); err = nla_parse_nested_deprecated(tb, NFTA_FLOWTABLE_HOOK_MAX, nla[NFTA_FLOWTABLE_HOOK], nft_flowtable_hook_policy, NULL); if (err < 0) return err; if (add) { if (!tb[NFTA_FLOWTABLE_HOOK_NUM] || !tb[NFTA_FLOWTABLE_HOOK_PRIORITY]) { NL_SET_BAD_ATTR(extack, nla[NFTA_FLOWTABLE_NAME]); return -ENOENT; } hooknum = ntohl(nla_get_be32(tb[NFTA_FLOWTABLE_HOOK_NUM])); if (hooknum != NF_NETDEV_INGRESS) return -EOPNOTSUPP; priority = ntohl(nla_get_be32(tb[NFTA_FLOWTABLE_HOOK_PRIORITY])); flowtable_hook->priority = priority; flowtable_hook->num = hooknum; } else { if (tb[NFTA_FLOWTABLE_HOOK_NUM]) { hooknum = ntohl(nla_get_be32(tb[NFTA_FLOWTABLE_HOOK_NUM])); if (hooknum != flowtable->hooknum) return -EOPNOTSUPP; } if (tb[NFTA_FLOWTABLE_HOOK_PRIORITY]) { priority = ntohl(nla_get_be32(tb[NFTA_FLOWTABLE_HOOK_PRIORITY])); if (priority != flowtable->data.priority) return -EOPNOTSUPP; } flowtable_hook->priority = flowtable->data.priority; flowtable_hook->num = flowtable->hooknum; } if (tb[NFTA_FLOWTABLE_HOOK_DEVS]) { err = nf_tables_parse_netdev_hooks(ctx->net, tb[NFTA_FLOWTABLE_HOOK_DEVS], &flowtable_hook->list, extack); if (err < 0) return err; } list_for_each_entry(hook, &flowtable_hook->list, list) { list_for_each_entry(ops, &hook->ops_list, list) { ops->pf = NFPROTO_NETDEV; ops->hooknum = flowtable_hook->num; ops->priority = flowtable_hook->priority; ops->priv = &flowtable->data; ops->hook = flowtable->data.type->hook; ops->hook_ops_type = NF_HOOK_OP_NFT_FT; } } return err; } /* call under rcu_read_lock */ static const struct nf_flowtable_type *__nft_flowtable_type_get(u8 family) { const struct nf_flowtable_type *type; list_for_each_entry_rcu(type, &nf_tables_flowtables, list) { if (family == type->family) return type; } return NULL; } static const struct nf_flowtable_type * nft_flowtable_type_get(struct net *net, u8 family) { const struct nf_flowtable_type *type; rcu_read_lock(); type = __nft_flowtable_type_get(family); if (type != NULL && try_module_get(type->owner)) { rcu_read_unlock(); return type; } rcu_read_unlock(); lockdep_nfnl_nft_mutex_not_held(); #ifdef CONFIG_MODULES if (type == NULL) { if (nft_request_module(net, "nf-flowtable-%u", family) == -EAGAIN) return ERR_PTR(-EAGAIN); } #endif return ERR_PTR(-ENOENT); } /* Only called from error and netdev event paths. */ static void nft_unregister_flowtable_ops(struct net *net, struct nft_flowtable *flowtable, struct nf_hook_ops *ops) { nf_unregister_net_hook(net, ops); flowtable->data.type->setup(&flowtable->data, ops->dev, FLOW_BLOCK_UNBIND); } static void __nft_unregister_flowtable_net_hooks(struct net *net, struct nft_flowtable *flowtable, struct list_head *hook_list, bool release_netdev) { struct nft_hook *hook, *next; struct nf_hook_ops *ops; list_for_each_entry_safe(hook, next, hook_list, list) { list_for_each_entry(ops, &hook->ops_list, list) nft_unregister_flowtable_ops(net, flowtable, ops); if (release_netdev) { list_del(&hook->list); nft_netdev_hook_free_rcu(hook); } } } static void nft_unregister_flowtable_net_hooks(struct net *net, struct nft_flowtable *flowtable, struct list_head *hook_list) { __nft_unregister_flowtable_net_hooks(net, flowtable, hook_list, false); } static int nft_register_flowtable_ops(struct net *net, struct nft_flowtable *flowtable, struct nf_hook_ops *ops) { int err; err = flowtable->data.type->setup(&flowtable->data, ops->dev, FLOW_BLOCK_BIND); if (err < 0) return err; err = nf_register_net_hook(net, ops); if (!err) return 0; flowtable->data.type->setup(&flowtable->data, ops->dev, FLOW_BLOCK_UNBIND); return err; } static int nft_register_flowtable_net_hooks(struct net *net, struct nft_table *table, struct list_head *hook_list, struct nft_flowtable *flowtable) { struct nft_hook *hook, *next; struct nft_flowtable *ft; struct nf_hook_ops *ops; int err, i = 0; list_for_each_entry(hook, hook_list, list) { list_for_each_entry(ft, &table->flowtables, list) { if (!nft_is_active_next(net, ft)) continue; if (nft_hook_list_find(&ft->hook_list, hook)) { err = -EEXIST; goto err_unregister_net_hooks; } } list_for_each_entry(ops, &hook->ops_list, list) { err = nft_register_flowtable_ops(net, flowtable, ops); if (err < 0) goto err_unregister_net_hooks; i++; } } return 0; err_unregister_net_hooks: list_for_each_entry_safe(hook, next, hook_list, list) { list_for_each_entry(ops, &hook->ops_list, list) { if (i-- <= 0) break; nft_unregister_flowtable_ops(net, flowtable, ops); } list_del_rcu(&hook->list); nft_netdev_hook_free_rcu(hook); } return err; } static void nft_hooks_destroy(struct list_head *hook_list) { struct nft_hook *hook, *next; list_for_each_entry_safe(hook, next, hook_list, list) { list_del_rcu(&hook->list); nft_netdev_hook_free_rcu(hook); } } static int nft_flowtable_update(struct nft_ctx *ctx, const struct nlmsghdr *nlh, struct nft_flowtable *flowtable, struct netlink_ext_ack *extack) { const struct nlattr * const *nla = ctx->nla; struct nft_flowtable_hook flowtable_hook; struct nftables_pernet *nft_net; struct nft_hook *hook, *next; struct nf_hook_ops *ops; struct nft_trans *trans; bool unregister = false; u32 flags; int err; err = nft_flowtable_parse_hook(ctx, nla, &flowtable_hook, flowtable, extack, false); if (err < 0) return err; list_for_each_entry_safe(hook, next, &flowtable_hook.list, list) { if (nft_hook_list_find(&flowtable->hook_list, hook)) { list_del(&hook->list); nft_netdev_hook_free(hook); continue; } nft_net = nft_pernet(ctx->net); list_for_each_entry(trans, &nft_net->commit_list, list) { if (trans->msg_type != NFT_MSG_NEWFLOWTABLE || trans->table != ctx->table || !nft_trans_flowtable_update(trans)) continue; if (nft_hook_list_find(&nft_trans_flowtable_hooks(trans), hook)) { err = -EEXIST; goto err_flowtable_update_hook; } } } if (nla[NFTA_FLOWTABLE_FLAGS]) { flags = ntohl(nla_get_be32(nla[NFTA_FLOWTABLE_FLAGS])); if (flags & ~NFT_FLOWTABLE_MASK) { err = -EOPNOTSUPP; goto err_flowtable_update_hook; } if ((flowtable->data.flags & NFT_FLOWTABLE_HW_OFFLOAD) ^ (flags & NFT_FLOWTABLE_HW_OFFLOAD)) { err = -EOPNOTSUPP; goto err_flowtable_update_hook; } } else { flags = flowtable->data.flags; } err = nft_register_flowtable_net_hooks(ctx->net, ctx->table, &flowtable_hook.list, flowtable); if (err < 0) goto err_flowtable_update_hook; trans = nft_trans_alloc(ctx, NFT_MSG_NEWFLOWTABLE, sizeof(struct nft_trans_flowtable)); if (!trans) { unregister = true; err = -ENOMEM; goto err_flowtable_update_hook; } nft_trans_flowtable_flags(trans) = flags; nft_trans_flowtable(trans) = flowtable; nft_trans_flowtable_update(trans) = true; INIT_LIST_HEAD(&nft_trans_flowtable_hooks(trans)); list_splice(&flowtable_hook.list, &nft_trans_flowtable_hooks(trans)); nft_trans_commit_list_add_tail(ctx->net, trans); return 0; err_flowtable_update_hook: list_for_each_entry_safe(hook, next, &flowtable_hook.list, list) { if (unregister) { list_for_each_entry(ops, &hook->ops_list, list) nft_unregister_flowtable_ops(ctx->net, flowtable, ops); } list_del_rcu(&hook->list); nft_netdev_hook_free_rcu(hook); } return err; } static int nf_tables_newflowtable(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; struct nft_flowtable_hook flowtable_hook; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; const struct nf_flowtable_type *type; struct nft_flowtable *flowtable; struct net *net = info->net; struct nft_table *table; struct nft_trans *trans; struct nft_ctx ctx; int err; if (!nla[NFTA_FLOWTABLE_TABLE] || !nla[NFTA_FLOWTABLE_NAME] || !nla[NFTA_FLOWTABLE_HOOK]) return -EINVAL; table = nft_table_lookup(net, nla[NFTA_FLOWTABLE_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_FLOWTABLE_TABLE]); return PTR_ERR(table); } flowtable = nft_flowtable_lookup(net, table, nla[NFTA_FLOWTABLE_NAME], genmask); if (IS_ERR(flowtable)) { err = PTR_ERR(flowtable); if (err != -ENOENT) { NL_SET_BAD_ATTR(extack, nla[NFTA_FLOWTABLE_NAME]); return err; } } else { if (info->nlh->nlmsg_flags & NLM_F_EXCL) { NL_SET_BAD_ATTR(extack, nla[NFTA_FLOWTABLE_NAME]); return -EEXIST; } nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); return nft_flowtable_update(&ctx, info->nlh, flowtable, extack); } nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); if (!nft_use_inc(&table->use)) return -EMFILE; flowtable = kzalloc(sizeof(*flowtable), GFP_KERNEL_ACCOUNT); if (!flowtable) { err = -ENOMEM; goto flowtable_alloc; } flowtable->table = table; flowtable->handle = nf_tables_alloc_handle(table); INIT_LIST_HEAD(&flowtable->hook_list); flowtable->name = nla_strdup(nla[NFTA_FLOWTABLE_NAME], GFP_KERNEL_ACCOUNT); if (!flowtable->name) { err = -ENOMEM; goto err1; } type = nft_flowtable_type_get(net, family); if (IS_ERR(type)) { err = PTR_ERR(type); goto err2; } if (nla[NFTA_FLOWTABLE_FLAGS]) { flowtable->data.flags = ntohl(nla_get_be32(nla[NFTA_FLOWTABLE_FLAGS])); if (flowtable->data.flags & ~NFT_FLOWTABLE_MASK) { err = -EOPNOTSUPP; goto err3; } } write_pnet(&flowtable->data.net, net); flowtable->data.type = type; err = type->init(&flowtable->data); if (err < 0) goto err3; err = nft_flowtable_parse_hook(&ctx, nla, &flowtable_hook, flowtable, extack, true); if (err < 0) goto err_flowtable_parse_hooks; list_splice(&flowtable_hook.list, &flowtable->hook_list); flowtable->data.priority = flowtable_hook.priority; flowtable->hooknum = flowtable_hook.num; trans = nft_trans_flowtable_add(&ctx, NFT_MSG_NEWFLOWTABLE, flowtable); if (IS_ERR(trans)) { err = PTR_ERR(trans); goto err_flowtable_trans; } /* This must be LAST to ensure no packets are walking over this flowtable. */ err = nft_register_flowtable_net_hooks(ctx.net, table, &flowtable->hook_list, flowtable); if (err < 0) goto err_flowtable_hooks; list_add_tail_rcu(&flowtable->list, &table->flowtables); return 0; err_flowtable_hooks: nft_trans_destroy(trans); err_flowtable_trans: nft_hooks_destroy(&flowtable->hook_list); err_flowtable_parse_hooks: flowtable->data.type->free(&flowtable->data); err3: module_put(type->owner); err2: kfree(flowtable->name); err1: kfree(flowtable); flowtable_alloc: nft_use_dec_restore(&table->use); return err; } static void nft_flowtable_hook_release(struct nft_flowtable_hook *flowtable_hook) { struct nft_hook *this, *next; list_for_each_entry_safe(this, next, &flowtable_hook->list, list) { list_del(&this->list); nft_netdev_hook_free(this); } } static int nft_delflowtable_hook(struct nft_ctx *ctx, struct nft_flowtable *flowtable, struct netlink_ext_ack *extack) { const struct nlattr * const *nla = ctx->nla; struct nft_flowtable_hook flowtable_hook; LIST_HEAD(flowtable_del_list); struct nft_hook *this, *hook; struct nft_trans *trans; int err; err = nft_flowtable_parse_hook(ctx, nla, &flowtable_hook, flowtable, extack, false); if (err < 0) return err; list_for_each_entry(this, &flowtable_hook.list, list) { hook = nft_hook_list_find(&flowtable->hook_list, this); if (!hook) { err = -ENOENT; goto err_flowtable_del_hook; } list_move(&hook->list, &flowtable_del_list); } trans = nft_trans_alloc(ctx, NFT_MSG_DELFLOWTABLE, sizeof(struct nft_trans_flowtable)); if (!trans) { err = -ENOMEM; goto err_flowtable_del_hook; } nft_trans_flowtable(trans) = flowtable; nft_trans_flowtable_update(trans) = true; INIT_LIST_HEAD(&nft_trans_flowtable_hooks(trans)); list_splice(&flowtable_del_list, &nft_trans_flowtable_hooks(trans)); nft_flowtable_hook_release(&flowtable_hook); nft_trans_commit_list_add_tail(ctx->net, trans); return 0; err_flowtable_del_hook: list_splice(&flowtable_del_list, &flowtable->hook_list); nft_flowtable_hook_release(&flowtable_hook); return err; } static int nf_tables_delflowtable(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_next(info->net); u8 family = info->nfmsg->nfgen_family; struct nft_flowtable *flowtable; struct net *net = info->net; const struct nlattr *attr; struct nft_table *table; struct nft_ctx ctx; if (!nla[NFTA_FLOWTABLE_TABLE] || (!nla[NFTA_FLOWTABLE_NAME] && !nla[NFTA_FLOWTABLE_HANDLE])) return -EINVAL; table = nft_table_lookup(net, nla[NFTA_FLOWTABLE_TABLE], family, genmask, NETLINK_CB(skb).portid); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_FLOWTABLE_TABLE]); return PTR_ERR(table); } if (nla[NFTA_FLOWTABLE_HANDLE]) { attr = nla[NFTA_FLOWTABLE_HANDLE]; flowtable = nft_flowtable_lookup_byhandle(table, attr, genmask); } else { attr = nla[NFTA_FLOWTABLE_NAME]; flowtable = nft_flowtable_lookup(net, table, attr, genmask); } if (IS_ERR(flowtable)) { if (PTR_ERR(flowtable) == -ENOENT && NFNL_MSG_TYPE(info->nlh->nlmsg_type) == NFT_MSG_DESTROYFLOWTABLE) return 0; NL_SET_BAD_ATTR(extack, attr); return PTR_ERR(flowtable); } nft_ctx_init(&ctx, net, skb, info->nlh, family, table, NULL, nla); if (nla[NFTA_FLOWTABLE_HOOK]) return nft_delflowtable_hook(&ctx, flowtable, extack); if (flowtable->use > 0) { NL_SET_BAD_ATTR(extack, attr); return -EBUSY; } return nft_delflowtable(&ctx, flowtable); } static int nf_tables_fill_flowtable_info(struct sk_buff *skb, struct net *net, u32 portid, u32 seq, int event, u32 flags, int family, struct nft_flowtable *flowtable, struct list_head *hook_list) { struct nlattr *nest, *nest_devs; struct nft_hook *hook; struct nlmsghdr *nlh; nlh = nfnl_msg_put(skb, portid, seq, nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event), flags, family, NFNETLINK_V0, nft_base_seq_be16(net)); if (!nlh) goto nla_put_failure; if (nla_put_string(skb, NFTA_FLOWTABLE_TABLE, flowtable->table->name) || nla_put_string(skb, NFTA_FLOWTABLE_NAME, flowtable->name) || nla_put_be64(skb, NFTA_FLOWTABLE_HANDLE, cpu_to_be64(flowtable->handle), NFTA_FLOWTABLE_PAD)) goto nla_put_failure; if (!hook_list && (event == NFT_MSG_DELFLOWTABLE || event == NFT_MSG_DESTROYFLOWTABLE)) { nlmsg_end(skb, nlh); return 0; } if (nla_put_be32(skb, NFTA_FLOWTABLE_USE, htonl(flowtable->use)) || nla_put_be32(skb, NFTA_FLOWTABLE_FLAGS, htonl(flowtable->data.flags))) goto nla_put_failure; nest = nla_nest_start_noflag(skb, NFTA_FLOWTABLE_HOOK); if (!nest) goto nla_put_failure; if (nla_put_be32(skb, NFTA_FLOWTABLE_HOOK_NUM, htonl(flowtable->hooknum)) || nla_put_be32(skb, NFTA_FLOWTABLE_HOOK_PRIORITY, htonl(flowtable->data.priority))) goto nla_put_failure; nest_devs = nla_nest_start_noflag(skb, NFTA_FLOWTABLE_HOOK_DEVS); if (!nest_devs) goto nla_put_failure; if (!hook_list) hook_list = &flowtable->hook_list; list_for_each_entry_rcu(hook, hook_list, list, lockdep_commit_lock_is_held(net)) { if (nft_nla_put_hook_dev(skb, hook)) goto nla_put_failure; } nla_nest_end(skb, nest_devs); nla_nest_end(skb, nest); nlmsg_end(skb, nlh); return 0; nla_put_failure: nlmsg_trim(skb, nlh); return -1; } struct nft_flowtable_filter { char *table; }; static int nf_tables_dump_flowtable(struct sk_buff *skb, struct netlink_callback *cb) { const struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh); struct nft_flowtable_filter *filter = cb->data; unsigned int idx = 0, s_idx = cb->args[0]; struct net *net = sock_net(skb->sk); int family = nfmsg->nfgen_family; struct nft_flowtable *flowtable; struct nftables_pernet *nft_net; const struct nft_table *table; rcu_read_lock(); nft_net = nft_pernet(net); cb->seq = nft_base_seq(net); list_for_each_entry_rcu(table, &nft_net->tables, list) { if (family != NFPROTO_UNSPEC && family != table->family) continue; list_for_each_entry_rcu(flowtable, &table->flowtables, list) { if (!nft_is_active(net, flowtable)) goto cont; if (idx < s_idx) goto cont; if (idx > s_idx) memset(&cb->args[1], 0, sizeof(cb->args) - sizeof(cb->args[0])); if (filter && filter->table && strcmp(filter->table, table->name)) goto cont; if (nf_tables_fill_flowtable_info(skb, net, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NFT_MSG_NEWFLOWTABLE, NLM_F_MULTI | NLM_F_APPEND, table->family, flowtable, NULL) < 0) goto done; nl_dump_check_consistent(cb, nlmsg_hdr(skb)); cont: idx++; } } done: rcu_read_unlock(); cb->args[0] = idx; return skb->len; } static int nf_tables_dump_flowtable_start(struct netlink_callback *cb) { const struct nlattr * const *nla = cb->data; struct nft_flowtable_filter *filter = NULL; if (nla[NFTA_FLOWTABLE_TABLE]) { filter = kzalloc(sizeof(*filter), GFP_ATOMIC); if (!filter) return -ENOMEM; filter->table = nla_strdup(nla[NFTA_FLOWTABLE_TABLE], GFP_ATOMIC); if (!filter->table) { kfree(filter); return -ENOMEM; } } cb->data = filter; return 0; } static int nf_tables_dump_flowtable_done(struct netlink_callback *cb) { struct nft_flowtable_filter *filter = cb->data; if (!filter) return 0; kfree(filter->table); kfree(filter); return 0; } /* called with rcu_read_lock held */ static int nf_tables_getflowtable(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_cur(info->net); u8 family = info->nfmsg->nfgen_family; struct nft_flowtable *flowtable; const struct nft_table *table; struct net *net = info->net; struct sk_buff *skb2; int err; if (info->nlh->nlmsg_flags & NLM_F_DUMP) { struct netlink_dump_control c = { .start = nf_tables_dump_flowtable_start, .dump = nf_tables_dump_flowtable, .done = nf_tables_dump_flowtable_done, .module = THIS_MODULE, .data = (void *)nla, }; return nft_netlink_dump_start_rcu(info->sk, skb, info->nlh, &c); } if (!nla[NFTA_FLOWTABLE_NAME]) return -EINVAL; table = nft_table_lookup(net, nla[NFTA_FLOWTABLE_TABLE], family, genmask, 0); if (IS_ERR(table)) { NL_SET_BAD_ATTR(extack, nla[NFTA_FLOWTABLE_TABLE]); return PTR_ERR(table); } flowtable = nft_flowtable_lookup(net, table, nla[NFTA_FLOWTABLE_NAME], genmask); if (IS_ERR(flowtable)) { NL_SET_BAD_ATTR(extack, nla[NFTA_FLOWTABLE_NAME]); return PTR_ERR(flowtable); } skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); if (!skb2) return -ENOMEM; err = nf_tables_fill_flowtable_info(skb2, net, NETLINK_CB(skb).portid, info->nlh->nlmsg_seq, NFT_MSG_NEWFLOWTABLE, 0, family, flowtable, NULL); if (err < 0) goto err_fill_flowtable_info; return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid); err_fill_flowtable_info: kfree_skb(skb2); return err; } static void nf_tables_flowtable_notify(struct nft_ctx *ctx, struct nft_flowtable *flowtable, struct list_head *hook_list, int event) { struct nftables_pernet *nft_net = nft_pernet(ctx->net); struct sk_buff *skb; u16 flags = 0; int err; if (!ctx->report && !nfnetlink_has_listeners(ctx->net, NFNLGRP_NFTABLES)) return; skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL); if (skb == NULL) goto err; if (ctx->flags & (NLM_F_CREATE | NLM_F_EXCL)) flags |= ctx->flags & (NLM_F_CREATE | NLM_F_EXCL); err = nf_tables_fill_flowtable_info(skb, ctx->net, ctx->portid, ctx->seq, event, flags, ctx->family, flowtable, hook_list); if (err < 0) { kfree_skb(skb); goto err; } nft_notify_enqueue(skb, ctx->report, &nft_net->notify_list); return; err: nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES, -ENOBUFS); } static void nf_tables_flowtable_destroy(struct nft_flowtable *flowtable) { struct nft_hook *hook, *next; flowtable->data.type->free(&flowtable->data); list_for_each_entry_safe(hook, next, &flowtable->hook_list, list) { list_del_rcu(&hook->list); nft_netdev_hook_free_rcu(hook); } kfree(flowtable->name); module_put(flowtable->data.type->owner); kfree(flowtable); } static int nf_tables_fill_gen_info(struct sk_buff *skb, struct net *net, u32 portid, u32 seq) { struct nlmsghdr *nlh; char buf[TASK_COMM_LEN]; int event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, NFT_MSG_NEWGEN); nlh = nfnl_msg_put(skb, portid, seq, event, 0, AF_UNSPEC, NFNETLINK_V0, nft_base_seq_be16(net)); if (!nlh) goto nla_put_failure; if (nla_put_be32(skb, NFTA_GEN_ID, htonl(nft_base_seq(net))) || nla_put_be32(skb, NFTA_GEN_PROC_PID, htonl(task_pid_nr(current))) || nla_put_string(skb, NFTA_GEN_PROC_NAME, get_task_comm(buf, current))) goto nla_put_failure; nlmsg_end(skb, nlh); return 0; nla_put_failure: nlmsg_trim(skb, nlh); return -EMSGSIZE; } struct nf_hook_ops *nft_hook_find_ops(const struct nft_hook *hook, const struct net_device *dev) { struct nf_hook_ops *ops; list_for_each_entry(ops, &hook->ops_list, list) { if (ops->dev == dev) return ops; } return NULL; } EXPORT_SYMBOL_GPL(nft_hook_find_ops); struct nf_hook_ops *nft_hook_find_ops_rcu(const struct nft_hook *hook, const struct net_device *dev) { struct nf_hook_ops *ops; list_for_each_entry_rcu(ops, &hook->ops_list, list) { if (ops->dev == dev) return ops; } return NULL; } EXPORT_SYMBOL_GPL(nft_hook_find_ops_rcu); static int nft_flowtable_event(unsigned long event, struct net_device *dev, struct nft_flowtable *flowtable, bool changename) { struct nf_hook_ops *ops; struct nft_hook *hook; bool match; list_for_each_entry(hook, &flowtable->hook_list, list) { ops = nft_hook_find_ops(hook, dev); match = !strncmp(hook->ifname, dev->name, hook->ifnamelen); switch (event) { case NETDEV_UNREGISTER: /* NOP if not found or new name still matching */ if (!ops || (changename && match)) continue; /* flow_offload_netdev_event() cleans up entries for us. */ nft_unregister_flowtable_ops(dev_net(dev), flowtable, ops); list_del_rcu(&ops->list); kfree_rcu(ops, rcu); break; case NETDEV_REGISTER: /* NOP if not matching or already registered */ if (!match || (changename && ops)) continue; ops = kzalloc(sizeof(struct nf_hook_ops), GFP_KERNEL_ACCOUNT); if (!ops) return 1; ops->pf = NFPROTO_NETDEV; ops->hooknum = flowtable->hooknum; ops->priority = flowtable->data.priority; ops->priv = &flowtable->data; ops->hook = flowtable->data.type->hook; ops->hook_ops_type = NF_HOOK_OP_NFT_FT; ops->dev = dev; if (nft_register_flowtable_ops(dev_net(dev), flowtable, ops)) { kfree(ops); return 1; } list_add_tail_rcu(&ops->list, &hook->ops_list); break; } break; } return 0; } static int __nf_tables_flowtable_event(unsigned long event, struct net_device *dev, bool changename) { struct nftables_pernet *nft_net = nft_pernet(dev_net(dev)); struct nft_flowtable *flowtable; struct nft_table *table; list_for_each_entry(table, &nft_net->tables, list) { list_for_each_entry(flowtable, &table->flowtables, list) { if (nft_flowtable_event(event, dev, flowtable, changename)) return 1; } } return 0; } static int nf_tables_flowtable_event(struct notifier_block *this, unsigned long event, void *ptr) { struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct nftables_pernet *nft_net; int ret = NOTIFY_DONE; struct net *net; if (event != NETDEV_REGISTER && event != NETDEV_UNREGISTER && event != NETDEV_CHANGENAME) return NOTIFY_DONE; net = dev_net(dev); nft_net = nft_pernet(net); mutex_lock(&nft_net->commit_mutex); if (event == NETDEV_CHANGENAME) { if (__nf_tables_flowtable_event(NETDEV_REGISTER, dev, true)) { ret = NOTIFY_BAD; goto out_unlock; } __nf_tables_flowtable_event(NETDEV_UNREGISTER, dev, true); } else if (__nf_tables_flowtable_event(event, dev, false)) { ret = NOTIFY_BAD; } out_unlock: mutex_unlock(&nft_net->commit_mutex); return ret; } static struct notifier_block nf_tables_flowtable_notifier = { .notifier_call = nf_tables_flowtable_event, }; static void nf_tables_gen_notify(struct net *net, struct sk_buff *skb, int event) { struct nlmsghdr *nlh = nlmsg_hdr(skb); struct sk_buff *skb2; int err; if (!nlmsg_report(nlh) && !nfnetlink_has_listeners(net, NFNLGRP_NFTABLES)) return; skb2 = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL); if (skb2 == NULL) goto err; err = nf_tables_fill_gen_info(skb2, net, NETLINK_CB(skb).portid, nlh->nlmsg_seq); if (err < 0) { kfree_skb(skb2); goto err; } nfnetlink_send(skb2, net, NETLINK_CB(skb).portid, NFNLGRP_NFTABLES, nlmsg_report(nlh), GFP_KERNEL); return; err: nfnetlink_set_err(net, NETLINK_CB(skb).portid, NFNLGRP_NFTABLES, -ENOBUFS); } static int nf_tables_getgen(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const nla[]) { struct sk_buff *skb2; int err; skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); if (skb2 == NULL) return -ENOMEM; err = nf_tables_fill_gen_info(skb2, info->net, NETLINK_CB(skb).portid, info->nlh->nlmsg_seq); if (err < 0) goto err_fill_gen_info; return nfnetlink_unicast(skb2, info->net, NETLINK_CB(skb).portid); err_fill_gen_info: kfree_skb(skb2); return err; } static const struct nfnl_callback nf_tables_cb[NFT_MSG_MAX] = { [NFT_MSG_NEWTABLE] = { .call = nf_tables_newtable, .type = NFNL_CB_BATCH, .attr_count = NFTA_TABLE_MAX, .policy = nft_table_policy, }, [NFT_MSG_GETTABLE] = { .call = nf_tables_gettable, .type = NFNL_CB_RCU, .attr_count = NFTA_TABLE_MAX, .policy = nft_table_policy, }, [NFT_MSG_DELTABLE] = { .call = nf_tables_deltable, .type = NFNL_CB_BATCH, .attr_count = NFTA_TABLE_MAX, .policy = nft_table_policy, }, [NFT_MSG_DESTROYTABLE] = { .call = nf_tables_deltable, .type = NFNL_CB_BATCH, .attr_count = NFTA_TABLE_MAX, .policy = nft_table_policy, }, [NFT_MSG_NEWCHAIN] = { .call = nf_tables_newchain, .type = NFNL_CB_BATCH, .attr_count = NFTA_CHAIN_MAX, .policy = nft_chain_policy, }, [NFT_MSG_GETCHAIN] = { .call = nf_tables_getchain, .type = NFNL_CB_RCU, .attr_count = NFTA_CHAIN_MAX, .policy = nft_chain_policy, }, [NFT_MSG_DELCHAIN] = { .call = nf_tables_delchain, .type = NFNL_CB_BATCH, .attr_count = NFTA_CHAIN_MAX, .policy = nft_chain_policy, }, [NFT_MSG_DESTROYCHAIN] = { .call = nf_tables_delchain, .type = NFNL_CB_BATCH, .attr_count = NFTA_CHAIN_MAX, .policy = nft_chain_policy, }, [NFT_MSG_NEWRULE] = { .call = nf_tables_newrule, .type = NFNL_CB_BATCH, .attr_count = NFTA_RULE_MAX, .policy = nft_rule_policy, }, [NFT_MSG_GETRULE] = { .call = nf_tables_getrule, .type = NFNL_CB_RCU, .attr_count = NFTA_RULE_MAX, .policy = nft_rule_policy, }, [NFT_MSG_GETRULE_RESET] = { .call = nf_tables_getrule_reset, .type = NFNL_CB_RCU, .attr_count = NFTA_RULE_MAX, .policy = nft_rule_policy, }, [NFT_MSG_DELRULE] = { .call = nf_tables_delrule, .type = NFNL_CB_BATCH, .attr_count = NFTA_RULE_MAX, .policy = nft_rule_policy, }, [NFT_MSG_DESTROYRULE] = { .call = nf_tables_delrule, .type = NFNL_CB_BATCH, .attr_count = NFTA_RULE_MAX, .policy = nft_rule_policy, }, [NFT_MSG_NEWSET] = { .call = nf_tables_newset, .type = NFNL_CB_BATCH, .attr_count = NFTA_SET_MAX, .policy = nft_set_policy, }, [NFT_MSG_GETSET] = { .call = nf_tables_getset, .type = NFNL_CB_RCU, .attr_count = NFTA_SET_MAX, .policy = nft_set_policy, }, [NFT_MSG_DELSET] = { .call = nf_tables_delset, .type = NFNL_CB_BATCH, .attr_count = NFTA_SET_MAX, .policy = nft_set_policy, }, [NFT_MSG_DESTROYSET] = { .call = nf_tables_delset, .type = NFNL_CB_BATCH, .attr_count = NFTA_SET_MAX, .policy = nft_set_policy, }, [NFT_MSG_NEWSETELEM] = { .call = nf_tables_newsetelem, .type = NFNL_CB_BATCH, .attr_count = NFTA_SET_ELEM_LIST_MAX, .policy = nft_set_elem_list_policy, }, [NFT_MSG_GETSETELEM] = { .call = nf_tables_getsetelem, .type = NFNL_CB_RCU, .attr_count = NFTA_SET_ELEM_LIST_MAX, .policy = nft_set_elem_list_policy, }, [NFT_MSG_GETSETELEM_RESET] = { .call = nf_tables_getsetelem_reset, .type = NFNL_CB_RCU, .attr_count = NFTA_SET_ELEM_LIST_MAX, .policy = nft_set_elem_list_policy, }, [NFT_MSG_DELSETELEM] = { .call = nf_tables_delsetelem, .type = NFNL_CB_BATCH, .attr_count = NFTA_SET_ELEM_LIST_MAX, .policy = nft_set_elem_list_policy, }, [NFT_MSG_DESTROYSETELEM] = { .call = nf_tables_delsetelem, .type = NFNL_CB_BATCH, .attr_count = NFTA_SET_ELEM_LIST_MAX, .policy = nft_set_elem_list_policy, }, [NFT_MSG_GETGEN] = { .call = nf_tables_getgen, .type = NFNL_CB_RCU, }, [NFT_MSG_NEWOBJ] = { .call = nf_tables_newobj, .type = NFNL_CB_BATCH, .attr_count = NFTA_OBJ_MAX, .policy = nft_obj_policy, }, [NFT_MSG_GETOBJ] = { .call = nf_tables_getobj, .type = NFNL_CB_RCU, .attr_count = NFTA_OBJ_MAX, .policy = nft_obj_policy, }, [NFT_MSG_DELOBJ] = { .call = nf_tables_delobj, .type = NFNL_CB_BATCH, .attr_count = NFTA_OBJ_MAX, .policy = nft_obj_policy, }, [NFT_MSG_DESTROYOBJ] = { .call = nf_tables_delobj, .type = NFNL_CB_BATCH, .attr_count = NFTA_OBJ_MAX, .policy = nft_obj_policy, }, [NFT_MSG_GETOBJ_RESET] = { .call = nf_tables_getobj_reset, .type = NFNL_CB_RCU, .attr_count = NFTA_OBJ_MAX, .policy = nft_obj_policy, }, [NFT_MSG_NEWFLOWTABLE] = { .call = nf_tables_newflowtable, .type = NFNL_CB_BATCH, .attr_count = NFTA_FLOWTABLE_MAX, .policy = nft_flowtable_policy, }, [NFT_MSG_GETFLOWTABLE] = { .call = nf_tables_getflowtable, .type = NFNL_CB_RCU, .attr_count = NFTA_FLOWTABLE_MAX, .policy = nft_flowtable_policy, }, [NFT_MSG_DELFLOWTABLE] = { .call = nf_tables_delflowtable, .type = NFNL_CB_BATCH, .attr_count = NFTA_FLOWTABLE_MAX, .policy = nft_flowtable_policy, }, [NFT_MSG_DESTROYFLOWTABLE] = { .call = nf_tables_delflowtable, .type = NFNL_CB_BATCH, .attr_count = NFTA_FLOWTABLE_MAX, .policy = nft_flowtable_policy, }, }; static int nf_tables_validate(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); struct nft_table *table; list_for_each_entry(table, &nft_net->tables, list) { switch (table->validate_state) { case NFT_VALIDATE_SKIP: continue; case NFT_VALIDATE_NEED: nft_validate_state_update(table, NFT_VALIDATE_DO); fallthrough; case NFT_VALIDATE_DO: if (nft_table_validate(net, table) < 0) return -EAGAIN; nft_validate_state_update(table, NFT_VALIDATE_SKIP); break; } } return 0; } /* a drop policy has to be deferred until all rules have been activated, * otherwise a large ruleset that contains a drop-policy base chain will * cause all packets to get dropped until the full transaction has been * processed. * * We defer the drop policy until the transaction has been finalized. */ static void nft_chain_commit_drop_policy(struct nft_trans_chain *trans) { struct nft_base_chain *basechain; if (trans->policy != NF_DROP) return; if (!nft_is_base_chain(trans->chain)) return; basechain = nft_base_chain(trans->chain); basechain->policy = NF_DROP; } static void nft_chain_commit_update(struct nft_trans_chain *trans) { struct nft_table *table = trans->nft_trans_binding.nft_trans.table; struct nft_base_chain *basechain; if (trans->name) { rhltable_remove(&table->chains_ht, &trans->chain->rhlhead, nft_chain_ht_params); swap(trans->chain->name, trans->name); rhltable_insert_key(&table->chains_ht, trans->chain->name, &trans->chain->rhlhead, nft_chain_ht_params); } if (!nft_is_base_chain(trans->chain)) return; nft_chain_stats_replace(trans); basechain = nft_base_chain(trans->chain); switch (trans->policy) { case NF_DROP: case NF_ACCEPT: basechain->policy = trans->policy; break; } } static void nft_obj_commit_update(const struct nft_ctx *ctx, struct nft_trans *trans) { struct nft_object *newobj; struct nft_object *obj; obj = nft_trans_obj(trans); newobj = nft_trans_obj_newobj(trans); if (WARN_ON_ONCE(!obj->ops->update)) return; obj->ops->update(obj, newobj); nft_obj_destroy(ctx, newobj); } static void nft_commit_release(struct nft_trans *trans) { struct nft_ctx ctx = { .net = trans->net, }; nft_ctx_update(&ctx, trans); switch (trans->msg_type) { case NFT_MSG_DELTABLE: case NFT_MSG_DESTROYTABLE: nf_tables_table_destroy(trans->table); break; case NFT_MSG_NEWCHAIN: free_percpu(nft_trans_chain_stats(trans)); kfree(nft_trans_chain_name(trans)); break; case NFT_MSG_DELCHAIN: case NFT_MSG_DESTROYCHAIN: if (nft_trans_chain_update(trans)) nft_hooks_destroy(&nft_trans_chain_hooks(trans)); else nf_tables_chain_destroy(nft_trans_chain(trans)); break; case NFT_MSG_DELRULE: case NFT_MSG_DESTROYRULE: nf_tables_rule_destroy(&ctx, nft_trans_rule(trans)); break; case NFT_MSG_DELSET: case NFT_MSG_DESTROYSET: nft_set_destroy(&ctx, nft_trans_set(trans)); break; case NFT_MSG_DELSETELEM: case NFT_MSG_DESTROYSETELEM: nft_trans_elems_destroy(&ctx, nft_trans_container_elem(trans)); break; case NFT_MSG_DELOBJ: case NFT_MSG_DESTROYOBJ: nft_obj_destroy(&ctx, nft_trans_obj(trans)); break; case NFT_MSG_DELFLOWTABLE: case NFT_MSG_DESTROYFLOWTABLE: if (nft_trans_flowtable_update(trans)) nft_hooks_destroy(&nft_trans_flowtable_hooks(trans)); else nf_tables_flowtable_destroy(nft_trans_flowtable(trans)); break; } if (trans->put_net) put_net(trans->net); kfree(trans); } static void nf_tables_trans_destroy_work(struct work_struct *w) { struct nftables_pernet *nft_net = container_of(w, struct nftables_pernet, destroy_work); struct nft_trans *trans, *next; LIST_HEAD(head); spin_lock(&nf_tables_destroy_list_lock); list_splice_init(&nft_net->destroy_list, &head); spin_unlock(&nf_tables_destroy_list_lock); if (list_empty(&head)) return; synchronize_rcu(); list_for_each_entry_safe(trans, next, &head, list) { nft_trans_list_del(trans); nft_commit_release(trans); } } void nf_tables_trans_destroy_flush_work(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); flush_work(&nft_net->destroy_work); } EXPORT_SYMBOL_GPL(nf_tables_trans_destroy_flush_work); static bool nft_expr_reduce(struct nft_regs_track *track, const struct nft_expr *expr) { return false; } static int nf_tables_commit_chain_prepare(struct net *net, struct nft_chain *chain) { const struct nft_expr *expr, *last; struct nft_regs_track track = {}; unsigned int size, data_size; void *data, *data_boundary; struct nft_rule_dp *prule; struct nft_rule *rule; /* already handled or inactive chain? */ if (chain->blob_next || !nft_is_active_next(net, chain)) return 0; data_size = 0; list_for_each_entry(rule, &chain->rules, list) { if (nft_is_active_next(net, rule)) { data_size += sizeof(*prule) + rule->dlen; if (data_size > INT_MAX) return -ENOMEM; } } chain->blob_next = nf_tables_chain_alloc_rules(chain, data_size); if (!chain->blob_next) return -ENOMEM; data = (void *)chain->blob_next->data; data_boundary = data + data_size; size = 0; list_for_each_entry(rule, &chain->rules, list) { if (!nft_is_active_next(net, rule)) continue; prule = (struct nft_rule_dp *)data; data += offsetof(struct nft_rule_dp, data); if (WARN_ON_ONCE(data > data_boundary)) return -ENOMEM; size = 0; track.last = nft_expr_last(rule); nft_rule_for_each_expr(expr, last, rule) { track.cur = expr; if (nft_expr_reduce(&track, expr)) { expr = track.cur; continue; } if (WARN_ON_ONCE(data + size + expr->ops->size > data_boundary)) return -ENOMEM; memcpy(data + size, expr, expr->ops->size); size += expr->ops->size; } if (WARN_ON_ONCE(size >= 1 << 12)) return -ENOMEM; prule->handle = rule->handle; prule->dlen = size; prule->is_last = 0; data += size; size = 0; chain->blob_next->size += (unsigned long)(data - (void *)prule); } if (WARN_ON_ONCE(data > data_boundary)) return -ENOMEM; prule = (struct nft_rule_dp *)data; nft_last_rule(chain, prule); return 0; } static void nf_tables_commit_chain_prepare_cancel(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); struct nft_trans *trans, *next; list_for_each_entry_safe(trans, next, &nft_net->commit_list, list) { if (trans->msg_type == NFT_MSG_NEWRULE || trans->msg_type == NFT_MSG_DELRULE) { struct nft_chain *chain = nft_trans_rule_chain(trans); kvfree(chain->blob_next); chain->blob_next = NULL; } } } static void __nf_tables_commit_chain_free_rules(struct rcu_head *h) { struct nft_rule_dp_last *l = container_of(h, struct nft_rule_dp_last, h); kvfree(l->blob); } static void nf_tables_commit_chain_free_rules_old(struct nft_rule_blob *blob) { struct nft_rule_dp_last *last; /* last rule trailer is after end marker */ last = (void *)blob + sizeof(*blob) + blob->size; last->blob = blob; call_rcu(&last->h, __nf_tables_commit_chain_free_rules); } static void nf_tables_commit_chain(struct net *net, struct nft_chain *chain) { struct nft_rule_blob *g0, *g1; bool next_genbit; next_genbit = nft_gencursor_next(net); g0 = rcu_dereference_protected(chain->blob_gen_0, lockdep_commit_lock_is_held(net)); g1 = rcu_dereference_protected(chain->blob_gen_1, lockdep_commit_lock_is_held(net)); /* No changes to this chain? */ if (chain->blob_next == NULL) { /* chain had no change in last or next generation */ if (g0 == g1) return; /* * chain had no change in this generation; make sure next * one uses same rules as current generation. */ if (next_genbit) { rcu_assign_pointer(chain->blob_gen_1, g0); nf_tables_commit_chain_free_rules_old(g1); } else { rcu_assign_pointer(chain->blob_gen_0, g1); nf_tables_commit_chain_free_rules_old(g0); } return; } if (next_genbit) rcu_assign_pointer(chain->blob_gen_1, chain->blob_next); else rcu_assign_pointer(chain->blob_gen_0, chain->blob_next); chain->blob_next = NULL; if (g0 == g1) return; if (next_genbit) nf_tables_commit_chain_free_rules_old(g1); else nf_tables_commit_chain_free_rules_old(g0); } static void nft_obj_del(struct nft_object *obj) { rhltable_remove(&nft_objname_ht, &obj->rhlhead, nft_objname_ht_params); list_del_rcu(&obj->list); } void nft_chain_del(struct nft_chain *chain) { struct nft_table *table = chain->table; WARN_ON_ONCE(rhltable_remove(&table->chains_ht, &chain->rhlhead, nft_chain_ht_params)); list_del_rcu(&chain->list); } static void nft_trans_gc_setelem_remove(struct nft_ctx *ctx, struct nft_trans_gc *trans) { struct nft_elem_priv **priv = trans->priv; unsigned int i; for (i = 0; i < trans->count; i++) { nft_setelem_data_deactivate(ctx->net, trans->set, priv[i]); nft_setelem_remove(ctx->net, trans->set, priv[i]); } } void nft_trans_gc_destroy(struct nft_trans_gc *trans) { nft_set_put(trans->set); put_net(trans->net); kfree(trans); } static void nft_trans_gc_trans_free(struct rcu_head *rcu) { struct nft_elem_priv *elem_priv; struct nft_trans_gc *trans; struct nft_ctx ctx = {}; unsigned int i; trans = container_of(rcu, struct nft_trans_gc, rcu); ctx.net = read_pnet(&trans->set->net); for (i = 0; i < trans->count; i++) { elem_priv = trans->priv[i]; if (!nft_setelem_is_catchall(trans->set, elem_priv)) atomic_dec(&trans->set->nelems); nf_tables_set_elem_destroy(&ctx, trans->set, elem_priv); } nft_trans_gc_destroy(trans); } static bool nft_trans_gc_work_done(struct nft_trans_gc *trans) { struct nftables_pernet *nft_net; struct nft_ctx ctx = {}; nft_net = nft_pernet(trans->net); mutex_lock(&nft_net->commit_mutex); /* Check for race with transaction, otherwise this batch refers to * stale objects that might not be there anymore. Skip transaction if * set has been destroyed from control plane transaction in case gc * worker loses race. */ if (READ_ONCE(nft_net->gc_seq) != trans->seq || trans->set->dead) { mutex_unlock(&nft_net->commit_mutex); return false; } ctx.net = trans->net; ctx.table = trans->set->table; nft_trans_gc_setelem_remove(&ctx, trans); mutex_unlock(&nft_net->commit_mutex); return true; } static void nft_trans_gc_work(struct work_struct *work) { struct nft_trans_gc *trans, *next; LIST_HEAD(trans_gc_list); spin_lock(&nf_tables_gc_list_lock); list_splice_init(&nf_tables_gc_list, &trans_gc_list); spin_unlock(&nf_tables_gc_list_lock); list_for_each_entry_safe(trans, next, &trans_gc_list, list) { list_del(&trans->list); if (!nft_trans_gc_work_done(trans)) { nft_trans_gc_destroy(trans); continue; } call_rcu(&trans->rcu, nft_trans_gc_trans_free); } } struct nft_trans_gc *nft_trans_gc_alloc(struct nft_set *set, unsigned int gc_seq, gfp_t gfp) { struct net *net = read_pnet(&set->net); struct nft_trans_gc *trans; trans = kzalloc(sizeof(*trans), gfp); if (!trans) return NULL; trans->net = maybe_get_net(net); if (!trans->net) { kfree(trans); return NULL; } refcount_inc(&set->refs); trans->set = set; trans->seq = gc_seq; return trans; } void nft_trans_gc_elem_add(struct nft_trans_gc *trans, void *priv) { trans->priv[trans->count++] = priv; } static void nft_trans_gc_queue_work(struct nft_trans_gc *trans) { spin_lock(&nf_tables_gc_list_lock); list_add_tail(&trans->list, &nf_tables_gc_list); spin_unlock(&nf_tables_gc_list_lock); schedule_work(&trans_gc_work); } static int nft_trans_gc_space(struct nft_trans_gc *trans) { return NFT_TRANS_GC_BATCHCOUNT - trans->count; } struct nft_trans_gc *nft_trans_gc_queue_async(struct nft_trans_gc *gc, unsigned int gc_seq, gfp_t gfp) { struct nft_set *set; if (nft_trans_gc_space(gc)) return gc; set = gc->set; nft_trans_gc_queue_work(gc); return nft_trans_gc_alloc(set, gc_seq, gfp); } void nft_trans_gc_queue_async_done(struct nft_trans_gc *trans) { if (trans->count == 0) { nft_trans_gc_destroy(trans); return; } nft_trans_gc_queue_work(trans); } struct nft_trans_gc *nft_trans_gc_queue_sync(struct nft_trans_gc *gc, gfp_t gfp) { struct nft_set *set; if (WARN_ON_ONCE(!lockdep_commit_lock_is_held(gc->net))) return NULL; if (nft_trans_gc_space(gc)) return gc; set = gc->set; call_rcu(&gc->rcu, nft_trans_gc_trans_free); return nft_trans_gc_alloc(set, 0, gfp); } void nft_trans_gc_queue_sync_done(struct nft_trans_gc *trans) { WARN_ON_ONCE(!lockdep_commit_lock_is_held(trans->net)); if (trans->count == 0) { nft_trans_gc_destroy(trans); return; } call_rcu(&trans->rcu, nft_trans_gc_trans_free); } struct nft_trans_gc *nft_trans_gc_catchall_async(struct nft_trans_gc *gc, unsigned int gc_seq) { struct nft_set_elem_catchall *catchall; const struct nft_set *set = gc->set; struct nft_set_ext *ext; list_for_each_entry_rcu(catchall, &set->catchall_list, list) { ext = nft_set_elem_ext(set, catchall->elem); if (!nft_set_elem_expired(ext)) continue; if (nft_set_elem_is_dead(ext)) goto dead_elem; nft_set_elem_dead(ext); dead_elem: gc = nft_trans_gc_queue_async(gc, gc_seq, GFP_ATOMIC); if (!gc) return NULL; nft_trans_gc_elem_add(gc, catchall->elem); } return gc; } struct nft_trans_gc *nft_trans_gc_catchall_sync(struct nft_trans_gc *gc) { struct nft_set_elem_catchall *catchall, *next; u64 tstamp = nft_net_tstamp(gc->net); const struct nft_set *set = gc->set; struct nft_elem_priv *elem_priv; struct nft_set_ext *ext; WARN_ON_ONCE(!lockdep_commit_lock_is_held(gc->net)); list_for_each_entry_safe(catchall, next, &set->catchall_list, list) { ext = nft_set_elem_ext(set, catchall->elem); if (!__nft_set_elem_expired(ext, tstamp)) continue; gc = nft_trans_gc_queue_sync(gc, GFP_KERNEL); if (!gc) return NULL; elem_priv = catchall->elem; nft_setelem_data_deactivate(gc->net, gc->set, elem_priv); nft_setelem_catchall_destroy(catchall); nft_trans_gc_elem_add(gc, elem_priv); } return gc; } static void nf_tables_module_autoload_cleanup(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); struct nft_module_request *req, *next; WARN_ON_ONCE(!list_empty(&nft_net->commit_list)); list_for_each_entry_safe(req, next, &nft_net->module_list, list) { WARN_ON_ONCE(!req->done); list_del(&req->list); kfree(req); } } static void nf_tables_commit_release(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); struct nft_trans *trans; /* all side effects have to be made visible. * For example, if a chain named 'foo' has been deleted, a * new transaction must not find it anymore. * * Memory reclaim happens asynchronously from work queue * to prevent expensive synchronize_rcu() in commit phase. */ if (list_empty(&nft_net->commit_list)) { nf_tables_module_autoload_cleanup(net); mutex_unlock(&nft_net->commit_mutex); return; } trans = list_last_entry(&nft_net->commit_list, struct nft_trans, list); get_net(trans->net); WARN_ON_ONCE(trans->put_net); trans->put_net = true; spin_lock(&nf_tables_destroy_list_lock); list_splice_tail_init(&nft_net->commit_list, &nft_net->destroy_list); spin_unlock(&nf_tables_destroy_list_lock); nf_tables_module_autoload_cleanup(net); schedule_work(&nft_net->destroy_work); mutex_unlock(&nft_net->commit_mutex); } static void nft_commit_notify(struct net *net, u32 portid) { struct nftables_pernet *nft_net = nft_pernet(net); struct sk_buff *batch_skb = NULL, *nskb, *skb; unsigned char *data; int len; list_for_each_entry_safe(skb, nskb, &nft_net->notify_list, list) { if (!batch_skb) { new_batch: batch_skb = skb; len = NLMSG_GOODSIZE - skb->len; list_del(&skb->list); continue; } len -= skb->len; if (len > 0 && NFT_CB(skb).report == NFT_CB(batch_skb).report) { data = skb_put(batch_skb, skb->len); memcpy(data, skb->data, skb->len); list_del(&skb->list); kfree_skb(skb); continue; } nfnetlink_send(batch_skb, net, portid, NFNLGRP_NFTABLES, NFT_CB(batch_skb).report, GFP_KERNEL); goto new_batch; } if (batch_skb) { nfnetlink_send(batch_skb, net, portid, NFNLGRP_NFTABLES, NFT_CB(batch_skb).report, GFP_KERNEL); } WARN_ON_ONCE(!list_empty(&nft_net->notify_list)); } static int nf_tables_commit_audit_alloc(struct list_head *adl, struct nft_table *table) { struct nft_audit_data *adp; list_for_each_entry(adp, adl, list) { if (adp->table == table) return 0; } adp = kzalloc(sizeof(*adp), GFP_KERNEL); if (!adp) return -ENOMEM; adp->table = table; list_add(&adp->list, adl); return 0; } static void nf_tables_commit_audit_free(struct list_head *adl) { struct nft_audit_data *adp, *adn; list_for_each_entry_safe(adp, adn, adl, list) { list_del(&adp->list); kfree(adp); } } /* nft audit emits the number of elements that get added/removed/updated, * so NEW/DELSETELEM needs to increment based on the total elem count. */ static unsigned int nf_tables_commit_audit_entrycount(const struct nft_trans *trans) { switch (trans->msg_type) { case NFT_MSG_NEWSETELEM: case NFT_MSG_DELSETELEM: return nft_trans_container_elem(trans)->nelems; } return 1; } static void nf_tables_commit_audit_collect(struct list_head *adl, const struct nft_trans *trans, u32 op) { const struct nft_table *table = trans->table; struct nft_audit_data *adp; list_for_each_entry(adp, adl, list) { if (adp->table == table) goto found; } WARN_ONCE(1, "table=%s not expected in commit list", table->name); return; found: adp->entries += nf_tables_commit_audit_entrycount(trans); if (!adp->op || adp->op > op) adp->op = op; } #define AUNFTABLENAMELEN (NFT_TABLE_MAXNAMELEN + 22) static void nf_tables_commit_audit_log(struct list_head *adl, u32 generation) { struct nft_audit_data *adp, *adn; char aubuf[AUNFTABLENAMELEN]; list_for_each_entry_safe(adp, adn, adl, list) { snprintf(aubuf, AUNFTABLENAMELEN, "%s:%u", adp->table->name, generation); audit_log_nfcfg(aubuf, adp->table->family, adp->entries, nft2audit_op[adp->op], GFP_KERNEL); list_del(&adp->list); kfree(adp); } } static void nft_set_commit_update(struct list_head *set_update_list) { struct nft_set *set, *next; list_for_each_entry_safe(set, next, set_update_list, pending_update) { list_del_init(&set->pending_update); if (!set->ops->commit || set->dead) continue; set->ops->commit(set); } } static unsigned int nft_gc_seq_begin(struct nftables_pernet *nft_net) { unsigned int gc_seq; /* Bump gc counter, it becomes odd, this is the busy mark. */ gc_seq = READ_ONCE(nft_net->gc_seq); WRITE_ONCE(nft_net->gc_seq, ++gc_seq); return gc_seq; } static void nft_gc_seq_end(struct nftables_pernet *nft_net, unsigned int gc_seq) { WRITE_ONCE(nft_net->gc_seq, ++gc_seq); } static int nf_tables_commit(struct net *net, struct sk_buff *skb) { struct nftables_pernet *nft_net = nft_pernet(net); const struct nlmsghdr *nlh = nlmsg_hdr(skb); struct nft_trans_binding *trans_binding; struct nft_trans *trans, *next; unsigned int base_seq, gc_seq; LIST_HEAD(set_update_list); struct nft_trans_elem *te; struct nft_chain *chain; struct nft_table *table; struct nft_ctx ctx; LIST_HEAD(adl); int err; if (list_empty(&nft_net->commit_list)) { mutex_unlock(&nft_net->commit_mutex); return 0; } nft_ctx_init(&ctx, net, skb, nlh, NFPROTO_UNSPEC, NULL, NULL, NULL); list_for_each_entry(trans_binding, &nft_net->binding_list, binding_list) { trans = &trans_binding->nft_trans; switch (trans->msg_type) { case NFT_MSG_NEWSET: if (!nft_trans_set_update(trans) && nft_set_is_anonymous(nft_trans_set(trans)) && !nft_trans_set_bound(trans)) { pr_warn_once("nftables ruleset with unbound set\n"); return -EINVAL; } break; case NFT_MSG_NEWCHAIN: if (!nft_trans_chain_update(trans) && nft_chain_binding(nft_trans_chain(trans)) && !nft_trans_chain_bound(trans)) { pr_warn_once("nftables ruleset with unbound chain\n"); return -EINVAL; } break; default: WARN_ONCE(1, "Unhandled bind type %d", trans->msg_type); break; } } /* 0. Validate ruleset, otherwise roll back for error reporting. */ if (nf_tables_validate(net) < 0) { nft_net->validate_state = NFT_VALIDATE_DO; return -EAGAIN; } err = nft_flow_rule_offload_commit(net); if (err < 0) return err; /* 1. Allocate space for next generation rules_gen_X[] */ list_for_each_entry_safe(trans, next, &nft_net->commit_list, list) { struct nft_table *table = trans->table; int ret; ret = nf_tables_commit_audit_alloc(&adl, table); if (ret) { nf_tables_commit_chain_prepare_cancel(net); nf_tables_commit_audit_free(&adl); return ret; } if (trans->msg_type == NFT_MSG_NEWRULE || trans->msg_type == NFT_MSG_DELRULE) { chain = nft_trans_rule_chain(trans); ret = nf_tables_commit_chain_prepare(net, chain); if (ret < 0) { nf_tables_commit_chain_prepare_cancel(net); nf_tables_commit_audit_free(&adl); return ret; } } } /* step 2. Make rules_gen_X visible to packet path */ list_for_each_entry(table, &nft_net->tables, list) { list_for_each_entry(chain, &table->chains, list) nf_tables_commit_chain(net, chain); } /* * Bump generation counter, invalidate any dump in progress. * Cannot fail after this point. */ base_seq = nft_base_seq(net); while (++base_seq == 0) ; /* pairs with smp_load_acquire in nft_lookup_eval */ smp_store_release(&net->nft.base_seq, base_seq); gc_seq = nft_gc_seq_begin(nft_net); /* step 3. Start new generation, rules_gen_X now in use. */ net->nft.gencursor = nft_gencursor_next(net); list_for_each_entry_safe(trans, next, &nft_net->commit_list, list) { struct nft_table *table = trans->table; nft_ctx_update(&ctx, trans); nf_tables_commit_audit_collect(&adl, trans, trans->msg_type); switch (trans->msg_type) { case NFT_MSG_NEWTABLE: if (nft_trans_table_update(trans)) { if (!(table->flags & __NFT_TABLE_F_UPDATE)) { nft_trans_destroy(trans); break; } if (table->flags & NFT_TABLE_F_DORMANT) nf_tables_table_disable(net, table); table->flags &= ~__NFT_TABLE_F_UPDATE; } else { nft_clear(net, table); } nf_tables_table_notify(&ctx, NFT_MSG_NEWTABLE); nft_trans_destroy(trans); break; case NFT_MSG_DELTABLE: case NFT_MSG_DESTROYTABLE: list_del_rcu(&table->list); nf_tables_table_notify(&ctx, trans->msg_type); break; case NFT_MSG_NEWCHAIN: if (nft_trans_chain_update(trans)) { nft_chain_commit_update(nft_trans_container_chain(trans)); nf_tables_chain_notify(&ctx, NFT_MSG_NEWCHAIN, &nft_trans_chain_hooks(trans)); list_splice(&nft_trans_chain_hooks(trans), &nft_trans_basechain(trans)->hook_list); /* trans destroyed after rcu grace period */ } else { nft_chain_commit_drop_policy(nft_trans_container_chain(trans)); nft_clear(net, nft_trans_chain(trans)); nf_tables_chain_notify(&ctx, NFT_MSG_NEWCHAIN, NULL); nft_trans_destroy(trans); } break; case NFT_MSG_DELCHAIN: case NFT_MSG_DESTROYCHAIN: if (nft_trans_chain_update(trans)) { nf_tables_chain_notify(&ctx, NFT_MSG_DELCHAIN, &nft_trans_chain_hooks(trans)); if (!(table->flags & NFT_TABLE_F_DORMANT)) { nft_netdev_unregister_hooks(net, &nft_trans_chain_hooks(trans), true); } } else { nft_chain_del(nft_trans_chain(trans)); nf_tables_chain_notify(&ctx, NFT_MSG_DELCHAIN, NULL); nf_tables_unregister_hook(ctx.net, ctx.table, nft_trans_chain(trans)); } break; case NFT_MSG_NEWRULE: nft_clear(net, nft_trans_rule(trans)); nf_tables_rule_notify(&ctx, nft_trans_rule(trans), NFT_MSG_NEWRULE); if (nft_trans_rule_chain(trans)->flags & NFT_CHAIN_HW_OFFLOAD) nft_flow_rule_destroy(nft_trans_flow_rule(trans)); nft_trans_destroy(trans); break; case NFT_MSG_DELRULE: case NFT_MSG_DESTROYRULE: list_del_rcu(&nft_trans_rule(trans)->list); nf_tables_rule_notify(&ctx, nft_trans_rule(trans), trans->msg_type); nft_rule_expr_deactivate(&ctx, nft_trans_rule(trans), NFT_TRANS_COMMIT); if (nft_trans_rule_chain(trans)->flags & NFT_CHAIN_HW_OFFLOAD) nft_flow_rule_destroy(nft_trans_flow_rule(trans)); break; case NFT_MSG_NEWSET: list_del(&nft_trans_container_set(trans)->list_trans_newset); if (nft_trans_set_update(trans)) { struct nft_set *set = nft_trans_set(trans); WRITE_ONCE(set->timeout, nft_trans_set_timeout(trans)); WRITE_ONCE(set->gc_int, nft_trans_set_gc_int(trans)); if (nft_trans_set_size(trans)) WRITE_ONCE(set->size, nft_trans_set_size(trans)); } else { nft_clear(net, nft_trans_set(trans)); /* This avoids hitting -EBUSY when deleting the table * from the transaction. */ if (nft_set_is_anonymous(nft_trans_set(trans)) && !list_empty(&nft_trans_set(trans)->bindings)) nft_use_dec(&table->use); } nf_tables_set_notify(&ctx, nft_trans_set(trans), NFT_MSG_NEWSET, GFP_KERNEL); nft_trans_destroy(trans); break; case NFT_MSG_DELSET: case NFT_MSG_DESTROYSET: nft_trans_set(trans)->dead = 1; list_del_rcu(&nft_trans_set(trans)->list); nf_tables_set_notify(&ctx, nft_trans_set(trans), trans->msg_type, GFP_KERNEL); break; case NFT_MSG_NEWSETELEM: te = nft_trans_container_elem(trans); nft_trans_elems_add(&ctx, te); if (te->set->ops->commit && list_empty(&te->set->pending_update)) { list_add_tail(&te->set->pending_update, &set_update_list); } nft_trans_destroy(trans); break; case NFT_MSG_DELSETELEM: case NFT_MSG_DESTROYSETELEM: te = nft_trans_container_elem(trans); nft_trans_elems_remove(&ctx, te); if (te->set->ops->commit && list_empty(&te->set->pending_update)) { list_add_tail(&te->set->pending_update, &set_update_list); } break; case NFT_MSG_NEWOBJ: if (nft_trans_obj_update(trans)) { nft_obj_commit_update(&ctx, trans); nf_tables_obj_notify(&ctx, nft_trans_obj(trans), NFT_MSG_NEWOBJ); } else { nft_clear(net, nft_trans_obj(trans)); nf_tables_obj_notify(&ctx, nft_trans_obj(trans), NFT_MSG_NEWOBJ); nft_trans_destroy(trans); } break; case NFT_MSG_DELOBJ: case NFT_MSG_DESTROYOBJ: nft_obj_del(nft_trans_obj(trans)); nf_tables_obj_notify(&ctx, nft_trans_obj(trans), trans->msg_type); break; case NFT_MSG_NEWFLOWTABLE: if (nft_trans_flowtable_update(trans)) { nft_trans_flowtable(trans)->data.flags = nft_trans_flowtable_flags(trans); nf_tables_flowtable_notify(&ctx, nft_trans_flowtable(trans), &nft_trans_flowtable_hooks(trans), NFT_MSG_NEWFLOWTABLE); list_splice(&nft_trans_flowtable_hooks(trans), &nft_trans_flowtable(trans)->hook_list); } else { nft_clear(net, nft_trans_flowtable(trans)); nf_tables_flowtable_notify(&ctx, nft_trans_flowtable(trans), NULL, NFT_MSG_NEWFLOWTABLE); } nft_trans_destroy(trans); break; case NFT_MSG_DELFLOWTABLE: case NFT_MSG_DESTROYFLOWTABLE: if (nft_trans_flowtable_update(trans)) { nf_tables_flowtable_notify(&ctx, nft_trans_flowtable(trans), &nft_trans_flowtable_hooks(trans), trans->msg_type); nft_unregister_flowtable_net_hooks(net, nft_trans_flowtable(trans), &nft_trans_flowtable_hooks(trans)); } else { list_del_rcu(&nft_trans_flowtable(trans)->list); nf_tables_flowtable_notify(&ctx, nft_trans_flowtable(trans), NULL, trans->msg_type); nft_unregister_flowtable_net_hooks(net, nft_trans_flowtable(trans), &nft_trans_flowtable(trans)->hook_list); } break; } } nft_set_commit_update(&set_update_list); nft_commit_notify(net, NETLINK_CB(skb).portid); nf_tables_gen_notify(net, skb, NFT_MSG_NEWGEN); nf_tables_commit_audit_log(&adl, nft_base_seq(net)); nft_gc_seq_end(nft_net, gc_seq); nft_net->validate_state = NFT_VALIDATE_SKIP; nf_tables_commit_release(net); return 0; } static void nf_tables_module_autoload(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); struct nft_module_request *req, *next; LIST_HEAD(module_list); list_splice_init(&nft_net->module_list, &module_list); mutex_unlock(&nft_net->commit_mutex); list_for_each_entry_safe(req, next, &module_list, list) { request_module("%s", req->module); req->done = true; } mutex_lock(&nft_net->commit_mutex); list_splice(&module_list, &nft_net->module_list); } static void nf_tables_abort_release(struct nft_trans *trans) { struct nft_ctx ctx = { }; nft_ctx_update(&ctx, trans); switch (trans->msg_type) { case NFT_MSG_NEWTABLE: nf_tables_table_destroy(trans->table); break; case NFT_MSG_NEWCHAIN: if (nft_trans_chain_update(trans)) nft_hooks_destroy(&nft_trans_chain_hooks(trans)); else nf_tables_chain_destroy(nft_trans_chain(trans)); break; case NFT_MSG_NEWRULE: nf_tables_rule_destroy(&ctx, nft_trans_rule(trans)); break; case NFT_MSG_NEWSET: nft_set_destroy(&ctx, nft_trans_set(trans)); break; case NFT_MSG_NEWSETELEM: nft_trans_set_elem_destroy(&ctx, nft_trans_container_elem(trans)); break; case NFT_MSG_NEWOBJ: nft_obj_destroy(&ctx, nft_trans_obj(trans)); break; case NFT_MSG_NEWFLOWTABLE: if (nft_trans_flowtable_update(trans)) nft_hooks_destroy(&nft_trans_flowtable_hooks(trans)); else nf_tables_flowtable_destroy(nft_trans_flowtable(trans)); break; } kfree(trans); } static void nft_set_abort_update(struct list_head *set_update_list) { struct nft_set *set, *next; list_for_each_entry_safe(set, next, set_update_list, pending_update) { list_del_init(&set->pending_update); if (!set->ops->abort) continue; set->ops->abort(set); } } static int __nf_tables_abort(struct net *net, enum nfnl_abort_action action) { struct nftables_pernet *nft_net = nft_pernet(net); struct nft_trans *trans, *next; LIST_HEAD(set_update_list); struct nft_trans_elem *te; struct nft_ctx ctx = { .net = net, }; int err = 0; if (action == NFNL_ABORT_VALIDATE && nf_tables_validate(net) < 0) err = -EAGAIN; list_for_each_entry_safe_reverse(trans, next, &nft_net->commit_list, list) { struct nft_table *table = trans->table; nft_ctx_update(&ctx, trans); switch (trans->msg_type) { case NFT_MSG_NEWTABLE: if (nft_trans_table_update(trans)) { if (!(table->flags & __NFT_TABLE_F_UPDATE)) { nft_trans_destroy(trans); break; } if (table->flags & __NFT_TABLE_F_WAS_DORMANT) { nf_tables_table_disable(net, table); table->flags |= NFT_TABLE_F_DORMANT; } else if (table->flags & __NFT_TABLE_F_WAS_AWAKEN) { table->flags &= ~NFT_TABLE_F_DORMANT; } if (table->flags & __NFT_TABLE_F_WAS_ORPHAN) { table->flags &= ~NFT_TABLE_F_OWNER; table->nlpid = 0; } table->flags &= ~__NFT_TABLE_F_UPDATE; nft_trans_destroy(trans); } else { list_del_rcu(&table->list); } break; case NFT_MSG_DELTABLE: case NFT_MSG_DESTROYTABLE: nft_clear(trans->net, table); nft_trans_destroy(trans); break; case NFT_MSG_NEWCHAIN: if (nft_trans_chain_update(trans)) { if (!(table->flags & NFT_TABLE_F_DORMANT)) { nft_netdev_unregister_hooks(net, &nft_trans_chain_hooks(trans), true); } free_percpu(nft_trans_chain_stats(trans)); kfree(nft_trans_chain_name(trans)); nft_trans_destroy(trans); } else { if (nft_trans_chain_bound(trans)) { nft_trans_destroy(trans); break; } nft_use_dec_restore(&table->use); nft_chain_del(nft_trans_chain(trans)); nf_tables_unregister_hook(trans->net, table, nft_trans_chain(trans)); } break; case NFT_MSG_DELCHAIN: case NFT_MSG_DESTROYCHAIN: if (nft_trans_chain_update(trans)) { list_splice(&nft_trans_chain_hooks(trans), &nft_trans_basechain(trans)->hook_list); } else { nft_use_inc_restore(&table->use); nft_clear(trans->net, nft_trans_chain(trans)); } nft_trans_destroy(trans); break; case NFT_MSG_NEWRULE: if (nft_trans_rule_bound(trans)) { nft_trans_destroy(trans); break; } nft_use_dec_restore(&nft_trans_rule_chain(trans)->use); list_del_rcu(&nft_trans_rule(trans)->list); nft_rule_expr_deactivate(&ctx, nft_trans_rule(trans), NFT_TRANS_ABORT); if (nft_trans_rule_chain(trans)->flags & NFT_CHAIN_HW_OFFLOAD) nft_flow_rule_destroy(nft_trans_flow_rule(trans)); break; case NFT_MSG_DELRULE: case NFT_MSG_DESTROYRULE: nft_use_inc_restore(&nft_trans_rule_chain(trans)->use); nft_clear(trans->net, nft_trans_rule(trans)); nft_rule_expr_activate(&ctx, nft_trans_rule(trans)); if (nft_trans_rule_chain(trans)->flags & NFT_CHAIN_HW_OFFLOAD) nft_flow_rule_destroy(nft_trans_flow_rule(trans)); nft_trans_destroy(trans); break; case NFT_MSG_NEWSET: list_del(&nft_trans_container_set(trans)->list_trans_newset); if (nft_trans_set_update(trans)) { nft_trans_destroy(trans); break; } nft_use_dec_restore(&table->use); if (nft_trans_set_bound(trans)) { nft_trans_destroy(trans); break; } nft_trans_set(trans)->dead = 1; list_del_rcu(&nft_trans_set(trans)->list); break; case NFT_MSG_DELSET: case NFT_MSG_DESTROYSET: nft_use_inc_restore(&table->use); nft_clear(trans->net, nft_trans_set(trans)); if (nft_trans_set(trans)->flags & (NFT_SET_MAP | NFT_SET_OBJECT)) nft_map_activate(&ctx, nft_trans_set(trans)); nft_trans_destroy(trans); break; case NFT_MSG_NEWSETELEM: if (nft_trans_elem_set_bound(trans)) { nft_trans_destroy(trans); break; } te = nft_trans_container_elem(trans); if (!nft_trans_elems_new_abort(&ctx, te)) { nft_trans_destroy(trans); break; } if (te->set->ops->abort && list_empty(&te->set->pending_update)) { list_add_tail(&te->set->pending_update, &set_update_list); } break; case NFT_MSG_DELSETELEM: case NFT_MSG_DESTROYSETELEM: te = nft_trans_container_elem(trans); nft_trans_elems_destroy_abort(&ctx, te); if (te->set->ops->abort && list_empty(&te->set->pending_update)) { list_add_tail(&te->set->pending_update, &set_update_list); } nft_trans_destroy(trans); break; case NFT_MSG_NEWOBJ: if (nft_trans_obj_update(trans)) { nft_obj_destroy(&ctx, nft_trans_obj_newobj(trans)); nft_trans_destroy(trans); } else { nft_use_dec_restore(&table->use); nft_obj_del(nft_trans_obj(trans)); } break; case NFT_MSG_DELOBJ: case NFT_MSG_DESTROYOBJ: nft_use_inc_restore(&table->use); nft_clear(trans->net, nft_trans_obj(trans)); nft_trans_destroy(trans); break; case NFT_MSG_NEWFLOWTABLE: if (nft_trans_flowtable_update(trans)) { nft_unregister_flowtable_net_hooks(net, nft_trans_flowtable(trans), &nft_trans_flowtable_hooks(trans)); } else { nft_use_dec_restore(&table->use); list_del_rcu(&nft_trans_flowtable(trans)->list); nft_unregister_flowtable_net_hooks(net, nft_trans_flowtable(trans), &nft_trans_flowtable(trans)->hook_list); } break; case NFT_MSG_DELFLOWTABLE: case NFT_MSG_DESTROYFLOWTABLE: if (nft_trans_flowtable_update(trans)) { list_splice(&nft_trans_flowtable_hooks(trans), &nft_trans_flowtable(trans)->hook_list); } else { nft_use_inc_restore(&table->use); nft_clear(trans->net, nft_trans_flowtable(trans)); } nft_trans_destroy(trans); break; } } WARN_ON_ONCE(!list_empty(&nft_net->commit_set_list)); nft_set_abort_update(&set_update_list); synchronize_rcu(); list_for_each_entry_safe_reverse(trans, next, &nft_net->commit_list, list) { nft_trans_list_del(trans); nf_tables_abort_release(trans); } return err; } static int nf_tables_abort(struct net *net, struct sk_buff *skb, enum nfnl_abort_action action) { struct nftables_pernet *nft_net = nft_pernet(net); unsigned int gc_seq; int ret; gc_seq = nft_gc_seq_begin(nft_net); ret = __nf_tables_abort(net, action); nft_gc_seq_end(nft_net, gc_seq); WARN_ON_ONCE(!list_empty(&nft_net->commit_list)); /* module autoload needs to happen after GC sequence update because it * temporarily releases and grabs mutex again. */ if (action == NFNL_ABORT_AUTOLOAD) nf_tables_module_autoload(net); else nf_tables_module_autoload_cleanup(net); mutex_unlock(&nft_net->commit_mutex); return ret; } static bool nf_tables_valid_genid(struct net *net, u32 genid) { struct nftables_pernet *nft_net = nft_pernet(net); bool genid_ok; mutex_lock(&nft_net->commit_mutex); nft_net->tstamp = get_jiffies_64(); genid_ok = genid == 0 || nft_base_seq(net) == genid; if (!genid_ok) mutex_unlock(&nft_net->commit_mutex); /* else, commit mutex has to be released by commit or abort function */ return genid_ok; } static const struct nfnetlink_subsystem nf_tables_subsys = { .name = "nf_tables", .subsys_id = NFNL_SUBSYS_NFTABLES, .cb_count = NFT_MSG_MAX, .cb = nf_tables_cb, .commit = nf_tables_commit, .abort = nf_tables_abort, .valid_genid = nf_tables_valid_genid, .owner = THIS_MODULE, }; int nft_chain_validate_dependency(const struct nft_chain *chain, enum nft_chain_types type) { const struct nft_base_chain *basechain; if (nft_is_base_chain(chain)) { basechain = nft_base_chain(chain); if (basechain->type->type != type) return -EOPNOTSUPP; } return 0; } EXPORT_SYMBOL_GPL(nft_chain_validate_dependency); int nft_chain_validate_hooks(const struct nft_chain *chain, unsigned int hook_flags) { struct nft_base_chain *basechain; if (nft_is_base_chain(chain)) { basechain = nft_base_chain(chain); if ((1 << basechain->ops.hooknum) & hook_flags) return 0; return -EOPNOTSUPP; } return 0; } EXPORT_SYMBOL_GPL(nft_chain_validate_hooks); /** * nft_parse_u32_check - fetch u32 attribute and check for maximum value * * @attr: netlink attribute to fetch value from * @max: maximum value to be stored in dest * @dest: pointer to the variable * * Parse, check and store a given u32 netlink attribute into variable. * This function returns -ERANGE if the value goes over maximum value. * Otherwise a 0 is returned and the attribute value is stored in the * destination variable. */ int nft_parse_u32_check(const struct nlattr *attr, int max, u32 *dest) { u32 val; val = ntohl(nla_get_be32(attr)); if (val > max) return -ERANGE; *dest = val; return 0; } EXPORT_SYMBOL_GPL(nft_parse_u32_check); static int nft_parse_register(const struct nlattr *attr, u32 *preg) { unsigned int reg; reg = ntohl(nla_get_be32(attr)); switch (reg) { case NFT_REG_VERDICT...NFT_REG_4: *preg = reg * NFT_REG_SIZE / NFT_REG32_SIZE; break; case NFT_REG32_00...NFT_REG32_15: *preg = reg + NFT_REG_SIZE / NFT_REG32_SIZE - NFT_REG32_00; break; default: return -ERANGE; } return 0; } /** * nft_dump_register - dump a register value to a netlink attribute * * @skb: socket buffer * @attr: attribute number * @reg: register number * * Construct a netlink attribute containing the register number. For * compatibility reasons, register numbers being a multiple of 4 are * translated to the corresponding 128 bit register numbers. */ int nft_dump_register(struct sk_buff *skb, unsigned int attr, unsigned int reg) { if (reg % (NFT_REG_SIZE / NFT_REG32_SIZE) == 0) reg = reg / (NFT_REG_SIZE / NFT_REG32_SIZE); else reg = reg - NFT_REG_SIZE / NFT_REG32_SIZE + NFT_REG32_00; return nla_put_be32(skb, attr, htonl(reg)); } EXPORT_SYMBOL_GPL(nft_dump_register); static int nft_validate_register_load(enum nft_registers reg, unsigned int len) { if (reg < NFT_REG_1 * NFT_REG_SIZE / NFT_REG32_SIZE) return -EINVAL; if (len == 0) return -EINVAL; if (reg * NFT_REG32_SIZE + len > sizeof_field(struct nft_regs, data)) return -ERANGE; return 0; } int nft_parse_register_load(const struct nft_ctx *ctx, const struct nlattr *attr, u8 *sreg, u32 len) { int err, invalid_reg; u32 reg, next_register; err = nft_parse_register(attr, ®); if (err < 0) return err; err = nft_validate_register_load(reg, len); if (err < 0) return err; next_register = DIV_ROUND_UP(len, NFT_REG32_SIZE) + reg; /* Can't happen: nft_validate_register_load() should have failed */ if (WARN_ON_ONCE(next_register > NFT_REG32_NUM)) return -EINVAL; /* find first register that did not see an earlier store. */ invalid_reg = find_next_zero_bit(ctx->reg_inited, NFT_REG32_NUM, reg); /* invalid register within the range that we're loading from? */ if (invalid_reg < next_register) return -ENODATA; *sreg = reg; return 0; } EXPORT_SYMBOL_GPL(nft_parse_register_load); static void nft_saw_register_store(const struct nft_ctx *__ctx, int reg, unsigned int len) { unsigned int registers = DIV_ROUND_UP(len, NFT_REG32_SIZE); struct nft_ctx *ctx = (struct nft_ctx *)__ctx; if (WARN_ON_ONCE(len == 0 || reg < 0)) return; bitmap_set(ctx->reg_inited, reg, registers); } static int nft_validate_register_store(const struct nft_ctx *ctx, enum nft_registers reg, const struct nft_data *data, enum nft_data_types type, unsigned int len) { switch (reg) { case NFT_REG_VERDICT: if (type != NFT_DATA_VERDICT) return -EINVAL; break; default: if (type != NFT_DATA_VALUE) return -EINVAL; if (reg < NFT_REG_1 * NFT_REG_SIZE / NFT_REG32_SIZE) return -EINVAL; if (len == 0) return -EINVAL; if (reg * NFT_REG32_SIZE + len > sizeof_field(struct nft_regs, data)) return -ERANGE; break; } nft_saw_register_store(ctx, reg, len); return 0; } int nft_parse_register_store(const struct nft_ctx *ctx, const struct nlattr *attr, u8 *dreg, const struct nft_data *data, enum nft_data_types type, unsigned int len) { int err; u32 reg; err = nft_parse_register(attr, ®); if (err < 0) return err; err = nft_validate_register_store(ctx, reg, data, type, len); if (err < 0) return err; *dreg = reg; return 0; } EXPORT_SYMBOL_GPL(nft_parse_register_store); static const struct nla_policy nft_verdict_policy[NFTA_VERDICT_MAX + 1] = { [NFTA_VERDICT_CODE] = { .type = NLA_U32 }, [NFTA_VERDICT_CHAIN] = { .type = NLA_STRING, .len = NFT_CHAIN_MAXNAMELEN - 1 }, [NFTA_VERDICT_CHAIN_ID] = { .type = NLA_U32 }, }; static int nft_verdict_init(const struct nft_ctx *ctx, struct nft_data *data, struct nft_data_desc *desc, const struct nlattr *nla) { u8 genmask = nft_genmask_next(ctx->net); struct nlattr *tb[NFTA_VERDICT_MAX + 1]; struct nft_chain *chain; int err; err = nla_parse_nested_deprecated(tb, NFTA_VERDICT_MAX, nla, nft_verdict_policy, NULL); if (err < 0) return err; if (!tb[NFTA_VERDICT_CODE]) return -EINVAL; /* zero padding hole for memcmp */ memset(data, 0, sizeof(*data)); data->verdict.code = ntohl(nla_get_be32(tb[NFTA_VERDICT_CODE])); switch (data->verdict.code) { case NF_ACCEPT: case NF_DROP: case NF_QUEUE: break; case NFT_CONTINUE: case NFT_BREAK: case NFT_RETURN: break; case NFT_JUMP: case NFT_GOTO: if (tb[NFTA_VERDICT_CHAIN]) { chain = nft_chain_lookup(ctx->net, ctx->table, tb[NFTA_VERDICT_CHAIN], genmask); } else if (tb[NFTA_VERDICT_CHAIN_ID]) { chain = nft_chain_lookup_byid(ctx->net, ctx->table, tb[NFTA_VERDICT_CHAIN_ID], genmask); if (IS_ERR(chain)) return PTR_ERR(chain); } else { return -EINVAL; } if (IS_ERR(chain)) return PTR_ERR(chain); if (nft_is_base_chain(chain)) return -EOPNOTSUPP; if (nft_chain_is_bound(chain)) return -EINVAL; if (desc->flags & NFT_DATA_DESC_SETELEM && chain->flags & NFT_CHAIN_BINDING) return -EINVAL; if (!nft_use_inc(&chain->use)) return -EMFILE; data->verdict.chain = chain; break; default: return -EINVAL; } desc->len = sizeof(data->verdict); return 0; } static void nft_verdict_uninit(const struct nft_data *data) { struct nft_chain *chain; switch (data->verdict.code) { case NFT_JUMP: case NFT_GOTO: chain = data->verdict.chain; nft_use_dec(&chain->use); break; } } int nft_verdict_dump(struct sk_buff *skb, int type, const struct nft_verdict *v) { struct nlattr *nest; nest = nla_nest_start_noflag(skb, type); if (!nest) goto nla_put_failure; if (nla_put_be32(skb, NFTA_VERDICT_CODE, htonl(v->code))) goto nla_put_failure; switch (v->code) { case NFT_JUMP: case NFT_GOTO: if (nla_put_string(skb, NFTA_VERDICT_CHAIN, v->chain->name)) goto nla_put_failure; } nla_nest_end(skb, nest); return 0; nla_put_failure: return -1; } static int nft_value_init(const struct nft_ctx *ctx, struct nft_data *data, struct nft_data_desc *desc, const struct nlattr *nla) { unsigned int len; len = nla_len(nla); if (len == 0) return -EINVAL; if (len > desc->size) return -EOVERFLOW; if (desc->len) { if (len != desc->len) return -EINVAL; } else { desc->len = len; } nla_memcpy(data->data, nla, len); return 0; } static int nft_value_dump(struct sk_buff *skb, const struct nft_data *data, unsigned int len) { return nla_put(skb, NFTA_DATA_VALUE, len, data->data); } static const struct nla_policy nft_data_policy[NFTA_DATA_MAX + 1] = { [NFTA_DATA_VALUE] = { .type = NLA_BINARY }, [NFTA_DATA_VERDICT] = { .type = NLA_NESTED }, }; /** * nft_data_init - parse nf_tables data netlink attributes * * @ctx: context of the expression using the data * @data: destination struct nft_data * @desc: data description * @nla: netlink attribute containing data * * Parse the netlink data attributes and initialize a struct nft_data. * The type and length of data are returned in the data description. * * The caller can indicate that it only wants to accept data of type * NFT_DATA_VALUE by passing NULL for the ctx argument. */ int nft_data_init(const struct nft_ctx *ctx, struct nft_data *data, struct nft_data_desc *desc, const struct nlattr *nla) { struct nlattr *tb[NFTA_DATA_MAX + 1]; int err; if (WARN_ON_ONCE(!desc->size)) return -EINVAL; err = nla_parse_nested_deprecated(tb, NFTA_DATA_MAX, nla, nft_data_policy, NULL); if (err < 0) return err; if (tb[NFTA_DATA_VALUE]) { if (desc->type != NFT_DATA_VALUE) return -EINVAL; err = nft_value_init(ctx, data, desc, tb[NFTA_DATA_VALUE]); } else if (tb[NFTA_DATA_VERDICT] && ctx != NULL) { if (desc->type != NFT_DATA_VERDICT) return -EINVAL; err = nft_verdict_init(ctx, data, desc, tb[NFTA_DATA_VERDICT]); } else { err = -EINVAL; } return err; } EXPORT_SYMBOL_GPL(nft_data_init); /** * nft_data_release - release a nft_data item * * @data: struct nft_data to release * @type: type of data * * Release a nft_data item. NFT_DATA_VALUE types can be silently discarded, * all others need to be released by calling this function. */ void nft_data_release(const struct nft_data *data, enum nft_data_types type) { if (type < NFT_DATA_VERDICT) return; switch (type) { case NFT_DATA_VERDICT: return nft_verdict_uninit(data); default: WARN_ON(1); } } EXPORT_SYMBOL_GPL(nft_data_release); int nft_data_dump(struct sk_buff *skb, int attr, const struct nft_data *data, enum nft_data_types type, unsigned int len) { struct nlattr *nest; int err; nest = nla_nest_start_noflag(skb, attr); if (nest == NULL) return -1; switch (type) { case NFT_DATA_VALUE: err = nft_value_dump(skb, data, len); break; case NFT_DATA_VERDICT: err = nft_verdict_dump(skb, NFTA_DATA_VERDICT, &data->verdict); break; default: err = -EINVAL; WARN_ON(1); } nla_nest_end(skb, nest); return err; } EXPORT_SYMBOL_GPL(nft_data_dump); static void __nft_release_hook(struct net *net, struct nft_table *table) { struct nft_flowtable *flowtable; struct nft_chain *chain; list_for_each_entry(chain, &table->chains, list) __nf_tables_unregister_hook(net, table, chain, true); list_for_each_entry(flowtable, &table->flowtables, list) __nft_unregister_flowtable_net_hooks(net, flowtable, &flowtable->hook_list, true); } static void __nft_release_hooks(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); struct nft_table *table; list_for_each_entry(table, &nft_net->tables, list) { if (nft_table_has_owner(table)) continue; __nft_release_hook(net, table); } } static void __nft_release_table(struct net *net, struct nft_table *table) { struct nft_flowtable *flowtable, *nf; struct nft_chain *chain, *nc; struct nft_object *obj, *ne; struct nft_rule *rule, *nr; struct nft_set *set, *ns; struct nft_ctx ctx = { .net = net, .family = NFPROTO_NETDEV, }; ctx.family = table->family; ctx.table = table; list_for_each_entry(chain, &table->chains, list) { if (nft_chain_binding(chain)) continue; ctx.chain = chain; list_for_each_entry_safe(rule, nr, &chain->rules, list) { list_del(&rule->list); nft_use_dec(&chain->use); nf_tables_rule_release(&ctx, rule); } } list_for_each_entry_safe(flowtable, nf, &table->flowtables, list) { list_del(&flowtable->list); nft_use_dec(&table->use); nf_tables_flowtable_destroy(flowtable); } list_for_each_entry_safe(set, ns, &table->sets, list) { list_del(&set->list); nft_use_dec(&table->use); if (set->flags & (NFT_SET_MAP | NFT_SET_OBJECT)) nft_map_deactivate(&ctx, set); nft_set_destroy(&ctx, set); } list_for_each_entry_safe(obj, ne, &table->objects, list) { nft_obj_del(obj); nft_use_dec(&table->use); nft_obj_destroy(&ctx, obj); } list_for_each_entry_safe(chain, nc, &table->chains, list) { nft_chain_del(chain); nft_use_dec(&table->use); nf_tables_chain_destroy(chain); } nf_tables_table_destroy(table); } static void __nft_release_tables(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); struct nft_table *table, *nt; list_for_each_entry_safe(table, nt, &nft_net->tables, list) { if (nft_table_has_owner(table)) continue; list_del(&table->list); __nft_release_table(net, table); } } static int nft_rcv_nl_event(struct notifier_block *this, unsigned long event, void *ptr) { struct nft_table *table, *to_delete[8]; struct nftables_pernet *nft_net; struct netlink_notify *n = ptr; struct net *net = n->net; unsigned int deleted; bool restart = false; unsigned int gc_seq; if (event != NETLINK_URELEASE || n->protocol != NETLINK_NETFILTER) return NOTIFY_DONE; nft_net = nft_pernet(net); deleted = 0; mutex_lock(&nft_net->commit_mutex); gc_seq = nft_gc_seq_begin(nft_net); nf_tables_trans_destroy_flush_work(net); again: list_for_each_entry(table, &nft_net->tables, list) { if (nft_table_has_owner(table) && n->portid == table->nlpid) { if (table->flags & NFT_TABLE_F_PERSIST) { table->flags &= ~NFT_TABLE_F_OWNER; continue; } __nft_release_hook(net, table); list_del_rcu(&table->list); to_delete[deleted++] = table; if (deleted >= ARRAY_SIZE(to_delete)) break; } } if (deleted) { restart = deleted >= ARRAY_SIZE(to_delete); synchronize_rcu(); while (deleted) __nft_release_table(net, to_delete[--deleted]); if (restart) goto again; } nft_gc_seq_end(nft_net, gc_seq); mutex_unlock(&nft_net->commit_mutex); return NOTIFY_DONE; } static struct notifier_block nft_nl_notifier = { .notifier_call = nft_rcv_nl_event, }; static int __net_init nf_tables_init_net(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); INIT_LIST_HEAD(&nft_net->tables); INIT_LIST_HEAD(&nft_net->commit_list); INIT_LIST_HEAD(&nft_net->destroy_list); INIT_LIST_HEAD(&nft_net->commit_set_list); INIT_LIST_HEAD(&nft_net->binding_list); INIT_LIST_HEAD(&nft_net->module_list); INIT_LIST_HEAD(&nft_net->notify_list); mutex_init(&nft_net->commit_mutex); net->nft.base_seq = 1; nft_net->gc_seq = 0; nft_net->validate_state = NFT_VALIDATE_SKIP; INIT_WORK(&nft_net->destroy_work, nf_tables_trans_destroy_work); return 0; } static void __net_exit nf_tables_pre_exit_net(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); mutex_lock(&nft_net->commit_mutex); __nft_release_hooks(net); mutex_unlock(&nft_net->commit_mutex); } static void __net_exit nf_tables_exit_net(struct net *net) { struct nftables_pernet *nft_net = nft_pernet(net); unsigned int gc_seq; mutex_lock(&nft_net->commit_mutex); gc_seq = nft_gc_seq_begin(nft_net); WARN_ON_ONCE(!list_empty(&nft_net->commit_list)); WARN_ON_ONCE(!list_empty(&nft_net->commit_set_list)); if (!list_empty(&nft_net->module_list)) nf_tables_module_autoload_cleanup(net); cancel_work_sync(&nft_net->destroy_work); __nft_release_tables(net); nft_gc_seq_end(nft_net, gc_seq); mutex_unlock(&nft_net->commit_mutex); WARN_ON_ONCE(!list_empty(&nft_net->tables)); WARN_ON_ONCE(!list_empty(&nft_net->module_list)); WARN_ON_ONCE(!list_empty(&nft_net->notify_list)); WARN_ON_ONCE(!list_empty(&nft_net->destroy_list)); } static void nf_tables_exit_batch(struct list_head *net_exit_list) { flush_work(&trans_gc_work); } static struct pernet_operations nf_tables_net_ops = { .init = nf_tables_init_net, .pre_exit = nf_tables_pre_exit_net, .exit = nf_tables_exit_net, .exit_batch = nf_tables_exit_batch, .id = &nf_tables_net_id, .size = sizeof(struct nftables_pernet), }; static int __init nf_tables_module_init(void) { int err; BUILD_BUG_ON(offsetof(struct nft_trans_table, nft_trans) != 0); BUILD_BUG_ON(offsetof(struct nft_trans_chain, nft_trans_binding.nft_trans) != 0); BUILD_BUG_ON(offsetof(struct nft_trans_rule, nft_trans) != 0); BUILD_BUG_ON(offsetof(struct nft_trans_set, nft_trans_binding.nft_trans) != 0); BUILD_BUG_ON(offsetof(struct nft_trans_elem, nft_trans) != 0); BUILD_BUG_ON(offsetof(struct nft_trans_obj, nft_trans) != 0); BUILD_BUG_ON(offsetof(struct nft_trans_flowtable, nft_trans) != 0); err = register_pernet_subsys(&nf_tables_net_ops); if (err < 0) return err; err = nft_chain_filter_init(); if (err < 0) goto err_chain_filter; err = nf_tables_core_module_init(); if (err < 0) goto err_core_module; err = register_netdevice_notifier(&nf_tables_flowtable_notifier); if (err < 0) goto err_netdev_notifier; err = rhltable_init(&nft_objname_ht, &nft_objname_ht_params); if (err < 0) goto err_rht_objname; err = nft_offload_init(); if (err < 0) goto err_offload; err = netlink_register_notifier(&nft_nl_notifier); if (err < 0) goto err_netlink_notifier; /* must be last */ err = nfnetlink_subsys_register(&nf_tables_subsys); if (err < 0) goto err_nfnl_subsys; nft_chain_route_init(); return err; err_nfnl_subsys: netlink_unregister_notifier(&nft_nl_notifier); err_netlink_notifier: nft_offload_exit(); err_offload: rhltable_destroy(&nft_objname_ht); err_rht_objname: unregister_netdevice_notifier(&nf_tables_flowtable_notifier); err_netdev_notifier: nf_tables_core_module_exit(); err_core_module: nft_chain_filter_fini(); err_chain_filter: unregister_pernet_subsys(&nf_tables_net_ops); return err; } static void __exit nf_tables_module_exit(void) { nfnetlink_subsys_unregister(&nf_tables_subsys); netlink_unregister_notifier(&nft_nl_notifier); nft_offload_exit(); unregister_netdevice_notifier(&nf_tables_flowtable_notifier); nft_chain_filter_fini(); nft_chain_route_fini(); unregister_pernet_subsys(&nf_tables_net_ops); cancel_work_sync(&trans_gc_work); rcu_barrier(); rhltable_destroy(&nft_objname_ht); nf_tables_core_module_exit(); } module_init(nf_tables_module_init); module_exit(nf_tables_module_exit); MODULE_LICENSE("GPL"); MODULE_AUTHOR("Patrick McHardy <kaber@trash.net>"); MODULE_DESCRIPTION("Framework for packet filtering and classification"); MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_NFTABLES); |
| 3 3 2 3 1 3 2 3 3 2 1 2 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | // SPDX-License-Identifier: GPL-2.0 /* * linux/mm/msync.c * * Copyright (C) 1994-1999 Linus Torvalds */ /* * The msync() system call. */ #include <linux/fs.h> #include <linux/mm.h> #include <linux/mman.h> #include <linux/file.h> #include <linux/syscalls.h> #include <linux/sched.h> /* * MS_SYNC syncs the entire file - including mappings. * * MS_ASYNC does not start I/O (it used to, up to 2.5.67). * Nor does it marks the relevant pages dirty (it used to up to 2.6.17). * Now it doesn't do anything, since dirty pages are properly tracked. * * The application may now run fsync() to * write out the dirty pages and wait on the writeout and check the result. * Or the application may run fadvise(FADV_DONTNEED) against the fd to start * async writeout immediately. * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to * applications. */ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) { unsigned long end; struct mm_struct *mm = current->mm; struct vm_area_struct *vma; int unmapped_error = 0; int error = -EINVAL; start = untagged_addr(start); if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) goto out; if (offset_in_page(start)) goto out; if ((flags & MS_ASYNC) && (flags & MS_SYNC)) goto out; error = -ENOMEM; len = (len + ~PAGE_MASK) & PAGE_MASK; end = start + len; if (end < start) goto out; error = 0; if (end == start) goto out; /* * If the interval [start,end) covers some unmapped address ranges, * just ignore them, but return -ENOMEM at the end. Besides, if the * flag is MS_ASYNC (w/o MS_INVALIDATE) the result would be -ENOMEM * anyway and there is nothing left to do, so return immediately. */ mmap_read_lock(mm); vma = find_vma(mm, start); for (;;) { struct file *file; loff_t fstart, fend; /* Still start < end. */ error = -ENOMEM; if (!vma) goto out_unlock; /* Here start < vma->vm_end. */ if (start < vma->vm_start) { if (flags == MS_ASYNC) goto out_unlock; start = vma->vm_start; if (start >= end) goto out_unlock; unmapped_error = -ENOMEM; } /* Here vma->vm_start <= start < vma->vm_end. */ if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) { error = -EBUSY; goto out_unlock; } file = vma->vm_file; fstart = (start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); fend = fstart + (min(end, vma->vm_end) - start) - 1; start = vma->vm_end; if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) { get_file(file); mmap_read_unlock(mm); error = vfs_fsync_range(file, fstart, fend, 1); fput(file); if (error || start >= end) goto out; mmap_read_lock(mm); vma = find_vma(mm, start); } else { if (start >= end) { error = 0; goto out_unlock; } vma = find_vma(mm, vma->vm_end); } } out_unlock: mmap_read_unlock(mm); out: return error ? : unmapped_error; } |
| 828 4 196 873 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | /* SPDX-License-Identifier: GPL-2.0 */ /* File: linux/posix_acl.h (C) 2002 Andreas Gruenbacher, <a.gruenbacher@computer.org> */ #ifndef __LINUX_POSIX_ACL_H #define __LINUX_POSIX_ACL_H #include <linux/bug.h> #include <linux/slab.h> #include <linux/rcupdate.h> #include <linux/refcount.h> #include <uapi/linux/posix_acl.h> struct user_namespace; struct posix_acl_entry { short e_tag; unsigned short e_perm; union { kuid_t e_uid; kgid_t e_gid; }; }; struct posix_acl { /* New members MUST be added within the struct_group() macro below. */ struct_group_tagged(posix_acl_hdr, hdr, refcount_t a_refcount; unsigned int a_count; struct rcu_head a_rcu; ); struct posix_acl_entry a_entries[] __counted_by(a_count); }; static_assert(offsetof(struct posix_acl, a_entries) == sizeof(struct posix_acl_hdr), "struct member likely outside of struct_group_tagged()"); #define FOREACH_ACL_ENTRY(pa, acl, pe) \ for(pa=(acl)->a_entries, pe=pa+(acl)->a_count; pa<pe; pa++) /* * Duplicate an ACL handle. */ static inline struct posix_acl * posix_acl_dup(struct posix_acl *acl) { if (acl) refcount_inc(&acl->a_refcount); return acl; } /* * Free an ACL handle. */ static inline void posix_acl_release(struct posix_acl *acl) { if (acl && refcount_dec_and_test(&acl->a_refcount)) kfree_rcu(acl, a_rcu); } /* posix_acl.c */ extern void posix_acl_init(struct posix_acl *, int); extern struct posix_acl *posix_acl_alloc(unsigned int count, gfp_t flags); extern struct posix_acl *posix_acl_from_mode(umode_t, gfp_t); extern int posix_acl_equiv_mode(const struct posix_acl *, umode_t *); extern int __posix_acl_create(struct posix_acl **, gfp_t, umode_t *); extern int __posix_acl_chmod(struct posix_acl **, gfp_t, umode_t); extern struct posix_acl *get_posix_acl(struct inode *, int); int set_posix_acl(struct mnt_idmap *, struct dentry *, int, struct posix_acl *); struct posix_acl *get_cached_acl_rcu(struct inode *inode, int type); struct posix_acl *posix_acl_clone(const struct posix_acl *acl, gfp_t flags); #ifdef CONFIG_FS_POSIX_ACL int posix_acl_chmod(struct mnt_idmap *, struct dentry *, umode_t); extern int posix_acl_create(struct inode *, umode_t *, struct posix_acl **, struct posix_acl **); int posix_acl_update_mode(struct mnt_idmap *, struct inode *, umode_t *, struct posix_acl **); int simple_set_acl(struct mnt_idmap *, struct dentry *, struct posix_acl *, int); extern int simple_acl_create(struct inode *, struct inode *); struct posix_acl *get_cached_acl(struct inode *inode, int type); void set_cached_acl(struct inode *inode, int type, struct posix_acl *acl); void forget_cached_acl(struct inode *inode, int type); void forget_all_cached_acls(struct inode *inode); int posix_acl_valid(struct user_namespace *, const struct posix_acl *); int posix_acl_permission(struct mnt_idmap *, struct inode *, const struct posix_acl *, int); static inline void cache_no_acl(struct inode *inode) { inode->i_acl = NULL; inode->i_default_acl = NULL; } int vfs_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *acl_name, struct posix_acl *kacl); struct posix_acl *vfs_get_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *acl_name); int vfs_remove_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *acl_name); int posix_acl_listxattr(struct inode *inode, char **buffer, ssize_t *remaining_size); #else static inline int posix_acl_chmod(struct mnt_idmap *idmap, struct dentry *dentry, umode_t mode) { return 0; } #define simple_set_acl NULL static inline int simple_acl_create(struct inode *dir, struct inode *inode) { return 0; } static inline void cache_no_acl(struct inode *inode) { } static inline int posix_acl_create(struct inode *inode, umode_t *mode, struct posix_acl **default_acl, struct posix_acl **acl) { *default_acl = *acl = NULL; return 0; } static inline void forget_all_cached_acls(struct inode *inode) { } static inline int vfs_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *name, struct posix_acl *acl) { return -EOPNOTSUPP; } static inline struct posix_acl *vfs_get_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *acl_name) { return ERR_PTR(-EOPNOTSUPP); } static inline int vfs_remove_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *acl_name) { return -EOPNOTSUPP; } static inline int posix_acl_listxattr(struct inode *inode, char **buffer, ssize_t *remaining_size) { return 0; } #endif /* CONFIG_FS_POSIX_ACL */ struct posix_acl *get_inode_acl(struct inode *inode, int type); #endif /* __LINUX_POSIX_ACL_H */ |
| 1 4 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Generic nexthop implementation * * Copyright (c) 2017-19 Cumulus Networks * Copyright (c) 2017-19 David Ahern <dsa@cumulusnetworks.com> */ #ifndef __LINUX_NEXTHOP_H #define __LINUX_NEXTHOP_H #include <linux/netdevice.h> #include <linux/notifier.h> #include <linux/route.h> #include <linux/types.h> #include <net/ip_fib.h> #include <net/ip6_fib.h> #include <net/netlink.h> #define NEXTHOP_VALID_USER_FLAGS RTNH_F_ONLINK struct nexthop; struct nh_config { u32 nh_id; u8 nh_family; u8 nh_protocol; u8 nh_blackhole; u8 nh_fdb; u32 nh_flags; int nh_ifindex; struct net_device *dev; union { __be32 ipv4; struct in6_addr ipv6; } gw; struct nlattr *nh_grp; u16 nh_grp_type; u16 nh_grp_res_num_buckets; unsigned long nh_grp_res_idle_timer; unsigned long nh_grp_res_unbalanced_timer; bool nh_grp_res_has_num_buckets; bool nh_grp_res_has_idle_timer; bool nh_grp_res_has_unbalanced_timer; bool nh_hw_stats; struct nlattr *nh_encap; u16 nh_encap_type; u32 nlflags; struct nl_info nlinfo; }; struct nh_info { struct hlist_node dev_hash; /* entry on netns devhash */ struct nexthop *nh_parent; u8 family; bool reject_nh; bool fdb_nh; union { struct fib_nh_common fib_nhc; struct fib_nh fib_nh; struct fib6_nh fib6_nh; }; }; struct nh_res_bucket { struct nh_grp_entry __rcu *nh_entry; atomic_long_t used_time; unsigned long migrated_time; bool occupied; u8 nh_flags; }; struct nh_res_table { struct net *net; u32 nhg_id; struct delayed_work upkeep_dw; /* List of NHGEs that have too few buckets ("uw" for underweight). * Reclaimed buckets will be given to entries in this list. */ struct list_head uw_nh_entries; unsigned long unbalanced_since; u32 idle_timer; u32 unbalanced_timer; u16 num_nh_buckets; struct nh_res_bucket nh_buckets[] __counted_by(num_nh_buckets); }; struct nh_grp_entry_stats { u64_stats_t packets; struct u64_stats_sync syncp; }; struct nh_grp_entry { struct nexthop *nh; struct nh_grp_entry_stats __percpu *stats; u16 weight; union { struct { atomic_t upper_bound; } hthr; struct { /* Member on uw_nh_entries. */ struct list_head uw_nh_entry; u16 count_buckets; u16 wants_buckets; } res; }; struct list_head nh_list; struct nexthop *nh_parent; /* nexthop of group with this entry */ u64 packets_hw; }; struct nh_group { struct nh_group *spare; /* spare group for removals */ u16 num_nh; bool is_multipath; bool hash_threshold; bool resilient; bool fdb_nh; bool has_v4; bool hw_stats; struct nh_res_table __rcu *res_table; struct nh_grp_entry nh_entries[] __counted_by(num_nh); }; struct nexthop { struct rb_node rb_node; /* entry on netns rbtree */ struct list_head fi_list; /* v4 entries using nh */ struct list_head f6i_list; /* v6 entries using nh */ struct list_head fdb_list; /* fdb entries using this nh */ struct list_head grp_list; /* nh group entries using this nh */ struct net *net; u32 id; u8 protocol; /* app managing this nh */ u8 nh_flags; bool is_group; bool dead; spinlock_t lock; /* protect dead and f6i_list */ refcount_t refcnt; struct rcu_head rcu; union { struct nh_info __rcu *nh_info; struct nh_group __rcu *nh_grp; }; }; enum nexthop_event_type { NEXTHOP_EVENT_DEL, NEXTHOP_EVENT_REPLACE, NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, NEXTHOP_EVENT_BUCKET_REPLACE, NEXTHOP_EVENT_HW_STATS_REPORT_DELTA, }; enum nh_notifier_info_type { NH_NOTIFIER_INFO_TYPE_SINGLE, NH_NOTIFIER_INFO_TYPE_GRP, NH_NOTIFIER_INFO_TYPE_RES_TABLE, NH_NOTIFIER_INFO_TYPE_RES_BUCKET, NH_NOTIFIER_INFO_TYPE_GRP_HW_STATS, }; struct nh_notifier_single_info { struct net_device *dev; u8 gw_family; union { __be32 ipv4; struct in6_addr ipv6; }; u32 id; u8 is_reject:1, is_fdb:1, has_encap:1; }; struct nh_notifier_grp_entry_info { u16 weight; struct nh_notifier_single_info nh; }; struct nh_notifier_grp_info { u16 num_nh; bool is_fdb; bool hw_stats; struct nh_notifier_grp_entry_info nh_entries[] __counted_by(num_nh); }; struct nh_notifier_res_bucket_info { u16 bucket_index; unsigned int idle_timer_ms; bool force; struct nh_notifier_single_info old_nh; struct nh_notifier_single_info new_nh; }; struct nh_notifier_res_table_info { u16 num_nh_buckets; bool hw_stats; struct nh_notifier_single_info nhs[] __counted_by(num_nh_buckets); }; struct nh_notifier_grp_hw_stats_entry_info { u32 id; u64 packets; }; struct nh_notifier_grp_hw_stats_info { u16 num_nh; bool hw_stats_used; struct nh_notifier_grp_hw_stats_entry_info stats[] __counted_by(num_nh); }; struct nh_notifier_info { struct net *net; struct netlink_ext_ack *extack; u32 id; enum nh_notifier_info_type type; union { struct nh_notifier_single_info *nh; struct nh_notifier_grp_info *nh_grp; struct nh_notifier_res_table_info *nh_res_table; struct nh_notifier_res_bucket_info *nh_res_bucket; struct nh_notifier_grp_hw_stats_info *nh_grp_hw_stats; }; }; int register_nexthop_notifier(struct net *net, struct notifier_block *nb, struct netlink_ext_ack *extack); int __unregister_nexthop_notifier(struct net *net, struct notifier_block *nb); int unregister_nexthop_notifier(struct net *net, struct notifier_block *nb); void nexthop_set_hw_flags(struct net *net, u32 id, bool offload, bool trap); void nexthop_bucket_set_hw_flags(struct net *net, u32 id, u16 bucket_index, bool offload, bool trap); void nexthop_res_grp_activity_update(struct net *net, u32 id, u16 num_buckets, unsigned long *activity); void nh_grp_hw_stats_report_delta(struct nh_notifier_grp_hw_stats_info *info, unsigned int nh_idx, u64 delta_packets); /* caller is holding rcu or rtnl; no reference taken to nexthop */ struct nexthop *nexthop_find_by_id(struct net *net, u32 id); void nexthop_free_rcu(struct rcu_head *head); static inline bool nexthop_get(struct nexthop *nh) { return refcount_inc_not_zero(&nh->refcnt); } static inline void nexthop_put(struct nexthop *nh) { if (refcount_dec_and_test(&nh->refcnt)) call_rcu_hurry(&nh->rcu, nexthop_free_rcu); } static inline bool nexthop_cmp(const struct nexthop *nh1, const struct nexthop *nh2) { return nh1 == nh2; } static inline bool nexthop_is_fdb(const struct nexthop *nh) { if (nh->is_group) { const struct nh_group *nh_grp; nh_grp = rcu_dereference_rtnl(nh->nh_grp); return nh_grp->fdb_nh; } else { const struct nh_info *nhi; nhi = rcu_dereference_rtnl(nh->nh_info); return nhi->fdb_nh; } } static inline bool nexthop_has_v4(const struct nexthop *nh) { if (nh->is_group) { struct nh_group *nh_grp; nh_grp = rcu_dereference_rtnl(nh->nh_grp); return nh_grp->has_v4; } return false; } static inline bool nexthop_is_multipath(const struct nexthop *nh) { if (nh->is_group) { struct nh_group *nh_grp; nh_grp = rcu_dereference_rtnl(nh->nh_grp); return nh_grp->is_multipath; } return false; } struct nexthop *nexthop_select_path(struct nexthop *nh, int hash); static inline unsigned int nexthop_num_path(const struct nexthop *nh) { unsigned int rc = 1; if (nh->is_group) { struct nh_group *nh_grp; nh_grp = rcu_dereference_rtnl(nh->nh_grp); if (nh_grp->is_multipath) rc = nh_grp->num_nh; } return rc; } static inline struct nexthop *nexthop_mpath_select(const struct nh_group *nhg, int nhsel) { /* for_nexthops macros in fib_semantics.c grabs a pointer to * the nexthop before checking nhsel */ if (nhsel >= nhg->num_nh) return NULL; return nhg->nh_entries[nhsel].nh; } static inline int nexthop_mpath_fill_node(struct sk_buff *skb, struct nexthop *nh, u8 rt_family) { struct nh_group *nhg = rcu_dereference_rtnl(nh->nh_grp); int i; for (i = 0; i < nhg->num_nh; i++) { struct nexthop *nhe = nhg->nh_entries[i].nh; struct nh_info *nhi = rcu_dereference_rtnl(nhe->nh_info); struct fib_nh_common *nhc = &nhi->fib_nhc; int weight = nhg->nh_entries[i].weight; if (fib_add_nexthop(skb, nhc, weight, rt_family, 0) < 0) return -EMSGSIZE; } return 0; } /* called with rcu lock */ static inline bool nexthop_is_blackhole(const struct nexthop *nh) { const struct nh_info *nhi; if (nh->is_group) { struct nh_group *nh_grp; nh_grp = rcu_dereference_rtnl(nh->nh_grp); if (nh_grp->num_nh > 1) return false; nh = nh_grp->nh_entries[0].nh; } nhi = rcu_dereference_rtnl(nh->nh_info); return nhi->reject_nh; } static inline void nexthop_path_fib_result(struct fib_result *res, int hash) { struct nh_info *nhi; struct nexthop *nh; nh = nexthop_select_path(res->fi->nh, hash); nhi = rcu_dereference(nh->nh_info); res->nhc = &nhi->fib_nhc; } /* called with rcu read lock or rtnl held */ static inline struct fib_nh_common *nexthop_fib_nhc(struct nexthop *nh, int nhsel) { struct nh_info *nhi; BUILD_BUG_ON(offsetof(struct fib_nh, nh_common) != 0); BUILD_BUG_ON(offsetof(struct fib6_nh, nh_common) != 0); if (nh->is_group) { struct nh_group *nh_grp; nh_grp = rcu_dereference_rtnl(nh->nh_grp); if (nh_grp->is_multipath) { nh = nexthop_mpath_select(nh_grp, nhsel); if (!nh) return NULL; } } nhi = rcu_dereference_rtnl(nh->nh_info); return &nhi->fib_nhc; } /* called from fib_table_lookup with rcu_lock */ static inline struct fib_nh_common *nexthop_get_nhc_lookup(const struct nexthop *nh, int fib_flags, const struct flowi4 *flp, int *nhsel) { struct nh_info *nhi; if (nh->is_group) { struct nh_group *nhg = rcu_dereference(nh->nh_grp); int i; for (i = 0; i < nhg->num_nh; i++) { struct nexthop *nhe = nhg->nh_entries[i].nh; nhi = rcu_dereference(nhe->nh_info); if (fib_lookup_good_nhc(&nhi->fib_nhc, fib_flags, flp)) { *nhsel = i; return &nhi->fib_nhc; } } } else { nhi = rcu_dereference(nh->nh_info); if (fib_lookup_good_nhc(&nhi->fib_nhc, fib_flags, flp)) { *nhsel = 0; return &nhi->fib_nhc; } } return NULL; } static inline bool nexthop_uses_dev(const struct nexthop *nh, const struct net_device *dev) { struct nh_info *nhi; if (nh->is_group) { struct nh_group *nhg = rcu_dereference(nh->nh_grp); int i; for (i = 0; i < nhg->num_nh; i++) { struct nexthop *nhe = nhg->nh_entries[i].nh; nhi = rcu_dereference(nhe->nh_info); if (nhc_l3mdev_matches_dev(&nhi->fib_nhc, dev)) return true; } } else { nhi = rcu_dereference(nh->nh_info); if (nhc_l3mdev_matches_dev(&nhi->fib_nhc, dev)) return true; } return false; } static inline unsigned int fib_info_num_path(const struct fib_info *fi) { if (unlikely(fi->nh)) return nexthop_num_path(fi->nh); return fi->fib_nhs; } int fib_check_nexthop(struct nexthop *nh, u8 scope, struct netlink_ext_ack *extack); static inline struct fib_nh_common *fib_info_nhc(struct fib_info *fi, int nhsel) { if (unlikely(fi->nh)) return nexthop_fib_nhc(fi->nh, nhsel); return &fi->fib_nh[nhsel].nh_common; } /* only used when fib_nh is built into fib_info */ static inline struct fib_nh *fib_info_nh(struct fib_info *fi, int nhsel) { WARN_ON(fi->nh); return &fi->fib_nh[nhsel]; } /* * IPv6 variants */ int fib6_check_nexthop(struct nexthop *nh, struct fib6_config *cfg, struct netlink_ext_ack *extack); /* Caller should either hold rcu_read_lock(), or RTNL. */ static inline struct fib6_nh *nexthop_fib6_nh(struct nexthop *nh) { struct nh_info *nhi; if (nh->is_group) { struct nh_group *nh_grp; nh_grp = rcu_dereference_rtnl(nh->nh_grp); nh = nexthop_mpath_select(nh_grp, 0); if (!nh) return NULL; } nhi = rcu_dereference_rtnl(nh->nh_info); if (nhi->family == AF_INET6) return &nhi->fib6_nh; return NULL; } static inline struct net_device *fib6_info_nh_dev(struct fib6_info *f6i) { struct fib6_nh *fib6_nh; fib6_nh = f6i->nh ? nexthop_fib6_nh(f6i->nh) : f6i->fib6_nh; return fib6_nh->fib_nh_dev; } static inline void nexthop_path_fib6_result(struct fib6_result *res, int hash) { struct nexthop *nh = res->f6i->nh; struct nh_info *nhi; nh = nexthop_select_path(nh, hash); nhi = rcu_dereference_rtnl(nh->nh_info); if (nhi->reject_nh) { res->fib6_type = RTN_BLACKHOLE; res->fib6_flags |= RTF_REJECT; res->nh = nexthop_fib6_nh(nh); } else { res->nh = &nhi->fib6_nh; } } int nexthop_for_each_fib6_nh(struct nexthop *nh, int (*cb)(struct fib6_nh *nh, void *arg), void *arg); static inline int nexthop_get_family(struct nexthop *nh) { struct nh_info *nhi = rcu_dereference_rtnl(nh->nh_info); return nhi->family; } static inline struct fib_nh_common *nexthop_fdb_nhc(struct nexthop *nh) { struct nh_info *nhi = rcu_dereference_rtnl(nh->nh_info); return &nhi->fib_nhc; } static inline struct fib_nh_common *nexthop_path_fdb_result(struct nexthop *nh, int hash) { struct nh_info *nhi; struct nexthop *nhp; nhp = nexthop_select_path(nh, hash); if (unlikely(!nhp)) return NULL; nhi = rcu_dereference(nhp->nh_info); return &nhi->fib_nhc; } #endif |
| 96 84 9 5 3 3169 3700 162 3 139 3560 3167 31 3205 4 3568 3748 253 2 3486 3075 1959 11 2262 3348 3534 3818 3816 3825 3642 273 272 3820 3633 271 318 318 318 318 316 1 318 386 96 238 6 53 2 56 3774 1 10 6 3782 1331 2513 146 3640 3793 3794 3782 3785 3772 70 8 62 198 173 47 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 | // SPDX-License-Identifier: GPL-2.0-or-later /* Provide a way to create a superblock configuration context within the kernel * that allows a superblock to be set up prior to mounting. * * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/module.h> #include <linux/fs_context.h> #include <linux/fs_parser.h> #include <linux/fs.h> #include <linux/mount.h> #include <linux/nsproxy.h> #include <linux/slab.h> #include <linux/magic.h> #include <linux/security.h> #include <linux/mnt_namespace.h> #include <linux/pid_namespace.h> #include <linux/user_namespace.h> #include <net/net_namespace.h> #include <asm/sections.h> #include "mount.h" #include "internal.h" enum legacy_fs_param { LEGACY_FS_UNSET_PARAMS, LEGACY_FS_MONOLITHIC_PARAMS, LEGACY_FS_INDIVIDUAL_PARAMS, }; struct legacy_fs_context { char *legacy_data; /* Data page for legacy filesystems */ size_t data_size; enum legacy_fs_param param_type; }; static int legacy_init_fs_context(struct fs_context *fc); static const struct constant_table common_set_sb_flag[] = { { "dirsync", SB_DIRSYNC }, { "lazytime", SB_LAZYTIME }, { "mand", SB_MANDLOCK }, { "ro", SB_RDONLY }, { "sync", SB_SYNCHRONOUS }, { }, }; static const struct constant_table common_clear_sb_flag[] = { { "async", SB_SYNCHRONOUS }, { "nolazytime", SB_LAZYTIME }, { "nomand", SB_MANDLOCK }, { "rw", SB_RDONLY }, { }, }; /* * Check for a common mount option that manipulates s_flags. */ static int vfs_parse_sb_flag(struct fs_context *fc, const char *key) { unsigned int token; token = lookup_constant(common_set_sb_flag, key, 0); if (token) { fc->sb_flags |= token; fc->sb_flags_mask |= token; return 0; } token = lookup_constant(common_clear_sb_flag, key, 0); if (token) { fc->sb_flags &= ~token; fc->sb_flags_mask |= token; return 0; } return -ENOPARAM; } /** * vfs_parse_fs_param_source - Handle setting "source" via parameter * @fc: The filesystem context to modify * @param: The parameter * * This is a simple helper for filesystems to verify that the "source" they * accept is sane. * * Returns 0 on success, -ENOPARAM if this is not "source" parameter, and * -EINVAL otherwise. In the event of failure, supplementary error information * is logged. */ int vfs_parse_fs_param_source(struct fs_context *fc, struct fs_parameter *param) { if (strcmp(param->key, "source") != 0) return -ENOPARAM; if (param->type != fs_value_is_string) return invalf(fc, "Non-string source"); if (fc->source) return invalf(fc, "Multiple sources"); fc->source = param->string; param->string = NULL; return 0; } EXPORT_SYMBOL(vfs_parse_fs_param_source); /** * vfs_parse_fs_param - Add a single parameter to a superblock config * @fc: The filesystem context to modify * @param: The parameter * * A single mount option in string form is applied to the filesystem context * being set up. Certain standard options (for example "ro") are translated * into flag bits without going to the filesystem. The active security module * is allowed to observe and poach options. Any other options are passed over * to the filesystem to parse. * * This may be called multiple times for a context. * * Returns 0 on success and a negative error code on failure. In the event of * failure, supplementary error information may have been set. */ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param) { int ret; if (!param->key) return invalf(fc, "Unnamed parameter\n"); ret = vfs_parse_sb_flag(fc, param->key); if (ret != -ENOPARAM) return ret; ret = security_fs_context_parse_param(fc, param); if (ret != -ENOPARAM) /* Param belongs to the LSM or is disallowed by the LSM; so * don't pass to the FS. */ return ret; if (fc->ops->parse_param) { ret = fc->ops->parse_param(fc, param); if (ret != -ENOPARAM) return ret; } /* If the filesystem doesn't take any arguments, give it the * default handling of source. */ ret = vfs_parse_fs_param_source(fc, param); if (ret != -ENOPARAM) return ret; return invalf(fc, "%s: Unknown parameter '%s'", fc->fs_type->name, param->key); } EXPORT_SYMBOL(vfs_parse_fs_param); /** * vfs_parse_fs_qstr - Convenience function to just parse a string. * @fc: Filesystem context. * @key: Parameter name. * @value: Default value. */ int vfs_parse_fs_qstr(struct fs_context *fc, const char *key, const struct qstr *value) { int ret; struct fs_parameter param = { .key = key, .type = fs_value_is_flag, .size = value ? value->len : 0, }; if (value) { param.string = kmemdup_nul(value->name, value->len, GFP_KERNEL); if (!param.string) return -ENOMEM; param.type = fs_value_is_string; } ret = vfs_parse_fs_param(fc, ¶m); kfree(param.string); return ret; } EXPORT_SYMBOL(vfs_parse_fs_qstr); /** * vfs_parse_monolithic_sep - Parse key[=val][,key[=val]]* mount data * @fc: The superblock configuration to fill in. * @data: The data to parse * @sep: callback for separating next option * * Parse a blob of data that's in key[=val][,key[=val]]* form with a custom * option separator callback. * * Returns 0 on success or the error returned by the ->parse_option() fs_context * operation on failure. */ int vfs_parse_monolithic_sep(struct fs_context *fc, void *data, char *(*sep)(char **)) { char *options = data, *key; int ret = 0; if (!options) return 0; ret = security_sb_eat_lsm_opts(options, &fc->security); if (ret) return ret; while ((key = sep(&options)) != NULL) { if (*key) { char *value = strchr(key, '='); if (value) { if (unlikely(value == key)) continue; *value++ = 0; } ret = vfs_parse_fs_string(fc, key, value); if (ret < 0) break; } } return ret; } EXPORT_SYMBOL(vfs_parse_monolithic_sep); static char *vfs_parse_comma_sep(char **s) { return strsep(s, ","); } /** * generic_parse_monolithic - Parse key[=val][,key[=val]]* mount data * @fc: The superblock configuration to fill in. * @data: The data to parse * * Parse a blob of data that's in key[=val][,key[=val]]* form. This can be * called from the ->monolithic_mount_data() fs_context operation. * * Returns 0 on success or the error returned by the ->parse_option() fs_context * operation on failure. */ int generic_parse_monolithic(struct fs_context *fc, void *data) { return vfs_parse_monolithic_sep(fc, data, vfs_parse_comma_sep); } EXPORT_SYMBOL(generic_parse_monolithic); /** * alloc_fs_context - Create a filesystem context. * @fs_type: The filesystem type. * @reference: The dentry from which this one derives (or NULL) * @sb_flags: Filesystem/superblock flags (SB_*) * @sb_flags_mask: Applicable members of @sb_flags * @purpose: The purpose that this configuration shall be used for. * * Open a filesystem and create a mount context. The mount context is * initialised with the supplied flags and, if a submount/automount from * another superblock (referred to by @reference) is supplied, may have * parameters such as namespaces copied across from that superblock. */ static struct fs_context *alloc_fs_context(struct file_system_type *fs_type, struct dentry *reference, unsigned int sb_flags, unsigned int sb_flags_mask, enum fs_context_purpose purpose) { int (*init_fs_context)(struct fs_context *); struct fs_context *fc; int ret = -ENOMEM; fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL_ACCOUNT); if (!fc) return ERR_PTR(-ENOMEM); fc->purpose = purpose; fc->sb_flags = sb_flags; fc->sb_flags_mask = sb_flags_mask; fc->fs_type = get_filesystem(fs_type); fc->cred = get_current_cred(); fc->net_ns = get_net(current->nsproxy->net_ns); fc->log.prefix = fs_type->name; mutex_init(&fc->uapi_mutex); switch (purpose) { case FS_CONTEXT_FOR_MOUNT: fc->user_ns = get_user_ns(fc->cred->user_ns); break; case FS_CONTEXT_FOR_SUBMOUNT: fc->user_ns = get_user_ns(reference->d_sb->s_user_ns); break; case FS_CONTEXT_FOR_RECONFIGURE: atomic_inc(&reference->d_sb->s_active); fc->user_ns = get_user_ns(reference->d_sb->s_user_ns); fc->root = dget(reference); break; } /* TODO: Make all filesystems support this unconditionally */ init_fs_context = fc->fs_type->init_fs_context; if (!init_fs_context) init_fs_context = legacy_init_fs_context; ret = init_fs_context(fc); if (ret < 0) goto err_fc; fc->need_free = true; return fc; err_fc: put_fs_context(fc); return ERR_PTR(ret); } struct fs_context *fs_context_for_mount(struct file_system_type *fs_type, unsigned int sb_flags) { return alloc_fs_context(fs_type, NULL, sb_flags, 0, FS_CONTEXT_FOR_MOUNT); } EXPORT_SYMBOL(fs_context_for_mount); struct fs_context *fs_context_for_reconfigure(struct dentry *dentry, unsigned int sb_flags, unsigned int sb_flags_mask) { return alloc_fs_context(dentry->d_sb->s_type, dentry, sb_flags, sb_flags_mask, FS_CONTEXT_FOR_RECONFIGURE); } EXPORT_SYMBOL(fs_context_for_reconfigure); /** * fs_context_for_submount: allocate a new fs_context for a submount * @type: file_system_type of the new context * @reference: reference dentry from which to copy relevant info * * Allocate a new fs_context suitable for a submount. This also ensures that * the fc->security object is inherited from @reference (if needed). */ struct fs_context *fs_context_for_submount(struct file_system_type *type, struct dentry *reference) { struct fs_context *fc; int ret; fc = alloc_fs_context(type, reference, 0, 0, FS_CONTEXT_FOR_SUBMOUNT); if (IS_ERR(fc)) return fc; ret = security_fs_context_submount(fc, reference->d_sb); if (ret) { put_fs_context(fc); return ERR_PTR(ret); } return fc; } EXPORT_SYMBOL(fs_context_for_submount); void fc_drop_locked(struct fs_context *fc) { struct super_block *sb = fc->root->d_sb; dput(fc->root); fc->root = NULL; deactivate_locked_super(sb); } static void legacy_fs_context_free(struct fs_context *fc); /** * vfs_dup_fs_context - Duplicate a filesystem context. * @src_fc: The context to copy. */ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc) { struct fs_context *fc; int ret; if (!src_fc->ops->dup) return ERR_PTR(-EOPNOTSUPP); fc = kmemdup(src_fc, sizeof(struct fs_context), GFP_KERNEL); if (!fc) return ERR_PTR(-ENOMEM); mutex_init(&fc->uapi_mutex); fc->fs_private = NULL; fc->s_fs_info = NULL; fc->source = NULL; fc->security = NULL; get_filesystem(fc->fs_type); get_net(fc->net_ns); get_user_ns(fc->user_ns); get_cred(fc->cred); if (fc->log.log) refcount_inc(&fc->log.log->usage); /* Can't call put until we've called ->dup */ ret = fc->ops->dup(fc, src_fc); if (ret < 0) goto err_fc; ret = security_fs_context_dup(fc, src_fc); if (ret < 0) goto err_fc; return fc; err_fc: put_fs_context(fc); return ERR_PTR(ret); } EXPORT_SYMBOL(vfs_dup_fs_context); /** * logfc - Log a message to a filesystem context * @log: The filesystem context to log to, or NULL to use printk. * @prefix: A string to prefix the output with, or NULL. * @level: 'w' for a warning, 'e' for an error. Anything else is a notice. * @fmt: The format of the buffer. */ void logfc(struct fc_log *log, const char *prefix, char level, const char *fmt, ...) { va_list va; struct va_format vaf = {.fmt = fmt, .va = &va}; va_start(va, fmt); if (!log) { switch (level) { case 'w': printk(KERN_WARNING "%s%s%pV\n", prefix ? prefix : "", prefix ? ": " : "", &vaf); break; case 'e': printk(KERN_ERR "%s%s%pV\n", prefix ? prefix : "", prefix ? ": " : "", &vaf); break; case 'i': printk(KERN_INFO "%s%s%pV\n", prefix ? prefix : "", prefix ? ": " : "", &vaf); break; default: printk(KERN_NOTICE "%s%s%pV\n", prefix ? prefix : "", prefix ? ": " : "", &vaf); break; } } else { unsigned int logsize = ARRAY_SIZE(log->buffer); u8 index; char *q = kasprintf(GFP_KERNEL, "%c %s%s%pV\n", level, prefix ? prefix : "", prefix ? ": " : "", &vaf); index = log->head & (logsize - 1); BUILD_BUG_ON(sizeof(log->head) != sizeof(u8) || sizeof(log->tail) != sizeof(u8)); if ((u8)(log->head - log->tail) == logsize) { /* The buffer is full, discard the oldest message */ if (log->need_free & (1 << index)) kfree(log->buffer[index]); log->tail++; } log->buffer[index] = q ? q : "OOM: Can't store error string"; if (q) log->need_free |= 1 << index; else log->need_free &= ~(1 << index); log->head++; } va_end(va); } EXPORT_SYMBOL(logfc); /* * Free a logging structure. */ static void put_fc_log(struct fs_context *fc) { struct fc_log *log = fc->log.log; int i; if (log) { if (refcount_dec_and_test(&log->usage)) { fc->log.log = NULL; for (i = 0; i < ARRAY_SIZE(log->buffer) ; i++) if (log->need_free & (1 << i)) kfree(log->buffer[i]); kfree(log); } } } /** * put_fs_context - Dispose of a superblock configuration context. * @fc: The context to dispose of. */ void put_fs_context(struct fs_context *fc) { struct super_block *sb; if (fc->root) { sb = fc->root->d_sb; dput(fc->root); fc->root = NULL; deactivate_super(sb); } if (fc->need_free && fc->ops && fc->ops->free) fc->ops->free(fc); security_free_mnt_opts(&fc->security); put_net(fc->net_ns); put_user_ns(fc->user_ns); put_cred(fc->cred); put_fc_log(fc); put_filesystem(fc->fs_type); kfree(fc->source); kfree(fc); } EXPORT_SYMBOL(put_fs_context); /* * Free the config for a filesystem that doesn't support fs_context. */ static void legacy_fs_context_free(struct fs_context *fc) { struct legacy_fs_context *ctx = fc->fs_private; if (ctx) { if (ctx->param_type == LEGACY_FS_INDIVIDUAL_PARAMS) kfree(ctx->legacy_data); kfree(ctx); } } /* * Duplicate a legacy config. */ static int legacy_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc) { struct legacy_fs_context *ctx; struct legacy_fs_context *src_ctx = src_fc->fs_private; ctx = kmemdup(src_ctx, sizeof(*src_ctx), GFP_KERNEL); if (!ctx) return -ENOMEM; if (ctx->param_type == LEGACY_FS_INDIVIDUAL_PARAMS) { ctx->legacy_data = kmemdup(src_ctx->legacy_data, src_ctx->data_size, GFP_KERNEL); if (!ctx->legacy_data) { kfree(ctx); return -ENOMEM; } } fc->fs_private = ctx; return 0; } /* * Add a parameter to a legacy config. We build up a comma-separated list of * options. */ static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param) { struct legacy_fs_context *ctx = fc->fs_private; unsigned int size = ctx->data_size; size_t len = 0; int ret; ret = vfs_parse_fs_param_source(fc, param); if (ret != -ENOPARAM) return ret; if (ctx->param_type == LEGACY_FS_MONOLITHIC_PARAMS) return invalf(fc, "VFS: Legacy: Can't mix monolithic and individual options"); switch (param->type) { case fs_value_is_string: len = 1 + param->size; fallthrough; case fs_value_is_flag: len += strlen(param->key); break; default: return invalf(fc, "VFS: Legacy: Parameter type for '%s' not supported", param->key); } if (size + len + 2 > PAGE_SIZE) return invalf(fc, "VFS: Legacy: Cumulative options too large"); if (strchr(param->key, ',') || (param->type == fs_value_is_string && memchr(param->string, ',', param->size))) return invalf(fc, "VFS: Legacy: Option '%s' contained comma", param->key); if (!ctx->legacy_data) { ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL); if (!ctx->legacy_data) return -ENOMEM; } if (size) ctx->legacy_data[size++] = ','; len = strlen(param->key); memcpy(ctx->legacy_data + size, param->key, len); size += len; if (param->type == fs_value_is_string) { ctx->legacy_data[size++] = '='; memcpy(ctx->legacy_data + size, param->string, param->size); size += param->size; } ctx->legacy_data[size] = '\0'; ctx->data_size = size; ctx->param_type = LEGACY_FS_INDIVIDUAL_PARAMS; return 0; } /* * Add monolithic mount data. */ static int legacy_parse_monolithic(struct fs_context *fc, void *data) { struct legacy_fs_context *ctx = fc->fs_private; if (ctx->param_type != LEGACY_FS_UNSET_PARAMS) { pr_warn("VFS: Can't mix monolithic and individual options\n"); return -EINVAL; } ctx->legacy_data = data; ctx->param_type = LEGACY_FS_MONOLITHIC_PARAMS; if (!ctx->legacy_data) return 0; if (fc->fs_type->fs_flags & FS_BINARY_MOUNTDATA) return 0; return security_sb_eat_lsm_opts(ctx->legacy_data, &fc->security); } /* * Get a mountable root with the legacy mount command. */ static int legacy_get_tree(struct fs_context *fc) { struct legacy_fs_context *ctx = fc->fs_private; struct super_block *sb; struct dentry *root; root = fc->fs_type->mount(fc->fs_type, fc->sb_flags, fc->source, ctx->legacy_data); if (IS_ERR(root)) return PTR_ERR(root); sb = root->d_sb; BUG_ON(!sb); fc->root = root; return 0; } /* * Handle remount. */ static int legacy_reconfigure(struct fs_context *fc) { struct legacy_fs_context *ctx = fc->fs_private; struct super_block *sb = fc->root->d_sb; if (!sb->s_op->remount_fs) return 0; return sb->s_op->remount_fs(sb, &fc->sb_flags, ctx ? ctx->legacy_data : NULL); } const struct fs_context_operations legacy_fs_context_ops = { .free = legacy_fs_context_free, .dup = legacy_fs_context_dup, .parse_param = legacy_parse_param, .parse_monolithic = legacy_parse_monolithic, .get_tree = legacy_get_tree, .reconfigure = legacy_reconfigure, }; /* * Initialise a legacy context for a filesystem that doesn't support * fs_context. */ static int legacy_init_fs_context(struct fs_context *fc) { fc->fs_private = kzalloc(sizeof(struct legacy_fs_context), GFP_KERNEL_ACCOUNT); if (!fc->fs_private) return -ENOMEM; fc->ops = &legacy_fs_context_ops; return 0; } int parse_monolithic_mount_data(struct fs_context *fc, void *data) { int (*monolithic_mount_data)(struct fs_context *, void *); monolithic_mount_data = fc->ops->parse_monolithic; if (!monolithic_mount_data) monolithic_mount_data = generic_parse_monolithic; return monolithic_mount_data(fc, data); } /* * Clean up a context after performing an action on it and put it into a state * from where it can be used to reconfigure a superblock. * * Note that here we do only the parts that can't fail; the rest is in * finish_clean_context() below and in between those fs_context is marked * FS_CONTEXT_AWAITING_RECONF. The reason for splitup is that after * successful mount or remount we need to report success to userland. * Trying to do full reinit (for the sake of possible subsequent remount) * and failing to allocate memory would've put us into a nasty situation. * So here we only discard the old state and reinitialization is left * until we actually try to reconfigure. */ void vfs_clean_context(struct fs_context *fc) { if (fc->need_free && fc->ops && fc->ops->free) fc->ops->free(fc); fc->need_free = false; fc->fs_private = NULL; fc->s_fs_info = NULL; fc->sb_flags = 0; security_free_mnt_opts(&fc->security); kfree(fc->source); fc->source = NULL; fc->exclusive = false; fc->purpose = FS_CONTEXT_FOR_RECONFIGURE; fc->phase = FS_CONTEXT_AWAITING_RECONF; } int finish_clean_context(struct fs_context *fc) { int error; if (fc->phase != FS_CONTEXT_AWAITING_RECONF) return 0; if (fc->fs_type->init_fs_context) error = fc->fs_type->init_fs_context(fc); else error = legacy_init_fs_context(fc); if (unlikely(error)) { fc->phase = FS_CONTEXT_FAILED; return error; } fc->need_free = true; fc->phase = FS_CONTEXT_RECONF_PARAMS; return 0; } |
| 1 1 1 1 1 2 1 2 6 6 2 2 2 1 3 1 2 2 230 229 6 229 227 1 225 6 3 1 2 33 20 33 26 6 2 6 24 32 32 247 6 241 6 240 248 248 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (c) 2000-2005 Silicon Graphics, Inc. * All Rights Reserved. */ #include "xfs.h" #include "xfs_fs.h" #include "xfs_shared.h" #include "xfs_format.h" #include "xfs_log_format.h" #include "xfs_trans_resv.h" #include "xfs_sb.h" #include "xfs_mount.h" #include "xfs_trans.h" #include "xfs_error.h" #include "xfs_alloc.h" #include "xfs_fsops.h" #include "xfs_trans_space.h" #include "xfs_log.h" #include "xfs_log_priv.h" #include "xfs_ag.h" #include "xfs_ag_resv.h" #include "xfs_trace.h" #include "xfs_rtalloc.h" #include "xfs_rtrmap_btree.h" #include "xfs_rtrefcount_btree.h" #include "xfs_metafile.h" /* * Write new AG headers to disk. Non-transactional, but need to be * written and completed prior to the growfs transaction being logged. * To do this, we use a delayed write buffer list and wait for * submission and IO completion of the list as a whole. This allows the * IO subsystem to merge all the AG headers in a single AG into a single * IO and hide most of the latency of the IO from us. * * This also means that if we get an error whilst building the buffer * list to write, we can cancel the entire list without having written * anything. */ static int xfs_resizefs_init_new_ags( struct xfs_trans *tp, struct aghdr_init_data *id, xfs_agnumber_t oagcount, xfs_agnumber_t nagcount, xfs_rfsblock_t delta, struct xfs_perag *last_pag, bool *lastag_extended) { struct xfs_mount *mp = tp->t_mountp; xfs_rfsblock_t nb = mp->m_sb.sb_dblocks + delta; int error; *lastag_extended = false; INIT_LIST_HEAD(&id->buffer_list); for (id->agno = nagcount - 1; id->agno >= oagcount; id->agno--, delta -= id->agsize) { if (id->agno == nagcount - 1) id->agsize = nb - (id->agno * (xfs_rfsblock_t)mp->m_sb.sb_agblocks); else id->agsize = mp->m_sb.sb_agblocks; error = xfs_ag_init_headers(mp, id); if (error) { xfs_buf_delwri_cancel(&id->buffer_list); return error; } } error = xfs_buf_delwri_submit(&id->buffer_list); if (error) return error; if (delta) { *lastag_extended = true; error = xfs_ag_extend_space(last_pag, tp, delta); } return error; } /* * growfs operations */ static int xfs_growfs_data_private( struct xfs_mount *mp, /* mount point for filesystem */ struct xfs_growfs_data *in) /* growfs data input struct */ { xfs_agnumber_t oagcount = mp->m_sb.sb_agcount; struct xfs_buf *bp; int error; xfs_agnumber_t nagcount; xfs_agnumber_t nagimax = 0; xfs_rfsblock_t nb, nb_div, nb_mod; int64_t delta; bool lastag_extended = false; struct xfs_trans *tp; struct aghdr_init_data id = {}; struct xfs_perag *last_pag; nb = in->newblocks; error = xfs_sb_validate_fsb_count(&mp->m_sb, nb); if (error) return error; if (nb > mp->m_sb.sb_dblocks) { error = xfs_buf_read_uncached(mp->m_ddev_targp, XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1), XFS_FSS_TO_BB(mp, 1), &bp, NULL); if (error) return error; xfs_buf_relse(bp); } /* Make sure the new fs size won't cause problems with the log. */ error = xfs_growfs_check_rtgeom(mp, nb, mp->m_sb.sb_rblocks, mp->m_sb.sb_rextsize); if (error) return error; nb_div = nb; nb_mod = do_div(nb_div, mp->m_sb.sb_agblocks); if (nb_mod && nb_mod >= XFS_MIN_AG_BLOCKS) nb_div++; else if (nb_mod) nb = nb_div * mp->m_sb.sb_agblocks; if (nb_div > XFS_MAX_AGNUMBER + 1) { nb_div = XFS_MAX_AGNUMBER + 1; nb = nb_div * mp->m_sb.sb_agblocks; } nagcount = nb_div; delta = nb - mp->m_sb.sb_dblocks; /* * Reject filesystems with a single AG because they are not * supported, and reject a shrink operation that would cause a * filesystem to become unsupported. */ if (delta < 0 && nagcount < 2) return -EINVAL; /* No work to do */ if (delta == 0) return 0; /* TODO: shrinking the entire AGs hasn't yet completed */ if (nagcount < oagcount) return -EINVAL; /* allocate the new per-ag structures */ error = xfs_initialize_perag(mp, oagcount, nagcount, nb, &nagimax); if (error) return error; if (delta > 0) error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata, XFS_GROWFS_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp); else error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata, -delta, 0, 0, &tp); if (error) goto out_free_unused_perag; last_pag = xfs_perag_get(mp, oagcount - 1); if (delta > 0) { error = xfs_resizefs_init_new_ags(tp, &id, oagcount, nagcount, delta, last_pag, &lastag_extended); } else { xfs_warn_experimental(mp, XFS_EXPERIMENTAL_SHRINK); error = xfs_ag_shrink_space(last_pag, &tp, -delta); } xfs_perag_put(last_pag); if (error) goto out_trans_cancel; /* * Update changed superblock fields transactionally. These are not * seen by the rest of the world until the transaction commit applies * them atomically to the superblock. */ if (nagcount > oagcount) xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount); if (delta) xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta); if (id.nfree) xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree); /* * Sync sb counters now to reflect the updated values. This is * particularly important for shrink because the write verifier * will fail if sb_fdblocks is ever larger than sb_dblocks. */ if (xfs_has_lazysbcount(mp)) xfs_log_sb(tp); xfs_trans_set_sync(tp); error = xfs_trans_commit(tp); if (error) return error; /* New allocation groups fully initialized, so update mount struct */ if (nagimax) mp->m_maxagi = nagimax; xfs_set_low_space_thresholds(mp); mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); if (delta > 0) { /* * If we expanded the last AG, free the per-AG reservation * so we can reinitialize it with the new size. */ if (lastag_extended) { struct xfs_perag *pag; pag = xfs_perag_get(mp, id.agno); xfs_ag_resv_free(pag); xfs_perag_put(pag); } /* * Reserve AG metadata blocks. ENOSPC here does not mean there * was a growfs failure, just that there still isn't space for * new user data after the grow has been run. */ error = xfs_fs_reserve_ag_blocks(mp); if (error == -ENOSPC) error = 0; /* Compute new maxlevels for rt btrees. */ xfs_rtrmapbt_compute_maxlevels(mp); xfs_rtrefcountbt_compute_maxlevels(mp); } return error; out_trans_cancel: xfs_trans_cancel(tp); out_free_unused_perag: if (nagcount > oagcount) xfs_free_perag_range(mp, oagcount, nagcount); return error; } static int xfs_growfs_log_private( struct xfs_mount *mp, /* mount point for filesystem */ struct xfs_growfs_log *in) /* growfs log input struct */ { xfs_extlen_t nb; nb = in->newblocks; if (nb < XFS_MIN_LOG_BLOCKS || nb < XFS_B_TO_FSB(mp, XFS_MIN_LOG_BYTES)) return -EINVAL; if (nb == mp->m_sb.sb_logblocks && in->isint == (mp->m_sb.sb_logstart != 0)) return -EINVAL; /* * Moving the log is hard, need new interfaces to sync * the log first, hold off all activity while moving it. * Can have shorter or longer log in the same space, * or transform internal to external log or vice versa. */ return -ENOSYS; } static int xfs_growfs_imaxpct( struct xfs_mount *mp, __u32 imaxpct) { struct xfs_trans *tp; int dpct; int error; if (imaxpct > 100) return -EINVAL; error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata, XFS_GROWFS_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp); if (error) return error; dpct = imaxpct - mp->m_sb.sb_imax_pct; xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct); xfs_trans_set_sync(tp); return xfs_trans_commit(tp); } /* * protected versions of growfs function acquire and release locks on the mount * point - exported through ioctls: XFS_IOC_FSGROWFSDATA, XFS_IOC_FSGROWFSLOG, * XFS_IOC_FSGROWFSRT */ int xfs_growfs_data( struct xfs_mount *mp, struct xfs_growfs_data *in) { int error; if (!capable(CAP_SYS_ADMIN)) return -EPERM; if (!mutex_trylock(&mp->m_growlock)) return -EWOULDBLOCK; /* we can't grow the data section when an internal RT section exists */ if (in->newblocks != mp->m_sb.sb_dblocks && mp->m_sb.sb_rtstart) { error = -EINVAL; goto out_unlock; } /* update imaxpct separately to the physical grow of the filesystem */ if (in->imaxpct != mp->m_sb.sb_imax_pct) { error = xfs_growfs_imaxpct(mp, in->imaxpct); if (error) goto out_unlock; } if (in->newblocks != mp->m_sb.sb_dblocks) { error = xfs_growfs_data_private(mp, in); if (error) goto out_unlock; } /* Post growfs calculations needed to reflect new state in operations */ if (mp->m_sb.sb_imax_pct) { uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct; do_div(icount, 100); M_IGEO(mp)->maxicount = XFS_FSB_TO_INO(mp, icount); } else M_IGEO(mp)->maxicount = 0; /* Update secondary superblocks now the physical grow has completed */ error = xfs_update_secondary_sbs(mp); /* * Increment the generation unconditionally, after trying to update the * secondary superblocks, as the new size is live already at this point. */ mp->m_generation++; out_unlock: mutex_unlock(&mp->m_growlock); return error; } int xfs_growfs_log( xfs_mount_t *mp, struct xfs_growfs_log *in) { int error; if (!capable(CAP_SYS_ADMIN)) return -EPERM; if (!mutex_trylock(&mp->m_growlock)) return -EWOULDBLOCK; error = xfs_growfs_log_private(mp, in); mutex_unlock(&mp->m_growlock); return error; } /* * Reserve the requested number of blocks if available. Otherwise return * as many as possible to satisfy the request. The actual number * reserved are returned in outval. */ int xfs_reserve_blocks( struct xfs_mount *mp, enum xfs_free_counter ctr, uint64_t request) { int64_t lcounter, delta; int64_t fdblks_delta = 0; int64_t free; int error = 0; ASSERT(ctr < XC_FREE_NR); /* * With per-cpu counters, this becomes an interesting problem. we need * to work out if we are freeing or allocation blocks first, then we can * do the modification as necessary. * * We do this under the m_sb_lock so that if we are near ENOSPC, we will * hold out any changes while we work out what to do. This means that * the amount of free space can change while we do this, so we need to * retry if we end up trying to reserve more space than is available. */ spin_lock(&mp->m_sb_lock); /* * If our previous reservation was larger than the current value, * then move any unused blocks back to the free pool. Modify the resblks * counters directly since we shouldn't have any problems unreserving * space. */ if (mp->m_free[ctr].res_total > request) { lcounter = mp->m_free[ctr].res_avail - request; if (lcounter > 0) { /* release unused blocks */ fdblks_delta = lcounter; mp->m_free[ctr].res_avail -= lcounter; } mp->m_free[ctr].res_total = request; if (fdblks_delta) { spin_unlock(&mp->m_sb_lock); xfs_add_freecounter(mp, ctr, fdblks_delta); spin_lock(&mp->m_sb_lock); } goto out; } /* * If the request is larger than the current reservation, reserve the * blocks before we update the reserve counters. Sample m_free and * perform a partial reservation if the request exceeds free space. * * The code below estimates how many blocks it can request from * fdblocks to stash in the reserve pool. This is a classic TOCTOU * race since fdblocks updates are not always coordinated via * m_sb_lock. Set the reserve size even if there's not enough free * space to fill it because mod_fdblocks will refill an undersized * reserve when it can. */ free = xfs_sum_freecounter_raw(mp, ctr) - xfs_freecounter_unavailable(mp, ctr); delta = request - mp->m_free[ctr].res_total; mp->m_free[ctr].res_total = request; if (delta > 0 && free > 0) { /* * We'll either succeed in getting space from the free block * count or we'll get an ENOSPC. Don't set the reserved flag * here - we don't want to reserve the extra reserve blocks * from the reserve. * * The desired reserve size can change after we drop the lock. * Use mod_fdblocks to put the space into the reserve or into * fdblocks as appropriate. */ fdblks_delta = min(free, delta); spin_unlock(&mp->m_sb_lock); error = xfs_dec_freecounter(mp, ctr, fdblks_delta, 0); if (!error) xfs_add_freecounter(mp, ctr, fdblks_delta); spin_lock(&mp->m_sb_lock); } out: spin_unlock(&mp->m_sb_lock); return error; } int xfs_fs_goingdown( xfs_mount_t *mp, uint32_t inflags) { switch (inflags) { case XFS_FSOP_GOING_FLAGS_DEFAULT: { if (!bdev_freeze(mp->m_super->s_bdev)) { xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT); bdev_thaw(mp->m_super->s_bdev); } break; } case XFS_FSOP_GOING_FLAGS_LOGFLUSH: xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT); break; case XFS_FSOP_GOING_FLAGS_NOLOGFLUSH: xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT | SHUTDOWN_LOG_IO_ERROR); break; default: return -EINVAL; } return 0; } /* * Force a shutdown of the filesystem instantly while keeping the filesystem * consistent. We don't do an unmount here; just shutdown the shop, make sure * that absolutely nothing persistent happens to this filesystem after this * point. * * The shutdown state change is atomic, resulting in the first and only the * first shutdown call processing the shutdown. This means we only shutdown the * log once as it requires, and we don't spam the logs when multiple concurrent * shutdowns race to set the shutdown flags. */ void xfs_do_force_shutdown( struct xfs_mount *mp, uint32_t flags, char *fname, int lnnum) { int tag; const char *why; if (xfs_set_shutdown(mp)) { xlog_shutdown_wait(mp->m_log); return; } if (mp->m_sb_bp) mp->m_sb_bp->b_flags |= XBF_DONE; if (flags & SHUTDOWN_FORCE_UMOUNT) xfs_alert(mp, "User initiated shutdown received."); if (xlog_force_shutdown(mp->m_log, flags)) { tag = XFS_PTAG_SHUTDOWN_LOGERROR; why = "Log I/O Error"; } else if (flags & SHUTDOWN_CORRUPT_INCORE) { tag = XFS_PTAG_SHUTDOWN_CORRUPT; why = "Corruption of in-memory data"; } else if (flags & SHUTDOWN_CORRUPT_ONDISK) { tag = XFS_PTAG_SHUTDOWN_CORRUPT; why = "Corruption of on-disk metadata"; } else if (flags & SHUTDOWN_DEVICE_REMOVED) { tag = XFS_PTAG_SHUTDOWN_IOERROR; why = "Block device removal"; } else { tag = XFS_PTAG_SHUTDOWN_IOERROR; why = "Metadata I/O Error"; } trace_xfs_force_shutdown(mp, tag, flags, fname, lnnum); xfs_alert_tag(mp, tag, "%s (0x%x) detected at %pS (%s:%d). Shutting down filesystem.", why, flags, __return_address, fname, lnnum); xfs_alert(mp, "Please unmount the filesystem and rectify the problem(s)"); if (xfs_error_level >= XFS_ERRLEVEL_HIGH) xfs_stack_trace(); } /* * Reserve free space for per-AG metadata. */ int xfs_fs_reserve_ag_blocks( struct xfs_mount *mp) { struct xfs_perag *pag = NULL; int error = 0; int err2; mp->m_finobt_nores = false; while ((pag = xfs_perag_next(mp, pag))) { err2 = xfs_ag_resv_init(pag, NULL); if (err2 && !error) error = err2; } if (error && error != -ENOSPC) { xfs_warn(mp, "Error %d reserving per-AG metadata reserve pool.", error); xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); return error; } err2 = xfs_metafile_resv_init(mp); if (err2 && err2 != -ENOSPC) { xfs_warn(mp, "Error %d reserving realtime metadata reserve pool.", err2); xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); if (!error) error = err2; } return error; } /* * Free space reserved for per-AG metadata. */ void xfs_fs_unreserve_ag_blocks( struct xfs_mount *mp) { struct xfs_perag *pag = NULL; xfs_metafile_resv_free(mp); while ((pag = xfs_perag_next(mp, pag))) xfs_ag_resv_free(pag); } |
| 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * INET An implementation of the TCP/IP protocol suite for the LINUX * operating system. INET is implemented using the BSD Socket * interface as the means of communication with the user level. * * Definitions for the Forwarding Information Base. * * Authors: A.N.Kuznetsov, <kuznet@ms2.inr.ac.ru> */ #ifndef _NET_IP_FIB_H #define _NET_IP_FIB_H #include <net/flow.h> #include <linux/seq_file.h> #include <linux/rcupdate.h> #include <net/fib_notifier.h> #include <net/fib_rules.h> #include <net/inet_dscp.h> #include <net/inetpeer.h> #include <linux/percpu.h> #include <linux/notifier.h> #include <linux/refcount.h> #include <linux/ip.h> #include <linux/in_route.h> struct fib_config { u8 fc_dst_len; dscp_t fc_dscp; u8 fc_protocol; u8 fc_scope; u8 fc_type; u8 fc_gw_family; /* 2 bytes unused */ u32 fc_table; __be32 fc_dst; union { __be32 fc_gw4; struct in6_addr fc_gw6; }; int fc_oif; u32 fc_flags; u32 fc_priority; __be32 fc_prefsrc; u32 fc_nh_id; struct nlattr *fc_mx; struct rtnexthop *fc_mp; int fc_mx_len; int fc_mp_len; u32 fc_flow; u32 fc_nlflags; struct nl_info fc_nlinfo; struct nlattr *fc_encap; u16 fc_encap_type; }; struct fib_info; struct rtable; struct fib_nh_exception { struct fib_nh_exception __rcu *fnhe_next; int fnhe_genid; __be32 fnhe_daddr; u32 fnhe_pmtu; bool fnhe_mtu_locked; __be32 fnhe_gw; unsigned long fnhe_expires; struct rtable __rcu *fnhe_rth_input; struct rtable __rcu *fnhe_rth_output; unsigned long fnhe_stamp; struct rcu_head rcu; }; struct fnhe_hash_bucket { struct fib_nh_exception __rcu *chain; }; #define FNHE_HASH_SHIFT 11 #define FNHE_HASH_SIZE (1 << FNHE_HASH_SHIFT) #define FNHE_RECLAIM_DEPTH 5 struct fib_nh_common { struct net_device *nhc_dev; netdevice_tracker nhc_dev_tracker; int nhc_oif; unsigned char nhc_scope; u8 nhc_family; u8 nhc_gw_family; unsigned char nhc_flags; struct lwtunnel_state *nhc_lwtstate; union { __be32 ipv4; struct in6_addr ipv6; } nhc_gw; int nhc_weight; atomic_t nhc_upper_bound; /* v4 specific, but allows fib6_nh with v4 routes */ struct rtable __rcu * __percpu *nhc_pcpu_rth_output; struct rtable __rcu *nhc_rth_input; struct fnhe_hash_bucket __rcu *nhc_exceptions; }; struct fib_nh { struct fib_nh_common nh_common; struct hlist_node nh_hash; struct fib_info *nh_parent; #ifdef CONFIG_IP_ROUTE_CLASSID __u32 nh_tclassid; #endif __be32 nh_saddr; int nh_saddr_genid; #define fib_nh_family nh_common.nhc_family #define fib_nh_dev nh_common.nhc_dev #define fib_nh_dev_tracker nh_common.nhc_dev_tracker #define fib_nh_oif nh_common.nhc_oif #define fib_nh_flags nh_common.nhc_flags #define fib_nh_lws nh_common.nhc_lwtstate #define fib_nh_scope nh_common.nhc_scope #define fib_nh_gw_family nh_common.nhc_gw_family #define fib_nh_gw4 nh_common.nhc_gw.ipv4 #define fib_nh_gw6 nh_common.nhc_gw.ipv6 #define fib_nh_weight nh_common.nhc_weight #define fib_nh_upper_bound nh_common.nhc_upper_bound }; /* * This structure contains data shared by many of routes. */ struct nexthop; struct fib_info { struct hlist_node fib_hash; struct hlist_node fib_lhash; struct list_head nh_list; struct net *fib_net; refcount_t fib_treeref; refcount_t fib_clntref; unsigned int fib_flags; unsigned char fib_dead; unsigned char fib_protocol; unsigned char fib_scope; unsigned char fib_type; __be32 fib_prefsrc; u32 fib_tb_id; u32 fib_priority; struct dst_metrics *fib_metrics; #define fib_mtu fib_metrics->metrics[RTAX_MTU-1] #define fib_window fib_metrics->metrics[RTAX_WINDOW-1] #define fib_rtt fib_metrics->metrics[RTAX_RTT-1] #define fib_advmss fib_metrics->metrics[RTAX_ADVMSS-1] int fib_nhs; bool fib_nh_is_v6; bool nh_updated; bool pfsrc_removed; struct nexthop *nh; struct rcu_head rcu; struct fib_nh fib_nh[] __counted_by(fib_nhs); }; int __net_init fib4_semantics_init(struct net *net); void __net_exit fib4_semantics_exit(struct net *net); #ifdef CONFIG_IP_MULTIPLE_TABLES struct fib_rule; #endif struct fib_table; struct fib_result { __be32 prefix; unsigned char prefixlen; unsigned char nh_sel; unsigned char type; unsigned char scope; u32 tclassid; dscp_t dscp; struct fib_nh_common *nhc; struct fib_info *fi; struct fib_table *table; struct hlist_head *fa_head; }; struct fib_result_nl { __be32 fl_addr; /* To be looked up*/ u32 fl_mark; unsigned char fl_tos; unsigned char fl_scope; unsigned char tb_id_in; unsigned char tb_id; /* Results */ unsigned char prefixlen; unsigned char nh_sel; unsigned char type; unsigned char scope; int err; }; #ifdef CONFIG_IP_MULTIPLE_TABLES #define FIB_TABLE_HASHSZ 256 #else #define FIB_TABLE_HASHSZ 2 #endif __be32 fib_info_update_nhc_saddr(struct net *net, struct fib_nh_common *nhc, unsigned char scope); __be32 fib_result_prefsrc(struct net *net, struct fib_result *res); #define FIB_RES_NHC(res) ((res).nhc) #define FIB_RES_DEV(res) (FIB_RES_NHC(res)->nhc_dev) #define FIB_RES_OIF(res) (FIB_RES_NHC(res)->nhc_oif) struct fib_rt_info { struct fib_info *fi; u32 tb_id; __be32 dst; int dst_len; dscp_t dscp; u8 type; u8 offload:1, trap:1, offload_failed:1, unused:5; }; struct fib_entry_notifier_info { struct fib_notifier_info info; /* must be first */ u32 dst; int dst_len; struct fib_info *fi; dscp_t dscp; u8 type; u32 tb_id; }; struct fib_nh_notifier_info { struct fib_notifier_info info; /* must be first */ struct fib_nh *fib_nh; }; int call_fib4_notifier(struct notifier_block *nb, enum fib_event_type event_type, struct fib_notifier_info *info); int call_fib4_notifiers(struct net *net, enum fib_event_type event_type, struct fib_notifier_info *info); int __net_init fib4_notifier_init(struct net *net); void __net_exit fib4_notifier_exit(struct net *net); void fib_info_notify_update(struct net *net, struct nl_info *info); int fib_notify(struct net *net, struct notifier_block *nb, struct netlink_ext_ack *extack); struct fib_table { struct hlist_node tb_hlist; u32 tb_id; int tb_num_default; struct rcu_head rcu; unsigned long *tb_data; unsigned long __data[]; }; struct fib_dump_filter { u32 table_id; /* filter_set is an optimization that an entry is set */ bool filter_set; bool dump_routes; bool dump_exceptions; bool rtnl_held; unsigned char protocol; unsigned char rt_type; unsigned int flags; struct net_device *dev; }; int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, struct fib_result *res, int fib_flags); int fib_table_insert(struct net *, struct fib_table *, struct fib_config *, struct netlink_ext_ack *extack); int fib_table_delete(struct net *, struct fib_table *, struct fib_config *, struct netlink_ext_ack *extack); int fib_table_dump(struct fib_table *table, struct sk_buff *skb, struct netlink_callback *cb, struct fib_dump_filter *filter); int fib_table_flush(struct net *net, struct fib_table *table, bool flush_all); struct fib_table *fib_trie_unmerge(struct fib_table *main_tb); void fib_table_flush_external(struct fib_table *table); void fib_free_table(struct fib_table *tb); #ifndef CONFIG_IP_MULTIPLE_TABLES #define TABLE_LOCAL_INDEX (RT_TABLE_LOCAL & (FIB_TABLE_HASHSZ - 1)) #define TABLE_MAIN_INDEX (RT_TABLE_MAIN & (FIB_TABLE_HASHSZ - 1)) static inline struct fib_table *fib_get_table(struct net *net, u32 id) { struct hlist_node *tb_hlist; struct hlist_head *ptr; ptr = id == RT_TABLE_LOCAL ? &net->ipv4.fib_table_hash[TABLE_LOCAL_INDEX] : &net->ipv4.fib_table_hash[TABLE_MAIN_INDEX]; tb_hlist = rcu_dereference_rtnl(hlist_first_rcu(ptr)); return hlist_entry(tb_hlist, struct fib_table, tb_hlist); } static inline struct fib_table *fib_new_table(struct net *net, u32 id) { return fib_get_table(net, id); } static inline int fib_lookup(struct net *net, const struct flowi4 *flp, struct fib_result *res, unsigned int flags) { struct fib_table *tb; int err = -ENETUNREACH; rcu_read_lock(); tb = fib_get_table(net, RT_TABLE_MAIN); if (tb) err = fib_table_lookup(tb, flp, res, flags | FIB_LOOKUP_NOREF); if (err == -EAGAIN) err = -ENETUNREACH; rcu_read_unlock(); return err; } static inline bool fib4_has_custom_rules(const struct net *net) { return false; } static inline bool fib4_rule_default(const struct fib_rule *rule) { return true; } static inline int fib4_rules_dump(struct net *net, struct notifier_block *nb, struct netlink_ext_ack *extack) { return 0; } static inline unsigned int fib4_rules_seq_read(const struct net *net) { return 0; } static inline bool fib4_rules_early_flow_dissect(struct net *net, struct sk_buff *skb, struct flowi4 *fl4, struct flow_keys *flkeys) { return false; } #else /* CONFIG_IP_MULTIPLE_TABLES */ int __net_init fib4_rules_init(struct net *net); void __net_exit fib4_rules_exit(struct net *net); struct fib_table *fib_new_table(struct net *net, u32 id); struct fib_table *fib_get_table(struct net *net, u32 id); int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res, unsigned int flags); static inline int fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res, unsigned int flags) { struct fib_table *tb; int err = -ENETUNREACH; flags |= FIB_LOOKUP_NOREF; if (net->ipv4.fib_has_custom_rules) return __fib_lookup(net, flp, res, flags); rcu_read_lock(); res->tclassid = 0; tb = rcu_dereference_rtnl(net->ipv4.fib_main); if (tb) err = fib_table_lookup(tb, flp, res, flags); if (!err) goto out; tb = rcu_dereference_rtnl(net->ipv4.fib_default); if (tb) err = fib_table_lookup(tb, flp, res, flags); out: if (err == -EAGAIN) err = -ENETUNREACH; rcu_read_unlock(); return err; } static inline bool fib4_has_custom_rules(const struct net *net) { return net->ipv4.fib_has_custom_rules; } bool fib4_rule_default(const struct fib_rule *rule); int fib4_rules_dump(struct net *net, struct notifier_block *nb, struct netlink_ext_ack *extack); unsigned int fib4_rules_seq_read(const struct net *net); static inline bool fib4_rules_early_flow_dissect(struct net *net, struct sk_buff *skb, struct flowi4 *fl4, struct flow_keys *flkeys) { unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP; if (!net->ipv4.fib_rules_require_fldissect) return false; memset(flkeys, 0, sizeof(*flkeys)); __skb_flow_dissect(net, skb, &flow_keys_dissector, flkeys, NULL, 0, 0, 0, flag); fl4->fl4_sport = flkeys->ports.src; fl4->fl4_dport = flkeys->ports.dst; fl4->flowi4_proto = flkeys->basic.ip_proto; return true; } #endif /* CONFIG_IP_MULTIPLE_TABLES */ static inline bool fib_dscp_masked_match(dscp_t dscp, const struct flowi4 *fl4) { return dscp == (fl4->flowi4_dscp & INET_DSCP_LEGACY_TOS_MASK); } /* Exported by fib_frontend.c */ extern const struct nla_policy rtm_ipv4_policy[]; void ip_fib_init(void); int fib_gw_from_via(struct fib_config *cfg, struct nlattr *nla, struct netlink_ext_ack *extack); __be32 fib_compute_spec_dst(struct sk_buff *skb); bool fib_info_nh_uses_dev(struct fib_info *fi, const struct net_device *dev); int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, dscp_t dscp, int oif, struct net_device *dev, struct in_device *idev, u32 *itag); static inline enum skb_drop_reason fib_validate_source_reason(struct sk_buff *skb, __be32 src, __be32 dst, dscp_t dscp, int oif, struct net_device *dev, struct in_device *idev, u32 *itag) { int err = fib_validate_source(skb, src, dst, dscp, oif, dev, idev, itag); if (err < 0) return -err; return SKB_NOT_DROPPED_YET; } #ifdef CONFIG_IP_ROUTE_CLASSID static inline int fib_num_tclassid_users(struct net *net) { return atomic_read(&net->ipv4.fib_num_tclassid_users); } #else static inline int fib_num_tclassid_users(struct net *net) { return 0; } #endif int fib_unmerge(struct net *net); static inline bool nhc_l3mdev_matches_dev(const struct fib_nh_common *nhc, const struct net_device *dev) { if (nhc->nhc_dev == dev || l3mdev_master_ifindex_rcu(nhc->nhc_dev) == dev->ifindex) return true; return false; } /* Exported by fib_semantics.c */ int ip_fib_check_default(__be32 gw, struct net_device *dev); int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force); int fib_sync_down_addr(struct net_device *dev, __be32 local); int fib_sync_up(struct net_device *dev, unsigned char nh_flags); void fib_sync_mtu(struct net_device *dev, u32 orig_mtu); void fib_nhc_update_mtu(struct fib_nh_common *nhc, u32 new, u32 orig); /* Fields used for sysctl_fib_multipath_hash_fields. * Common to IPv4 and IPv6. * * Add new fields at the end. This is user API. */ #define FIB_MULTIPATH_HASH_FIELD_SRC_IP BIT(0) #define FIB_MULTIPATH_HASH_FIELD_DST_IP BIT(1) #define FIB_MULTIPATH_HASH_FIELD_IP_PROTO BIT(2) #define FIB_MULTIPATH_HASH_FIELD_FLOWLABEL BIT(3) #define FIB_MULTIPATH_HASH_FIELD_SRC_PORT BIT(4) #define FIB_MULTIPATH_HASH_FIELD_DST_PORT BIT(5) #define FIB_MULTIPATH_HASH_FIELD_INNER_SRC_IP BIT(6) #define FIB_MULTIPATH_HASH_FIELD_INNER_DST_IP BIT(7) #define FIB_MULTIPATH_HASH_FIELD_INNER_IP_PROTO BIT(8) #define FIB_MULTIPATH_HASH_FIELD_INNER_FLOWLABEL BIT(9) #define FIB_MULTIPATH_HASH_FIELD_INNER_SRC_PORT BIT(10) #define FIB_MULTIPATH_HASH_FIELD_INNER_DST_PORT BIT(11) #define FIB_MULTIPATH_HASH_FIELD_OUTER_MASK \ (FIB_MULTIPATH_HASH_FIELD_SRC_IP | \ FIB_MULTIPATH_HASH_FIELD_DST_IP | \ FIB_MULTIPATH_HASH_FIELD_IP_PROTO | \ FIB_MULTIPATH_HASH_FIELD_FLOWLABEL | \ FIB_MULTIPATH_HASH_FIELD_SRC_PORT | \ FIB_MULTIPATH_HASH_FIELD_DST_PORT) #define FIB_MULTIPATH_HASH_FIELD_INNER_MASK \ (FIB_MULTIPATH_HASH_FIELD_INNER_SRC_IP | \ FIB_MULTIPATH_HASH_FIELD_INNER_DST_IP | \ FIB_MULTIPATH_HASH_FIELD_INNER_IP_PROTO | \ FIB_MULTIPATH_HASH_FIELD_INNER_FLOWLABEL | \ FIB_MULTIPATH_HASH_FIELD_INNER_SRC_PORT | \ FIB_MULTIPATH_HASH_FIELD_INNER_DST_PORT) #define FIB_MULTIPATH_HASH_FIELD_ALL_MASK \ (FIB_MULTIPATH_HASH_FIELD_OUTER_MASK | \ FIB_MULTIPATH_HASH_FIELD_INNER_MASK) #define FIB_MULTIPATH_HASH_FIELD_DEFAULT_MASK \ (FIB_MULTIPATH_HASH_FIELD_SRC_IP | \ FIB_MULTIPATH_HASH_FIELD_DST_IP | \ FIB_MULTIPATH_HASH_FIELD_IP_PROTO) #ifdef CONFIG_IP_ROUTE_MULTIPATH int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4, const struct sk_buff *skb, struct flow_keys *flkeys); static void fib_multipath_hash_construct_key(siphash_key_t *key, u32 mp_seed) { u64 mp_seed_64 = mp_seed; key->key[0] = (mp_seed_64 << 32) | mp_seed_64; key->key[1] = key->key[0]; } static inline u32 fib_multipath_hash_from_keys(const struct net *net, struct flow_keys *keys) { siphash_aligned_key_t hash_key; u32 mp_seed; mp_seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed).mp_seed; fib_multipath_hash_construct_key(&hash_key, mp_seed); return flow_hash_from_keys_seed(keys, &hash_key); } #else static inline u32 fib_multipath_hash_from_keys(const struct net *net, struct flow_keys *keys) { return flow_hash_from_keys(keys); } #endif int fib_check_nh(struct net *net, struct fib_nh *nh, u32 table, u8 scope, struct netlink_ext_ack *extack); void fib_select_multipath(struct fib_result *res, int hash, const struct flowi4 *fl4); void fib_select_path(struct net *net, struct fib_result *res, struct flowi4 *fl4, const struct sk_buff *skb); int fib_nh_init(struct net *net, struct fib_nh *fib_nh, struct fib_config *cfg, int nh_weight, struct netlink_ext_ack *extack); void fib_nh_release(struct net *net, struct fib_nh *fib_nh); int fib_nh_common_init(struct net *net, struct fib_nh_common *nhc, struct nlattr *fc_encap, u16 fc_encap_type, void *cfg, gfp_t gfp_flags, struct netlink_ext_ack *extack); void fib_nh_common_release(struct fib_nh_common *nhc); /* Exported by fib_trie.c */ void fib_alias_hw_flags_set(struct net *net, const struct fib_rt_info *fri); void fib_trie_init(void); struct fib_table *fib_trie_table(u32 id, struct fib_table *alias); bool fib_lookup_good_nhc(const struct fib_nh_common *nhc, int fib_flags, const struct flowi4 *flp); static inline void fib_combine_itag(u32 *itag, const struct fib_result *res) { #ifdef CONFIG_IP_ROUTE_CLASSID struct fib_nh_common *nhc = res->nhc; #ifdef CONFIG_IP_MULTIPLE_TABLES u32 rtag; #endif if (nhc->nhc_family == AF_INET) { struct fib_nh *nh; nh = container_of(nhc, struct fib_nh, nh_common); *itag = nh->nh_tclassid << 16; } else { *itag = 0; } #ifdef CONFIG_IP_MULTIPLE_TABLES rtag = res->tclassid; if (*itag == 0) *itag = (rtag<<16); *itag |= (rtag>>16); #endif #endif } void fib_flush(struct net *net); void free_fib_info(struct fib_info *fi); static inline void fib_info_hold(struct fib_info *fi) { refcount_inc(&fi->fib_clntref); } static inline void fib_info_put(struct fib_info *fi) { if (refcount_dec_and_test(&fi->fib_clntref)) free_fib_info(fi); } #ifdef CONFIG_PROC_FS int __net_init fib_proc_init(struct net *net); void __net_exit fib_proc_exit(struct net *net); #else static inline int fib_proc_init(struct net *net) { return 0; } static inline void fib_proc_exit(struct net *net) { } #endif u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr); int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, struct fib_dump_filter *filter, struct netlink_callback *cb); int fib_nexthop_info(struct sk_buff *skb, const struct fib_nh_common *nh, u8 rt_family, unsigned char *flags, bool skip_oif); int fib_add_nexthop(struct sk_buff *skb, const struct fib_nh_common *nh, int nh_weight, u8 rt_family, u32 nh_tclassid); #endif /* _NET_FIB_H */ |
| 2 2 2 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | // SPDX-License-Identifier: GPL-2.0 /* * mm/fadvise.c * * Copyright (C) 2002, Linus Torvalds * * 11Jan2003 Andrew Morton * Initial version. */ #include <linux/kernel.h> #include <linux/file.h> #include <linux/fs.h> #include <linux/mm.h> #include <linux/pagemap.h> #include <linux/backing-dev.h> #include <linux/fadvise.h> #include <linux/writeback.h> #include <linux/syscalls.h> #include <linux/swap.h> #include <asm/unistd.h> #include "internal.h" /* * POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could * deactivate the pages and clear PG_Referenced. */ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) { struct inode *inode; struct address_space *mapping; struct backing_dev_info *bdi; loff_t endbyte; /* inclusive */ pgoff_t start_index; pgoff_t end_index; unsigned long nrpages; inode = file_inode(file); if (S_ISFIFO(inode->i_mode)) return -ESPIPE; mapping = file->f_mapping; if (!mapping || len < 0) return -EINVAL; bdi = inode_to_bdi(mapping->host); if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) { switch (advice) { case POSIX_FADV_NORMAL: case POSIX_FADV_RANDOM: case POSIX_FADV_SEQUENTIAL: case POSIX_FADV_WILLNEED: case POSIX_FADV_NOREUSE: case POSIX_FADV_DONTNEED: /* no bad return value, but ignore advice */ break; default: return -EINVAL; } return 0; } /* * Careful about overflows. Len == 0 means "as much as possible". Use * unsigned math because signed overflows are undefined and UBSan * complains. */ endbyte = (u64)offset + (u64)len; if (!len || endbyte < len) endbyte = LLONG_MAX; else endbyte--; /* inclusive */ switch (advice) { case POSIX_FADV_NORMAL: file->f_ra.ra_pages = bdi->ra_pages; spin_lock(&file->f_lock); file->f_mode &= ~(FMODE_RANDOM | FMODE_NOREUSE); spin_unlock(&file->f_lock); break; case POSIX_FADV_RANDOM: spin_lock(&file->f_lock); file->f_mode |= FMODE_RANDOM; spin_unlock(&file->f_lock); break; case POSIX_FADV_SEQUENTIAL: file->f_ra.ra_pages = bdi->ra_pages * 2; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); break; case POSIX_FADV_WILLNEED: /* First and last PARTIAL page! */ start_index = offset >> PAGE_SHIFT; end_index = endbyte >> PAGE_SHIFT; /* Careful about overflow on the "+1" */ nrpages = end_index - start_index + 1; if (!nrpages) nrpages = ~0UL; force_page_cache_readahead(mapping, file, start_index, nrpages); break; case POSIX_FADV_NOREUSE: spin_lock(&file->f_lock); file->f_mode |= FMODE_NOREUSE; spin_unlock(&file->f_lock); break; case POSIX_FADV_DONTNEED: filemap_flush_range(mapping, offset, endbyte); /* * First and last FULL page! Partial pages are deliberately * preserved on the expectation that it is better to preserve * needed memory than to discard unneeded memory. */ start_index = (offset+(PAGE_SIZE-1)) >> PAGE_SHIFT; end_index = (endbyte >> PAGE_SHIFT); /* * The page at end_index will be inclusively discarded according * by invalidate_mapping_pages(), so subtracting 1 from * end_index means we will skip the last page. But if endbyte * is page aligned or is at the end of file, we should not skip * that page - discarding the last page is safe enough. */ if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK && endbyte != inode->i_size - 1) { /* First page is tricky as 0 - 1 = -1, but pgoff_t * is unsigned, so the end_index >= start_index * check below would be true and we'll discard the whole * file cache which is not what was asked. */ if (end_index == 0) break; end_index--; } if (end_index >= start_index) { unsigned long nr_failed = 0; /* * It's common to FADV_DONTNEED right after * the read or write that instantiates the * pages, in which case there will be some * sitting on the local LRU cache. Try to * avoid the expensive remote drain and the * second cache tree walk below by flushing * them out right away. */ lru_add_drain(); mapping_try_invalidate(mapping, start_index, end_index, &nr_failed); /* * The failures may be due to the folio being * in the LRU cache of a remote CPU. Drain all * caches and try again. */ if (nr_failed) { lru_add_drain_all(); invalidate_mapping_pages(mapping, start_index, end_index); } } break; default: return -EINVAL; } return 0; } EXPORT_SYMBOL(generic_fadvise); int vfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice) { if (file->f_op->fadvise) return file->f_op->fadvise(file, offset, len, advice); return generic_fadvise(file, offset, len, advice); } EXPORT_SYMBOL(vfs_fadvise); #ifdef CONFIG_ADVISE_SYSCALLS int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice) { CLASS(fd, f)(fd); if (fd_empty(f)) return -EBADF; return vfs_fadvise(fd_file(f), offset, len, advice); } SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) { return ksys_fadvise64_64(fd, offset, len, advice); } #ifdef __ARCH_WANT_SYS_FADVISE64 SYSCALL_DEFINE4(fadvise64, int, fd, loff_t, offset, size_t, len, int, advice) { return ksys_fadvise64_64(fd, offset, len, advice); } #endif #if defined(CONFIG_COMPAT) && defined(__ARCH_WANT_COMPAT_FADVISE64_64) COMPAT_SYSCALL_DEFINE6(fadvise64_64, int, fd, compat_arg_u64_dual(offset), compat_arg_u64_dual(len), int, advice) { return ksys_fadvise64_64(fd, compat_arg_u64_glue(offset), compat_arg_u64_glue(len), advice); } #endif #endif |
| 76 36 8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef BTRFS_EXTENT_IO_TREE_H #define BTRFS_EXTENT_IO_TREE_H #include <linux/rbtree.h> #include <linux/spinlock.h> #include <linux/refcount.h> #include <linux/list.h> #include <linux/wait.h> #include "misc.h" struct extent_changeset; struct btrfs_fs_info; struct btrfs_inode; /* Bits for the extent state */ enum { ENUM_BIT(EXTENT_DIRTY), ENUM_BIT(EXTENT_LOCKED), ENUM_BIT(EXTENT_DIO_LOCKED), ENUM_BIT(EXTENT_DIRTY_LOG1), ENUM_BIT(EXTENT_DIRTY_LOG2), ENUM_BIT(EXTENT_DELALLOC), ENUM_BIT(EXTENT_DEFRAG), ENUM_BIT(EXTENT_BOUNDARY), ENUM_BIT(EXTENT_NODATASUM), ENUM_BIT(EXTENT_CLEAR_META_RESV), ENUM_BIT(EXTENT_NEED_WAIT), ENUM_BIT(EXTENT_NORESERVE), ENUM_BIT(EXTENT_QGROUP_RESERVED), ENUM_BIT(EXTENT_CLEAR_DATA_RESV), /* * Must be cleared only during ordered extent completion or on error * paths if we did not manage to submit bios and create the ordered * extents for the range. Should not be cleared during page release * and page invalidation (if there is an ordered extent in flight), * that is left for the ordered extent completion. */ ENUM_BIT(EXTENT_DELALLOC_NEW), /* * Mark that a range is being locked for finishing an ordered extent. * Used together with EXTENT_LOCKED. */ ENUM_BIT(EXTENT_FINISHING_ORDERED), /* * When an ordered extent successfully completes for a region marked as * a new delalloc range, use this flag when clearing a new delalloc * range to indicate that the VFS' inode number of bytes should be * incremented and the inode's new delalloc bytes decremented, in an * atomic way to prevent races with stat(2). */ ENUM_BIT(EXTENT_ADD_INODE_BYTES), /* * Set during truncate when we're clearing an entire range and we just * want the extent states to go away. */ ENUM_BIT(EXTENT_CLEAR_ALL_BITS), /* * This must be last. * * Bit not representing a state but a request for NOWAIT semantics, * e.g. when allocating memory, and must be masked out from the other * bits. */ ENUM_BIT(EXTENT_NOWAIT) }; #define EXTENT_DO_ACCOUNTING (EXTENT_CLEAR_META_RESV | \ EXTENT_CLEAR_DATA_RESV) #define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | \ EXTENT_ADD_INODE_BYTES | \ EXTENT_CLEAR_ALL_BITS) #define EXTENT_LOCK_BITS (EXTENT_LOCKED | EXTENT_DIO_LOCKED) /* * Redefined bits above which are used only in the device allocation tree, * shouldn't be using EXTENT_LOCKED / EXTENT_BOUNDARY / EXTENT_CLEAR_META_RESV * / EXTENT_CLEAR_DATA_RESV because they have special meaning to the bit * manipulation functions */ #define CHUNK_ALLOCATED EXTENT_DIRTY #define CHUNK_TRIMMED EXTENT_DEFRAG #define CHUNK_STATE_MASK (CHUNK_ALLOCATED | \ CHUNK_TRIMMED) enum { IO_TREE_FS_PINNED_EXTENTS, IO_TREE_FS_EXCLUDED_EXTENTS, IO_TREE_BTREE_INODE_IO, IO_TREE_INODE_IO, IO_TREE_RELOC_BLOCKS, IO_TREE_TRANS_DIRTY_PAGES, IO_TREE_ROOT_DIRTY_LOG_PAGES, IO_TREE_INODE_FILE_EXTENT, IO_TREE_LOG_CSUM_RANGE, IO_TREE_SELFTEST, IO_TREE_DEVICE_ALLOC_STATE, }; struct extent_io_tree { struct rb_root state; /* * The fs_info is needed for trace points, a tree attached to an inode * needs the inode. * * owner == IO_TREE_INODE_IO - then inode is valid and fs_info can be * accessed as inode->root->fs_info */ union { struct btrfs_fs_info *fs_info; struct btrfs_inode *inode; }; /* Who owns this io tree, should be one of IO_TREE_* */ u8 owner; spinlock_t lock; }; struct extent_state { u64 start; u64 end; /* inclusive */ struct rb_node rb_node; /* ADD NEW ELEMENTS AFTER THIS */ wait_queue_head_t wq; refcount_t refs; u32 state; #ifdef CONFIG_BTRFS_DEBUG struct list_head leak_list; #endif }; const struct btrfs_inode *btrfs_extent_io_tree_to_inode(const struct extent_io_tree *tree); const struct btrfs_fs_info *btrfs_extent_io_tree_to_fs_info(const struct extent_io_tree *tree); void btrfs_extent_io_tree_init(struct btrfs_fs_info *fs_info, struct extent_io_tree *tree, unsigned int owner); void btrfs_extent_io_tree_release(struct extent_io_tree *tree); int btrfs_lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 bits, struct extent_state **cached); bool btrfs_try_lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 bits, struct extent_state **cached); static inline int btrfs_lock_extent(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached) { return btrfs_lock_extent_bits(tree, start, end, EXTENT_LOCKED, cached); } static inline bool btrfs_try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached) { return btrfs_try_lock_extent_bits(tree, start, end, EXTENT_LOCKED, cached); } int __init btrfs_extent_state_init_cachep(void); void __cold btrfs_extent_state_free_cachep(void); u64 btrfs_count_range_bits(struct extent_io_tree *tree, u64 *start, u64 search_end, u64 max_bytes, u32 bits, bool contig, struct extent_state **cached_state); void btrfs_free_extent_state(struct extent_state *state); bool btrfs_test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bit, struct extent_state *cached_state); bool btrfs_test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 bit); void btrfs_get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits, struct extent_state **cached_state); int btrfs_clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 bits, struct extent_changeset *changeset); int btrfs_clear_extent_bit_changeset(struct extent_io_tree *tree, u64 start, u64 end, u32 bits, struct extent_state **cached, struct extent_changeset *changeset); static inline int btrfs_clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits, struct extent_state **cached) { return btrfs_clear_extent_bit_changeset(tree, start, end, bits, cached, NULL); } static inline int btrfs_unlock_extent(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached) { return btrfs_clear_extent_bit_changeset(tree, start, end, EXTENT_LOCKED, cached, NULL); } int btrfs_set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 bits, struct extent_changeset *changeset); int btrfs_set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits, struct extent_state **cached_state); static inline int btrfs_clear_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached) { return btrfs_clear_extent_bit(tree, start, end, EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING, cached); } int btrfs_convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits, u32 clear_bits, struct extent_state **cached_state); bool btrfs_find_first_extent_bit(struct extent_io_tree *tree, u64 start, u64 *start_ret, u64 *end_ret, u32 bits, struct extent_state **cached_state); void btrfs_find_first_clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 *start_ret, u64 *end_ret, u32 bits); bool btrfs_find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start, u64 *start_ret, u64 *end_ret, u32 bits); bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start, u64 *end, u64 max_bytes, struct extent_state **cached_state); static inline int btrfs_lock_dio_extent(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached) { return btrfs_lock_extent_bits(tree, start, end, EXTENT_DIO_LOCKED, cached); } static inline bool btrfs_try_lock_dio_extent(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached) { return btrfs_try_lock_extent_bits(tree, start, end, EXTENT_DIO_LOCKED, cached); } static inline int btrfs_unlock_dio_extent(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached) { return btrfs_clear_extent_bit_changeset(tree, start, end, EXTENT_DIO_LOCKED, cached, NULL); } struct extent_state *btrfs_next_extent_state(struct extent_io_tree *tree, struct extent_state *state); #endif /* BTRFS_EXTENT_IO_TREE_H */ |
| 856 895 481 477 5 243 242 59 59 12 12 120 120 818 369 667 324 70 264 438 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 | // SPDX-License-Identifier: GPL-2.0 /* * fs/ext4/fast_commit.c * * Written by Harshad Shirwadkar <harshadshirwadkar@gmail.com> * * Ext4 fast commits routines. */ #include "ext4.h" #include "ext4_jbd2.h" #include "ext4_extents.h" #include "mballoc.h" #include <linux/lockdep.h> /* * Ext4 Fast Commits * ----------------- * * Ext4 fast commits implement fine grained journalling for Ext4. * * Fast commits are organized as a log of tag-length-value (TLV) structs. (See * struct ext4_fc_tl). Each TLV contains some delta that is replayed TLV by * TLV during the recovery phase. For the scenarios for which we currently * don't have replay code, fast commit falls back to full commits. * Fast commits record delta in one of the following three categories. * * (A) Directory entry updates: * * - EXT4_FC_TAG_UNLINK - records directory entry unlink * - EXT4_FC_TAG_LINK - records directory entry link * - EXT4_FC_TAG_CREAT - records inode and directory entry creation * * (B) File specific data range updates: * * - EXT4_FC_TAG_ADD_RANGE - records addition of new blocks to an inode * - EXT4_FC_TAG_DEL_RANGE - records deletion of blocks from an inode * * (C) Inode metadata (mtime / ctime etc): * * - EXT4_FC_TAG_INODE - record the inode that should be replayed * during recovery. Note that iblocks field is * not replayed and instead derived during * replay. * Commit Operation * ---------------- * With fast commits, we maintain all the directory entry operations in the * order in which they are issued in an in-memory queue. This queue is flushed * to disk during the commit operation. We also maintain a list of inodes * that need to be committed during a fast commit in another in memory queue of * inodes. During the commit operation, we commit in the following order: * * [1] Prepare all the inodes to write out their data by setting * "EXT4_STATE_FC_FLUSHING_DATA". This ensures that inode cannot be * deleted while it is being flushed. * [2] Flush data buffers to disk and clear "EXT4_STATE_FC_FLUSHING_DATA" * state. * [3] Lock the journal by calling jbd2_journal_lock_updates. This ensures that * all the exsiting handles finish and no new handles can start. * [4] Mark all the fast commit eligible inodes as undergoing fast commit * by setting "EXT4_STATE_FC_COMMITTING" state. * [5] Unlock the journal by calling jbd2_journal_unlock_updates. This allows * starting of new handles. If new handles try to start an update on * any of the inodes that are being committed, ext4_fc_track_inode() * will block until those inodes have finished the fast commit. * [6] Commit all the directory entry updates in the fast commit space. * [7] Commit all the changed inodes in the fast commit space and clear * "EXT4_STATE_FC_COMMITTING" for these inodes. * [8] Write tail tag (this tag ensures the atomicity, please read the following * section for more details). * * All the inode updates must be enclosed within jbd2_jounrnal_start() * and jbd2_journal_stop() similar to JBD2 journaling. * * Fast Commit Ineligibility * ------------------------- * * Not all operations are supported by fast commits today (e.g extended * attributes). Fast commit ineligibility is marked by calling * ext4_fc_mark_ineligible(): This makes next fast commit operation to fall back * to full commit. * * Atomicity of commits * -------------------- * In order to guarantee atomicity during the commit operation, fast commit * uses "EXT4_FC_TAG_TAIL" tag that marks a fast commit as complete. Tail * tag contains CRC of the contents and TID of the transaction after which * this fast commit should be applied. Recovery code replays fast commit * logs only if there's at least 1 valid tail present. For every fast commit * operation, there is 1 tail. This means, we may end up with multiple tails * in the fast commit space. Here's an example: * * - Create a new file A and remove existing file B * - fsync() * - Append contents to file A * - Truncate file A * - fsync() * * The fast commit space at the end of above operations would look like this: * [HEAD] [CREAT A] [UNLINK B] [TAIL] [ADD_RANGE A] [DEL_RANGE A] [TAIL] * |<--- Fast Commit 1 --->|<--- Fast Commit 2 ---->| * * Replay code should thus check for all the valid tails in the FC area. * * Fast Commit Replay Idempotence * ------------------------------ * * Fast commits tags are idempotent in nature provided the recovery code follows * certain rules. The guiding principle that the commit path follows while * committing is that it stores the result of a particular operation instead of * storing the procedure. * * Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a' * was associated with inode 10. During fast commit, instead of storing this * operation as a procedure "rename a to b", we store the resulting file system * state as a "series" of outcomes: * * - Link dirent b to inode 10 * - Unlink dirent a * - Inode <10> with valid refcount * * Now when recovery code runs, it needs "enforce" this state on the file * system. This is what guarantees idempotence of fast commit replay. * * Let's take an example of a procedure that is not idempotent and see how fast * commits make it idempotent. Consider following sequence of operations: * * rm A; mv B A; read A * (x) (y) (z) * * (x), (y) and (z) are the points at which we can crash. If we store this * sequence of operations as is then the replay is not idempotent. Let's say * while in replay, we crash at (z). During the second replay, file A (which was * actually created as a result of "mv B A" operation) would get deleted. Thus, * file named A would be absent when we try to read A. So, this sequence of * operations is not idempotent. However, as mentioned above, instead of storing * the procedure fast commits store the outcome of each procedure. Thus the fast * commit log for above procedure would be as follows: * * (Let's assume dirent A was linked to inode 10 and dirent B was linked to * inode 11 before the replay) * * [Unlink A] [Link A to inode 11] [Unlink B] [Inode 11] * (w) (x) (y) (z) * * If we crash at (z), we will have file A linked to inode 11. During the second * replay, we will remove file A (inode 11). But we will create it back and make * it point to inode 11. We won't find B, so we'll just skip that step. At this * point, the refcount for inode 11 is not reliable, but that gets fixed by the * replay of last inode 11 tag. Crashes at points (w), (x) and (y) get handled * similarly. Thus, by converting a non-idempotent procedure into a series of * idempotent outcomes, fast commits ensured idempotence during the replay. * * Locking * ------- * sbi->s_fc_lock protects the fast commit inodes queue and the fast commit * dentry queue. ei->i_fc_lock protects the fast commit related info in a given * inode. Most of the code avoids acquiring both the locks, but if one must do * that then sbi->s_fc_lock must be acquired before ei->i_fc_lock. * * TODOs * ----- * * 0) Fast commit replay path hardening: Fast commit replay code should use * journal handles to make sure all the updates it does during the replay * path are atomic. With that if we crash during fast commit replay, after * trying to do recovery again, we will find a file system where fast commit * area is invalid (because new full commit would be found). In order to deal * with that, fast commit replay code should ensure that the "FC_REPLAY" * superblock state is persisted before starting the replay, so that after * the crash, fast commit recovery code can look at that flag and perform * fast commit recovery even if that area is invalidated by later full * commits. * * 1) Handle more ineligible cases. * * 2) Change ext4_fc_commit() to lookup logical to physical mapping using extent * status tree. This would get rid of the need to call ext4_fc_track_inode() * before acquiring i_data_sem. To do that we would need to ensure that * modified extents from the extent status tree are not evicted from memory. */ #include <trace/events/ext4.h> static struct kmem_cache *ext4_fc_dentry_cachep; static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate) { BUFFER_TRACE(bh, ""); if (uptodate) { ext4_debug("%s: Block %lld up-to-date", __func__, bh->b_blocknr); set_buffer_uptodate(bh); } else { ext4_debug("%s: Block %lld not up-to-date", __func__, bh->b_blocknr); clear_buffer_uptodate(bh); } unlock_buffer(bh); } static inline void ext4_fc_reset_inode(struct inode *inode) { struct ext4_inode_info *ei = EXT4_I(inode); ei->i_fc_lblk_start = 0; ei->i_fc_lblk_len = 0; } void ext4_fc_init_inode(struct inode *inode) { struct ext4_inode_info *ei = EXT4_I(inode); ext4_fc_reset_inode(inode); ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING); INIT_LIST_HEAD(&ei->i_fc_list); INIT_LIST_HEAD(&ei->i_fc_dilist); init_waitqueue_head(&ei->i_fc_wait); } static bool ext4_fc_disabled(struct super_block *sb) { return (!test_opt2(sb, JOURNAL_FAST_COMMIT) || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)); } /* * Remove inode from fast commit list. If the inode is being committed * we wait until inode commit is done. */ void ext4_fc_del(struct inode *inode) { struct ext4_inode_info *ei = EXT4_I(inode); struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct ext4_fc_dentry_update *fc_dentry; wait_queue_head_t *wq; if (ext4_fc_disabled(inode->i_sb)) return; mutex_lock(&sbi->s_fc_lock); if (list_empty(&ei->i_fc_list) && list_empty(&ei->i_fc_dilist)) { mutex_unlock(&sbi->s_fc_lock); return; } /* * Since ext4_fc_del is called from ext4_evict_inode while having a * handle open, there is no need for us to wait here even if a fast * commit is going on. That is because, if this inode is being * committed, ext4_mark_inode_dirty would have waited for inode commit * operation to finish before we come here. So, by the time we come * here, inode's EXT4_STATE_FC_COMMITTING would have been cleared. So, * we shouldn't see EXT4_STATE_FC_COMMITTING to be set on this inode * here. * * We may come here without any handles open in the "no_delete" case of * ext4_evict_inode as well. However, if that happens, we first mark the * file system as fast commit ineligible anyway. So, even in that case, * it is okay to remove the inode from the fc list. */ WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING) && !ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE)); while (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) { #if (BITS_PER_LONG < 64) DEFINE_WAIT_BIT(wait, &ei->i_state_flags, EXT4_STATE_FC_FLUSHING_DATA); wq = bit_waitqueue(&ei->i_state_flags, EXT4_STATE_FC_FLUSHING_DATA); #else DEFINE_WAIT_BIT(wait, &ei->i_flags, EXT4_STATE_FC_FLUSHING_DATA); wq = bit_waitqueue(&ei->i_flags, EXT4_STATE_FC_FLUSHING_DATA); #endif prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); if (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) { mutex_unlock(&sbi->s_fc_lock); schedule(); mutex_lock(&sbi->s_fc_lock); } finish_wait(wq, &wait.wq_entry); } list_del_init(&ei->i_fc_list); /* * Since this inode is getting removed, let's also remove all FC * dentry create references, since it is not needed to log it anyways. */ if (list_empty(&ei->i_fc_dilist)) { mutex_unlock(&sbi->s_fc_lock); return; } fc_dentry = list_first_entry(&ei->i_fc_dilist, struct ext4_fc_dentry_update, fcd_dilist); WARN_ON(fc_dentry->fcd_op != EXT4_FC_TAG_CREAT); list_del_init(&fc_dentry->fcd_list); list_del_init(&fc_dentry->fcd_dilist); WARN_ON(!list_empty(&ei->i_fc_dilist)); mutex_unlock(&sbi->s_fc_lock); release_dentry_name_snapshot(&fc_dentry->fcd_name); kmem_cache_free(ext4_fc_dentry_cachep, fc_dentry); } /* * Mark file system as fast commit ineligible, and record latest * ineligible transaction tid. This means until the recorded * transaction, commit operation would result in a full jbd2 commit. */ void ext4_fc_mark_ineligible(struct super_block *sb, int reason, handle_t *handle) { struct ext4_sb_info *sbi = EXT4_SB(sb); tid_t tid; bool has_transaction = true; bool is_ineligible; if (ext4_fc_disabled(sb)) return; if (handle && !IS_ERR(handle)) tid = handle->h_transaction->t_tid; else { read_lock(&sbi->s_journal->j_state_lock); if (sbi->s_journal->j_running_transaction) tid = sbi->s_journal->j_running_transaction->t_tid; else has_transaction = false; read_unlock(&sbi->s_journal->j_state_lock); } mutex_lock(&sbi->s_fc_lock); is_ineligible = ext4_test_mount_flag(sb, EXT4_MF_FC_INELIGIBLE); if (has_transaction && (!is_ineligible || tid_gt(tid, sbi->s_fc_ineligible_tid))) sbi->s_fc_ineligible_tid = tid; ext4_set_mount_flag(sb, EXT4_MF_FC_INELIGIBLE); mutex_unlock(&sbi->s_fc_lock); WARN_ON(reason >= EXT4_FC_REASON_MAX); sbi->s_fc_stats.fc_ineligible_reason_count[reason]++; } /* * Generic fast commit tracking function. If this is the first time this we are * called after a full commit, we initialize fast commit fields and then call * __fc_track_fn() with update = 0. If we have already been called after a full * commit, we pass update = 1. Based on that, the track function can determine * if it needs to track a field for the first time or if it needs to just * update the previously tracked value. * * If enqueue is set, this function enqueues the inode in fast commit list. */ static int ext4_fc_track_template( handle_t *handle, struct inode *inode, int (*__fc_track_fn)(handle_t *handle, struct inode *, void *, bool), void *args, int enqueue) { bool update = false; struct ext4_inode_info *ei = EXT4_I(inode); struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); tid_t tid = 0; int ret; tid = handle->h_transaction->t_tid; spin_lock(&ei->i_fc_lock); if (tid == ei->i_sync_tid) { update = true; } else { ext4_fc_reset_inode(inode); ei->i_sync_tid = tid; } ret = __fc_track_fn(handle, inode, args, update); spin_unlock(&ei->i_fc_lock); if (!enqueue) return ret; mutex_lock(&sbi->s_fc_lock); if (list_empty(&EXT4_I(inode)->i_fc_list)) list_add_tail(&EXT4_I(inode)->i_fc_list, (sbi->s_journal->j_flags & JBD2_FULL_COMMIT_ONGOING || sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ? &sbi->s_fc_q[FC_Q_STAGING] : &sbi->s_fc_q[FC_Q_MAIN]); mutex_unlock(&sbi->s_fc_lock); return ret; } struct __track_dentry_update_args { struct dentry *dentry; int op; }; /* __track_fn for directory entry updates. Called with ei->i_fc_lock. */ static int __track_dentry_update(handle_t *handle, struct inode *inode, void *arg, bool update) { struct ext4_fc_dentry_update *node; struct ext4_inode_info *ei = EXT4_I(inode); struct __track_dentry_update_args *dentry_update = (struct __track_dentry_update_args *)arg; struct dentry *dentry = dentry_update->dentry; struct inode *dir = dentry->d_parent->d_inode; struct super_block *sb = inode->i_sb; struct ext4_sb_info *sbi = EXT4_SB(sb); spin_unlock(&ei->i_fc_lock); if (IS_ENCRYPTED(dir)) { ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_ENCRYPTED_FILENAME, handle); spin_lock(&ei->i_fc_lock); return -EOPNOTSUPP; } node = kmem_cache_alloc(ext4_fc_dentry_cachep, GFP_NOFS); if (!node) { ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_NOMEM, handle); spin_lock(&ei->i_fc_lock); return -ENOMEM; } node->fcd_op = dentry_update->op; node->fcd_parent = dir->i_ino; node->fcd_ino = inode->i_ino; take_dentry_name_snapshot(&node->fcd_name, dentry); INIT_LIST_HEAD(&node->fcd_dilist); INIT_LIST_HEAD(&node->fcd_list); mutex_lock(&sbi->s_fc_lock); if (sbi->s_journal->j_flags & JBD2_FULL_COMMIT_ONGOING || sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) list_add_tail(&node->fcd_list, &sbi->s_fc_dentry_q[FC_Q_STAGING]); else list_add_tail(&node->fcd_list, &sbi->s_fc_dentry_q[FC_Q_MAIN]); /* * This helps us keep a track of all fc_dentry updates which is part of * this ext4 inode. So in case the inode is getting unlinked, before * even we get a chance to fsync, we could remove all fc_dentry * references while evicting the inode in ext4_fc_del(). * Also with this, we don't need to loop over all the inodes in * sbi->s_fc_q to get the corresponding inode in * ext4_fc_commit_dentry_updates(). */ if (dentry_update->op == EXT4_FC_TAG_CREAT) { WARN_ON(!list_empty(&ei->i_fc_dilist)); list_add_tail(&node->fcd_dilist, &ei->i_fc_dilist); } mutex_unlock(&sbi->s_fc_lock); spin_lock(&ei->i_fc_lock); return 0; } void __ext4_fc_track_unlink(handle_t *handle, struct inode *inode, struct dentry *dentry) { struct __track_dentry_update_args args; int ret; args.dentry = dentry; args.op = EXT4_FC_TAG_UNLINK; ret = ext4_fc_track_template(handle, inode, __track_dentry_update, (void *)&args, 0); trace_ext4_fc_track_unlink(handle, inode, dentry, ret); } void ext4_fc_track_unlink(handle_t *handle, struct dentry *dentry) { struct inode *inode = d_inode(dentry); if (ext4_fc_disabled(inode->i_sb)) return; if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE)) return; __ext4_fc_track_unlink(handle, inode, dentry); } void __ext4_fc_track_link(handle_t *handle, struct inode *inode, struct dentry *dentry) { struct __track_dentry_update_args args; int ret; args.dentry = dentry; args.op = EXT4_FC_TAG_LINK; ret = ext4_fc_track_template(handle, inode, __track_dentry_update, (void *)&args, 0); trace_ext4_fc_track_link(handle, inode, dentry, ret); } void ext4_fc_track_link(handle_t *handle, struct dentry *dentry) { struct inode *inode = d_inode(dentry); if (ext4_fc_disabled(inode->i_sb)) return; if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE)) return; __ext4_fc_track_link(handle, inode, dentry); } void __ext4_fc_track_create(handle_t *handle, struct inode *inode, struct dentry *dentry) { struct __track_dentry_update_args args; int ret; args.dentry = dentry; args.op = EXT4_FC_TAG_CREAT; ret = ext4_fc_track_template(handle, inode, __track_dentry_update, (void *)&args, 0); trace_ext4_fc_track_create(handle, inode, dentry, ret); } void ext4_fc_track_create(handle_t *handle, struct dentry *dentry) { struct inode *inode = d_inode(dentry); if (ext4_fc_disabled(inode->i_sb)) return; if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE)) return; __ext4_fc_track_create(handle, inode, dentry); } /* __track_fn for inode tracking */ static int __track_inode(handle_t *handle, struct inode *inode, void *arg, bool update) { if (update) return -EEXIST; EXT4_I(inode)->i_fc_lblk_len = 0; return 0; } void ext4_fc_track_inode(handle_t *handle, struct inode *inode) { struct ext4_inode_info *ei = EXT4_I(inode); wait_queue_head_t *wq; int ret; if (S_ISDIR(inode->i_mode)) return; if (ext4_fc_disabled(inode->i_sb)) return; if (ext4_should_journal_data(inode)) { ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_INODE_JOURNAL_DATA, handle); return; } if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE)) return; /* * If we come here, we may sleep while waiting for the inode to * commit. We shouldn't be holding i_data_sem when we go to sleep since * the commit path needs to grab the lock while committing the inode. */ lockdep_assert_not_held(&ei->i_data_sem); while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) { #if (BITS_PER_LONG < 64) DEFINE_WAIT_BIT(wait, &ei->i_state_flags, EXT4_STATE_FC_COMMITTING); wq = bit_waitqueue(&ei->i_state_flags, EXT4_STATE_FC_COMMITTING); #else DEFINE_WAIT_BIT(wait, &ei->i_flags, EXT4_STATE_FC_COMMITTING); wq = bit_waitqueue(&ei->i_flags, EXT4_STATE_FC_COMMITTING); #endif prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) schedule(); finish_wait(wq, &wait.wq_entry); } /* * From this point on, this inode will not be committed either * by fast or full commit as long as the handle is open. */ ret = ext4_fc_track_template(handle, inode, __track_inode, NULL, 1); trace_ext4_fc_track_inode(handle, inode, ret); } struct __track_range_args { ext4_lblk_t start, end; }; /* __track_fn for tracking data updates */ static int __track_range(handle_t *handle, struct inode *inode, void *arg, bool update) { struct ext4_inode_info *ei = EXT4_I(inode); ext4_lblk_t oldstart; struct __track_range_args *__arg = (struct __track_range_args *)arg; if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb)) { ext4_debug("Special inode %ld being modified\n", inode->i_ino); return -ECANCELED; } oldstart = ei->i_fc_lblk_start; if (update && ei->i_fc_lblk_len > 0) { ei->i_fc_lblk_start = min(ei->i_fc_lblk_start, __arg->start); ei->i_fc_lblk_len = max(oldstart + ei->i_fc_lblk_len - 1, __arg->end) - ei->i_fc_lblk_start + 1; } else { ei->i_fc_lblk_start = __arg->start; ei->i_fc_lblk_len = __arg->end - __arg->start + 1; } return 0; } void ext4_fc_track_range(handle_t *handle, struct inode *inode, ext4_lblk_t start, ext4_lblk_t end) { struct __track_range_args args; int ret; if (S_ISDIR(inode->i_mode)) return; if (ext4_fc_disabled(inode->i_sb)) return; if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE)) return; if (ext4_has_inline_data(inode)) { ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR, handle); return; } args.start = start; args.end = end; ret = ext4_fc_track_template(handle, inode, __track_range, &args, 1); trace_ext4_fc_track_range(handle, inode, start, end, ret); } static void ext4_fc_submit_bh(struct super_block *sb, bool is_tail) { blk_opf_t write_flags = JBD2_JOURNAL_REQ_FLAGS; struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh; /* Add REQ_FUA | REQ_PREFLUSH only its tail */ if (test_opt(sb, BARRIER) && is_tail) write_flags |= REQ_FUA | REQ_PREFLUSH; lock_buffer(bh); set_buffer_dirty(bh); set_buffer_uptodate(bh); bh->b_end_io = ext4_end_buffer_io_sync; submit_bh(REQ_OP_WRITE | write_flags, bh); EXT4_SB(sb)->s_fc_bh = NULL; } /* Ext4 commit path routines */ /* * Allocate len bytes on a fast commit buffer. * * During the commit time this function is used to manage fast commit * block space. We don't split a fast commit log onto different * blocks. So this function makes sure that if there's not enough space * on the current block, the remaining space in the current block is * marked as unused by adding EXT4_FC_TAG_PAD tag. In that case, * new block is from jbd2 and CRC is updated to reflect the padding * we added. */ static u8 *ext4_fc_reserve_space(struct super_block *sb, int len, u32 *crc) { struct ext4_fc_tl tl; struct ext4_sb_info *sbi = EXT4_SB(sb); struct buffer_head *bh; int bsize = sbi->s_journal->j_blocksize; int ret, off = sbi->s_fc_bytes % bsize; int remaining; u8 *dst; /* * If 'len' is too long to fit in any block alongside a PAD tlv, then we * cannot fulfill the request. */ if (len > bsize - EXT4_FC_TAG_BASE_LEN) return NULL; if (!sbi->s_fc_bh) { ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh); if (ret) return NULL; sbi->s_fc_bh = bh; } dst = sbi->s_fc_bh->b_data + off; /* * Allocate the bytes in the current block if we can do so while still * leaving enough space for a PAD tlv. */ remaining = bsize - EXT4_FC_TAG_BASE_LEN - off; if (len <= remaining) { sbi->s_fc_bytes += len; return dst; } /* * Else, terminate the current block with a PAD tlv, then allocate a new * block and allocate the bytes at the start of that new block. */ tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_PAD); tl.fc_len = cpu_to_le16(remaining); memcpy(dst, &tl, EXT4_FC_TAG_BASE_LEN); memset(dst + EXT4_FC_TAG_BASE_LEN, 0, remaining); *crc = ext4_chksum(*crc, sbi->s_fc_bh->b_data, bsize); ext4_fc_submit_bh(sb, false); ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh); if (ret) return NULL; sbi->s_fc_bh = bh; sbi->s_fc_bytes += bsize - off + len; return sbi->s_fc_bh->b_data; } /* * Complete a fast commit by writing tail tag. * * Writing tail tag marks the end of a fast commit. In order to guarantee * atomicity, after writing tail tag, even if there's space remaining * in the block, next commit shouldn't use it. That's why tail tag * has the length as that of the remaining space on the block. */ static int ext4_fc_write_tail(struct super_block *sb, u32 crc) { struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_fc_tl tl; struct ext4_fc_tail tail; int off, bsize = sbi->s_journal->j_blocksize; u8 *dst; /* * ext4_fc_reserve_space takes care of allocating an extra block if * there's no enough space on this block for accommodating this tail. */ dst = ext4_fc_reserve_space(sb, EXT4_FC_TAG_BASE_LEN + sizeof(tail), &crc); if (!dst) return -ENOSPC; off = sbi->s_fc_bytes % bsize; tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL); tl.fc_len = cpu_to_le16(bsize - off + sizeof(struct ext4_fc_tail)); sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize); memcpy(dst, &tl, EXT4_FC_TAG_BASE_LEN); dst += EXT4_FC_TAG_BASE_LEN; tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid); memcpy(dst, &tail.fc_tid, sizeof(tail.fc_tid)); dst += sizeof(tail.fc_tid); crc = ext4_chksum(crc, sbi->s_fc_bh->b_data, dst - (u8 *)sbi->s_fc_bh->b_data); tail.fc_crc = cpu_to_le32(crc); memcpy(dst, &tail.fc_crc, sizeof(tail.fc_crc)); dst += sizeof(tail.fc_crc); memset(dst, 0, bsize - off); /* Don't leak uninitialized memory. */ ext4_fc_submit_bh(sb, true); return 0; } /* * Adds tag, length, value and updates CRC. Returns true if tlv was added. * Returns false if there's not enough space. */ static bool ext4_fc_add_tlv(struct super_block *sb, u16 tag, u16 len, u8 *val, u32 *crc) { struct ext4_fc_tl tl; u8 *dst; dst = ext4_fc_reserve_space(sb, EXT4_FC_TAG_BASE_LEN + len, crc); if (!dst) return false; tl.fc_tag = cpu_to_le16(tag); tl.fc_len = cpu_to_le16(len); memcpy(dst, &tl, EXT4_FC_TAG_BASE_LEN); memcpy(dst + EXT4_FC_TAG_BASE_LEN, val, len); return true; } /* Same as above, but adds dentry tlv. */ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc, struct ext4_fc_dentry_update *fc_dentry) { struct ext4_fc_dentry_info fcd; struct ext4_fc_tl tl; int dlen = fc_dentry->fcd_name.name.len; u8 *dst = ext4_fc_reserve_space(sb, EXT4_FC_TAG_BASE_LEN + sizeof(fcd) + dlen, crc); if (!dst) return false; fcd.fc_parent_ino = cpu_to_le32(fc_dentry->fcd_parent); fcd.fc_ino = cpu_to_le32(fc_dentry->fcd_ino); tl.fc_tag = cpu_to_le16(fc_dentry->fcd_op); tl.fc_len = cpu_to_le16(sizeof(fcd) + dlen); memcpy(dst, &tl, EXT4_FC_TAG_BASE_LEN); dst += EXT4_FC_TAG_BASE_LEN; memcpy(dst, &fcd, sizeof(fcd)); dst += sizeof(fcd); memcpy(dst, fc_dentry->fcd_name.name.name, dlen); return true; } /* * Writes inode in the fast commit space under TLV with tag @tag. * Returns 0 on success, error on failure. */ static int ext4_fc_write_inode(struct inode *inode, u32 *crc) { struct ext4_inode_info *ei = EXT4_I(inode); int inode_len = EXT4_GOOD_OLD_INODE_SIZE; int ret; struct ext4_iloc iloc; struct ext4_fc_inode fc_inode; struct ext4_fc_tl tl; u8 *dst; ret = ext4_get_inode_loc(inode, &iloc); if (ret) return ret; if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA)) inode_len = EXT4_INODE_SIZE(inode->i_sb); else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) inode_len += ei->i_extra_isize; fc_inode.fc_ino = cpu_to_le32(inode->i_ino); tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE); tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino)); ret = -ECANCELED; dst = ext4_fc_reserve_space(inode->i_sb, EXT4_FC_TAG_BASE_LEN + inode_len + sizeof(fc_inode.fc_ino), crc); if (!dst) goto err; memcpy(dst, &tl, EXT4_FC_TAG_BASE_LEN); dst += EXT4_FC_TAG_BASE_LEN; memcpy(dst, &fc_inode, sizeof(fc_inode)); dst += sizeof(fc_inode); memcpy(dst, (u8 *)ext4_raw_inode(&iloc), inode_len); ret = 0; err: brelse(iloc.bh); return ret; } /* * Writes updated data ranges for the inode in question. Updates CRC. * Returns 0 on success, error otherwise. */ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc) { ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size; struct ext4_inode_info *ei = EXT4_I(inode); struct ext4_map_blocks map; struct ext4_fc_add_range fc_ext; struct ext4_fc_del_range lrange; struct ext4_extent *ex; int ret; spin_lock(&ei->i_fc_lock); if (ei->i_fc_lblk_len == 0) { spin_unlock(&ei->i_fc_lock); return 0; } old_blk_size = ei->i_fc_lblk_start; new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1; ei->i_fc_lblk_len = 0; spin_unlock(&ei->i_fc_lock); cur_lblk_off = old_blk_size; ext4_debug("will try writing %d to %d for inode %ld\n", cur_lblk_off, new_blk_size, inode->i_ino); while (cur_lblk_off <= new_blk_size) { map.m_lblk = cur_lblk_off; map.m_len = new_blk_size - cur_lblk_off + 1; ret = ext4_map_blocks(NULL, inode, &map, EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE); if (ret < 0) return -ECANCELED; if (map.m_len == 0) { cur_lblk_off++; continue; } if (ret == 0) { lrange.fc_ino = cpu_to_le32(inode->i_ino); lrange.fc_lblk = cpu_to_le32(map.m_lblk); lrange.fc_len = cpu_to_le32(map.m_len); if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE, sizeof(lrange), (u8 *)&lrange, crc)) return -ENOSPC; } else { unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ? EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN; /* Limit the number of blocks in one extent */ map.m_len = min(max, map.m_len); fc_ext.fc_ino = cpu_to_le32(inode->i_ino); ex = (struct ext4_extent *)&fc_ext.fc_ex; ex->ee_block = cpu_to_le32(map.m_lblk); ex->ee_len = cpu_to_le16(map.m_len); ext4_ext_store_pblock(ex, map.m_pblk); if (map.m_flags & EXT4_MAP_UNWRITTEN) ext4_ext_mark_unwritten(ex); else ext4_ext_mark_initialized(ex); if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE, sizeof(fc_ext), (u8 *)&fc_ext, crc)) return -ENOSPC; } cur_lblk_off += map.m_len; } return 0; } /* Flushes data of all the inodes in the commit queue. */ static int ext4_fc_flush_data(journal_t *journal) { struct super_block *sb = journal->j_private; struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_inode_info *ei; int ret = 0; list_for_each_entry(ei, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { ret = jbd2_submit_inode_data(journal, ei->jinode); if (ret) return ret; } list_for_each_entry(ei, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { ret = jbd2_wait_inode_data(journal, ei->jinode); if (ret) return ret; } return 0; } /* Commit all the directory entry updates */ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc) { struct super_block *sb = journal->j_private; struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_fc_dentry_update *fc_dentry, *fc_dentry_n; struct inode *inode; struct ext4_inode_info *ei; int ret; if (list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) return 0; list_for_each_entry_safe(fc_dentry, fc_dentry_n, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) { if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT) { if (!ext4_fc_add_dentry_tlv(sb, crc, fc_dentry)) return -ENOSPC; continue; } /* * With fcd_dilist we need not loop in sbi->s_fc_q to get the * corresponding inode. Also, the corresponding inode could have been * deleted, in which case, we don't need to do anything. */ if (list_empty(&fc_dentry->fcd_dilist)) continue; ei = list_first_entry(&fc_dentry->fcd_dilist, struct ext4_inode_info, i_fc_dilist); inode = &ei->vfs_inode; WARN_ON(inode->i_ino != fc_dentry->fcd_ino); /* * We first write the inode and then the create dirent. This * allows the recovery code to create an unnamed inode first * and then link it to a directory entry. This allows us * to use namei.c routines almost as is and simplifies * the recovery code. */ ret = ext4_fc_write_inode(inode, crc); if (ret) return ret; ret = ext4_fc_write_inode_data(inode, crc); if (ret) return ret; if (!ext4_fc_add_dentry_tlv(sb, crc, fc_dentry)) return -ENOSPC; } return 0; } static int ext4_fc_perform_commit(journal_t *journal) { struct super_block *sb = journal->j_private; struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_inode_info *iter; struct ext4_fc_head head; struct inode *inode; struct blk_plug plug; int ret = 0; u32 crc = 0; /* * Step 1: Mark all inodes on s_fc_q[MAIN] with * EXT4_STATE_FC_FLUSHING_DATA. This prevents these inodes from being * freed until the data flush is over. */ mutex_lock(&sbi->s_fc_lock); list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { ext4_set_inode_state(&iter->vfs_inode, EXT4_STATE_FC_FLUSHING_DATA); } mutex_unlock(&sbi->s_fc_lock); /* Step 2: Flush data for all the eligible inodes. */ ret = ext4_fc_flush_data(journal); /* * Step 3: Clear EXT4_STATE_FC_FLUSHING_DATA flag, before returning * any error from step 2. This ensures that waiters waiting on * EXT4_STATE_FC_FLUSHING_DATA can resume. */ mutex_lock(&sbi->s_fc_lock); list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { ext4_clear_inode_state(&iter->vfs_inode, EXT4_STATE_FC_FLUSHING_DATA); #if (BITS_PER_LONG < 64) wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_FLUSHING_DATA); #else wake_up_bit(&iter->i_flags, EXT4_STATE_FC_FLUSHING_DATA); #endif } /* * Make sure clearing of EXT4_STATE_FC_FLUSHING_DATA is visible before * the waiter checks the bit. Pairs with implicit barrier in * prepare_to_wait() in ext4_fc_del(). */ smp_mb(); mutex_unlock(&sbi->s_fc_lock); /* * If we encountered error in Step 2, return it now after clearing * EXT4_STATE_FC_FLUSHING_DATA bit. */ if (ret) return ret; /* Step 4: Mark all inodes as being committed. */ jbd2_journal_lock_updates(journal); /* * The journal is now locked. No more handles can start and all the * previous handles are now drained. We now mark the inodes on the * commit queue as being committed. */ mutex_lock(&sbi->s_fc_lock); list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { ext4_set_inode_state(&iter->vfs_inode, EXT4_STATE_FC_COMMITTING); } mutex_unlock(&sbi->s_fc_lock); jbd2_journal_unlock_updates(journal); /* * Step 5: If file system device is different from journal device, * issue a cache flush before we start writing fast commit blocks. */ if (journal->j_fs_dev != journal->j_dev) blkdev_issue_flush(journal->j_fs_dev); blk_start_plug(&plug); /* Step 6: Write fast commit blocks to disk. */ if (sbi->s_fc_bytes == 0) { /* * Step 6.1: Add a head tag only if this is the first fast * commit in this TID. */ head.fc_features = cpu_to_le32(EXT4_FC_SUPPORTED_FEATURES); head.fc_tid = cpu_to_le32( sbi->s_journal->j_running_transaction->t_tid); if (!ext4_fc_add_tlv(sb, EXT4_FC_TAG_HEAD, sizeof(head), (u8 *)&head, &crc)) { ret = -ENOSPC; goto out; } } /* Step 6.2: Now write all the dentry updates. */ mutex_lock(&sbi->s_fc_lock); ret = ext4_fc_commit_dentry_updates(journal, &crc); if (ret) goto out; /* Step 6.3: Now write all the changed inodes to disk. */ list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { inode = &iter->vfs_inode; if (!ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) continue; ret = ext4_fc_write_inode_data(inode, &crc); if (ret) goto out; ret = ext4_fc_write_inode(inode, &crc); if (ret) goto out; } /* Step 6.4: Finally write tail tag to conclude this fast commit. */ ret = ext4_fc_write_tail(sb, crc); out: mutex_unlock(&sbi->s_fc_lock); blk_finish_plug(&plug); return ret; } static void ext4_fc_update_stats(struct super_block *sb, int status, u64 commit_time, int nblks, tid_t commit_tid) { struct ext4_fc_stats *stats = &EXT4_SB(sb)->s_fc_stats; ext4_debug("Fast commit ended with status = %d for tid %u", status, commit_tid); if (status == EXT4_FC_STATUS_OK) { stats->fc_num_commits++; stats->fc_numblks += nblks; if (likely(stats->s_fc_avg_commit_time)) stats->s_fc_avg_commit_time = (commit_time + stats->s_fc_avg_commit_time * 3) / 4; else stats->s_fc_avg_commit_time = commit_time; } else if (status == EXT4_FC_STATUS_FAILED || status == EXT4_FC_STATUS_INELIGIBLE) { if (status == EXT4_FC_STATUS_FAILED) stats->fc_failed_commits++; stats->fc_ineligible_commits++; } else { stats->fc_skipped_commits++; } trace_ext4_fc_commit_stop(sb, nblks, status, commit_tid); } /* * The main commit entry point. Performs a fast commit for transaction * commit_tid if needed. If it's not possible to perform a fast commit * due to various reasons, we fall back to full commit. Returns 0 * on success, error otherwise. */ int ext4_fc_commit(journal_t *journal, tid_t commit_tid) { struct super_block *sb = journal->j_private; struct ext4_sb_info *sbi = EXT4_SB(sb); int nblks = 0, ret, bsize = journal->j_blocksize; int subtid = atomic_read(&sbi->s_fc_subtid); int status = EXT4_FC_STATUS_OK, fc_bufs_before = 0; ktime_t start_time, commit_time; int old_ioprio, journal_ioprio; if (!test_opt2(sb, JOURNAL_FAST_COMMIT)) return jbd2_complete_transaction(journal, commit_tid); trace_ext4_fc_commit_start(sb, commit_tid); start_time = ktime_get(); old_ioprio = get_current_ioprio(); restart_fc: ret = jbd2_fc_begin_commit(journal, commit_tid); if (ret == -EALREADY) { /* There was an ongoing commit, check if we need to restart */ if (atomic_read(&sbi->s_fc_subtid) <= subtid && tid_gt(commit_tid, journal->j_commit_sequence)) goto restart_fc; ext4_fc_update_stats(sb, EXT4_FC_STATUS_SKIPPED, 0, 0, commit_tid); return 0; } else if (ret) { /* * Commit couldn't start. Just update stats and perform a * full commit. */ ext4_fc_update_stats(sb, EXT4_FC_STATUS_FAILED, 0, 0, commit_tid); return jbd2_complete_transaction(journal, commit_tid); } /* * After establishing journal barrier via jbd2_fc_begin_commit(), check * if we are fast commit ineligible. */ if (ext4_test_mount_flag(sb, EXT4_MF_FC_INELIGIBLE)) { status = EXT4_FC_STATUS_INELIGIBLE; goto fallback; } /* * Now that we know that this thread is going to do a fast commit, * elevate the priority to match that of the journal thread. */ if (journal->j_task->io_context) journal_ioprio = sbi->s_journal->j_task->io_context->ioprio; else journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO; set_task_ioprio(current, journal_ioprio); fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize; ret = ext4_fc_perform_commit(journal); if (ret < 0) { status = EXT4_FC_STATUS_FAILED; goto fallback; } nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before; ret = jbd2_fc_wait_bufs(journal, nblks); if (ret < 0) { status = EXT4_FC_STATUS_FAILED; goto fallback; } atomic_inc(&sbi->s_fc_subtid); ret = jbd2_fc_end_commit(journal); set_task_ioprio(current, old_ioprio); /* * weight the commit time higher than the average time so we * don't react too strongly to vast changes in the commit time */ commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time)); ext4_fc_update_stats(sb, status, commit_time, nblks, commit_tid); return ret; fallback: set_task_ioprio(current, old_ioprio); ret = jbd2_fc_end_commit_fallback(journal); ext4_fc_update_stats(sb, status, 0, 0, commit_tid); return ret; } /* * Fast commit cleanup routine. This is called after every fast commit and * full commit. full is true if we are called after a full commit. */ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid) { struct super_block *sb = journal->j_private; struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_inode_info *ei; struct ext4_fc_dentry_update *fc_dentry; if (full && sbi->s_fc_bh) sbi->s_fc_bh = NULL; trace_ext4_fc_cleanup(journal, full, tid); jbd2_fc_release_bufs(journal); mutex_lock(&sbi->s_fc_lock); while (!list_empty(&sbi->s_fc_q[FC_Q_MAIN])) { ei = list_first_entry(&sbi->s_fc_q[FC_Q_MAIN], struct ext4_inode_info, i_fc_list); list_del_init(&ei->i_fc_list); ext4_clear_inode_state(&ei->vfs_inode, EXT4_STATE_FC_COMMITTING); if (tid_geq(tid, ei->i_sync_tid)) { ext4_fc_reset_inode(&ei->vfs_inode); } else if (full) { /* * We are called after a full commit, inode has been * modified while the commit was running. Re-enqueue * the inode into STAGING, which will then be splice * back into MAIN. This cannot happen during * fastcommit because the journal is locked all the * time in that case (and tid doesn't increase so * tid check above isn't reliable). */ list_add_tail(&ei->i_fc_list, &sbi->s_fc_q[FC_Q_STAGING]); } /* * Make sure clearing of EXT4_STATE_FC_COMMITTING is * visible before we send the wakeup. Pairs with implicit * barrier in prepare_to_wait() in ext4_fc_track_inode(). */ smp_mb(); #if (BITS_PER_LONG < 64) wake_up_bit(&ei->i_state_flags, EXT4_STATE_FC_COMMITTING); #else wake_up_bit(&ei->i_flags, EXT4_STATE_FC_COMMITTING); #endif } while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) { fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN], struct ext4_fc_dentry_update, fcd_list); list_del_init(&fc_dentry->fcd_list); list_del_init(&fc_dentry->fcd_dilist); release_dentry_name_snapshot(&fc_dentry->fcd_name); kmem_cache_free(ext4_fc_dentry_cachep, fc_dentry); } list_splice_init(&sbi->s_fc_dentry_q[FC_Q_STAGING], &sbi->s_fc_dentry_q[FC_Q_MAIN]); list_splice_init(&sbi->s_fc_q[FC_Q_STAGING], &sbi->s_fc_q[FC_Q_MAIN]); if (tid_geq(tid, sbi->s_fc_ineligible_tid)) { sbi->s_fc_ineligible_tid = 0; ext4_clear_mount_flag(sb, EXT4_MF_FC_INELIGIBLE); } if (full) sbi->s_fc_bytes = 0; mutex_unlock(&sbi->s_fc_lock); trace_ext4_fc_stats(sb); } /* Ext4 Replay Path Routines */ /* Helper struct for dentry replay routines */ struct dentry_info_args { int parent_ino, dname_len, ino, inode_len; char *dname; }; /* Same as struct ext4_fc_tl, but uses native endianness fields */ struct ext4_fc_tl_mem { u16 fc_tag; u16 fc_len; }; static inline void tl_to_darg(struct dentry_info_args *darg, struct ext4_fc_tl_mem *tl, u8 *val) { struct ext4_fc_dentry_info fcd; memcpy(&fcd, val, sizeof(fcd)); darg->parent_ino = le32_to_cpu(fcd.fc_parent_ino); darg->ino = le32_to_cpu(fcd.fc_ino); darg->dname = val + offsetof(struct ext4_fc_dentry_info, fc_dname); darg->dname_len = tl->fc_len - sizeof(struct ext4_fc_dentry_info); } static inline void ext4_fc_get_tl(struct ext4_fc_tl_mem *tl, u8 *val) { struct ext4_fc_tl tl_disk; memcpy(&tl_disk, val, EXT4_FC_TAG_BASE_LEN); tl->fc_len = le16_to_cpu(tl_disk.fc_len); tl->fc_tag = le16_to_cpu(tl_disk.fc_tag); } /* Unlink replay function */ static int ext4_fc_replay_unlink(struct super_block *sb, struct ext4_fc_tl_mem *tl, u8 *val) { struct inode *inode, *old_parent; struct qstr entry; struct dentry_info_args darg; int ret = 0; tl_to_darg(&darg, tl, val); trace_ext4_fc_replay(sb, EXT4_FC_TAG_UNLINK, darg.ino, darg.parent_ino, darg.dname_len); entry.name = darg.dname; entry.len = darg.dname_len; inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL); if (IS_ERR(inode)) { ext4_debug("Inode %d not found", darg.ino); return 0; } old_parent = ext4_iget(sb, darg.parent_ino, EXT4_IGET_NORMAL); if (IS_ERR(old_parent)) { ext4_debug("Dir with inode %d not found", darg.parent_ino); iput(inode); return 0; } ret = __ext4_unlink(old_parent, &entry, inode, NULL); /* -ENOENT ok coz it might not exist anymore. */ if (ret == -ENOENT) ret = 0; iput(old_parent); iput(inode); return ret; } static int ext4_fc_replay_link_internal(struct super_block *sb, struct dentry_info_args *darg, struct inode *inode) { struct inode *dir = NULL; struct dentry *dentry_dir = NULL, *dentry_inode = NULL; struct qstr qstr_dname = QSTR_INIT(darg->dname, darg->dname_len); int ret = 0; dir = ext4_iget(sb, darg->parent_ino, EXT4_IGET_NORMAL); if (IS_ERR(dir)) { ext4_debug("Dir with inode %d not found.", darg->parent_ino); dir = NULL; goto out; } dentry_dir = d_obtain_alias(dir); if (IS_ERR(dentry_dir)) { ext4_debug("Failed to obtain dentry"); dentry_dir = NULL; goto out; } dentry_inode = d_alloc(dentry_dir, &qstr_dname); if (!dentry_inode) { ext4_debug("Inode dentry not created."); ret = -ENOMEM; goto out; } ret = __ext4_link(dir, inode, dentry_inode); /* * It's possible that link already existed since data blocks * for the dir in question got persisted before we crashed OR * we replayed this tag and crashed before the entire replay * could complete. */ if (ret && ret != -EEXIST) { ext4_debug("Failed to link\n"); goto out; } ret = 0; out: if (dentry_dir) { d_drop(dentry_dir); dput(dentry_dir); } else if (dir) { iput(dir); } if (dentry_inode) { d_drop(dentry_inode); dput(dentry_inode); } return ret; } /* Link replay function */ static int ext4_fc_replay_link(struct super_block *sb, struct ext4_fc_tl_mem *tl, u8 *val) { struct inode *inode; struct dentry_info_args darg; int ret = 0; tl_to_darg(&darg, tl, val); trace_ext4_fc_replay(sb, EXT4_FC_TAG_LINK, darg.ino, darg.parent_ino, darg.dname_len); inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL); if (IS_ERR(inode)) { ext4_debug("Inode not found."); return 0; } ret = ext4_fc_replay_link_internal(sb, &darg, inode); iput(inode); return ret; } /* * Record all the modified inodes during replay. We use this later to setup * block bitmaps correctly. */ static int ext4_fc_record_modified_inode(struct super_block *sb, int ino) { struct ext4_fc_replay_state *state; int i; state = &EXT4_SB(sb)->s_fc_replay_state; for (i = 0; i < state->fc_modified_inodes_used; i++) if (state->fc_modified_inodes[i] == ino) return 0; if (state->fc_modified_inodes_used == state->fc_modified_inodes_size) { int *fc_modified_inodes; fc_modified_inodes = krealloc(state->fc_modified_inodes, sizeof(int) * (state->fc_modified_inodes_size + EXT4_FC_REPLAY_REALLOC_INCREMENT), GFP_KERNEL); if (!fc_modified_inodes) return -ENOMEM; state->fc_modified_inodes = fc_modified_inodes; state->fc_modified_inodes_size += EXT4_FC_REPLAY_REALLOC_INCREMENT; } state->fc_modified_inodes[state->fc_modified_inodes_used++] = ino; return 0; } /* * Inode replay function */ static int ext4_fc_replay_inode(struct super_block *sb, struct ext4_fc_tl_mem *tl, u8 *val) { struct ext4_fc_inode fc_inode; struct ext4_inode *raw_inode; struct ext4_inode *raw_fc_inode; struct inode *inode = NULL; struct ext4_iloc iloc; int inode_len, ino, ret, tag = tl->fc_tag; struct ext4_extent_header *eh; size_t off_gen = offsetof(struct ext4_inode, i_generation); memcpy(&fc_inode, val, sizeof(fc_inode)); ino = le32_to_cpu(fc_inode.fc_ino); trace_ext4_fc_replay(sb, tag, ino, 0, 0); inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL); if (!IS_ERR(inode)) { ext4_ext_clear_bb(inode); iput(inode); } inode = NULL; ret = ext4_fc_record_modified_inode(sb, ino); if (ret) goto out; raw_fc_inode = (struct ext4_inode *) (val + offsetof(struct ext4_fc_inode, fc_raw_inode)); ret = ext4_get_fc_inode_loc(sb, ino, &iloc); if (ret) goto out; inode_len = tl->fc_len - sizeof(struct ext4_fc_inode); raw_inode = ext4_raw_inode(&iloc); memcpy(raw_inode, raw_fc_inode, offsetof(struct ext4_inode, i_block)); memcpy((u8 *)raw_inode + off_gen, (u8 *)raw_fc_inode + off_gen, inode_len - off_gen); if (le32_to_cpu(raw_inode->i_flags) & EXT4_EXTENTS_FL) { eh = (struct ext4_extent_header *)(&raw_inode->i_block[0]); if (eh->eh_magic != EXT4_EXT_MAGIC) { memset(eh, 0, sizeof(*eh)); eh->eh_magic = EXT4_EXT_MAGIC; eh->eh_max = cpu_to_le16( (sizeof(raw_inode->i_block) - sizeof(struct ext4_extent_header)) / sizeof(struct ext4_extent)); } } else if (le32_to_cpu(raw_inode->i_flags) & EXT4_INLINE_DATA_FL) { memcpy(raw_inode->i_block, raw_fc_inode->i_block, sizeof(raw_inode->i_block)); } /* Immediately update the inode on disk. */ ret = ext4_handle_dirty_metadata(NULL, NULL, iloc.bh); if (ret) goto out; ret = sync_dirty_buffer(iloc.bh); if (ret) goto out; ret = ext4_mark_inode_used(sb, ino); if (ret) goto out; /* Given that we just wrote the inode on disk, this SHOULD succeed. */ inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL); if (IS_ERR(inode)) { ext4_debug("Inode not found."); return -EFSCORRUPTED; } /* * Our allocator could have made different decisions than before * crashing. This should be fixed but until then, we calculate * the number of blocks the inode. */ if (!ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA)) ext4_ext_replay_set_iblocks(inode); inode->i_generation = le32_to_cpu(ext4_raw_inode(&iloc)->i_generation); ext4_reset_inode_seed(inode); ext4_inode_csum_set(inode, ext4_raw_inode(&iloc), EXT4_I(inode)); ret = ext4_handle_dirty_metadata(NULL, NULL, iloc.bh); sync_dirty_buffer(iloc.bh); brelse(iloc.bh); out: iput(inode); if (!ret) blkdev_issue_flush(sb->s_bdev); return 0; } /* * Dentry create replay function. * * EXT4_FC_TAG_CREAT is preceded by EXT4_FC_TAG_INODE_FULL. Which means, the * inode for which we are trying to create a dentry here, should already have * been replayed before we start here. */ static int ext4_fc_replay_create(struct super_block *sb, struct ext4_fc_tl_mem *tl, u8 *val) { int ret = 0; struct inode *inode = NULL; struct inode *dir = NULL; struct dentry_info_args darg; tl_to_darg(&darg, tl, val); trace_ext4_fc_replay(sb, EXT4_FC_TAG_CREAT, darg.ino, darg.parent_ino, darg.dname_len); /* This takes care of update group descriptor and other metadata */ ret = ext4_mark_inode_used(sb, darg.ino); if (ret) goto out; inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL); if (IS_ERR(inode)) { ext4_debug("inode %d not found.", darg.ino); inode = NULL; ret = -EINVAL; goto out; } if (S_ISDIR(inode->i_mode)) { /* * If we are creating a directory, we need to make sure that the * dot and dot dot dirents are setup properly. */ dir = ext4_iget(sb, darg.parent_ino, EXT4_IGET_NORMAL); if (IS_ERR(dir)) { ext4_debug("Dir %d not found.", darg.ino); goto out; } ret = ext4_init_new_dir(NULL, dir, inode); iput(dir); if (ret) { ret = 0; goto out; } } ret = ext4_fc_replay_link_internal(sb, &darg, inode); if (ret) goto out; set_nlink(inode, 1); ext4_mark_inode_dirty(NULL, inode); out: iput(inode); return ret; } /* * Record physical disk regions which are in use as per fast commit area, * and used by inodes during replay phase. Our simple replay phase * allocator excludes these regions from allocation. */ int ext4_fc_record_regions(struct super_block *sb, int ino, ext4_lblk_t lblk, ext4_fsblk_t pblk, int len, int replay) { struct ext4_fc_replay_state *state; struct ext4_fc_alloc_region *region; state = &EXT4_SB(sb)->s_fc_replay_state; /* * during replay phase, the fc_regions_valid may not same as * fc_regions_used, update it when do new additions. */ if (replay && state->fc_regions_used != state->fc_regions_valid) state->fc_regions_used = state->fc_regions_valid; if (state->fc_regions_used == state->fc_regions_size) { struct ext4_fc_alloc_region *fc_regions; fc_regions = krealloc(state->fc_regions, sizeof(struct ext4_fc_alloc_region) * (state->fc_regions_size + EXT4_FC_REPLAY_REALLOC_INCREMENT), GFP_KERNEL); if (!fc_regions) return -ENOMEM; state->fc_regions_size += EXT4_FC_REPLAY_REALLOC_INCREMENT; state->fc_regions = fc_regions; } region = &state->fc_regions[state->fc_regions_used++]; region->ino = ino; region->lblk = lblk; region->pblk = pblk; region->len = len; if (replay) state->fc_regions_valid++; return 0; } /* Replay add range tag */ static int ext4_fc_replay_add_range(struct super_block *sb, struct ext4_fc_tl_mem *tl, u8 *val) { struct ext4_fc_add_range fc_add_ex; struct ext4_extent newex, *ex; struct inode *inode; ext4_lblk_t start, cur; int remaining, len; ext4_fsblk_t start_pblk; struct ext4_map_blocks map; struct ext4_ext_path *path = NULL; int ret; memcpy(&fc_add_ex, val, sizeof(fc_add_ex)); ex = (struct ext4_extent *)&fc_add_ex.fc_ex; trace_ext4_fc_replay(sb, EXT4_FC_TAG_ADD_RANGE, le32_to_cpu(fc_add_ex.fc_ino), le32_to_cpu(ex->ee_block), ext4_ext_get_actual_len(ex)); inode = ext4_iget(sb, le32_to_cpu(fc_add_ex.fc_ino), EXT4_IGET_NORMAL); if (IS_ERR(inode)) { ext4_debug("Inode not found."); return 0; } ret = ext4_fc_record_modified_inode(sb, inode->i_ino); if (ret) goto out; start = le32_to_cpu(ex->ee_block); start_pblk = ext4_ext_pblock(ex); len = ext4_ext_get_actual_len(ex); cur = start; remaining = len; ext4_debug("ADD_RANGE, lblk %d, pblk %lld, len %d, unwritten %d, inode %ld\n", start, start_pblk, len, ext4_ext_is_unwritten(ex), inode->i_ino); while (remaining > 0) { map.m_lblk = cur; map.m_len = remaining; map.m_pblk = 0; ret = ext4_map_blocks(NULL, inode, &map, 0); if (ret < 0) goto out; if (ret == 0) { /* Range is not mapped */ path = ext4_find_extent(inode, cur, path, 0); if (IS_ERR(path)) goto out; memset(&newex, 0, sizeof(newex)); newex.ee_block = cpu_to_le32(cur); ext4_ext_store_pblock( &newex, start_pblk + cur - start); newex.ee_len = cpu_to_le16(map.m_len); if (ext4_ext_is_unwritten(ex)) ext4_ext_mark_unwritten(&newex); down_write(&EXT4_I(inode)->i_data_sem); path = ext4_ext_insert_extent(NULL, inode, path, &newex, 0); up_write((&EXT4_I(inode)->i_data_sem)); if (IS_ERR(path)) goto out; goto next; } if (start_pblk + cur - start != map.m_pblk) { /* * Logical to physical mapping changed. This can happen * if this range was removed and then reallocated to * map to new physical blocks during a fast commit. */ ret = ext4_ext_replay_update_ex(inode, cur, map.m_len, ext4_ext_is_unwritten(ex), start_pblk + cur - start); if (ret) goto out; /* * Mark the old blocks as free since they aren't used * anymore. We maintain an array of all the modified * inodes. In case these blocks are still used at either * a different logical range in the same inode or in * some different inode, we will mark them as allocated * at the end of the FC replay using our array of * modified inodes. */ ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, false); goto next; } /* Range is mapped and needs a state change */ ext4_debug("Converting from %ld to %d %lld", map.m_flags & EXT4_MAP_UNWRITTEN, ext4_ext_is_unwritten(ex), map.m_pblk); ret = ext4_ext_replay_update_ex(inode, cur, map.m_len, ext4_ext_is_unwritten(ex), map.m_pblk); if (ret) goto out; /* * We may have split the extent tree while toggling the state. * Try to shrink the extent tree now. */ ext4_ext_replay_shrink_inode(inode, start + len); next: cur += map.m_len; remaining -= map.m_len; } ext4_ext_replay_shrink_inode(inode, i_size_read(inode) >> sb->s_blocksize_bits); out: ext4_free_ext_path(path); iput(inode); return 0; } /* Replay DEL_RANGE tag */ static int ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl_mem *tl, u8 *val) { struct inode *inode; struct ext4_fc_del_range lrange; struct ext4_map_blocks map; ext4_lblk_t cur, remaining; int ret; memcpy(&lrange, val, sizeof(lrange)); cur = le32_to_cpu(lrange.fc_lblk); remaining = le32_to_cpu(lrange.fc_len); trace_ext4_fc_replay(sb, EXT4_FC_TAG_DEL_RANGE, le32_to_cpu(lrange.fc_ino), cur, remaining); inode = ext4_iget(sb, le32_to_cpu(lrange.fc_ino), EXT4_IGET_NORMAL); if (IS_ERR(inode)) { ext4_debug("Inode %d not found", le32_to_cpu(lrange.fc_ino)); return 0; } ret = ext4_fc_record_modified_inode(sb, inode->i_ino); if (ret) goto out; ext4_debug("DEL_RANGE, inode %ld, lblk %d, len %d\n", inode->i_ino, le32_to_cpu(lrange.fc_lblk), le32_to_cpu(lrange.fc_len)); while (remaining > 0) { map.m_lblk = cur; map.m_len = remaining; ret = ext4_map_blocks(NULL, inode, &map, 0); if (ret < 0) goto out; if (ret > 0) { remaining -= ret; cur += ret; ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, false); } else { remaining -= map.m_len; cur += map.m_len; } } down_write(&EXT4_I(inode)->i_data_sem); ret = ext4_ext_remove_space(inode, le32_to_cpu(lrange.fc_lblk), le32_to_cpu(lrange.fc_lblk) + le32_to_cpu(lrange.fc_len) - 1); up_write(&EXT4_I(inode)->i_data_sem); if (ret) goto out; ext4_ext_replay_shrink_inode(inode, i_size_read(inode) >> sb->s_blocksize_bits); ext4_mark_inode_dirty(NULL, inode); out: iput(inode); return 0; } static void ext4_fc_set_bitmaps_and_counters(struct super_block *sb) { struct ext4_fc_replay_state *state; struct inode *inode; struct ext4_ext_path *path = NULL; struct ext4_map_blocks map; int i, ret, j; ext4_lblk_t cur, end; state = &EXT4_SB(sb)->s_fc_replay_state; for (i = 0; i < state->fc_modified_inodes_used; i++) { inode = ext4_iget(sb, state->fc_modified_inodes[i], EXT4_IGET_NORMAL); if (IS_ERR(inode)) { ext4_debug("Inode %d not found.", state->fc_modified_inodes[i]); continue; } cur = 0; end = EXT_MAX_BLOCKS; if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA)) { iput(inode); continue; } while (cur < end) { map.m_lblk = cur; map.m_len = end - cur; ret = ext4_map_blocks(NULL, inode, &map, 0); if (ret < 0) break; if (ret > 0) { path = ext4_find_extent(inode, map.m_lblk, path, 0); if (!IS_ERR(path)) { for (j = 0; j < path->p_depth; j++) ext4_mb_mark_bb(inode->i_sb, path[j].p_block, 1, true); } else { path = NULL; } cur += ret; ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, true); } else { cur = cur + (map.m_len ? map.m_len : 1); } } iput(inode); } ext4_free_ext_path(path); } /* * Check if block is in excluded regions for block allocation. The simple * allocator that runs during replay phase is calls this function to see * if it is okay to use a block. */ bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t blk) { int i; struct ext4_fc_replay_state *state; state = &EXT4_SB(sb)->s_fc_replay_state; for (i = 0; i < state->fc_regions_valid; i++) { if (state->fc_regions[i].ino == 0 || state->fc_regions[i].len == 0) continue; if (in_range(blk, state->fc_regions[i].pblk, state->fc_regions[i].len)) return true; } return false; } /* Cleanup function called after replay */ void ext4_fc_replay_cleanup(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); sbi->s_mount_state &= ~EXT4_FC_REPLAY; kfree(sbi->s_fc_replay_state.fc_regions); kfree(sbi->s_fc_replay_state.fc_modified_inodes); } static bool ext4_fc_value_len_isvalid(struct ext4_sb_info *sbi, int tag, int len) { switch (tag) { case EXT4_FC_TAG_ADD_RANGE: return len == sizeof(struct ext4_fc_add_range); case EXT4_FC_TAG_DEL_RANGE: return len == sizeof(struct ext4_fc_del_range); case EXT4_FC_TAG_CREAT: case EXT4_FC_TAG_LINK: case EXT4_FC_TAG_UNLINK: len -= sizeof(struct ext4_fc_dentry_info); return len >= 1 && len <= EXT4_NAME_LEN; case EXT4_FC_TAG_INODE: len -= sizeof(struct ext4_fc_inode); return len >= EXT4_GOOD_OLD_INODE_SIZE && len <= sbi->s_inode_size; case EXT4_FC_TAG_PAD: return true; /* padding can have any length */ case EXT4_FC_TAG_TAIL: return len >= sizeof(struct ext4_fc_tail); case EXT4_FC_TAG_HEAD: return len == sizeof(struct ext4_fc_head); } return false; } /* * Recovery Scan phase handler * * This function is called during the scan phase and is responsible * for doing following things: * - Make sure the fast commit area has valid tags for replay * - Count number of tags that need to be replayed by the replay handler * - Verify CRC * - Create a list of excluded blocks for allocation during replay phase * * This function returns JBD2_FC_REPLAY_CONTINUE to indicate that SCAN is * incomplete and JBD2 should send more blocks. It returns JBD2_FC_REPLAY_STOP * to indicate that scan has finished and JBD2 can now start replay phase. * It returns a negative error to indicate that there was an error. At the end * of a successful scan phase, sbi->s_fc_replay_state.fc_replay_num_tags is set * to indicate the number of tags that need to replayed during the replay phase. */ static int ext4_fc_replay_scan(journal_t *journal, struct buffer_head *bh, int off, tid_t expected_tid) { struct super_block *sb = journal->j_private; struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_fc_replay_state *state; int ret = JBD2_FC_REPLAY_CONTINUE; struct ext4_fc_add_range ext; struct ext4_fc_tl_mem tl; struct ext4_fc_tail tail; __u8 *start, *end, *cur, *val; struct ext4_fc_head head; struct ext4_extent *ex; state = &sbi->s_fc_replay_state; start = (u8 *)bh->b_data; end = start + journal->j_blocksize; if (state->fc_replay_expected_off == 0) { state->fc_cur_tag = 0; state->fc_replay_num_tags = 0; state->fc_crc = 0; state->fc_regions = NULL; state->fc_regions_valid = state->fc_regions_used = state->fc_regions_size = 0; /* Check if we can stop early */ if (le16_to_cpu(((struct ext4_fc_tl *)start)->fc_tag) != EXT4_FC_TAG_HEAD) return 0; } if (off != state->fc_replay_expected_off) { ret = -EFSCORRUPTED; goto out_err; } state->fc_replay_expected_off++; for (cur = start; cur <= end - EXT4_FC_TAG_BASE_LEN; cur = cur + EXT4_FC_TAG_BASE_LEN + tl.fc_len) { ext4_fc_get_tl(&tl, cur); val = cur + EXT4_FC_TAG_BASE_LEN; if (tl.fc_len > end - val || !ext4_fc_value_len_isvalid(sbi, tl.fc_tag, tl.fc_len)) { ret = state->fc_replay_num_tags ? JBD2_FC_REPLAY_STOP : -ECANCELED; goto out_err; } ext4_debug("Scan phase, tag:%s, blk %lld\n", tag2str(tl.fc_tag), bh->b_blocknr); switch (tl.fc_tag) { case EXT4_FC_TAG_ADD_RANGE: memcpy(&ext, val, sizeof(ext)); ex = (struct ext4_extent *)&ext.fc_ex; ret = ext4_fc_record_regions(sb, le32_to_cpu(ext.fc_ino), le32_to_cpu(ex->ee_block), ext4_ext_pblock(ex), ext4_ext_get_actual_len(ex), 0); if (ret < 0) break; ret = JBD2_FC_REPLAY_CONTINUE; fallthrough; case EXT4_FC_TAG_DEL_RANGE: case EXT4_FC_TAG_LINK: case EXT4_FC_TAG_UNLINK: case EXT4_FC_TAG_CREAT: case EXT4_FC_TAG_INODE: case EXT4_FC_TAG_PAD: state->fc_cur_tag++; state->fc_crc = ext4_chksum(state->fc_crc, cur, EXT4_FC_TAG_BASE_LEN + tl.fc_len); break; case EXT4_FC_TAG_TAIL: state->fc_cur_tag++; memcpy(&tail, val, sizeof(tail)); state->fc_crc = ext4_chksum(state->fc_crc, cur, EXT4_FC_TAG_BASE_LEN + offsetof(struct ext4_fc_tail, fc_crc)); if (le32_to_cpu(tail.fc_tid) == expected_tid && le32_to_cpu(tail.fc_crc) == state->fc_crc) { state->fc_replay_num_tags = state->fc_cur_tag; state->fc_regions_valid = state->fc_regions_used; } else { ret = state->fc_replay_num_tags ? JBD2_FC_REPLAY_STOP : -EFSBADCRC; } state->fc_crc = 0; break; case EXT4_FC_TAG_HEAD: memcpy(&head, val, sizeof(head)); if (le32_to_cpu(head.fc_features) & ~EXT4_FC_SUPPORTED_FEATURES) { ret = -EOPNOTSUPP; break; } if (le32_to_cpu(head.fc_tid) != expected_tid) { ret = JBD2_FC_REPLAY_STOP; break; } state->fc_cur_tag++; state->fc_crc = ext4_chksum(state->fc_crc, cur, EXT4_FC_TAG_BASE_LEN + tl.fc_len); break; default: ret = state->fc_replay_num_tags ? JBD2_FC_REPLAY_STOP : -ECANCELED; } if (ret < 0 || ret == JBD2_FC_REPLAY_STOP) break; } out_err: trace_ext4_fc_replay_scan(sb, ret, off); return ret; } /* * Main recovery path entry point. * The meaning of return codes is similar as above. */ static int ext4_fc_replay(journal_t *journal, struct buffer_head *bh, enum passtype pass, int off, tid_t expected_tid) { struct super_block *sb = journal->j_private; struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_fc_tl_mem tl; __u8 *start, *end, *cur, *val; int ret = JBD2_FC_REPLAY_CONTINUE; struct ext4_fc_replay_state *state = &sbi->s_fc_replay_state; struct ext4_fc_tail tail; if (pass == PASS_SCAN) { state->fc_current_pass = PASS_SCAN; return ext4_fc_replay_scan(journal, bh, off, expected_tid); } if (state->fc_current_pass != pass) { state->fc_current_pass = pass; sbi->s_mount_state |= EXT4_FC_REPLAY; } if (!sbi->s_fc_replay_state.fc_replay_num_tags) { ext4_debug("Replay stops\n"); ext4_fc_set_bitmaps_and_counters(sb); return 0; } #ifdef CONFIG_EXT4_DEBUG if (sbi->s_fc_debug_max_replay && off >= sbi->s_fc_debug_max_replay) { pr_warn("Dropping fc block %d because max_replay set\n", off); return JBD2_FC_REPLAY_STOP; } #endif start = (u8 *)bh->b_data; end = start + journal->j_blocksize; for (cur = start; cur <= end - EXT4_FC_TAG_BASE_LEN; cur = cur + EXT4_FC_TAG_BASE_LEN + tl.fc_len) { ext4_fc_get_tl(&tl, cur); val = cur + EXT4_FC_TAG_BASE_LEN; if (state->fc_replay_num_tags == 0) { ret = JBD2_FC_REPLAY_STOP; ext4_fc_set_bitmaps_and_counters(sb); break; } ext4_debug("Replay phase, tag:%s\n", tag2str(tl.fc_tag)); state->fc_replay_num_tags--; switch (tl.fc_tag) { case EXT4_FC_TAG_LINK: ret = ext4_fc_replay_link(sb, &tl, val); break; case EXT4_FC_TAG_UNLINK: ret = ext4_fc_replay_unlink(sb, &tl, val); break; case EXT4_FC_TAG_ADD_RANGE: ret = ext4_fc_replay_add_range(sb, &tl, val); break; case EXT4_FC_TAG_CREAT: ret = ext4_fc_replay_create(sb, &tl, val); break; case EXT4_FC_TAG_DEL_RANGE: ret = ext4_fc_replay_del_range(sb, &tl, val); break; case EXT4_FC_TAG_INODE: ret = ext4_fc_replay_inode(sb, &tl, val); break; case EXT4_FC_TAG_PAD: trace_ext4_fc_replay(sb, EXT4_FC_TAG_PAD, 0, tl.fc_len, 0); break; case EXT4_FC_TAG_TAIL: trace_ext4_fc_replay(sb, EXT4_FC_TAG_TAIL, 0, tl.fc_len, 0); memcpy(&tail, val, sizeof(tail)); WARN_ON(le32_to_cpu(tail.fc_tid) != expected_tid); break; case EXT4_FC_TAG_HEAD: break; default: trace_ext4_fc_replay(sb, tl.fc_tag, 0, tl.fc_len, 0); ret = -ECANCELED; break; } if (ret < 0) break; ret = JBD2_FC_REPLAY_CONTINUE; } return ret; } void ext4_fc_init(struct super_block *sb, journal_t *journal) { /* * We set replay callback even if fast commit disabled because we may * could still have fast commit blocks that need to be replayed even if * fast commit has now been turned off. */ journal->j_fc_replay_callback = ext4_fc_replay; if (!test_opt2(sb, JOURNAL_FAST_COMMIT)) return; journal->j_fc_cleanup_callback = ext4_fc_cleanup; } static const char * const fc_ineligible_reasons[] = { [EXT4_FC_REASON_XATTR] = "Extended attributes changed", [EXT4_FC_REASON_CROSS_RENAME] = "Cross rename", [EXT4_FC_REASON_JOURNAL_FLAG_CHANGE] = "Journal flag changed", [EXT4_FC_REASON_NOMEM] = "Insufficient memory", [EXT4_FC_REASON_SWAP_BOOT] = "Swap boot", [EXT4_FC_REASON_RESIZE] = "Resize", [EXT4_FC_REASON_RENAME_DIR] = "Dir renamed", [EXT4_FC_REASON_FALLOC_RANGE] = "Falloc range op", [EXT4_FC_REASON_INODE_JOURNAL_DATA] = "Data journalling", [EXT4_FC_REASON_ENCRYPTED_FILENAME] = "Encrypted filename", }; int ext4_fc_info_show(struct seq_file *seq, void *v) { struct ext4_sb_info *sbi = EXT4_SB((struct super_block *)seq->private); struct ext4_fc_stats *stats = &sbi->s_fc_stats; int i; if (v != SEQ_START_TOKEN) return 0; seq_printf(seq, "fc stats:\n%ld commits\n%ld ineligible\n%ld numblks\n%lluus avg_commit_time\n", stats->fc_num_commits, stats->fc_ineligible_commits, stats->fc_numblks, div_u64(stats->s_fc_avg_commit_time, 1000)); seq_puts(seq, "Ineligible reasons:\n"); for (i = 0; i < EXT4_FC_REASON_MAX; i++) seq_printf(seq, "\"%s\":\t%d\n", fc_ineligible_reasons[i], stats->fc_ineligible_reason_count[i]); return 0; } int __init ext4_fc_init_dentry_cache(void) { ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update, SLAB_RECLAIM_ACCOUNT); if (ext4_fc_dentry_cachep == NULL) return -ENOMEM; return 0; } void ext4_fc_destroy_dentry_cache(void) { kmem_cache_destroy(ext4_fc_dentry_cachep); } |
| 31 57 58 1 31 31 2 6 58 4 65 7 7 25 24 25 84 6 6 6 6 6 6 6 6 1 1 1 1 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 56 27 15 15 64 70 66 66 66 66 66 1 63 63 63 63 64 64 6 64 64 13 64 64 64 63 64 64 63 63 63 1 63 63 63 63 63 63 6 6 44 41 6 188 184 6 57 97 105 34 35 46 57 57 3 1 34 33 1 2 1 3 34 32 3 3 14 14 13 4 3 1 4 4 46 28 60 12 12 12 12 1 4 10 4 14 6 6 12 12 12 12 1 14 14 3 12 1 14 14 14 7 5 5 7 1 2 4 65 11 65 32 19 22 19 33 28 26 6 60 60 60 60 31 10 17 7 1 7 7 7 60 60 31 60 45 60 21 17 25 9 12 11 25 25 25 25 3 20 20 20 31 4 11 30 31 20 19 20 19 22 55 43 2 11 3 13 3 2 5 13 3 3 9 1 33 33 31 7 31 31 31 33 31 31 33 33 32 26 1 22 3 60 4 2 12 58 56 17 49 19 34 3 31 15 5 15 27 9 28 31 45 1 45 18 2 45 14 14 4 4 4 1 4 3 2 2 2 3 2 1 2 2 2 2 2 65 2 2 64 27 24 45 18 45 45 45 45 2 2 19 39 45 44 12 20 27 14 24 18 35 9 9 14 14 9 3 9 9 6 6 2 6 1 1 14 14 10 1 6 14 14 9 6 65 16 58 65 3 65 15 4 58 3 4 1 33 4 31 5 2 11 41 14 45 45 19 34 44 23 18 33 11 44 18 59 14 64 11 65 61 14 6 4 4 4 4 3 3 3 4 36 4 4 36 37 36 3 3 3 2 3 65 66 65 11 55 34 6 35 3 1 36 1 65 66 26 24 1 2 1 2 32 13 23 4 7 23 5 4 3 5 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 6165 6166 6167 6168 6169 6170 6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 6192 6193 6194 6195 6196 6197 6198 6199 6200 6201 6202 6203 6204 6205 6206 6207 6208 6209 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 6300 6301 6302 6303 6304 6305 6306 6307 6308 6309 6310 6311 6312 6313 6314 6315 6316 6317 6318 6319 6320 6321 6322 6323 6324 6325 6326 6327 6328 6329 6330 6331 6332 6333 6334 6335 6336 6337 6338 6339 6340 6341 6342 6343 6344 6345 6346 6347 6348 6349 6350 6351 6352 6353 6354 6355 6356 6357 6358 6359 6360 6361 6362 6363 6364 6365 6366 6367 6368 6369 6370 6371 6372 6373 6374 6375 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435 6436 6437 6438 6439 6440 6441 6442 6443 6444 6445 6446 6447 6448 6449 6450 6451 6452 6453 6454 6455 6456 6457 6458 6459 6460 6461 6462 6463 6464 6465 6466 6467 6468 6469 6470 6471 6472 6473 6474 6475 6476 6477 6478 6479 6480 6481 6482 6483 6484 6485 6486 6487 6488 6489 6490 6491 6492 6493 6494 6495 6496 6497 6498 6499 6500 6501 6502 6503 6504 6505 6506 6507 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6562 6563 6564 6565 6566 6567 6568 6569 6570 6571 6572 6573 6574 6575 6576 6577 6578 6579 6580 6581 6582 6583 6584 6585 6586 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596 6597 6598 6599 6600 6601 6602 6603 6604 6605 6606 6607 6608 6609 6610 6611 6612 6613 6614 6615 6616 6617 6618 6619 6620 6621 6622 6623 6624 6625 6626 6627 6628 6629 6630 6631 6632 6633 6634 6635 6636 6637 6638 6639 6640 6641 6642 6643 6644 6645 6646 6647 6648 6649 6650 6651 6652 6653 6654 6655 6656 6657 6658 6659 6660 6661 6662 6663 6664 6665 6666 6667 6668 6669 6670 6671 6672 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684 6685 6686 6687 6688 6689 6690 6691 6692 6693 6694 6695 6696 6697 6698 6699 6700 6701 6702 6703 6704 6705 6706 6707 6708 6709 6710 6711 6712 6713 6714 6715 6716 6717 6718 6719 6720 6721 6722 6723 6724 6725 6726 6727 6728 6729 6730 6731 6732 6733 6734 6735 6736 6737 6738 6739 6740 6741 6742 6743 6744 6745 6746 6747 6748 6749 6750 6751 6752 6753 6754 6755 6756 6757 6758 6759 6760 6761 6762 6763 6764 6765 6766 6767 6768 6769 6770 6771 6772 6773 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6784 6785 6786 6787 6788 6789 6790 6791 6792 6793 6794 6795 6796 6797 6798 6799 6800 6801 6802 6803 6804 6805 6806 6807 6808 6809 6810 6811 6812 6813 6814 6815 6816 6817 6818 6819 6820 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 6833 6834 6835 6836 6837 6838 6839 6840 6841 6842 6843 6844 6845 6846 6847 6848 6849 6850 6851 6852 6853 6854 6855 6856 6857 6858 6859 6860 6861 6862 6863 6864 6865 6866 6867 6868 6869 6870 6871 6872 6873 6874 6875 6876 6877 6878 6879 6880 6881 6882 6883 6884 6885 6886 6887 6888 6889 6890 6891 6892 6893 6894 6895 6896 6897 6898 6899 6900 6901 6902 6903 6904 6905 6906 6907 6908 6909 6910 6911 6912 6913 6914 6915 6916 6917 6918 6919 6920 6921 6922 6923 6924 6925 6926 6927 6928 6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960 6961 6962 6963 6964 6965 6966 6967 6968 6969 6970 6971 6972 6973 6974 6975 6976 6977 6978 6979 6980 6981 6982 6983 6984 6985 6986 6987 6988 6989 6990 6991 6992 6993 6994 6995 6996 6997 6998 6999 7000 7001 7002 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012 7013 7014 7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028 7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042 7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056 7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070 7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126 7127 7128 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170 7171 7172 7173 7174 7175 7176 7177 7178 7179 7180 7181 7182 7183 7184 7185 7186 7187 7188 7189 7190 7191 7192 7193 7194 7195 7196 7197 7198 7199 7200 7201 7202 7203 7204 7205 7206 7207 7208 7209 7210 7211 7212 7213 7214 7215 7216 7217 7218 7219 7220 7221 7222 7223 7224 7225 7226 7227 7228 7229 7230 7231 7232 7233 7234 7235 7236 7237 7238 7239 7240 7241 7242 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 7253 7254 7255 7256 7257 7258 7259 7260 7261 7262 7263 7264 7265 7266 7267 7268 7269 7270 7271 7272 7273 7274 7275 7276 7277 7278 7279 7280 7281 7282 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 7293 7294 7295 7296 7297 7298 7299 7300 7301 7302 7303 7304 7305 7306 7307 7308 7309 7310 7311 7312 7313 7314 7315 7316 7317 7318 7319 7320 7321 7322 7323 7324 7325 7326 7327 7328 7329 7330 7331 7332 7333 7334 7335 7336 7337 7338 7339 7340 7341 7342 7343 7344 7345 7346 7347 7348 7349 7350 7351 7352 7353 7354 7355 7356 7357 7358 7359 7360 7361 7362 7363 7364 7365 7366 7367 7368 7369 7370 7371 7372 7373 7374 7375 7376 7377 7378 7379 7380 7381 7382 7383 7384 7385 7386 7387 7388 7389 7390 7391 7392 7393 7394 7395 7396 7397 7398 7399 7400 7401 7402 7403 7404 7405 7406 7407 7408 7409 7410 7411 7412 7413 7414 7415 7416 7417 7418 7419 7420 7421 7422 7423 7424 7425 7426 7427 7428 7429 7430 7431 7432 7433 7434 7435 7436 7437 7438 7439 7440 7441 7442 7443 7444 7445 7446 7447 7448 7449 7450 7451 7452 7453 7454 7455 7456 7457 7458 7459 7460 7461 7462 7463 7464 7465 7466 7467 7468 7469 7470 7471 7472 7473 7474 7475 7476 7477 7478 7479 7480 7481 7482 7483 7484 7485 7486 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497 7498 7499 7500 7501 7502 7503 7504 7505 7506 7507 7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 7526 7527 7528 7529 7530 7531 7532 7533 7534 7535 7536 7537 7538 7539 7540 7541 7542 7543 7544 7545 7546 7547 7548 7549 7550 7551 7552 7553 7554 7555 7556 7557 7558 7559 7560 7561 7562 7563 7564 7565 7566 7567 7568 7569 7570 7571 7572 7573 7574 7575 7576 7577 7578 7579 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599 7600 7601 7602 7603 7604 7605 7606 7607 7608 7609 7610 7611 7612 7613 7614 7615 7616 7617 7618 7619 7620 7621 7622 7623 7624 7625 7626 7627 7628 7629 7630 7631 7632 7633 7634 7635 7636 7637 7638 7639 7640 7641 7642 7643 7644 7645 7646 7647 7648 7649 7650 7651 7652 7653 7654 7655 7656 7657 7658 7659 7660 7661 7662 7663 7664 7665 7666 7667 7668 7669 7670 7671 7672 7673 7674 7675 7676 7677 7678 7679 7680 7681 7682 7683 7684 7685 7686 7687 7688 7689 7690 7691 7692 7693 7694 7695 7696 7697 7698 7699 7700 7701 7702 7703 7704 7705 7706 7707 7708 7709 7710 7711 7712 7713 7714 7715 7716 7717 7718 7719 7720 7721 7722 7723 7724 7725 7726 7727 7728 7729 7730 7731 7732 7733 7734 7735 7736 7737 7738 7739 7740 7741 7742 7743 7744 7745 7746 7747 7748 7749 7750 7751 7752 7753 7754 7755 7756 7757 7758 7759 7760 7761 7762 7763 7764 7765 7766 7767 7768 7769 7770 7771 7772 7773 7774 7775 7776 7777 7778 7779 7780 7781 7782 7783 7784 7785 7786 7787 7788 7789 7790 7791 7792 7793 7794 7795 7796 7797 7798 7799 7800 7801 7802 7803 7804 7805 7806 7807 7808 7809 7810 7811 7812 7813 7814 7815 7816 7817 7818 7819 7820 7821 7822 7823 7824 7825 7826 7827 7828 7829 7830 7831 7832 7833 7834 7835 7836 7837 7838 7839 7840 7841 7842 7843 7844 7845 7846 7847 7848 7849 7850 7851 7852 7853 7854 7855 7856 7857 7858 7859 7860 7861 7862 7863 7864 7865 7866 7867 7868 7869 7870 7871 7872 7873 7874 7875 7876 7877 7878 7879 7880 7881 7882 7883 7884 7885 7886 7887 7888 7889 7890 7891 7892 7893 7894 7895 7896 7897 7898 7899 7900 7901 7902 7903 7904 7905 7906 7907 7908 7909 7910 7911 7912 7913 7914 7915 7916 7917 7918 7919 7920 7921 7922 7923 7924 7925 7926 7927 7928 7929 7930 7931 7932 7933 7934 7935 7936 7937 7938 7939 7940 7941 7942 7943 7944 7945 7946 7947 7948 7949 7950 7951 7952 7953 7954 7955 7956 7957 7958 7959 7960 7961 7962 7963 7964 7965 7966 7967 7968 7969 7970 7971 7972 7973 7974 7975 7976 7977 7978 7979 7980 7981 7982 7983 7984 7985 7986 7987 7988 7989 7990 7991 7992 7993 7994 7995 7996 7997 7998 7999 8000 8001 8002 8003 8004 8005 8006 8007 8008 8009 8010 8011 8012 8013 8014 8015 8016 8017 8018 8019 8020 8021 8022 8023 8024 8025 8026 8027 8028 8029 8030 8031 8032 8033 8034 8035 8036 8037 8038 8039 8040 8041 8042 8043 8044 8045 8046 8047 8048 8049 8050 8051 8052 8053 8054 8055 8056 8057 8058 8059 8060 8061 8062 8063 8064 8065 8066 8067 8068 8069 8070 8071 8072 8073 8074 8075 8076 8077 8078 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2008 Oracle. All rights reserved. */ #include <linux/sched.h> #include <linux/slab.h> #include <linux/blkdev.h> #include <linux/list_sort.h> #include <linux/iversion.h> #include "misc.h" #include "ctree.h" #include "tree-log.h" #include "disk-io.h" #include "locking.h" #include "backref.h" #include "compression.h" #include "qgroup.h" #include "block-group.h" #include "space-info.h" #include "inode-item.h" #include "fs.h" #include "accessors.h" #include "extent-tree.h" #include "root-tree.h" #include "dir-item.h" #include "file-item.h" #include "file.h" #include "orphan.h" #include "print-tree.h" #include "tree-checker.h" #include "delayed-inode.h" #define MAX_CONFLICT_INODES 10 /* magic values for the inode_only field in btrfs_log_inode: * * LOG_INODE_ALL means to log everything * LOG_INODE_EXISTS means to log just enough to recreate the inode * during log replay */ enum { LOG_INODE_ALL, LOG_INODE_EXISTS, }; /* * directory trouble cases * * 1) on rename or unlink, if the inode being unlinked isn't in the fsync * log, we must force a full commit before doing an fsync of the directory * where the unlink was done. * ---> record transid of last unlink/rename per directory * * mkdir foo/some_dir * normal commit * rename foo/some_dir foo2/some_dir * mkdir foo/some_dir * fsync foo/some_dir/some_file * * The fsync above will unlink the original some_dir without recording * it in its new location (foo2). After a crash, some_dir will be gone * unless the fsync of some_file forces a full commit * * 2) we must log any new names for any file or dir that is in the fsync * log. ---> check inode while renaming/linking. * * 2a) we must log any new names for any file or dir during rename * when the directory they are being removed from was logged. * ---> check inode and old parent dir during rename * * 2a is actually the more important variant. With the extra logging * a crash might unlink the old name without recreating the new one * * 3) after a crash, we must go through any directories with a link count * of zero and redo the rm -rf * * mkdir f1/foo * normal commit * rm -rf f1/foo * fsync(f1) * * The directory f1 was fully removed from the FS, but fsync was never * called on f1, only its parent dir. After a crash the rm -rf must * be replayed. This must be able to recurse down the entire * directory tree. The inode link count fixup code takes care of the * ugly details. */ /* * stages for the tree walking. The first * stage (0) is to only pin down the blocks we find * the second stage (1) is to make sure that all the inodes * we find in the log are created in the subvolume. * * The last stage is to deal with directories and links and extents * and all the other fun semantics */ enum { LOG_WALK_PIN_ONLY, LOG_WALK_REPLAY_INODES, LOG_WALK_REPLAY_DIR_INDEX, LOG_WALK_REPLAY_ALL, }; /* * The walk control struct is used to pass state down the chain when processing * the log tree. The stage field tells us which part of the log tree processing * we are currently doing. */ struct walk_control { /* * Signal that we are freeing the metadata extents of a log tree. * This is used at transaction commit time while freeing a log tree. */ bool free; /* * Signal that we are pinning the metadata extents of a log tree and the * data extents its leaves point to (if using mixed block groups). * This happens in the first stage of log replay to ensure that during * replay, while we are modifying subvolume trees, we don't overwrite * the metadata extents of log trees. */ bool pin; /* What stage of the replay code we're currently in. */ int stage; /* * Ignore any items from the inode currently being processed. Needs * to be set every time we find a BTRFS_INODE_ITEM_KEY. */ bool ignore_cur_inode; /* * The root we are currently replaying to. This is NULL for the replay * stage LOG_WALK_PIN_ONLY. */ struct btrfs_root *root; /* The log tree we are currently processing (not NULL for any stage). */ struct btrfs_root *log; /* The transaction handle used for replaying all log trees. */ struct btrfs_trans_handle *trans; /* * The function that gets used to process blocks we find in the tree. * Note the extent_buffer might not be up to date when it is passed in, * and it must be checked or read if you need the data inside it. */ int (*process_func)(struct extent_buffer *eb, struct walk_control *wc, u64 gen, int level); /* * The following are used only when stage is >= LOG_WALK_REPLAY_INODES * and by the replay_one_buffer() callback. */ /* The current log leaf being processed. */ struct extent_buffer *log_leaf; /* The key being processed of the current log leaf. */ struct btrfs_key log_key; /* The slot being processed of the current log leaf. */ int log_slot; /* A path used for searches and modifications to subvolume trees. */ struct btrfs_path *subvol_path; }; static void do_abort_log_replay(struct walk_control *wc, const char *function, unsigned int line, int error, const char *fmt, ...) { struct btrfs_fs_info *fs_info = wc->trans->fs_info; struct va_format vaf; va_list args; /* * Do nothing if we already aborted, to avoid dumping leaves again which * can be verbose. Further more, only the first call is useful since it * is where we have a problem. Note that we do not use the flag * BTRFS_FS_STATE_TRANS_ABORTED because log replay calls functions that * are outside of tree-log.c that can abort transactions (such as * btrfs_add_link() for example), so if that happens we still want to * dump all log replay specific information below. */ if (test_and_set_bit(BTRFS_FS_STATE_LOG_REPLAY_ABORTED, &fs_info->fs_state)) return; btrfs_abort_transaction(wc->trans, error); if (wc->subvol_path && wc->subvol_path->nodes[0]) { btrfs_crit(fs_info, "subvolume (root %llu) leaf currently being processed:", btrfs_root_id(wc->root)); btrfs_print_leaf(wc->subvol_path->nodes[0]); } if (wc->log_leaf) { btrfs_crit(fs_info, "log tree (for root %llu) leaf currently being processed (slot %d key " BTRFS_KEY_FMT "):", btrfs_root_id(wc->root), wc->log_slot, BTRFS_KEY_FMT_VALUE(&wc->log_key)); btrfs_print_leaf(wc->log_leaf); } va_start(args, fmt); vaf.fmt = fmt; vaf.va = &args; btrfs_crit(fs_info, "log replay failed in %s:%u for root %llu, stage %d, with error %d: %pV", function, line, btrfs_root_id(wc->root), wc->stage, error, &vaf); va_end(args); } /* * Use this for aborting a transaction during log replay while we are down the * call chain of replay_one_buffer(), so that we get a lot more useful * information for debugging issues when compared to a plain call to * btrfs_abort_transaction(). */ #define btrfs_abort_log_replay(wc, error, fmt, args...) \ do_abort_log_replay((wc), __func__, __LINE__, (error), fmt, ##args) static int btrfs_log_inode(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, int inode_only, struct btrfs_log_ctx *ctx); static int link_to_fixup_dir(struct walk_control *wc, u64 objectid); static noinline int replay_dir_deletes(struct walk_control *wc, u64 dirid, bool del_all); static void wait_log_commit(struct btrfs_root *root, int transid); /* * tree logging is a special write ahead log used to make sure that * fsyncs and O_SYNCs can happen without doing full tree commits. * * Full tree commits are expensive because they require commonly * modified blocks to be recowed, creating many dirty pages in the * extent tree an 4x-6x higher write load than ext3. * * Instead of doing a tree commit on every fsync, we use the * key ranges and transaction ids to find items for a given file or directory * that have changed in this transaction. Those items are copied into * a special tree (one per subvolume root), that tree is written to disk * and then the fsync is considered complete. * * After a crash, items are copied out of the log-tree back into the * subvolume tree. Any file data extents found are recorded in the extent * allocation tree, and the log-tree freed. * * The log tree is read three times, once to pin down all the extents it is * using in ram and once, once to create all the inodes logged in the tree * and once to do all the other items. */ static struct btrfs_inode *btrfs_iget_logging(u64 objectid, struct btrfs_root *root) { unsigned int nofs_flag; struct btrfs_inode *inode; /* Only meant to be called for subvolume roots and not for log roots. */ ASSERT(btrfs_is_fstree(btrfs_root_id(root)), "root_id=%llu", btrfs_root_id(root)); /* * We're holding a transaction handle whether we are logging or * replaying a log tree, so we must make sure NOFS semantics apply * because btrfs_alloc_inode() may be triggered and it uses GFP_KERNEL * to allocate an inode, which can recurse back into the filesystem and * attempt a transaction commit, resulting in a deadlock. */ nofs_flag = memalloc_nofs_save(); inode = btrfs_iget(objectid, root); memalloc_nofs_restore(nofs_flag); return inode; } /* * start a sub transaction and setup the log tree * this increments the log tree writer count to make the people * syncing the tree wait for us to finish */ static int start_log_trans(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_log_ctx *ctx) { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_root *tree_root = fs_info->tree_root; const bool zoned = btrfs_is_zoned(fs_info); int ret = 0; bool created = false; /* * First check if the log root tree was already created. If not, create * it before locking the root's log_mutex, just to keep lockdep happy. */ if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) { mutex_lock(&tree_root->log_mutex); if (!fs_info->log_root_tree) { ret = btrfs_init_log_root_tree(trans, fs_info); if (!ret) { set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state); created = true; } } mutex_unlock(&tree_root->log_mutex); if (ret) return ret; } mutex_lock(&root->log_mutex); again: if (root->log_root) { int index = (root->log_transid + 1) % 2; if (btrfs_need_log_full_commit(trans)) { ret = BTRFS_LOG_FORCE_COMMIT; goto out; } if (zoned && atomic_read(&root->log_commit[index])) { wait_log_commit(root, root->log_transid - 1); goto again; } if (!root->log_start_pid) { clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state); root->log_start_pid = current->pid; } else if (root->log_start_pid != current->pid) { set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state); } } else { /* * This means fs_info->log_root_tree was already created * for some other FS trees. Do the full commit not to mix * nodes from multiple log transactions to do sequential * writing. */ if (zoned && !created) { ret = BTRFS_LOG_FORCE_COMMIT; goto out; } ret = btrfs_add_log_tree(trans, root); if (ret) goto out; set_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state); clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state); root->log_start_pid = current->pid; } atomic_inc(&root->log_writers); if (!ctx->logging_new_name) { int index = root->log_transid % 2; list_add_tail(&ctx->list, &root->log_ctxs[index]); ctx->log_transid = root->log_transid; } out: mutex_unlock(&root->log_mutex); return ret; } /* * returns 0 if there was a log transaction running and we were able * to join, or returns -ENOENT if there were not transactions * in progress */ static int join_running_log_trans(struct btrfs_root *root) { const bool zoned = btrfs_is_zoned(root->fs_info); int ret = -ENOENT; if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state)) return ret; mutex_lock(&root->log_mutex); again: if (root->log_root) { int index = (root->log_transid + 1) % 2; ret = 0; if (zoned && atomic_read(&root->log_commit[index])) { wait_log_commit(root, root->log_transid - 1); goto again; } atomic_inc(&root->log_writers); } mutex_unlock(&root->log_mutex); return ret; } /* * This either makes the current running log transaction wait * until you call btrfs_end_log_trans() or it makes any future * log transactions wait until you call btrfs_end_log_trans() */ void btrfs_pin_log_trans(struct btrfs_root *root) { atomic_inc(&root->log_writers); } /* * indicate we're done making changes to the log tree * and wake up anyone waiting to do a sync */ void btrfs_end_log_trans(struct btrfs_root *root) { if (atomic_dec_and_test(&root->log_writers)) { /* atomic_dec_and_test implies a barrier */ cond_wake_up_nomb(&root->log_writer_wait); } } /* * process_func used to pin down extents, write them or wait on them */ static int process_one_buffer(struct extent_buffer *eb, struct walk_control *wc, u64 gen, int level) { struct btrfs_root *log = wc->log; struct btrfs_trans_handle *trans = wc->trans; struct btrfs_fs_info *fs_info = log->fs_info; int ret = 0; /* * If this fs is mixed then we need to be able to process the leaves to * pin down any logged extents, so we have to read the block. */ if (btrfs_fs_incompat(fs_info, MIXED_GROUPS)) { struct btrfs_tree_parent_check check = { .level = level, .transid = gen }; ret = btrfs_read_extent_buffer(eb, &check); if (unlikely(ret)) { if (trans) btrfs_abort_transaction(trans, ret); else btrfs_handle_fs_error(fs_info, ret, NULL); return ret; } } if (wc->pin) { ASSERT(trans != NULL); ret = btrfs_pin_extent_for_log_replay(trans, eb); if (unlikely(ret)) { btrfs_abort_transaction(trans, ret); return ret; } if (btrfs_buffer_uptodate(eb, gen, false) && level == 0) { ret = btrfs_exclude_logged_extents(eb); if (ret) btrfs_abort_transaction(trans, ret); } } return ret; } /* * Item overwrite used by log replay. The given log tree leaf, slot and key * from the walk_control structure all refer to the source data we are copying * out. * * The given root is for the tree we are copying into, and path is a scratch * path for use in this function (it should be released on entry and will be * released on exit). * * If the key is already in the destination tree the existing item is * overwritten. If the existing item isn't big enough, it is extended. * If it is too large, it is truncated. * * If the key isn't in the destination yet, a new item is inserted. */ static int overwrite_item(struct walk_control *wc) { struct btrfs_trans_handle *trans = wc->trans; struct btrfs_root *root = wc->root; int ret; u32 item_size; u64 saved_i_size = 0; int save_old_i_size = 0; unsigned long src_ptr; unsigned long dst_ptr; struct extent_buffer *dst_eb; int dst_slot; const bool is_inode_item = (wc->log_key.type == BTRFS_INODE_ITEM_KEY); /* * This is only used during log replay, so the root is always from a * fs/subvolume tree. In case we ever need to support a log root, then * we'll have to clone the leaf in the path, release the path and use * the leaf before writing into the log tree. See the comments at * copy_items() for more details. */ ASSERT(btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID, "root_id=%llu", btrfs_root_id(root)); item_size = btrfs_item_size(wc->log_leaf, wc->log_slot); src_ptr = btrfs_item_ptr_offset(wc->log_leaf, wc->log_slot); /* Look for the key in the destination tree. */ ret = btrfs_search_slot(NULL, root, &wc->log_key, wc->subvol_path, 0, 0); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to search subvolume tree for key " BTRFS_KEY_FMT " root %llu", BTRFS_KEY_FMT_VALUE(&wc->log_key), btrfs_root_id(root)); return ret; } dst_eb = wc->subvol_path->nodes[0]; dst_slot = wc->subvol_path->slots[0]; if (ret == 0) { char *src_copy; const u32 dst_size = btrfs_item_size(dst_eb, dst_slot); if (dst_size != item_size) goto insert; if (item_size == 0) { btrfs_release_path(wc->subvol_path); return 0; } src_copy = kmalloc(item_size, GFP_NOFS); if (!src_copy) { btrfs_abort_log_replay(wc, -ENOMEM, "failed to allocate memory for log leaf item"); return -ENOMEM; } read_extent_buffer(wc->log_leaf, src_copy, src_ptr, item_size); dst_ptr = btrfs_item_ptr_offset(dst_eb, dst_slot); ret = memcmp_extent_buffer(dst_eb, src_copy, dst_ptr, item_size); kfree(src_copy); /* * they have the same contents, just return, this saves * us from cowing blocks in the destination tree and doing * extra writes that may not have been done by a previous * sync */ if (ret == 0) { btrfs_release_path(wc->subvol_path); return 0; } /* * We need to load the old nbytes into the inode so when we * replay the extents we've logged we get the right nbytes. */ if (is_inode_item) { struct btrfs_inode_item *item; u64 nbytes; u32 mode; item = btrfs_item_ptr(dst_eb, dst_slot, struct btrfs_inode_item); nbytes = btrfs_inode_nbytes(dst_eb, item); item = btrfs_item_ptr(wc->log_leaf, wc->log_slot, struct btrfs_inode_item); btrfs_set_inode_nbytes(wc->log_leaf, item, nbytes); /* * If this is a directory we need to reset the i_size to * 0 so that we can set it up properly when replaying * the rest of the items in this log. */ mode = btrfs_inode_mode(wc->log_leaf, item); if (S_ISDIR(mode)) btrfs_set_inode_size(wc->log_leaf, item, 0); } } else if (is_inode_item) { struct btrfs_inode_item *item; u32 mode; /* * New inode, set nbytes to 0 so that the nbytes comes out * properly when we replay the extents. */ item = btrfs_item_ptr(wc->log_leaf, wc->log_slot, struct btrfs_inode_item); btrfs_set_inode_nbytes(wc->log_leaf, item, 0); /* * If this is a directory we need to reset the i_size to 0 so * that we can set it up properly when replaying the rest of * the items in this log. */ mode = btrfs_inode_mode(wc->log_leaf, item); if (S_ISDIR(mode)) btrfs_set_inode_size(wc->log_leaf, item, 0); } insert: btrfs_release_path(wc->subvol_path); /* try to insert the key into the destination tree */ wc->subvol_path->skip_release_on_error = true; ret = btrfs_insert_empty_item(trans, root, wc->subvol_path, &wc->log_key, item_size); wc->subvol_path->skip_release_on_error = false; dst_eb = wc->subvol_path->nodes[0]; dst_slot = wc->subvol_path->slots[0]; /* make sure any existing item is the correct size */ if (ret == -EEXIST || ret == -EOVERFLOW) { const u32 found_size = btrfs_item_size(dst_eb, dst_slot); if (found_size > item_size) btrfs_truncate_item(trans, wc->subvol_path, item_size, 1); else if (found_size < item_size) btrfs_extend_item(trans, wc->subvol_path, item_size - found_size); } else if (ret) { btrfs_abort_log_replay(wc, ret, "failed to insert item for key " BTRFS_KEY_FMT, BTRFS_KEY_FMT_VALUE(&wc->log_key)); return ret; } dst_ptr = btrfs_item_ptr_offset(dst_eb, dst_slot); /* don't overwrite an existing inode if the generation number * was logged as zero. This is done when the tree logging code * is just logging an inode to make sure it exists after recovery. * * Also, don't overwrite i_size on directories during replay. * log replay inserts and removes directory items based on the * state of the tree found in the subvolume, and i_size is modified * as it goes */ if (is_inode_item && ret == -EEXIST) { struct btrfs_inode_item *src_item; struct btrfs_inode_item *dst_item; src_item = (struct btrfs_inode_item *)src_ptr; dst_item = (struct btrfs_inode_item *)dst_ptr; if (btrfs_inode_generation(wc->log_leaf, src_item) == 0) { const u64 ino_size = btrfs_inode_size(wc->log_leaf, src_item); /* * For regular files an ino_size == 0 is used only when * logging that an inode exists, as part of a directory * fsync, and the inode wasn't fsynced before. In this * case don't set the size of the inode in the fs/subvol * tree, otherwise we would be throwing valid data away. */ if (S_ISREG(btrfs_inode_mode(wc->log_leaf, src_item)) && S_ISREG(btrfs_inode_mode(dst_eb, dst_item)) && ino_size != 0) btrfs_set_inode_size(dst_eb, dst_item, ino_size); goto no_copy; } if (S_ISDIR(btrfs_inode_mode(wc->log_leaf, src_item)) && S_ISDIR(btrfs_inode_mode(dst_eb, dst_item))) { save_old_i_size = 1; saved_i_size = btrfs_inode_size(dst_eb, dst_item); } } copy_extent_buffer(dst_eb, wc->log_leaf, dst_ptr, src_ptr, item_size); if (save_old_i_size) { struct btrfs_inode_item *dst_item; dst_item = (struct btrfs_inode_item *)dst_ptr; btrfs_set_inode_size(dst_eb, dst_item, saved_i_size); } /* make sure the generation is filled in */ if (is_inode_item) { struct btrfs_inode_item *dst_item; dst_item = (struct btrfs_inode_item *)dst_ptr; if (btrfs_inode_generation(dst_eb, dst_item) == 0) btrfs_set_inode_generation(dst_eb, dst_item, trans->transid); } no_copy: btrfs_release_path(wc->subvol_path); return 0; } static int read_alloc_one_name(struct extent_buffer *eb, void *start, int len, struct fscrypt_str *name) { char *buf; buf = kmalloc(len, GFP_NOFS); if (!buf) return -ENOMEM; read_extent_buffer(eb, buf, (unsigned long)start, len); name->name = buf; name->len = len; return 0; } /* replays a single extent in 'eb' at 'slot' with 'key' into the * subvolume 'root'. path is released on entry and should be released * on exit. * * extents in the log tree have not been allocated out of the extent * tree yet. So, this completes the allocation, taking a reference * as required if the extent already exists or creating a new extent * if it isn't in the extent allocation tree yet. * * The extent is inserted into the file, dropping any existing extents * from the file that overlap the new one. */ static noinline int replay_one_extent(struct walk_control *wc) { struct btrfs_trans_handle *trans = wc->trans; struct btrfs_root *root = wc->root; struct btrfs_drop_extents_args drop_args = { 0 }; struct btrfs_fs_info *fs_info = root->fs_info; int found_type; u64 extent_end; const u64 start = wc->log_key.offset; u64 nbytes = 0; u64 csum_start; u64 csum_end; LIST_HEAD(ordered_sums); u64 offset; unsigned long dest_offset; struct btrfs_key ins; struct btrfs_file_extent_item *item; struct btrfs_inode *inode = NULL; int ret = 0; item = btrfs_item_ptr(wc->log_leaf, wc->log_slot, struct btrfs_file_extent_item); found_type = btrfs_file_extent_type(wc->log_leaf, item); if (found_type == BTRFS_FILE_EXTENT_REG || found_type == BTRFS_FILE_EXTENT_PREALLOC) { extent_end = start + btrfs_file_extent_num_bytes(wc->log_leaf, item); /* Holes don't take up space. */ if (btrfs_file_extent_disk_bytenr(wc->log_leaf, item) != 0) nbytes = btrfs_file_extent_num_bytes(wc->log_leaf, item); } else if (found_type == BTRFS_FILE_EXTENT_INLINE) { nbytes = btrfs_file_extent_ram_bytes(wc->log_leaf, item); extent_end = ALIGN(start + nbytes, fs_info->sectorsize); } else { btrfs_abort_log_replay(wc, -EUCLEAN, "unexpected extent type=%d root=%llu inode=%llu offset=%llu", found_type, btrfs_root_id(root), wc->log_key.objectid, wc->log_key.offset); return -EUCLEAN; } inode = btrfs_iget_logging(wc->log_key.objectid, root); if (IS_ERR(inode)) { ret = PTR_ERR(inode); btrfs_abort_log_replay(wc, ret, "failed to get inode %llu for root %llu", wc->log_key.objectid, btrfs_root_id(root)); return ret; } /* * first check to see if we already have this extent in the * file. This must be done before the btrfs_drop_extents run * so we don't try to drop this extent. */ ret = btrfs_lookup_file_extent(trans, root, wc->subvol_path, btrfs_ino(inode), start, 0); if (ret == 0 && (found_type == BTRFS_FILE_EXTENT_REG || found_type == BTRFS_FILE_EXTENT_PREALLOC)) { struct extent_buffer *leaf = wc->subvol_path->nodes[0]; struct btrfs_file_extent_item existing; unsigned long ptr; ptr = btrfs_item_ptr_offset(leaf, wc->subvol_path->slots[0]); read_extent_buffer(leaf, &existing, ptr, sizeof(existing)); /* * we already have a pointer to this exact extent, * we don't have to do anything */ if (memcmp_extent_buffer(wc->log_leaf, &existing, (unsigned long)item, sizeof(existing)) == 0) { btrfs_release_path(wc->subvol_path); goto out; } } btrfs_release_path(wc->subvol_path); /* drop any overlapping extents */ drop_args.start = start; drop_args.end = extent_end; drop_args.drop_cache = true; drop_args.path = wc->subvol_path; ret = btrfs_drop_extents(trans, root, inode, &drop_args); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to drop extents for inode %llu range [%llu, %llu) root %llu", wc->log_key.objectid, start, extent_end, btrfs_root_id(root)); goto out; } if (found_type == BTRFS_FILE_EXTENT_INLINE) { /* inline extents are easy, we just overwrite them */ ret = overwrite_item(wc); if (ret) goto out; goto update_inode; } /* * If not an inline extent, it can only be a regular or prealloc one. * We have checked that above and returned -EUCLEAN if not. */ /* A hole and NO_HOLES feature enabled, nothing else to do. */ if (btrfs_file_extent_disk_bytenr(wc->log_leaf, item) == 0 && btrfs_fs_incompat(fs_info, NO_HOLES)) goto update_inode; ret = btrfs_insert_empty_item(trans, root, wc->subvol_path, &wc->log_key, sizeof(*item)); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to insert item with key " BTRFS_KEY_FMT " root %llu", BTRFS_KEY_FMT_VALUE(&wc->log_key), btrfs_root_id(root)); goto out; } dest_offset = btrfs_item_ptr_offset(wc->subvol_path->nodes[0], wc->subvol_path->slots[0]); copy_extent_buffer(wc->subvol_path->nodes[0], wc->log_leaf, dest_offset, (unsigned long)item, sizeof(*item)); /* * We have an explicit hole and NO_HOLES is not enabled. We have added * the hole file extent item to the subvolume tree, so we don't have * anything else to do other than update the file extent item range and * update the inode item. */ if (btrfs_file_extent_disk_bytenr(wc->log_leaf, item) == 0) { btrfs_release_path(wc->subvol_path); goto update_inode; } ins.objectid = btrfs_file_extent_disk_bytenr(wc->log_leaf, item); ins.type = BTRFS_EXTENT_ITEM_KEY; ins.offset = btrfs_file_extent_disk_num_bytes(wc->log_leaf, item); offset = wc->log_key.offset - btrfs_file_extent_offset(wc->log_leaf, item); /* * Manually record dirty extent, as here we did a shallow file extent * item copy and skip normal backref update, but modifying extent tree * all by ourselves. So need to manually record dirty extent for qgroup, * as the owner of the file extent changed from log tree (doesn't affect * qgroup) to fs/file tree (affects qgroup). */ ret = btrfs_qgroup_trace_extent(trans, ins.objectid, ins.offset); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to trace extent for bytenr %llu disk_num_bytes %llu inode %llu root %llu", ins.objectid, ins.offset, wc->log_key.objectid, btrfs_root_id(root)); goto out; } /* * Is this extent already allocated in the extent tree? * If so, just add a reference. */ ret = btrfs_lookup_data_extent(fs_info, ins.objectid, ins.offset); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to lookup data extent for bytenr %llu disk_num_bytes %llu inode %llu root %llu", ins.objectid, ins.offset, wc->log_key.objectid, btrfs_root_id(root)); goto out; } else if (ret == 0) { struct btrfs_ref ref = { .action = BTRFS_ADD_DELAYED_REF, .bytenr = ins.objectid, .num_bytes = ins.offset, .owning_root = btrfs_root_id(root), .ref_root = btrfs_root_id(root), }; btrfs_init_data_ref(&ref, wc->log_key.objectid, offset, 0, false); ret = btrfs_inc_extent_ref(trans, &ref); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to increment data extent for bytenr %llu disk_num_bytes %llu inode %llu root %llu", ins.objectid, ins.offset, wc->log_key.objectid, btrfs_root_id(root)); goto out; } } else { /* Insert the extent pointer in the extent tree. */ ret = btrfs_alloc_logged_file_extent(trans, btrfs_root_id(root), wc->log_key.objectid, offset, &ins); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to allocate logged data extent for bytenr %llu disk_num_bytes %llu offset %llu inode %llu root %llu", ins.objectid, ins.offset, offset, wc->log_key.objectid, btrfs_root_id(root)); goto out; } } btrfs_release_path(wc->subvol_path); if (btrfs_file_extent_compression(wc->log_leaf, item)) { csum_start = ins.objectid; csum_end = csum_start + ins.offset; } else { csum_start = ins.objectid + btrfs_file_extent_offset(wc->log_leaf, item); csum_end = csum_start + btrfs_file_extent_num_bytes(wc->log_leaf, item); } ret = btrfs_lookup_csums_list(root->log_root, csum_start, csum_end - 1, &ordered_sums, false); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to lookups csums for range [%llu, %llu) inode %llu root %llu", csum_start, csum_end, wc->log_key.objectid, btrfs_root_id(root)); goto out; } ret = 0; /* * Now delete all existing cums in the csum root that cover our range. * We do this because we can have an extent that is completely * referenced by one file extent item and partially referenced by * another file extent item (like after using the clone or extent_same * ioctls). In this case if we end up doing the replay of the one that * partially references the extent first, and we do not do the csum * deletion below, we can get 2 csum items in the csum tree that overlap * each other. For example, imagine our log has the two following file * extent items: * * key (257 EXTENT_DATA 409600) * extent data disk byte 12845056 nr 102400 * extent data offset 20480 nr 20480 ram 102400 * * key (257 EXTENT_DATA 819200) * extent data disk byte 12845056 nr 102400 * extent data offset 0 nr 102400 ram 102400 * * Where the second one fully references the 100K extent that starts at * disk byte 12845056, and the log tree has a single csum item that * covers the entire range of the extent: * * key (EXTENT_CSUM EXTENT_CSUM 12845056) itemsize 100 * * After the first file extent item is replayed, the csum tree gets the * following csum item: * * key (EXTENT_CSUM EXTENT_CSUM 12865536) itemsize 20 * * Which covers the 20K sub-range starting at offset 20K of our extent. * Now when we replay the second file extent item, if we do not delete * existing csum items that cover any of its blocks, we end up getting * two csum items in our csum tree that overlap each other: * * key (EXTENT_CSUM EXTENT_CSUM 12845056) itemsize 100 * key (EXTENT_CSUM EXTENT_CSUM 12865536) itemsize 20 * * Which is a problem, because after this anyone trying to lookup for * the checksum of any block of our extent starting at an offset of 40K * or higher, will end up looking at the second csum item only, which * does not contain the checksum for any block starting at offset 40K or * higher of our extent. */ while (!list_empty(&ordered_sums)) { struct btrfs_ordered_sum *sums; struct btrfs_root *csum_root; sums = list_first_entry(&ordered_sums, struct btrfs_ordered_sum, list); csum_root = btrfs_csum_root(fs_info, sums->logical); if (!ret) { ret = btrfs_del_csums(trans, csum_root, sums->logical, sums->len); if (ret) btrfs_abort_log_replay(wc, ret, "failed to delete csums for range [%llu, %llu) inode %llu root %llu", sums->logical, sums->logical + sums->len, wc->log_key.objectid, btrfs_root_id(root)); } if (!ret) { ret = btrfs_csum_file_blocks(trans, csum_root, sums); if (ret) btrfs_abort_log_replay(wc, ret, "failed to add csums for range [%llu, %llu) inode %llu root %llu", sums->logical, sums->logical + sums->len, wc->log_key.objectid, btrfs_root_id(root)); } list_del(&sums->list); kfree(sums); } if (ret) goto out; update_inode: ret = btrfs_inode_set_file_extent_range(inode, start, extent_end - start); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to set file extent range [%llu, %llu) inode %llu root %llu", start, extent_end, wc->log_key.objectid, btrfs_root_id(root)); goto out; } btrfs_update_inode_bytes(inode, nbytes, drop_args.bytes_found); ret = btrfs_update_inode(trans, inode); if (ret) btrfs_abort_log_replay(wc, ret, "failed to update inode %llu root %llu", wc->log_key.objectid, btrfs_root_id(root)); out: iput(&inode->vfs_inode); return ret; } static int unlink_inode_for_log_replay(struct walk_control *wc, struct btrfs_inode *dir, struct btrfs_inode *inode, const struct fscrypt_str *name) { struct btrfs_trans_handle *trans = wc->trans; int ret; ret = btrfs_unlink_inode(trans, dir, inode, name); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to unlink inode %llu parent dir %llu name %.*s root %llu", btrfs_ino(inode), btrfs_ino(dir), name->len, name->name, btrfs_root_id(inode->root)); return ret; } /* * Whenever we need to check if a name exists or not, we check the * fs/subvolume tree. So after an unlink we must run delayed items, so * that future checks for a name during log replay see that the name * does not exists anymore. */ ret = btrfs_run_delayed_items(trans); if (ret) btrfs_abort_log_replay(wc, ret, "failed to run delayed items current inode %llu parent dir %llu name %.*s root %llu", btrfs_ino(inode), btrfs_ino(dir), name->len, name->name, btrfs_root_id(inode->root)); return ret; } /* * when cleaning up conflicts between the directory names in the * subvolume, directory names in the log and directory names in the * inode back references, we may have to unlink inodes from directories. * * This is a helper function to do the unlink of a specific directory * item */ static noinline int drop_one_dir_item(struct walk_control *wc, struct btrfs_inode *dir, struct btrfs_dir_item *di) { struct btrfs_root *root = dir->root; struct btrfs_inode *inode; struct fscrypt_str name; struct extent_buffer *leaf = wc->subvol_path->nodes[0]; struct btrfs_key location; int ret; btrfs_dir_item_key_to_cpu(leaf, di, &location); ret = read_alloc_one_name(leaf, di + 1, btrfs_dir_name_len(leaf, di), &name); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to allocate name for dir %llu root %llu", btrfs_ino(dir), btrfs_root_id(root)); return ret; } btrfs_release_path(wc->subvol_path); inode = btrfs_iget_logging(location.objectid, root); if (IS_ERR(inode)) { ret = PTR_ERR(inode); btrfs_abort_log_replay(wc, ret, "failed to open inode %llu parent dir %llu name %.*s root %llu", location.objectid, btrfs_ino(dir), name.len, name.name, btrfs_root_id(root)); inode = NULL; goto out; } ret = link_to_fixup_dir(wc, location.objectid); if (ret) goto out; ret = unlink_inode_for_log_replay(wc, dir, inode, &name); out: kfree(name.name); if (inode) iput(&inode->vfs_inode); return ret; } /* * See if a given name and sequence number found in an inode back reference are * already in a directory and correctly point to this inode. * * Returns: < 0 on error, 0 if the directory entry does not exists and 1 if it * exists. */ static noinline int inode_in_dir(struct btrfs_root *root, struct btrfs_path *path, u64 dirid, u64 objectid, u64 index, struct fscrypt_str *name) { struct btrfs_dir_item *di; struct btrfs_key location; int ret = 0; di = btrfs_lookup_dir_index_item(NULL, root, path, dirid, index, name, 0); if (IS_ERR(di)) { ret = PTR_ERR(di); goto out; } else if (di) { btrfs_dir_item_key_to_cpu(path->nodes[0], di, &location); if (location.objectid != objectid) goto out; } else { goto out; } btrfs_release_path(path); di = btrfs_lookup_dir_item(NULL, root, path, dirid, name, 0); if (IS_ERR(di)) { ret = PTR_ERR(di); goto out; } else if (di) { btrfs_dir_item_key_to_cpu(path->nodes[0], di, &location); if (location.objectid == objectid) ret = 1; } out: btrfs_release_path(path); return ret; } /* * helper function to check a log tree for a named back reference in * an inode. This is used to decide if a back reference that is * found in the subvolume conflicts with what we find in the log. * * inode backreferences may have multiple refs in a single item, * during replay we process one reference at a time, and we don't * want to delete valid links to a file from the subvolume if that * link is also in the log. */ static noinline int backref_in_log(struct btrfs_root *log, struct btrfs_key *key, u64 ref_objectid, const struct fscrypt_str *name) { BTRFS_PATH_AUTO_FREE(path); int ret; path = btrfs_alloc_path(); if (!path) return -ENOMEM; ret = btrfs_search_slot(NULL, log, key, path, 0, 0); if (ret < 0) return ret; if (ret == 1) return 0; if (key->type == BTRFS_INODE_EXTREF_KEY) ret = !!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0], ref_objectid, name); else ret = !!btrfs_find_name_in_backref(path->nodes[0], path->slots[0], name); return ret; } static int unlink_refs_not_in_log(struct walk_control *wc, struct btrfs_key *search_key, struct btrfs_inode *dir, struct btrfs_inode *inode) { struct extent_buffer *leaf = wc->subvol_path->nodes[0]; unsigned long ptr; unsigned long ptr_end; /* * Check all the names in this back reference to see if they are in the * log. If so, we allow them to stay otherwise they must be unlinked as * a conflict. */ ptr = btrfs_item_ptr_offset(leaf, wc->subvol_path->slots[0]); ptr_end = ptr + btrfs_item_size(leaf, wc->subvol_path->slots[0]); while (ptr < ptr_end) { struct fscrypt_str victim_name; struct btrfs_inode_ref *victim_ref; int ret; victim_ref = (struct btrfs_inode_ref *)ptr; ret = read_alloc_one_name(leaf, (victim_ref + 1), btrfs_inode_ref_name_len(leaf, victim_ref), &victim_name); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to allocate name for inode %llu parent dir %llu root %llu", btrfs_ino(inode), btrfs_ino(dir), btrfs_root_id(inode->root)); return ret; } ret = backref_in_log(wc->log, search_key, btrfs_ino(dir), &victim_name); if (ret) { if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to check if backref is in log tree for inode %llu parent dir %llu name %.*s root %llu", btrfs_ino(inode), btrfs_ino(dir), victim_name.len, victim_name.name, btrfs_root_id(inode->root)); kfree(victim_name.name); return ret; } kfree(victim_name.name); ptr = (unsigned long)(victim_ref + 1) + victim_name.len; continue; } inc_nlink(&inode->vfs_inode); btrfs_release_path(wc->subvol_path); ret = unlink_inode_for_log_replay(wc, dir, inode, &victim_name); kfree(victim_name.name); if (ret) return ret; return -EAGAIN; } return 0; } static int unlink_extrefs_not_in_log(struct walk_control *wc, struct btrfs_key *search_key, struct btrfs_inode *dir, struct btrfs_inode *inode) { struct extent_buffer *leaf = wc->subvol_path->nodes[0]; const unsigned long base = btrfs_item_ptr_offset(leaf, wc->subvol_path->slots[0]); const u32 item_size = btrfs_item_size(leaf, wc->subvol_path->slots[0]); u32 cur_offset = 0; while (cur_offset < item_size) { struct btrfs_root *log_root = wc->log; struct btrfs_inode_extref *extref; struct fscrypt_str victim_name; int ret; extref = (struct btrfs_inode_extref *)(base + cur_offset); victim_name.len = btrfs_inode_extref_name_len(leaf, extref); if (btrfs_inode_extref_parent(leaf, extref) != btrfs_ino(dir)) goto next; ret = read_alloc_one_name(leaf, &extref->name, victim_name.len, &victim_name); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to allocate name for inode %llu parent dir %llu root %llu", btrfs_ino(inode), btrfs_ino(dir), btrfs_root_id(inode->root)); return ret; } search_key->objectid = btrfs_ino(inode); search_key->type = BTRFS_INODE_EXTREF_KEY; search_key->offset = btrfs_extref_hash(btrfs_ino(dir), victim_name.name, victim_name.len); ret = backref_in_log(log_root, search_key, btrfs_ino(dir), &victim_name); if (ret) { if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to check if backref is in log tree for inode %llu parent dir %llu name %.*s root %llu", btrfs_ino(inode), btrfs_ino(dir), victim_name.len, victim_name.name, btrfs_root_id(inode->root)); kfree(victim_name.name); return ret; } kfree(victim_name.name); next: cur_offset += victim_name.len + sizeof(*extref); continue; } inc_nlink(&inode->vfs_inode); btrfs_release_path(wc->subvol_path); ret = unlink_inode_for_log_replay(wc, dir, inode, &victim_name); kfree(victim_name.name); if (ret) return ret; return -EAGAIN; } return 0; } static inline int __add_inode_ref(struct walk_control *wc, struct btrfs_inode *dir, struct btrfs_inode *inode, u64 ref_index, struct fscrypt_str *name) { int ret; struct btrfs_trans_handle *trans = wc->trans; struct btrfs_root *root = wc->root; struct btrfs_dir_item *di; struct btrfs_key search_key; struct btrfs_inode_extref *extref; again: /* Search old style refs */ search_key.objectid = btrfs_ino(inode); search_key.type = BTRFS_INODE_REF_KEY; search_key.offset = btrfs_ino(dir); ret = btrfs_search_slot(NULL, root, &search_key, wc->subvol_path, 0, 0); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to search subvolume tree for key " BTRFS_KEY_FMT " root %llu", BTRFS_KEY_FMT_VALUE(&search_key), btrfs_root_id(root)); return ret; } else if (ret == 0) { /* * Are we trying to overwrite a back ref for the root directory? * If so, we're done. */ if (search_key.objectid == search_key.offset) return 1; ret = unlink_refs_not_in_log(wc, &search_key, dir, inode); if (ret == -EAGAIN) goto again; else if (ret) return ret; } btrfs_release_path(wc->subvol_path); /* Same search but for extended refs */ extref = btrfs_lookup_inode_extref(root, wc->subvol_path, name, btrfs_ino(inode), btrfs_ino(dir)); if (IS_ERR(extref)) { return PTR_ERR(extref); } else if (extref) { ret = unlink_extrefs_not_in_log(wc, &search_key, dir, inode); if (ret == -EAGAIN) goto again; else if (ret) return ret; } btrfs_release_path(wc->subvol_path); /* look for a conflicting sequence number */ di = btrfs_lookup_dir_index_item(trans, root, wc->subvol_path, btrfs_ino(dir), ref_index, name, 0); if (IS_ERR(di)) { ret = PTR_ERR(di); btrfs_abort_log_replay(wc, ret, "failed to lookup dir index item for dir %llu ref_index %llu name %.*s root %llu", btrfs_ino(dir), ref_index, name->len, name->name, btrfs_root_id(root)); return ret; } else if (di) { ret = drop_one_dir_item(wc, dir, di); if (ret) return ret; } btrfs_release_path(wc->subvol_path); /* look for a conflicting name */ di = btrfs_lookup_dir_item(trans, root, wc->subvol_path, btrfs_ino(dir), name, 0); if (IS_ERR(di)) { ret = PTR_ERR(di); btrfs_abort_log_replay(wc, ret, "failed to lookup dir item for dir %llu name %.*s root %llu", btrfs_ino(dir), name->len, name->name, btrfs_root_id(root)); return ret; } else if (di) { ret = drop_one_dir_item(wc, dir, di); if (ret) return ret; } btrfs_release_path(wc->subvol_path); return 0; } static int extref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr, struct fscrypt_str *name, u64 *index, u64 *parent_objectid) { struct btrfs_inode_extref *extref; int ret; extref = (struct btrfs_inode_extref *)ref_ptr; ret = read_alloc_one_name(eb, &extref->name, btrfs_inode_extref_name_len(eb, extref), name); if (ret) return ret; if (index) *index = btrfs_inode_extref_index(eb, extref); if (parent_objectid) *parent_objectid = btrfs_inode_extref_parent(eb, extref); return 0; } static int ref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr, struct fscrypt_str *name, u64 *index) { struct btrfs_inode_ref *ref; int ret; ref = (struct btrfs_inode_ref *)ref_ptr; ret = read_alloc_one_name(eb, ref + 1, btrfs_inode_ref_name_len(eb, ref), name); if (ret) return ret; if (index) *index = btrfs_inode_ref_index(eb, ref); return 0; } /* * Take an inode reference item from the log tree and iterate all names from the * inode reference item in the subvolume tree with the same key (if it exists). * For any name that is not in the inode reference item from the log tree, do a * proper unlink of that name (that is, remove its entry from the inode * reference item and both dir index keys). */ static int unlink_old_inode_refs(struct walk_control *wc, struct btrfs_inode *inode) { struct btrfs_root *root = wc->root; int ret; unsigned long ref_ptr; unsigned long ref_end; struct extent_buffer *eb; again: btrfs_release_path(wc->subvol_path); ret = btrfs_search_slot(NULL, root, &wc->log_key, wc->subvol_path, 0, 0); if (ret > 0) { ret = 0; goto out; } if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to search subvolume tree for key " BTRFS_KEY_FMT " root %llu", BTRFS_KEY_FMT_VALUE(&wc->log_key), btrfs_root_id(root)); goto out; } eb = wc->subvol_path->nodes[0]; ref_ptr = btrfs_item_ptr_offset(eb, wc->subvol_path->slots[0]); ref_end = ref_ptr + btrfs_item_size(eb, wc->subvol_path->slots[0]); while (ref_ptr < ref_end) { struct fscrypt_str name; u64 parent_id; if (wc->log_key.type == BTRFS_INODE_EXTREF_KEY) { ret = extref_get_fields(eb, ref_ptr, &name, NULL, &parent_id); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to get extref details for inode %llu root %llu", btrfs_ino(inode), btrfs_root_id(root)); goto out; } } else { parent_id = wc->log_key.offset; ret = ref_get_fields(eb, ref_ptr, &name, NULL); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to get ref details for inode %llu parent_id %llu root %llu", btrfs_ino(inode), parent_id, btrfs_root_id(root)); goto out; } } if (wc->log_key.type == BTRFS_INODE_EXTREF_KEY) ret = !!btrfs_find_name_in_ext_backref(wc->log_leaf, wc->log_slot, parent_id, &name); else ret = !!btrfs_find_name_in_backref(wc->log_leaf, wc->log_slot, &name); if (!ret) { struct btrfs_inode *dir; btrfs_release_path(wc->subvol_path); dir = btrfs_iget_logging(parent_id, root); if (IS_ERR(dir)) { ret = PTR_ERR(dir); kfree(name.name); btrfs_abort_log_replay(wc, ret, "failed to lookup dir inode %llu root %llu", parent_id, btrfs_root_id(root)); goto out; } ret = unlink_inode_for_log_replay(wc, dir, inode, &name); kfree(name.name); iput(&dir->vfs_inode); if (ret) goto out; goto again; } kfree(name.name); ref_ptr += name.len; if (wc->log_key.type == BTRFS_INODE_EXTREF_KEY) ref_ptr += sizeof(struct btrfs_inode_extref); else ref_ptr += sizeof(struct btrfs_inode_ref); } ret = 0; out: btrfs_release_path(wc->subvol_path); return ret; } /* * Replay one inode back reference item found in the log tree. * Path is for temporary use by this function (it should be released on return). */ static noinline int add_inode_ref(struct walk_control *wc) { struct btrfs_trans_handle *trans = wc->trans; struct btrfs_root *root = wc->root; struct btrfs_inode *dir = NULL; struct btrfs_inode *inode = NULL; unsigned long ref_ptr; unsigned long ref_end; struct fscrypt_str name = { 0 }; int ret; const bool is_extref_item = (wc->log_key.type == BTRFS_INODE_EXTREF_KEY); u64 parent_objectid; u64 inode_objectid; u64 ref_index = 0; int ref_struct_size; ref_ptr = btrfs_item_ptr_offset(wc->log_leaf, wc->log_slot); ref_end = ref_ptr + btrfs_item_size(wc->log_leaf, wc->log_slot); if (is_extref_item) { struct btrfs_inode_extref *r; ref_struct_size = sizeof(struct btrfs_inode_extref); r = (struct btrfs_inode_extref *)ref_ptr; parent_objectid = btrfs_inode_extref_parent(wc->log_leaf, r); } else { ref_struct_size = sizeof(struct btrfs_inode_ref); parent_objectid = wc->log_key.offset; } inode_objectid = wc->log_key.objectid; /* * it is possible that we didn't log all the parent directories * for a given inode. If we don't find the dir, just don't * copy the back ref in. The link count fixup code will take * care of the rest */ dir = btrfs_iget_logging(parent_objectid, root); if (IS_ERR(dir)) { ret = PTR_ERR(dir); if (ret == -ENOENT) ret = 0; else btrfs_abort_log_replay(wc, ret, "failed to lookup dir inode %llu root %llu", parent_objectid, btrfs_root_id(root)); dir = NULL; goto out; } inode = btrfs_iget_logging(inode_objectid, root); if (IS_ERR(inode)) { ret = PTR_ERR(inode); btrfs_abort_log_replay(wc, ret, "failed to lookup inode %llu root %llu", inode_objectid, btrfs_root_id(root)); inode = NULL; goto out; } while (ref_ptr < ref_end) { if (is_extref_item) { ret = extref_get_fields(wc->log_leaf, ref_ptr, &name, &ref_index, &parent_objectid); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to get extref details for inode %llu root %llu", btrfs_ino(inode), btrfs_root_id(root)); goto out; } /* * parent object can change from one array * item to another. */ if (!dir) { dir = btrfs_iget_logging(parent_objectid, root); if (IS_ERR(dir)) { ret = PTR_ERR(dir); dir = NULL; /* * A new parent dir may have not been * logged and not exist in the subvolume * tree, see the comment above before * the loop when getting the first * parent dir. */ if (ret == -ENOENT) { /* * The next extref may refer to * another parent dir that * exists, so continue. */ ret = 0; goto next; } else { btrfs_abort_log_replay(wc, ret, "failed to lookup dir inode %llu root %llu", parent_objectid, btrfs_root_id(root)); } goto out; } } } else { ret = ref_get_fields(wc->log_leaf, ref_ptr, &name, &ref_index); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to get ref details for inode %llu parent_objectid %llu root %llu", btrfs_ino(inode), parent_objectid, btrfs_root_id(root)); goto out; } } ret = inode_in_dir(root, wc->subvol_path, btrfs_ino(dir), btrfs_ino(inode), ref_index, &name); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to check if inode %llu is in dir %llu ref_index %llu name %.*s root %llu", btrfs_ino(inode), btrfs_ino(dir), ref_index, name.len, name.name, btrfs_root_id(root)); goto out; } else if (ret == 0) { /* * look for a conflicting back reference in the * metadata. if we find one we have to unlink that name * of the file before we add our new link. Later on, we * overwrite any existing back reference, and we don't * want to create dangling pointers in the directory. */ ret = __add_inode_ref(wc, dir, inode, ref_index, &name); if (ret) { if (ret == 1) ret = 0; goto out; } /* insert our name */ ret = btrfs_add_link(trans, dir, inode, &name, 0, ref_index); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to add link for inode %llu in dir %llu ref_index %llu name %.*s root %llu", btrfs_ino(inode), btrfs_ino(dir), ref_index, name.len, name.name, btrfs_root_id(root)); goto out; } ret = btrfs_update_inode(trans, inode); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to update inode %llu root %llu", btrfs_ino(inode), btrfs_root_id(root)); goto out; } } /* Else, ret == 1, we already have a perfect match, we're done. */ next: ref_ptr = (unsigned long)(ref_ptr + ref_struct_size) + name.len; kfree(name.name); name.name = NULL; if (is_extref_item && dir) { iput(&dir->vfs_inode); dir = NULL; } } /* * Before we overwrite the inode reference item in the subvolume tree * with the item from the log tree, we must unlink all names from the * parent directory that are in the subvolume's tree inode reference * item, otherwise we end up with an inconsistent subvolume tree where * dir index entries exist for a name but there is no inode reference * item with the same name. */ ret = unlink_old_inode_refs(wc, inode); if (ret) goto out; /* finally write the back reference in the inode */ ret = overwrite_item(wc); out: btrfs_release_path(wc->subvol_path); kfree(name.name); if (dir) iput(&dir->vfs_inode); if (inode) iput(&inode->vfs_inode); return ret; } static int count_inode_extrefs(struct btrfs_inode *inode, struct btrfs_path *path) { int ret = 0; int name_len; unsigned int nlink = 0; u32 item_size; u32 cur_offset = 0; u64 inode_objectid = btrfs_ino(inode); u64 offset = 0; unsigned long ptr; struct btrfs_inode_extref *extref; struct extent_buffer *leaf; while (1) { ret = btrfs_find_one_extref(inode->root, inode_objectid, offset, path, &extref, &offset); if (ret) break; leaf = path->nodes[0]; item_size = btrfs_item_size(leaf, path->slots[0]); ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); cur_offset = 0; while (cur_offset < item_size) { extref = (struct btrfs_inode_extref *) (ptr + cur_offset); name_len = btrfs_inode_extref_name_len(leaf, extref); nlink++; cur_offset += name_len + sizeof(*extref); } offset++; btrfs_release_path(path); } btrfs_release_path(path); if (ret < 0 && ret != -ENOENT) return ret; return nlink; } static int count_inode_refs(struct btrfs_inode *inode, struct btrfs_path *path) { int ret; struct btrfs_key key; unsigned int nlink = 0; unsigned long ptr; unsigned long ptr_end; int name_len; u64 ino = btrfs_ino(inode); key.objectid = ino; key.type = BTRFS_INODE_REF_KEY; key.offset = (u64)-1; while (1) { ret = btrfs_search_slot(NULL, inode->root, &key, path, 0, 0); if (ret < 0) break; if (ret > 0) { if (path->slots[0] == 0) break; path->slots[0]--; } process_slot: btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); if (key.objectid != ino || key.type != BTRFS_INODE_REF_KEY) break; ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]); ptr_end = ptr + btrfs_item_size(path->nodes[0], path->slots[0]); while (ptr < ptr_end) { struct btrfs_inode_ref *ref; ref = (struct btrfs_inode_ref *)ptr; name_len = btrfs_inode_ref_name_len(path->nodes[0], ref); ptr = (unsigned long)(ref + 1) + name_len; nlink++; } if (key.offset == 0) break; if (path->slots[0] > 0) { path->slots[0]--; goto process_slot; } key.offset--; btrfs_release_path(path); } btrfs_release_path(path); return nlink; } /* * There are a few corners where the link count of the file can't * be properly maintained during replay. So, instead of adding * lots of complexity to the log code, we just scan the backrefs * for any file that has been through replay. * * The scan will update the link count on the inode to reflect the * number of back refs found. If it goes down to zero, the iput * will free the inode. */ static noinline int fixup_inode_link_count(struct walk_control *wc, struct btrfs_inode *inode) { struct btrfs_trans_handle *trans = wc->trans; struct btrfs_root *root = inode->root; int ret; u64 nlink = 0; const u64 ino = btrfs_ino(inode); ret = count_inode_refs(inode, wc->subvol_path); if (ret < 0) goto out; nlink = ret; ret = count_inode_extrefs(inode, wc->subvol_path); if (ret < 0) goto out; nlink += ret; ret = 0; if (nlink != inode->vfs_inode.i_nlink) { set_nlink(&inode->vfs_inode, nlink); ret = btrfs_update_inode(trans, inode); if (ret) goto out; } if (S_ISDIR(inode->vfs_inode.i_mode)) inode->index_cnt = (u64)-1; if (inode->vfs_inode.i_nlink == 0) { if (S_ISDIR(inode->vfs_inode.i_mode)) { ret = replay_dir_deletes(wc, ino, true); if (ret) goto out; } ret = btrfs_insert_orphan_item(trans, root, ino); if (ret == -EEXIST) ret = 0; } out: btrfs_release_path(wc->subvol_path); return ret; } static noinline int fixup_inode_link_counts(struct walk_control *wc) { int ret; struct btrfs_key key; key.objectid = BTRFS_TREE_LOG_FIXUP_OBJECTID; key.type = BTRFS_ORPHAN_ITEM_KEY; key.offset = (u64)-1; while (1) { struct btrfs_trans_handle *trans = wc->trans; struct btrfs_root *root = wc->root; struct btrfs_inode *inode; ret = btrfs_search_slot(trans, root, &key, wc->subvol_path, -1, 1); if (ret < 0) break; if (ret == 1) { ret = 0; if (wc->subvol_path->slots[0] == 0) break; wc->subvol_path->slots[0]--; } btrfs_item_key_to_cpu(wc->subvol_path->nodes[0], &key, wc->subvol_path->slots[0]); if (key.objectid != BTRFS_TREE_LOG_FIXUP_OBJECTID || key.type != BTRFS_ORPHAN_ITEM_KEY) break; ret = btrfs_del_item(trans, root, wc->subvol_path); if (ret) break; btrfs_release_path(wc->subvol_path); inode = btrfs_iget_logging(key.offset, root); if (IS_ERR(inode)) { ret = PTR_ERR(inode); break; } ret = fixup_inode_link_count(wc, inode); iput(&inode->vfs_inode); if (ret) break; /* * fixup on a directory may create new entries, * make sure we always look for the highest possible * offset */ key.offset = (u64)-1; } btrfs_release_path(wc->subvol_path); return ret; } /* * record a given inode in the fixup dir so we can check its link * count when replay is done. The link count is incremented here * so the inode won't go away until we check it */ static noinline int link_to_fixup_dir(struct walk_control *wc, u64 objectid) { struct btrfs_trans_handle *trans = wc->trans; struct btrfs_root *root = wc->root; struct btrfs_key key; int ret = 0; struct btrfs_inode *inode; struct inode *vfs_inode; inode = btrfs_iget_logging(objectid, root); if (IS_ERR(inode)) { ret = PTR_ERR(inode); btrfs_abort_log_replay(wc, ret, "failed to lookup inode %llu root %llu", objectid, btrfs_root_id(root)); return ret; } vfs_inode = &inode->vfs_inode; key.objectid = BTRFS_TREE_LOG_FIXUP_OBJECTID; key.type = BTRFS_ORPHAN_ITEM_KEY; key.offset = objectid; ret = btrfs_insert_empty_item(trans, root, wc->subvol_path, &key, 0); btrfs_release_path(wc->subvol_path); if (ret == 0) { if (!vfs_inode->i_nlink) set_nlink(vfs_inode, 1); else inc_nlink(vfs_inode); ret = btrfs_update_inode(trans, inode); if (ret) btrfs_abort_log_replay(wc, ret, "failed to update inode %llu root %llu", objectid, btrfs_root_id(root)); } else if (ret == -EEXIST) { ret = 0; } else { btrfs_abort_log_replay(wc, ret, "failed to insert fixup item for inode %llu root %llu", objectid, btrfs_root_id(root)); } iput(vfs_inode); return ret; } /* * when replaying the log for a directory, we only insert names * for inodes that actually exist. This means an fsync on a directory * does not implicitly fsync all the new files in it */ static noinline int insert_one_name(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 dirid, u64 index, const struct fscrypt_str *name, struct btrfs_key *location) { struct btrfs_inode *inode; struct btrfs_inode *dir; int ret; inode = btrfs_iget_logging(location->objectid, root); if (IS_ERR(inode)) return PTR_ERR(inode); dir = btrfs_iget_logging(dirid, root); if (IS_ERR(dir)) { iput(&inode->vfs_inode); return PTR_ERR(dir); } ret = btrfs_add_link(trans, dir, inode, name, 1, index); /* FIXME, put inode into FIXUP list */ iput(&inode->vfs_inode); iput(&dir->vfs_inode); return ret; } static int delete_conflicting_dir_entry(struct walk_control *wc, struct btrfs_inode *dir, struct btrfs_dir_item *dst_di, const struct btrfs_key *log_key, u8 log_flags, bool exists) { struct btrfs_key found_key; btrfs_dir_item_key_to_cpu(wc->subvol_path->nodes[0], dst_di, &found_key); /* The existing dentry points to the same inode, don't delete it. */ if (found_key.objectid == log_key->objectid && found_key.type == log_key->type && found_key.offset == log_key->offset && btrfs_dir_flags(wc->subvol_path->nodes[0], dst_di) == log_flags) return 1; /* * Don't drop the conflicting directory entry if the inode for the new * entry doesn't exist. */ if (!exists) return 0; return drop_one_dir_item(wc, dir, dst_di); } /* * take a single entry in a log directory item and replay it into * the subvolume. * * if a conflicting item exists in the subdirectory already, * the inode it points to is unlinked and put into the link count * fix up tree. * * If a name from the log points to a file or directory that does * not exist in the FS, it is skipped. fsyncs on directories * do not force down inodes inside that directory, just changes to the * names or unlinks in a directory. * * Returns < 0 on error, 0 if the name wasn't replayed (dentry points to a * non-existing inode) and 1 if the name was replayed. */ static noinline int replay_one_name(struct walk_control *wc, struct btrfs_dir_item *di) { struct btrfs_trans_handle *trans = wc->trans; struct btrfs_root *root = wc->root; struct fscrypt_str name = { 0 }; struct btrfs_dir_item *dir_dst_di; struct btrfs_dir_item *index_dst_di; bool dir_dst_matches = false; bool index_dst_matches = false; struct btrfs_key log_key; struct btrfs_key search_key; struct btrfs_inode *dir; u8 log_flags; bool exists; int ret; bool update_size = true; bool name_added = false; dir = btrfs_iget_logging(wc->log_key.objectid, root); if (IS_ERR(dir)) { ret = PTR_ERR(dir); btrfs_abort_log_replay(wc, ret, "failed to lookup dir inode %llu root %llu", wc->log_key.objectid, btrfs_root_id(root)); return ret; } ret = read_alloc_one_name(wc->log_leaf, di + 1, btrfs_dir_name_len(wc->log_leaf, di), &name); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to allocate name for dir %llu root %llu", btrfs_ino(dir), btrfs_root_id(root)); goto out; } log_flags = btrfs_dir_flags(wc->log_leaf, di); btrfs_dir_item_key_to_cpu(wc->log_leaf, di, &log_key); ret = btrfs_lookup_inode(trans, root, wc->subvol_path, &log_key, 0); btrfs_release_path(wc->subvol_path); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to lookup inode %llu root %llu", log_key.objectid, btrfs_root_id(root)); goto out; } exists = (ret == 0); ret = 0; dir_dst_di = btrfs_lookup_dir_item(trans, root, wc->subvol_path, wc->log_key.objectid, &name, 1); if (IS_ERR(dir_dst_di)) { ret = PTR_ERR(dir_dst_di); btrfs_abort_log_replay(wc, ret, "failed to lookup dir item for dir %llu name %.*s root %llu", wc->log_key.objectid, name.len, name.name, btrfs_root_id(root)); goto out; } else if (dir_dst_di) { ret = delete_conflicting_dir_entry(wc, dir, dir_dst_di, &log_key, log_flags, exists); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to delete conflicting entry for dir %llu name %.*s root %llu", btrfs_ino(dir), name.len, name.name, btrfs_root_id(root)); goto out; } dir_dst_matches = (ret == 1); } btrfs_release_path(wc->subvol_path); index_dst_di = btrfs_lookup_dir_index_item(trans, root, wc->subvol_path, wc->log_key.objectid, wc->log_key.offset, &name, 1); if (IS_ERR(index_dst_di)) { ret = PTR_ERR(index_dst_di); btrfs_abort_log_replay(wc, ret, "failed to lookup dir index item for dir %llu name %.*s root %llu", wc->log_key.objectid, name.len, name.name, btrfs_root_id(root)); goto out; } else if (index_dst_di) { ret = delete_conflicting_dir_entry(wc, dir, index_dst_di, &log_key, log_flags, exists); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to delete conflicting entry for dir %llu name %.*s root %llu", btrfs_ino(dir), name.len, name.name, btrfs_root_id(root)); goto out; } index_dst_matches = (ret == 1); } btrfs_release_path(wc->subvol_path); if (dir_dst_matches && index_dst_matches) { ret = 0; update_size = false; goto out; } /* * Check if the inode reference exists in the log for the given name, * inode and parent inode */ search_key.objectid = log_key.objectid; search_key.type = BTRFS_INODE_REF_KEY; search_key.offset = wc->log_key.objectid; ret = backref_in_log(root->log_root, &search_key, 0, &name); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to check if ref item is logged for inode %llu dir %llu name %.*s root %llu", search_key.objectid, btrfs_ino(dir), name.len, name.name, btrfs_root_id(root)); goto out; } else if (ret) { /* The dentry will be added later. */ ret = 0; update_size = false; goto out; } search_key.objectid = log_key.objectid; search_key.type = BTRFS_INODE_EXTREF_KEY; search_key.offset = btrfs_extref_hash(wc->log_key.objectid, name.name, name.len); ret = backref_in_log(root->log_root, &search_key, wc->log_key.objectid, &name); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to check if extref item is logged for inode %llu dir %llu name %.*s root %llu", search_key.objectid, btrfs_ino(dir), name.len, name.name, btrfs_root_id(root)); goto out; } else if (ret) { /* The dentry will be added later. */ ret = 0; update_size = false; goto out; } ret = insert_one_name(trans, root, wc->log_key.objectid, wc->log_key.offset, &name, &log_key); if (ret && ret != -ENOENT && ret != -EEXIST) { btrfs_abort_log_replay(wc, ret, "failed to insert name %.*s for inode %llu dir %llu root %llu", name.len, name.name, log_key.objectid, btrfs_ino(dir), btrfs_root_id(root)); goto out; } if (!ret) name_added = true; update_size = false; ret = 0; out: if (!ret && update_size) { btrfs_i_size_write(dir, dir->vfs_inode.i_size + name.len * 2); ret = btrfs_update_inode(trans, dir); if (ret) btrfs_abort_log_replay(wc, ret, "failed to update dir inode %llu root %llu", btrfs_ino(dir), btrfs_root_id(root)); } kfree(name.name); iput(&dir->vfs_inode); if (!ret && name_added) ret = 1; return ret; } /* Replay one dir item from a BTRFS_DIR_INDEX_KEY key. */ static noinline int replay_one_dir_item(struct walk_control *wc) { int ret; struct btrfs_dir_item *di; /* We only log dir index keys, which only contain a single dir item. */ ASSERT(wc->log_key.type == BTRFS_DIR_INDEX_KEY, "wc->log_key.type=%u", wc->log_key.type); di = btrfs_item_ptr(wc->log_leaf, wc->log_slot, struct btrfs_dir_item); ret = replay_one_name(wc, di); if (ret < 0) return ret; /* * If this entry refers to a non-directory (directories can not have a * link count > 1) and it was added in the transaction that was not * committed, make sure we fixup the link count of the inode the entry * points to. Otherwise something like the following would result in a * directory pointing to an inode with a wrong link that does not account * for this dir entry: * * mkdir testdir * touch testdir/foo * touch testdir/bar * sync * * ln testdir/bar testdir/bar_link * ln testdir/foo testdir/foo_link * xfs_io -c "fsync" testdir/bar * * <power failure> * * mount fs, log replay happens * * File foo would remain with a link count of 1 when it has two entries * pointing to it in the directory testdir. This would make it impossible * to ever delete the parent directory has it would result in stale * dentries that can never be deleted. */ if (ret == 1 && btrfs_dir_ftype(wc->log_leaf, di) != BTRFS_FT_DIR) { struct btrfs_key di_key; btrfs_dir_item_key_to_cpu(wc->log_leaf, di, &di_key); ret = link_to_fixup_dir(wc, di_key.objectid); } return ret; } /* * directory replay has two parts. There are the standard directory * items in the log copied from the subvolume, and range items * created in the log while the subvolume was logged. * * The range items tell us which parts of the key space the log * is authoritative for. During replay, if a key in the subvolume * directory is in a logged range item, but not actually in the log * that means it was deleted from the directory before the fsync * and should be removed. */ static noinline int find_dir_range(struct btrfs_root *root, struct btrfs_path *path, u64 dirid, u64 *start_ret, u64 *end_ret) { struct btrfs_key key; u64 found_end; struct btrfs_dir_log_item *item; int ret; int nritems; if (*start_ret == (u64)-1) return 1; key.objectid = dirid; key.type = BTRFS_DIR_LOG_INDEX_KEY; key.offset = *start_ret; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) goto out; if (ret > 0) { if (path->slots[0] == 0) goto out; path->slots[0]--; } if (ret != 0) btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); if (key.type != BTRFS_DIR_LOG_INDEX_KEY || key.objectid != dirid) { ret = 1; goto next; } item = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_dir_log_item); found_end = btrfs_dir_log_end(path->nodes[0], item); if (*start_ret >= key.offset && *start_ret <= found_end) { ret = 0; *start_ret = key.offset; *end_ret = found_end; goto out; } ret = 1; next: /* check the next slot in the tree to see if it is a valid item */ nritems = btrfs_header_nritems(path->nodes[0]); path->slots[0]++; if (path->slots[0] >= nritems) { ret = btrfs_next_leaf(root, path); if (ret) goto out; } btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); if (key.type != BTRFS_DIR_LOG_INDEX_KEY || key.objectid != dirid) { ret = 1; goto out; } item = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_dir_log_item); found_end = btrfs_dir_log_end(path->nodes[0], item); *start_ret = key.offset; *end_ret = found_end; ret = 0; out: btrfs_release_path(path); return ret; } /* * this looks for a given directory item in the log. If the directory * item is not in the log, the item is removed and the inode it points * to is unlinked */ static noinline int check_item_in_log(struct walk_control *wc, struct btrfs_path *log_path, struct btrfs_inode *dir, struct btrfs_key *dir_key, bool force_remove) { struct btrfs_trans_handle *trans = wc->trans; struct btrfs_root *root = dir->root; int ret; struct extent_buffer *eb; int slot; struct btrfs_dir_item *di; struct fscrypt_str name = { 0 }; struct btrfs_inode *inode = NULL; struct btrfs_key location; /* * Currently we only log dir index keys. Even if we replay a log created * by an older kernel that logged both dir index and dir item keys, all * we need to do is process the dir index keys, we (and our caller) can * safely ignore dir item keys (key type BTRFS_DIR_ITEM_KEY). */ ASSERT(dir_key->type == BTRFS_DIR_INDEX_KEY, "dir_key->type=%u", dir_key->type); eb = wc->subvol_path->nodes[0]; slot = wc->subvol_path->slots[0]; di = btrfs_item_ptr(eb, slot, struct btrfs_dir_item); ret = read_alloc_one_name(eb, di + 1, btrfs_dir_name_len(eb, di), &name); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to allocate name for dir %llu index %llu root %llu", btrfs_ino(dir), dir_key->offset, btrfs_root_id(root)); goto out; } if (!force_remove) { struct btrfs_dir_item *log_di; log_di = btrfs_lookup_dir_index_item(trans, wc->log, log_path, dir_key->objectid, dir_key->offset, &name, 0); if (IS_ERR(log_di)) { ret = PTR_ERR(log_di); btrfs_abort_log_replay(wc, ret, "failed to lookup dir index item for dir %llu index %llu name %.*s root %llu", btrfs_ino(dir), dir_key->offset, name.len, name.name, btrfs_root_id(root)); goto out; } else if (log_di) { /* The dentry exists in the log, we have nothing to do. */ ret = 0; goto out; } } btrfs_dir_item_key_to_cpu(eb, di, &location); btrfs_release_path(wc->subvol_path); btrfs_release_path(log_path); inode = btrfs_iget_logging(location.objectid, root); if (IS_ERR(inode)) { ret = PTR_ERR(inode); inode = NULL; btrfs_abort_log_replay(wc, ret, "failed to lookup inode %llu root %llu", location.objectid, btrfs_root_id(root)); goto out; } ret = link_to_fixup_dir(wc, location.objectid); if (ret) goto out; inc_nlink(&inode->vfs_inode); ret = unlink_inode_for_log_replay(wc, dir, inode, &name); /* * Unlike dir item keys, dir index keys can only have one name (entry) in * them, as there are no key collisions since each key has a unique offset * (an index number), so we're done. */ out: btrfs_release_path(wc->subvol_path); btrfs_release_path(log_path); kfree(name.name); if (inode) iput(&inode->vfs_inode); return ret; } static int replay_xattr_deletes(struct walk_control *wc) { struct btrfs_trans_handle *trans = wc->trans; struct btrfs_root *root = wc->root; struct btrfs_root *log = wc->log; struct btrfs_key search_key; BTRFS_PATH_AUTO_FREE(log_path); const u64 ino = wc->log_key.objectid; int nritems; int ret; log_path = btrfs_alloc_path(); if (!log_path) { btrfs_abort_log_replay(wc, -ENOMEM, "failed to allocate path"); return -ENOMEM; } search_key.objectid = ino; search_key.type = BTRFS_XATTR_ITEM_KEY; search_key.offset = 0; again: ret = btrfs_search_slot(NULL, root, &search_key, wc->subvol_path, 0, 0); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to search xattrs for inode %llu root %llu", ino, btrfs_root_id(root)); goto out; } process_leaf: nritems = btrfs_header_nritems(wc->subvol_path->nodes[0]); for (int i = wc->subvol_path->slots[0]; i < nritems; i++) { struct btrfs_key key; struct btrfs_dir_item *di; struct btrfs_dir_item *log_di; u32 total_size; u32 cur; btrfs_item_key_to_cpu(wc->subvol_path->nodes[0], &key, i); if (key.objectid != ino || key.type != BTRFS_XATTR_ITEM_KEY) { ret = 0; goto out; } di = btrfs_item_ptr(wc->subvol_path->nodes[0], i, struct btrfs_dir_item); total_size = btrfs_item_size(wc->subvol_path->nodes[0], i); cur = 0; while (cur < total_size) { u16 name_len = btrfs_dir_name_len(wc->subvol_path->nodes[0], di); u16 data_len = btrfs_dir_data_len(wc->subvol_path->nodes[0], di); u32 this_len = sizeof(*di) + name_len + data_len; char *name; name = kmalloc(name_len, GFP_NOFS); if (!name) { ret = -ENOMEM; btrfs_abort_log_replay(wc, ret, "failed to allocate memory for name of length %u", name_len); goto out; } read_extent_buffer(wc->subvol_path->nodes[0], name, (unsigned long)(di + 1), name_len); log_di = btrfs_lookup_xattr(NULL, log, log_path, ino, name, name_len, 0); btrfs_release_path(log_path); if (!log_di) { /* Doesn't exist in log tree, so delete it. */ btrfs_release_path(wc->subvol_path); di = btrfs_lookup_xattr(trans, root, wc->subvol_path, ino, name, name_len, -1); if (IS_ERR(di)) { ret = PTR_ERR(di); btrfs_abort_log_replay(wc, ret, "failed to lookup xattr with name %.*s for inode %llu root %llu", name_len, name, ino, btrfs_root_id(root)); kfree(name); goto out; } ASSERT(di); ret = btrfs_delete_one_dir_name(trans, root, wc->subvol_path, di); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to delete xattr with name %.*s for inode %llu root %llu", name_len, name, ino, btrfs_root_id(root)); kfree(name); goto out; } btrfs_release_path(wc->subvol_path); kfree(name); search_key = key; goto again; } if (IS_ERR(log_di)) { ret = PTR_ERR(log_di); btrfs_abort_log_replay(wc, ret, "failed to lookup xattr in log tree with name %.*s for inode %llu root %llu", name_len, name, ino, btrfs_root_id(root)); kfree(name); goto out; } kfree(name); cur += this_len; di = (struct btrfs_dir_item *)((char *)di + this_len); } } ret = btrfs_next_leaf(root, wc->subvol_path); if (ret > 0) ret = 0; else if (ret == 0) goto process_leaf; else btrfs_abort_log_replay(wc, ret, "failed to get next leaf in subvolume root %llu", btrfs_root_id(root)); out: btrfs_release_path(wc->subvol_path); return ret; } /* * deletion replay happens before we copy any new directory items * out of the log or out of backreferences from inodes. It * scans the log to find ranges of keys that log is authoritative for, * and then scans the directory to find items in those ranges that are * not present in the log. * * Anything we don't find in the log is unlinked and removed from the * directory. */ static noinline int replay_dir_deletes(struct walk_control *wc, u64 dirid, bool del_all) { struct btrfs_root *root = wc->root; struct btrfs_root *log = (del_all ? NULL : wc->log); u64 range_start; u64 range_end; int ret = 0; struct btrfs_key dir_key; struct btrfs_key found_key; BTRFS_PATH_AUTO_FREE(log_path); struct btrfs_inode *dir; dir_key.objectid = dirid; dir_key.type = BTRFS_DIR_INDEX_KEY; log_path = btrfs_alloc_path(); if (!log_path) { btrfs_abort_log_replay(wc, -ENOMEM, "failed to allocate path"); return -ENOMEM; } dir = btrfs_iget_logging(dirid, root); /* * It isn't an error if the inode isn't there, that can happen because * we replay the deletes before we copy in the inode item from the log. */ if (IS_ERR(dir)) { ret = PTR_ERR(dir); if (ret == -ENOENT) ret = 0; else btrfs_abort_log_replay(wc, ret, "failed to lookup dir inode %llu root %llu", dirid, btrfs_root_id(root)); return ret; } range_start = 0; range_end = 0; while (1) { if (del_all) range_end = (u64)-1; else { ret = find_dir_range(log, wc->subvol_path, dirid, &range_start, &range_end); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to find range for dir %llu in log tree root %llu", dirid, btrfs_root_id(root)); goto out; } else if (ret > 0) { break; } } dir_key.offset = range_start; while (1) { int nritems; ret = btrfs_search_slot(NULL, root, &dir_key, wc->subvol_path, 0, 0); if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to search root %llu for key " BTRFS_KEY_FMT, btrfs_root_id(root), BTRFS_KEY_FMT_VALUE(&dir_key)); goto out; } nritems = btrfs_header_nritems(wc->subvol_path->nodes[0]); if (wc->subvol_path->slots[0] >= nritems) { ret = btrfs_next_leaf(root, wc->subvol_path); if (ret == 1) { break; } else if (ret < 0) { btrfs_abort_log_replay(wc, ret, "failed to get next leaf in subvolume root %llu", btrfs_root_id(root)); goto out; } } btrfs_item_key_to_cpu(wc->subvol_path->nodes[0], &found_key, wc->subvol_path->slots[0]); if (found_key.objectid != dirid || found_key.type != dir_key.type) { ret = 0; goto out; } if (found_key.offset > range_end) break; ret = check_item_in_log(wc, log_path, dir, &found_key, del_all); if (ret) goto out; if (found_key.offset == (u64)-1) break; dir_key.offset = found_key.offset + 1; } btrfs_release_path(wc->subvol_path); if (range_end == (u64)-1) break; range_start = range_end + 1; } ret = 0; out: btrfs_release_path(wc->subvol_path); iput(&dir->vfs_inode); return ret; } /* * the process_func used to replay items from the log tree. This * gets called in two different stages. The first stage just looks * for inodes and makes sure they are all copied into the subvolume. * * The second stage copies all the other item types from the log into * the subvolume. The two stage approach is slower, but gets rid of * lots of complexity around inodes referencing other inodes that exist * only in the log (references come from either directory items or inode * back refs). */ static int replay_one_buffer(struct extent_buffer *eb, struct walk_control *wc, u64 gen, int level) { int nritems; struct btrfs_tree_parent_check check = { .transid = gen, .level = level }; struct btrfs_root *root = wc->root; struct btrfs_trans_handle *trans = wc->trans; int ret; if (level != 0) return 0; /* * Set to NULL since it was not yet read and in case we abort log replay * on error, we have no valid log tree leaf to dump. */ wc->log_leaf = NULL; ret = btrfs_read_extent_buffer(eb, &check); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to read log tree leaf %llu for root %llu", eb->start, btrfs_root_id(root)); return ret; } ASSERT(wc->subvol_path == NULL); wc->subvol_path = btrfs_alloc_path(); if (!wc->subvol_path) { btrfs_abort_log_replay(wc, -ENOMEM, "failed to allocate path"); return -ENOMEM; } wc->log_leaf = eb; nritems = btrfs_header_nritems(eb); for (wc->log_slot = 0; wc->log_slot < nritems; wc->log_slot++) { struct btrfs_inode_item *inode_item = NULL; btrfs_item_key_to_cpu(eb, &wc->log_key, wc->log_slot); if (wc->log_key.type == BTRFS_INODE_ITEM_KEY) { inode_item = btrfs_item_ptr(eb, wc->log_slot, struct btrfs_inode_item); /* * An inode with no links is either: * * 1) A tmpfile (O_TMPFILE) that got fsync'ed and never * got linked before the fsync, skip it, as replaying * it is pointless since it would be deleted later. * We skip logging tmpfiles, but it's always possible * we are replaying a log created with a kernel that * used to log tmpfiles; * * 2) A non-tmpfile which got its last link deleted * while holding an open fd on it and later got * fsynced through that fd. We always log the * parent inodes when inode->last_unlink_trans is * set to the current transaction, so ignore all the * inode items for this inode. We will delete the * inode when processing the parent directory with * replay_dir_deletes(). */ if (btrfs_inode_nlink(eb, inode_item) == 0) { wc->ignore_cur_inode = true; continue; } else { wc->ignore_cur_inode = false; } } /* Inode keys are done during the first stage. */ if (wc->log_key.type == BTRFS_INODE_ITEM_KEY && wc->stage == LOG_WALK_REPLAY_INODES) { u32 mode; ret = replay_xattr_deletes(wc); if (ret) break; mode = btrfs_inode_mode(eb, inode_item); if (S_ISDIR(mode)) { ret = replay_dir_deletes(wc, wc->log_key.objectid, false); if (ret) break; } ret = overwrite_item(wc); if (ret) break; /* * Before replaying extents, truncate the inode to its * size. We need to do it now and not after log replay * because before an fsync we can have prealloc extents * added beyond the inode's i_size. If we did it after, * through orphan cleanup for example, we would drop * those prealloc extents just after replaying them. */ if (S_ISREG(mode)) { struct btrfs_drop_extents_args drop_args = { 0 }; struct btrfs_inode *inode; u64 from; inode = btrfs_iget_logging(wc->log_key.objectid, root); if (IS_ERR(inode)) { ret = PTR_ERR(inode); btrfs_abort_log_replay(wc, ret, "failed to lookup inode %llu root %llu", wc->log_key.objectid, btrfs_root_id(root)); break; } from = ALIGN(i_size_read(&inode->vfs_inode), root->fs_info->sectorsize); drop_args.start = from; drop_args.end = (u64)-1; drop_args.drop_cache = true; drop_args.path = wc->subvol_path; ret = btrfs_drop_extents(trans, root, inode, &drop_args); if (ret) { btrfs_abort_log_replay(wc, ret, "failed to drop extents for inode %llu root %llu offset %llu", btrfs_ino(inode), btrfs_root_id(root), from); } else { inode_sub_bytes(&inode->vfs_inode, drop_args.bytes_found); /* Update the inode's nbytes. */ ret = btrfs_update_inode(trans, inode); if (ret) btrfs_abort_log_replay(wc, ret, "failed to update inode %llu root %llu", btrfs_ino(inode), btrfs_root_id(root)); } iput(&inode->vfs_inode); if (ret) break; } ret = link_to_fixup_dir(wc, wc->log_key.objectid); if (ret) break; } if (wc->ignore_cur_inode) continue; if (wc->log_key.type == BTRFS_DIR_INDEX_KEY && wc->stage == LOG_WALK_REPLAY_DIR_INDEX) { ret = replay_one_dir_item(wc); if (ret) break; } if (wc->stage < LOG_WALK_REPLAY_ALL) continue; /* these keys are simply copied */ if (wc->log_key.type == BTRFS_XATTR_ITEM_KEY) { ret = overwrite_item(wc); if (ret) break; } else if (wc->log_key.type == BTRFS_INODE_REF_KEY || wc->log_key.type == BTRFS_INODE_EXTREF_KEY) { ret = add_inode_ref(wc); if (ret) break; } else if (wc->log_key.type == BTRFS_EXTENT_DATA_KEY) { ret = replay_one_extent(wc); if (ret) break; } /* * We don't log BTRFS_DIR_ITEM_KEY keys anymore, only the * BTRFS_DIR_INDEX_KEY items which we use to derive the * BTRFS_DIR_ITEM_KEY items. If we are replaying a log from an * older kernel with such keys, ignore them. */ } btrfs_free_path(wc->subvol_path); wc->subvol_path = NULL; return ret; } static int clean_log_buffer(struct btrfs_trans_handle *trans, struct extent_buffer *eb) { struct btrfs_fs_info *fs_info = eb->fs_info; struct btrfs_block_group *bg; btrfs_tree_lock(eb); btrfs_clear_buffer_dirty(trans, eb); wait_on_extent_buffer_writeback(eb); btrfs_tree_unlock(eb); if (trans) { int ret; ret = btrfs_pin_reserved_extent(trans, eb); if (ret) btrfs_abort_transaction(trans, ret); return ret; } bg = btrfs_lookup_block_group(fs_info, eb->start); if (!bg) { btrfs_err(fs_info, "unable to find block group for %llu", eb->start); btrfs_handle_fs_error(fs_info, -ENOENT, NULL); return -ENOENT; } spin_lock(&bg->space_info->lock); spin_lock(&bg->lock); bg->reserved -= fs_info->nodesize; bg->space_info->bytes_reserved -= fs_info->nodesize; spin_unlock(&bg->lock); spin_unlock(&bg->space_info->lock); btrfs_put_block_group(bg); return 0; } static noinline int walk_down_log_tree(struct btrfs_path *path, int *level, struct walk_control *wc) { struct btrfs_trans_handle *trans = wc->trans; struct btrfs_fs_info *fs_info = wc->log->fs_info; u64 bytenr; u64 ptr_gen; struct extent_buffer *next; struct extent_buffer *cur; int ret = 0; while (*level > 0) { struct btrfs_tree_parent_check check = { 0 }; cur = path->nodes[*level]; WARN_ON(btrfs_header_level(cur) != *level); if (path->slots[*level] >= btrfs_header_nritems(cur)) break; bytenr = btrfs_node_blockptr(cur, path->slots[*level]); ptr_gen = btrfs_node_ptr_generation(cur, path->slots[*level]); check.transid = ptr_gen; check.level = *level - 1; check.has_first_key = true; btrfs_node_key_to_cpu(cur, &check.first_key, path->slots[*level]); next = btrfs_find_create_tree_block(fs_info, bytenr, btrfs_header_owner(cur), *level - 1); if (IS_ERR(next)) { ret = PTR_ERR(next); if (trans) btrfs_abort_transaction(trans, ret); else btrfs_handle_fs_error(fs_info, ret, NULL); return ret; } if (*level == 1) { ret = wc->process_func(next, wc, ptr_gen, *level - 1); if (ret) { free_extent_buffer(next); return ret; } path->slots[*level]++; if (wc->free) { ret = btrfs_read_extent_buffer(next, &check); if (ret) { free_extent_buffer(next); if (trans) btrfs_abort_transaction(trans, ret); else btrfs_handle_fs_error(fs_info, ret, NULL); return ret; } ret = clean_log_buffer(trans, next); if (ret) { free_extent_buffer(next); return ret; } } free_extent_buffer(next); continue; } ret = btrfs_read_extent_buffer(next, &check); if (ret) { free_extent_buffer(next); if (trans) btrfs_abort_transaction(trans, ret); else btrfs_handle_fs_error(fs_info, ret, NULL); return ret; } if (path->nodes[*level-1]) free_extent_buffer(path->nodes[*level-1]); path->nodes[*level-1] = next; *level = btrfs_header_level(next); path->slots[*level] = 0; cond_resched(); } path->slots[*level] = btrfs_header_nritems(path->nodes[*level]); cond_resched(); return 0; } static noinline int walk_up_log_tree(struct btrfs_path *path, int *level, struct walk_control *wc) { int i; int slot; int ret; for (i = *level; i < BTRFS_MAX_LEVEL - 1 && path->nodes[i]; i++) { slot = path->slots[i]; if (slot + 1 < btrfs_header_nritems(path->nodes[i])) { path->slots[i]++; *level = i; WARN_ON(*level == 0); return 0; } else { ret = wc->process_func(path->nodes[*level], wc, btrfs_header_generation(path->nodes[*level]), *level); if (ret) return ret; if (wc->free) { ret = clean_log_buffer(wc->trans, path->nodes[*level]); if (ret) return ret; } free_extent_buffer(path->nodes[*level]); path->nodes[*level] = NULL; *level = i + 1; } } return 1; } /* * drop the reference count on the tree rooted at 'snap'. This traverses * the tree freeing any blocks that have a ref count of zero after being * decremented. */ static int walk_log_tree(struct walk_control *wc) { struct btrfs_root *log = wc->log; int ret = 0; int wret; int level; BTRFS_PATH_AUTO_FREE(path); int orig_level; path = btrfs_alloc_path(); if (!path) return -ENOMEM; level = btrfs_header_level(log->node); orig_level = level; path->nodes[level] = log->node; refcount_inc(&log->node->refs); path->slots[level] = 0; while (1) { wret = walk_down_log_tree(path, &level, wc); if (wret > 0) break; if (wret < 0) return wret; wret = walk_up_log_tree(path, &level, wc); if (wret > 0) break; if (wret < 0) return wret; } /* was the root node processed? if not, catch it here */ if (path->nodes[orig_level]) { ret = wc->process_func(path->nodes[orig_level], wc, btrfs_header_generation(path->nodes[orig_level]), orig_level); if (ret) return ret; if (wc->free) ret = clean_log_buffer(wc->trans, path->nodes[orig_level]); } return ret; } /* * helper function to update the item for a given subvolumes log root * in the tree of log roots */ static int update_log_root(struct btrfs_trans_handle *trans, struct btrfs_root *log, struct btrfs_root_item *root_item) { struct btrfs_fs_info *fs_info = log->fs_info; int ret; if (log->log_transid == 1) { /* insert root item on the first sync */ ret = btrfs_insert_root(trans, fs_info->log_root_tree, &log->root_key, root_item); } else { ret = btrfs_update_root(trans, fs_info->log_root_tree, &log->root_key, root_item); } return ret; } static void wait_log_commit(struct btrfs_root *root, int transid) { DEFINE_WAIT(wait); int index = transid % 2; /* * we only allow two pending log transactions at a time, * so we know that if ours is more than 2 older than the * current transaction, we're done */ for (;;) { prepare_to_wait(&root->log_commit_wait[index], &wait, TASK_UNINTERRUPTIBLE); if (!(root->log_transid_committed < transid && atomic_read(&root->log_commit[index]))) break; mutex_unlock(&root->log_mutex); schedule(); mutex_lock(&root->log_mutex); } finish_wait(&root->log_commit_wait[index], &wait); } static void wait_for_writer(struct btrfs_root *root) { DEFINE_WAIT(wait); for (;;) { prepare_to_wait(&root->log_writer_wait, &wait, TASK_UNINTERRUPTIBLE); if (!atomic_read(&root->log_writers)) break; mutex_unlock(&root->log_mutex); schedule(); mutex_lock(&root->log_mutex); } finish_wait(&root->log_writer_wait, &wait); } void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, struct btrfs_inode *inode) { ctx->log_ret = 0; ctx->log_transid = 0; ctx->log_new_dentries = false; ctx->logging_new_name = false; ctx->logging_new_delayed_dentries = false; ctx->logged_before = false; ctx->inode = inode; INIT_LIST_HEAD(&ctx->list); INIT_LIST_HEAD(&ctx->ordered_extents); INIT_LIST_HEAD(&ctx->conflict_inodes); ctx->num_conflict_inodes = 0; ctx->logging_conflict_inodes = false; ctx->scratch_eb = NULL; } void btrfs_init_log_ctx_scratch_eb(struct btrfs_log_ctx *ctx) { struct btrfs_inode *inode = ctx->inode; if (!test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) && !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags)) return; /* * Don't care about allocation failure. This is just for optimization, * if we fail to allocate here, we will try again later if needed. */ ctx->scratch_eb = alloc_dummy_extent_buffer(inode->root->fs_info, 0); } void btrfs_release_log_ctx_extents(struct btrfs_log_ctx *ctx) { struct btrfs_ordered_extent *ordered; struct btrfs_ordered_extent *tmp; btrfs_assert_inode_locked(ctx->inode); list_for_each_entry_safe(ordered, tmp, &ctx->ordered_extents, log_list) { list_del_init(&ordered->log_list); btrfs_put_ordered_extent(ordered); } } static inline void btrfs_remove_log_ctx(struct btrfs_root *root, struct btrfs_log_ctx *ctx) { mutex_lock(&root->log_mutex); list_del_init(&ctx->list); mutex_unlock(&root->log_mutex); } /* * Invoked in log mutex context, or be sure there is no other task which * can access the list. */ static inline void btrfs_remove_all_log_ctxs(struct btrfs_root *root, int index, int error) { struct btrfs_log_ctx *ctx; struct btrfs_log_ctx *safe; list_for_each_entry_safe(ctx, safe, &root->log_ctxs[index], list) { list_del_init(&ctx->list); ctx->log_ret = error; } } /* * Sends a given tree log down to the disk and updates the super blocks to * record it. When this call is done, you know that any inodes previously * logged are safely on disk only if it returns 0. * * Any other return value means you need to call btrfs_commit_transaction. * Some of the edge cases for fsyncing directories that have had unlinks * or renames done in the past mean that sometimes the only safe * fsync is to commit the whole FS. When btrfs_sync_log returns -EAGAIN, * that has happened. */ int btrfs_sync_log(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_log_ctx *ctx) { int index1; int index2; int mark; int ret; struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_root *log = root->log_root; struct btrfs_root *log_root_tree = fs_info->log_root_tree; struct btrfs_root_item new_root_item; int log_transid = 0; struct btrfs_log_ctx root_log_ctx; struct blk_plug plug; u64 log_root_start; u64 log_root_level; mutex_lock(&root->log_mutex); log_transid = ctx->log_transid; if (root->log_transid_committed >= log_transid) { mutex_unlock(&root->log_mutex); return ctx->log_ret; } index1 = log_transid % 2; if (atomic_read(&root->log_commit[index1])) { wait_log_commit(root, log_transid); mutex_unlock(&root->log_mutex); return ctx->log_ret; } ASSERT(log_transid == root->log_transid, "log_transid=%d root->log_transid=%d", log_transid, root->log_transid); atomic_set(&root->log_commit[index1], 1); /* wait for previous tree log sync to complete */ if (atomic_read(&root->log_commit[(index1 + 1) % 2])) wait_log_commit(root, log_transid - 1); while (1) { int batch = atomic_read(&root->log_batch); /* when we're on an ssd, just kick the log commit out */ if (!btrfs_test_opt(fs_info, SSD) && test_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state)) { mutex_unlock(&root->log_mutex); schedule_timeout_uninterruptible(1); mutex_lock(&root->log_mutex); } wait_for_writer(root); if (batch == atomic_read(&root->log_batch)) break; } /* bail out if we need to do a full commit */ if (btrfs_need_log_full_commit(trans)) { ret = BTRFS_LOG_FORCE_COMMIT; mutex_unlock(&root->log_mutex); goto out; } if (log_transid % 2 == 0) mark = EXTENT_DIRTY_LOG1; else mark = EXTENT_DIRTY_LOG2; /* we start IO on all the marked extents here, but we don't actually * wait for them until later. */ blk_start_plug(&plug); ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark); /* * -EAGAIN happens when someone, e.g., a concurrent transaction * commit, writes a dirty extent in this tree-log commit. This * concurrent write will create a hole writing out the extents, * and we cannot proceed on a zoned filesystem, requiring * sequential writing. While we can bail out to a full commit * here, but we can continue hoping the concurrent writing fills * the hole. */ if (ret == -EAGAIN && btrfs_is_zoned(fs_info)) ret = 0; if (ret) { blk_finish_plug(&plug); btrfs_set_log_full_commit(trans); mutex_unlock(&root->log_mutex); goto out; } /* * We _must_ update under the root->log_mutex in order to make sure we * have a consistent view of the log root we are trying to commit at * this moment. * * We _must_ copy this into a local copy, because we are not holding the * log_root_tree->log_mutex yet. This is important because when we * commit the log_root_tree we must have a consistent view of the * log_root_tree when we update the super block to point at the * log_root_tree bytenr. If we update the log_root_tree here we'll race * with the commit and possibly point at the new block which we may not * have written out. */ btrfs_set_root_node(&log->root_item, log->node); memcpy(&new_root_item, &log->root_item, sizeof(new_root_item)); btrfs_set_root_log_transid(root, root->log_transid + 1); log->log_transid = root->log_transid; root->log_start_pid = 0; /* * IO has been started, blocks of the log tree have WRITTEN flag set * in their headers. new modifications of the log will be written to * new positions. so it's safe to allow log writers to go in. */ mutex_unlock(&root->log_mutex); if (btrfs_is_zoned(fs_info)) { mutex_lock(&fs_info->tree_root->log_mutex); if (!log_root_tree->node) { ret = btrfs_alloc_log_tree_node(trans, log_root_tree); if (ret) { mutex_unlock(&fs_info->tree_root->log_mutex); blk_finish_plug(&plug); goto out; } } mutex_unlock(&fs_info->tree_root->log_mutex); } btrfs_init_log_ctx(&root_log_ctx, NULL); mutex_lock(&log_root_tree->log_mutex); index2 = log_root_tree->log_transid % 2; list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]); root_log_ctx.log_transid = log_root_tree->log_transid; /* * Now we are safe to update the log_root_tree because we're under the * log_mutex, and we're a current writer so we're holding the commit * open until we drop the log_mutex. */ ret = update_log_root(trans, log, &new_root_item); if (ret) { list_del_init(&root_log_ctx.list); blk_finish_plug(&plug); btrfs_set_log_full_commit(trans); if (ret != -ENOSPC) btrfs_err(fs_info, "failed to update log for root %llu ret %d", btrfs_root_id(root), ret); btrfs_wait_tree_log_extents(log, mark); mutex_unlock(&log_root_tree->log_mutex); goto out; } if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) { blk_finish_plug(&plug); list_del_init(&root_log_ctx.list); mutex_unlock(&log_root_tree->log_mutex); ret = root_log_ctx.log_ret; goto out; } if (atomic_read(&log_root_tree->log_commit[index2])) { blk_finish_plug(&plug); ret = btrfs_wait_tree_log_extents(log, mark); wait_log_commit(log_root_tree, root_log_ctx.log_transid); mutex_unlock(&log_root_tree->log_mutex); if (!ret) ret = root_log_ctx.log_ret; goto out; } ASSERT(root_log_ctx.log_transid == log_root_tree->log_transid, "root_log_ctx.log_transid=%d log_root_tree->log_transid=%d", root_log_ctx.log_transid, log_root_tree->log_transid); atomic_set(&log_root_tree->log_commit[index2], 1); if (atomic_read(&log_root_tree->log_commit[(index2 + 1) % 2])) { wait_log_commit(log_root_tree, root_log_ctx.log_transid - 1); } /* * now that we've moved on to the tree of log tree roots, * check the full commit flag again */ if (btrfs_need_log_full_commit(trans)) { blk_finish_plug(&plug); btrfs_wait_tree_log_extents(log, mark); mutex_unlock(&log_root_tree->log_mutex); ret = BTRFS_LOG_FORCE_COMMIT; goto out_wake_log_root; } ret = btrfs_write_marked_extents(fs_info, &log_root_tree->dirty_log_pages, EXTENT_DIRTY_LOG1 | EXTENT_DIRTY_LOG2); blk_finish_plug(&plug); /* * As described above, -EAGAIN indicates a hole in the extents. We * cannot wait for these write outs since the waiting cause a * deadlock. Bail out to the full commit instead. */ if (ret == -EAGAIN && btrfs_is_zoned(fs_info)) { btrfs_set_log_full_commit(trans); btrfs_wait_tree_log_extents(log, mark); mutex_unlock(&log_root_tree->log_mutex); goto out_wake_log_root; } else if (ret) { btrfs_set_log_full_commit(trans); mutex_unlock(&log_root_tree->log_mutex); goto out_wake_log_root; } ret = btrfs_wait_tree_log_extents(log, mark); if (!ret) ret = btrfs_wait_tree_log_extents(log_root_tree, EXTENT_DIRTY_LOG1 | EXTENT_DIRTY_LOG2); if (ret) { btrfs_set_log_full_commit(trans); mutex_unlock(&log_root_tree->log_mutex); goto out_wake_log_root; } log_root_start = log_root_tree->node->start; log_root_level = btrfs_header_level(log_root_tree->node); log_root_tree->log_transid++; mutex_unlock(&log_root_tree->log_mutex); /* * Here we are guaranteed that nobody is going to write the superblock * for the current transaction before us and that neither we do write * our superblock before the previous transaction finishes its commit * and writes its superblock, because: * * 1) We are holding a handle on the current transaction, so no body * can commit it until we release the handle; * * 2) Before writing our superblock we acquire the tree_log_mutex, so * if the previous transaction is still committing, and hasn't yet * written its superblock, we wait for it to do it, because a * transaction commit acquires the tree_log_mutex when the commit * begins and releases it only after writing its superblock. */ mutex_lock(&fs_info->tree_log_mutex); /* * The previous transaction writeout phase could have failed, and thus * marked the fs in an error state. We must not commit here, as we * could have updated our generation in the super_for_commit and * writing the super here would result in transid mismatches. If there * is an error here just bail. */ if (BTRFS_FS_ERROR(fs_info)) { ret = -EIO; btrfs_set_log_full_commit(trans); btrfs_abort_transaction(trans, ret); mutex_unlock(&fs_info->tree_log_mutex); goto out_wake_log_root; } btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start); btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level); ret = write_all_supers(fs_info, 1); mutex_unlock(&fs_info->tree_log_mutex); if (unlikely(ret)) { btrfs_set_log_full_commit(trans); btrfs_abort_transaction(trans, ret); goto out_wake_log_root; } /* * We know there can only be one task here, since we have not yet set * root->log_commit[index1] to 0 and any task attempting to sync the * log must wait for the previous log transaction to commit if it's * still in progress or wait for the current log transaction commit if * someone else already started it. We use <= and not < because the * first log transaction has an ID of 0. */ ASSERT(btrfs_get_root_last_log_commit(root) <= log_transid, "last_log_commit(root)=%d log_transid=%d", btrfs_get_root_last_log_commit(root), log_transid); btrfs_set_root_last_log_commit(root, log_transid); out_wake_log_root: mutex_lock(&log_root_tree->log_mutex); btrfs_remove_all_log_ctxs(log_root_tree, index2, ret); log_root_tree->log_transid_committed++; atomic_set(&log_root_tree->log_commit[index2], 0); mutex_unlock(&log_root_tree->log_mutex); /* * The barrier before waitqueue_active (in cond_wake_up) is needed so * all the updates above are seen by the woken threads. It might not be * necessary, but proving that seems to be hard. */ cond_wake_up(&log_root_tree->log_commit_wait[index2]); out: mutex_lock(&root->log_mutex); btrfs_remove_all_log_ctxs(root, index1, ret); root->log_transid_committed++; atomic_set(&root->log_commit[index1], 0); mutex_unlock(&root->log_mutex); /* * The barrier before waitqueue_active (in cond_wake_up) is needed so * all the updates above are seen by the woken threads. It might not be * necessary, but proving that seems to be hard. */ cond_wake_up(&root->log_commit_wait[index1]); return ret; } static void free_log_tree(struct btrfs_trans_handle *trans, struct btrfs_root *log) { int ret; struct walk_control wc = { .free = true, .process_func = process_one_buffer, .log = log, .trans = trans, }; if (log->node) { ret = walk_log_tree(&wc); if (ret) { /* * We weren't able to traverse the entire log tree, the * typical scenario is getting an -EIO when reading an * extent buffer of the tree, due to a previous writeback * failure of it. */ set_bit(BTRFS_FS_STATE_LOG_CLEANUP_ERROR, &log->fs_info->fs_state); /* * Some extent buffers of the log tree may still be dirty * and not yet written back to storage, because we may * have updates to a log tree without syncing a log tree, * such as during rename and link operations. So flush * them out and wait for their writeback to complete, so * that we properly cleanup their state and pages. */ btrfs_write_marked_extents(log->fs_info, &log->dirty_log_pages, EXTENT_DIRTY_LOG1 | EXTENT_DIRTY_LOG2); btrfs_wait_tree_log_extents(log, EXTENT_DIRTY_LOG1 | EXTENT_DIRTY_LOG2); if (trans) btrfs_abort_transaction(trans, ret); else btrfs_handle_fs_error(log->fs_info, ret, NULL); } } btrfs_extent_io_tree_release(&log->dirty_log_pages); btrfs_extent_io_tree_release(&log->log_csum_range); btrfs_put_root(log); } /* * free all the extents used by the tree log. This should be called * at commit time of the full transaction */ int btrfs_free_log(struct btrfs_trans_handle *trans, struct btrfs_root *root) { if (root->log_root) { free_log_tree(trans, root->log_root); root->log_root = NULL; clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state); } return 0; } int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { if (fs_info->log_root_tree) { free_log_tree(trans, fs_info->log_root_tree); fs_info->log_root_tree = NULL; clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state); } return 0; } static bool mark_inode_as_not_logged(const struct btrfs_trans_handle *trans, struct btrfs_inode *inode) { bool ret = false; /* * Do this only if ->logged_trans is still 0 to prevent races with * concurrent logging as we may see the inode not logged when * inode_logged() is called but it gets logged after inode_logged() did * not find it in the log tree and we end up setting ->logged_trans to a * value less than trans->transid after the concurrent logging task has * set it to trans->transid. As a consequence, subsequent rename, unlink * and link operations may end up not logging new names and removing old * names from the log. */ spin_lock(&inode->lock); if (inode->logged_trans == 0) inode->logged_trans = trans->transid - 1; else if (inode->logged_trans == trans->transid) ret = true; spin_unlock(&inode->lock); return ret; } /* * Check if an inode was logged in the current transaction. This correctly deals * with the case where the inode was logged but has a logged_trans of 0, which * happens if the inode is evicted and loaded again, as logged_trans is an in * memory only field (not persisted). * * Returns 1 if the inode was logged before in the transaction, 0 if it was not, * and < 0 on error. */ static int inode_logged(const struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path_in) { struct btrfs_path *path = path_in; struct btrfs_key key; int ret; /* * Quick lockless call, since once ->logged_trans is set to the current * transaction, we never set it to a lower value anywhere else. */ if (data_race(inode->logged_trans) == trans->transid) return 1; /* * If logged_trans is not 0 and not trans->transid, then we know the * inode was not logged in this transaction, so we can return false * right away. We take the lock to avoid a race caused by load/store * tearing with a concurrent btrfs_log_inode() call or a concurrent task * in this function further below - an update to trans->transid can be * teared into two 32 bits updates for example, in which case we could * see a positive value that is not trans->transid and assume the inode * was not logged when it was. */ spin_lock(&inode->lock); if (inode->logged_trans == trans->transid) { spin_unlock(&inode->lock); return 1; } else if (inode->logged_trans > 0) { spin_unlock(&inode->lock); return 0; } spin_unlock(&inode->lock); /* * If no log tree was created for this root in this transaction, then * the inode can not have been logged in this transaction. In that case * set logged_trans to anything greater than 0 and less than the current * transaction's ID, to avoid the search below in a future call in case * a log tree gets created after this. */ if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &inode->root->state)) return mark_inode_as_not_logged(trans, inode); /* * We have a log tree and the inode's logged_trans is 0. We can't tell * for sure if the inode was logged before in this transaction by looking * only at logged_trans. We could be pessimistic and assume it was, but * that can lead to unnecessarily logging an inode during rename and link * operations, and then further updating the log in followup rename and * link operations, specially if it's a directory, which adds latency * visible to applications doing a series of rename or link operations. * * A logged_trans of 0 here can mean several things: * * 1) The inode was never logged since the filesystem was mounted, and may * or may have not been evicted and loaded again; * * 2) The inode was logged in a previous transaction, then evicted and * then loaded again; * * 3) The inode was logged in the current transaction, then evicted and * then loaded again. * * For cases 1) and 2) we don't want to return true, but we need to detect * case 3) and return true. So we do a search in the log root for the inode * item. */ key.objectid = btrfs_ino(inode); key.type = BTRFS_INODE_ITEM_KEY; key.offset = 0; if (!path) { path = btrfs_alloc_path(); if (!path) return -ENOMEM; } ret = btrfs_search_slot(NULL, inode->root->log_root, &key, path, 0, 0); if (path_in) btrfs_release_path(path); else btrfs_free_path(path); /* * Logging an inode always results in logging its inode item. So if we * did not find the item we know the inode was not logged for sure. */ if (ret < 0) { return ret; } else if (ret > 0) { /* * Set logged_trans to a value greater than 0 and less then the * current transaction to avoid doing the search in future calls. */ return mark_inode_as_not_logged(trans, inode); } /* * The inode was previously logged and then evicted, set logged_trans to * the current transaction's ID, to avoid future tree searches as long as * the inode is not evicted again. */ spin_lock(&inode->lock); inode->logged_trans = trans->transid; spin_unlock(&inode->lock); return 1; } /* * Delete a directory entry from the log if it exists. * * Returns < 0 on error * 1 if the entry does not exists * 0 if the entry existed and was successfully deleted */ static int del_logged_dentry(struct btrfs_trans_handle *trans, struct btrfs_root *log, struct btrfs_path *path, u64 dir_ino, const struct fscrypt_str *name, u64 index) { struct btrfs_dir_item *di; /* * We only log dir index items of a directory, so we don't need to look * for dir item keys. */ di = btrfs_lookup_dir_index_item(trans, log, path, dir_ino, index, name, -1); if (IS_ERR(di)) return PTR_ERR(di); else if (!di) return 1; /* * We do not need to update the size field of the directory's * inode item because on log replay we update the field to reflect * all existing entries in the directory (see overwrite_item()). */ return btrfs_del_item(trans, log, path); } /* * If both a file and directory are logged, and unlinks or renames are * mixed in, we have a few interesting corners: * * create file X in dir Y * link file X to X.link in dir Y * fsync file X * unlink file X but leave X.link * fsync dir Y * * After a crash we would expect only X.link to exist. But file X * didn't get fsync'd again so the log has back refs for X and X.link. * * We solve this by removing directory entries and inode backrefs from the * log when a file that was logged in the current transaction is * unlinked. Any later fsync will include the updated log entries, and * we'll be able to reconstruct the proper directory items from backrefs. * * This optimizations allows us to avoid relogging the entire inode * or the entire directory. */ void btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans, const struct fscrypt_str *name, struct btrfs_inode *dir, u64 index) { struct btrfs_root *root = dir->root; BTRFS_PATH_AUTO_FREE(path); int ret; ret = inode_logged(trans, dir, NULL); if (ret == 0) return; if (ret < 0) { btrfs_set_log_full_commit(trans); return; } path = btrfs_alloc_path(); if (!path) { btrfs_set_log_full_commit(trans); return; } ret = join_running_log_trans(root); ASSERT(ret == 0, "join_running_log_trans() ret=%d", ret); if (WARN_ON(ret)) return; mutex_lock(&dir->log_mutex); ret = del_logged_dentry(trans, root->log_root, path, btrfs_ino(dir), name, index); mutex_unlock(&dir->log_mutex); if (ret < 0) btrfs_set_log_full_commit(trans); btrfs_end_log_trans(root); } /* see comments for btrfs_del_dir_entries_in_log */ void btrfs_del_inode_ref_in_log(struct btrfs_trans_handle *trans, const struct fscrypt_str *name, struct btrfs_inode *inode, struct btrfs_inode *dir) { struct btrfs_root *root = dir->root; int ret; ret = inode_logged(trans, inode, NULL); if (ret == 0) return; else if (ret < 0) { btrfs_set_log_full_commit(trans); return; } ret = join_running_log_trans(root); ASSERT(ret == 0, "join_running_log_trans() ret=%d", ret); if (WARN_ON(ret)) return; mutex_lock(&inode->log_mutex); ret = btrfs_del_inode_ref(trans, root->log_root, name, btrfs_ino(inode), btrfs_ino(dir), NULL); mutex_unlock(&inode->log_mutex); if (ret < 0 && ret != -ENOENT) btrfs_set_log_full_commit(trans); btrfs_end_log_trans(root); } /* * creates a range item in the log for 'dirid'. first_offset and * last_offset tell us which parts of the key space the log should * be considered authoritative for. */ static noinline int insert_dir_log_key(struct btrfs_trans_handle *trans, struct btrfs_root *log, struct btrfs_path *path, u64 dirid, u64 first_offset, u64 last_offset) { int ret; struct btrfs_key key; struct btrfs_dir_log_item *item; key.objectid = dirid; key.type = BTRFS_DIR_LOG_INDEX_KEY; key.offset = first_offset; ret = btrfs_insert_empty_item(trans, log, path, &key, sizeof(*item)); /* * -EEXIST is fine and can happen sporadically when we are logging a * directory and have concurrent insertions in the subvolume's tree for * items from other inodes and that result in pushing off some dir items * from one leaf to another in order to accommodate for the new items. * This results in logging the same dir index range key. */ if (ret && ret != -EEXIST) return ret; item = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_dir_log_item); if (ret == -EEXIST) { const u64 curr_end = btrfs_dir_log_end(path->nodes[0], item); /* * btrfs_del_dir_entries_in_log() might have been called during * an unlink between the initial insertion of this key and the * current update, or we might be logging a single entry deletion * during a rename, so set the new last_offset to the max value. */ last_offset = max(last_offset, curr_end); } btrfs_set_dir_log_end(path->nodes[0], item, last_offset); btrfs_release_path(path); return 0; } static int flush_dir_items_batch(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct extent_buffer *src, struct btrfs_path *dst_path, int start_slot, int count) { struct btrfs_root *log = inode->root->log_root; char AUTO_KFREE(ins_data); struct btrfs_item_batch batch; struct extent_buffer *dst; unsigned long src_offset; unsigned long dst_offset; u64 last_index; struct btrfs_key key; u32 item_size; int ret; int i; ASSERT(count > 0, "count=%d", count); batch.nr = count; if (count == 1) { btrfs_item_key_to_cpu(src, &key, start_slot); item_size = btrfs_item_size(src, start_slot); batch.keys = &key; batch.data_sizes = &item_size; batch.total_data_size = item_size; } else { struct btrfs_key *ins_keys; u32 *ins_sizes; ins_data = kmalloc_array(count, sizeof(u32) + sizeof(struct btrfs_key), GFP_NOFS); if (!ins_data) return -ENOMEM; ins_sizes = (u32 *)ins_data; ins_keys = (struct btrfs_key *)(ins_data + count * sizeof(u32)); batch.keys = ins_keys; batch.data_sizes = ins_sizes; batch.total_data_size = 0; for (i = 0; i < count; i++) { const int slot = start_slot + i; btrfs_item_key_to_cpu(src, &ins_keys[i], slot); ins_sizes[i] = btrfs_item_size(src, slot); batch.total_data_size += ins_sizes[i]; } } ret = btrfs_insert_empty_items(trans, log, dst_path, &batch); if (ret) return ret; dst = dst_path->nodes[0]; /* * Copy all the items in bulk, in a single copy operation. Item data is * organized such that it's placed at the end of a leaf and from right * to left. For example, the data for the second item ends at an offset * that matches the offset where the data for the first item starts, the * data for the third item ends at an offset that matches the offset * where the data of the second items starts, and so on. * Therefore our source and destination start offsets for copy match the * offsets of the last items (highest slots). */ dst_offset = btrfs_item_ptr_offset(dst, dst_path->slots[0] + count - 1); src_offset = btrfs_item_ptr_offset(src, start_slot + count - 1); copy_extent_buffer(dst, src, dst_offset, src_offset, batch.total_data_size); btrfs_release_path(dst_path); last_index = batch.keys[count - 1].offset; ASSERT(last_index > inode->last_dir_index_offset, "last_index=%llu inode->last_dir_index_offset=%llu", last_index, inode->last_dir_index_offset); /* * If for some unexpected reason the last item's index is not greater * than the last index we logged, warn and force a transaction commit. */ if (WARN_ON(last_index <= inode->last_dir_index_offset)) ret = BTRFS_LOG_FORCE_COMMIT; else inode->last_dir_index_offset = last_index; if (btrfs_get_first_dir_index_to_log(inode) == 0) btrfs_set_first_dir_index_to_log(inode, batch.keys[0].offset); return ret; } static int clone_leaf(struct btrfs_path *path, struct btrfs_log_ctx *ctx) { const int slot = path->slots[0]; if (ctx->scratch_eb) { copy_extent_buffer_full(ctx->scratch_eb, path->nodes[0]); } else { ctx->scratch_eb = btrfs_clone_extent_buffer(path->nodes[0]); if (!ctx->scratch_eb) return -ENOMEM; } btrfs_release_path(path); path->nodes[0] = ctx->scratch_eb; path->slots[0] = slot; /* * Add extra ref to scratch eb so that it is not freed when callers * release the path, so we can reuse it later if needed. */ refcount_inc(&ctx->scratch_eb->refs); return 0; } static int process_dir_items_leaf(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, struct btrfs_path *dst_path, struct btrfs_log_ctx *ctx, u64 *last_old_dentry_offset) { struct btrfs_root *log = inode->root->log_root; struct extent_buffer *src; const int nritems = btrfs_header_nritems(path->nodes[0]); const u64 ino = btrfs_ino(inode); bool last_found = false; int batch_start = 0; int batch_size = 0; int ret; /* * We need to clone the leaf, release the read lock on it, and use the * clone before modifying the log tree. See the comment at copy_items() * about why we need to do this. */ ret = clone_leaf(path, ctx); if (ret < 0) return ret; src = path->nodes[0]; for (int i = path->slots[0]; i < nritems; i++) { struct btrfs_dir_item *di; struct btrfs_key key; btrfs_item_key_to_cpu(src, &key, i); if (key.objectid != ino || key.type != BTRFS_DIR_INDEX_KEY) { last_found = true; break; } di = btrfs_item_ptr(src, i, struct btrfs_dir_item); /* * Skip ranges of items that consist only of dir item keys created * in past transactions. However if we find a gap, we must log a * dir index range item for that gap, so that index keys in that * gap are deleted during log replay. */ if (btrfs_dir_transid(src, di) < trans->transid) { if (key.offset > *last_old_dentry_offset + 1) { ret = insert_dir_log_key(trans, log, dst_path, ino, *last_old_dentry_offset + 1, key.offset - 1); if (ret < 0) return ret; } *last_old_dentry_offset = key.offset; continue; } /* If we logged this dir index item before, we can skip it. */ if (key.offset <= inode->last_dir_index_offset) continue; /* * We must make sure that when we log a directory entry, the * corresponding inode, after log replay, has a matching link * count. For example: * * touch foo * mkdir mydir * sync * ln foo mydir/bar * xfs_io -c "fsync" mydir * <crash> * <mount fs and log replay> * * Would result in a fsync log that when replayed, our file inode * would have a link count of 1, but we get two directory entries * pointing to the same inode. After removing one of the names, * it would not be possible to remove the other name, which * resulted always in stale file handle errors, and would not be * possible to rmdir the parent directory, since its i_size could * never be decremented to the value BTRFS_EMPTY_DIR_SIZE, * resulting in -ENOTEMPTY errors. */ if (!ctx->log_new_dentries) { struct btrfs_key di_key; btrfs_dir_item_key_to_cpu(src, di, &di_key); if (di_key.type != BTRFS_ROOT_ITEM_KEY) ctx->log_new_dentries = true; } if (batch_size == 0) batch_start = i; batch_size++; } if (batch_size > 0) { ret = flush_dir_items_batch(trans, inode, src, dst_path, batch_start, batch_size); if (ret < 0) return ret; } return last_found ? 1 : 0; } /* * log all the items included in the current transaction for a given * directory. This also creates the range items in the log tree required * to replay anything deleted before the fsync */ static noinline int log_dir_items(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, struct btrfs_path *dst_path, struct btrfs_log_ctx *ctx, u64 min_offset, u64 *last_offset_ret) { struct btrfs_key min_key; struct btrfs_root *root = inode->root; struct btrfs_root *log = root->log_root; int ret; u64 last_old_dentry_offset = min_offset - 1; u64 last_offset = (u64)-1; u64 ino = btrfs_ino(inode); min_key.objectid = ino; min_key.type = BTRFS_DIR_INDEX_KEY; min_key.offset = min_offset; ret = btrfs_search_forward(root, &min_key, path, trans->transid); /* * we didn't find anything from this transaction, see if there * is anything at all */ if (ret != 0 || min_key.objectid != ino || min_key.type != BTRFS_DIR_INDEX_KEY) { min_key.objectid = ino; min_key.type = BTRFS_DIR_INDEX_KEY; min_key.offset = (u64)-1; btrfs_release_path(path); ret = btrfs_search_slot(NULL, root, &min_key, path, 0, 0); if (ret < 0) { btrfs_release_path(path); return ret; } ret = btrfs_previous_item(root, path, ino, BTRFS_DIR_INDEX_KEY); /* if ret == 0 there are items for this type, * create a range to tell us the last key of this type. * otherwise, there are no items in this directory after * *min_offset, and we create a range to indicate that. */ if (ret == 0) { struct btrfs_key tmp; btrfs_item_key_to_cpu(path->nodes[0], &tmp, path->slots[0]); if (tmp.type == BTRFS_DIR_INDEX_KEY) last_old_dentry_offset = tmp.offset; } else if (ret > 0) { ret = 0; } goto done; } /* go backward to find any previous key */ ret = btrfs_previous_item(root, path, ino, BTRFS_DIR_INDEX_KEY); if (ret == 0) { struct btrfs_key tmp; btrfs_item_key_to_cpu(path->nodes[0], &tmp, path->slots[0]); /* * The dir index key before the first one we found that needs to * be logged might be in a previous leaf, and there might be a * gap between these keys, meaning that we had deletions that * happened. So the key range item we log (key type * BTRFS_DIR_LOG_INDEX_KEY) must cover a range that starts at the * previous key's offset plus 1, so that those deletes are replayed. */ if (tmp.type == BTRFS_DIR_INDEX_KEY) last_old_dentry_offset = tmp.offset; } else if (ret < 0) { goto done; } btrfs_release_path(path); /* * Find the first key from this transaction again or the one we were at * in the loop below in case we had to reschedule. We may be logging the * directory without holding its VFS lock, which happen when logging new * dentries (through log_new_dir_dentries()) or in some cases when we * need to log the parent directory of an inode. This means a dir index * key might be deleted from the inode's root, and therefore we may not * find it anymore. If we can't find it, just move to the next key. We * can not bail out and ignore, because if we do that we will simply * not log dir index keys that come after the one that was just deleted * and we can end up logging a dir index range that ends at (u64)-1 * (@last_offset is initialized to that), resulting in removing dir * entries we should not remove at log replay time. */ search: ret = btrfs_search_slot(NULL, root, &min_key, path, 0, 0); if (ret > 0) { ret = btrfs_next_item(root, path); if (ret > 0) { /* There are no more keys in the inode's root. */ ret = 0; goto done; } } if (ret < 0) goto done; /* * we have a block from this transaction, log every item in it * from our directory */ while (1) { ret = process_dir_items_leaf(trans, inode, path, dst_path, ctx, &last_old_dentry_offset); if (ret != 0) { if (ret > 0) ret = 0; goto done; } path->slots[0] = btrfs_header_nritems(path->nodes[0]); /* * look ahead to the next item and see if it is also * from this directory and from this transaction */ ret = btrfs_next_leaf(root, path); if (ret) { if (ret == 1) { last_offset = (u64)-1; ret = 0; } goto done; } btrfs_item_key_to_cpu(path->nodes[0], &min_key, path->slots[0]); if (min_key.objectid != ino || min_key.type != BTRFS_DIR_INDEX_KEY) { last_offset = (u64)-1; goto done; } if (btrfs_header_generation(path->nodes[0]) != trans->transid) { /* * The next leaf was not changed in the current transaction * and has at least one dir index key. * We check for the next key because there might have been * one or more deletions between the last key we logged and * that next key. So the key range item we log (key type * BTRFS_DIR_LOG_INDEX_KEY) must end at the next key's * offset minus 1, so that those deletes are replayed. */ last_offset = min_key.offset - 1; goto done; } if (need_resched()) { btrfs_release_path(path); cond_resched(); goto search; } } done: btrfs_release_path(path); btrfs_release_path(dst_path); if (ret == 0) { *last_offset_ret = last_offset; /* * In case the leaf was changed in the current transaction but * all its dir items are from a past transaction, the last item * in the leaf is a dir item and there's no gap between that last * dir item and the first one on the next leaf (which did not * change in the current transaction), then we don't need to log * a range, last_old_dentry_offset is == to last_offset. */ ASSERT(last_old_dentry_offset <= last_offset, "last_old_dentry_offset=%llu last_offset=%llu", last_old_dentry_offset, last_offset); if (last_old_dentry_offset < last_offset) ret = insert_dir_log_key(trans, log, path, ino, last_old_dentry_offset + 1, last_offset); } return ret; } /* * If the inode was logged before and it was evicted, then its * last_dir_index_offset is 0, so we don't know the value of the last index * key offset. If that's the case, search for it and update the inode. This * is to avoid lookups in the log tree every time we try to insert a dir index * key from a leaf changed in the current transaction, and to allow us to always * do batch insertions of dir index keys. */ static int update_last_dir_index_offset(struct btrfs_inode *inode, struct btrfs_path *path, const struct btrfs_log_ctx *ctx) { const u64 ino = btrfs_ino(inode); struct btrfs_key key; int ret; lockdep_assert_held(&inode->log_mutex); if (inode->last_dir_index_offset != 0) return 0; if (!ctx->logged_before) { inode->last_dir_index_offset = BTRFS_DIR_START_INDEX - 1; return 0; } key.objectid = ino; key.type = BTRFS_DIR_INDEX_KEY; key.offset = (u64)-1; ret = btrfs_search_slot(NULL, inode->root->log_root, &key, path, 0, 0); /* * An error happened or we actually have an index key with an offset * value of (u64)-1. Bail out, we're done. */ if (ret <= 0) goto out; ret = 0; inode->last_dir_index_offset = BTRFS_DIR_START_INDEX - 1; /* * No dir index items, bail out and leave last_dir_index_offset with * the value right before the first valid index value. */ if (path->slots[0] == 0) goto out; /* * btrfs_search_slot() left us at one slot beyond the slot with the last * index key, or beyond the last key of the directory that is not an * index key. If we have an index key before, set last_dir_index_offset * to its offset value, otherwise leave it with a value right before the * first valid index value, as it means we have an empty directory. */ btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1); if (key.objectid == ino && key.type == BTRFS_DIR_INDEX_KEY) inode->last_dir_index_offset = key.offset; out: btrfs_release_path(path); return ret; } /* * logging directories is very similar to logging inodes, We find all the items * from the current transaction and write them to the log. * * The recovery code scans the directory in the subvolume, and if it finds a * key in the range logged that is not present in the log tree, then it means * that dir entry was unlinked during the transaction. * * In order for that scan to work, we must include one key smaller than * the smallest logged by this transaction and one key larger than the largest * key logged by this transaction. */ static noinline int log_directory_changes(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, struct btrfs_path *dst_path, struct btrfs_log_ctx *ctx) { u64 min_key; u64 max_key; int ret; ret = update_last_dir_index_offset(inode, path, ctx); if (ret) return ret; min_key = BTRFS_DIR_START_INDEX; max_key = 0; while (1) { ret = log_dir_items(trans, inode, path, dst_path, ctx, min_key, &max_key); if (ret) return ret; if (max_key == (u64)-1) break; min_key = max_key + 1; } return 0; } /* * a helper function to drop items from the log before we relog an * inode. max_key_type indicates the highest item type to remove. * This cannot be run for file data extents because it does not * free the extents they point to. */ static int drop_inode_items(struct btrfs_trans_handle *trans, struct btrfs_root *log, struct btrfs_path *path, struct btrfs_inode *inode, int max_key_type) { int ret; struct btrfs_key key; struct btrfs_key found_key; int start_slot; key.objectid = btrfs_ino(inode); key.type = max_key_type; key.offset = (u64)-1; while (1) { ret = btrfs_search_slot(trans, log, &key, path, -1, 1); if (ret < 0) { break; } else if (ret > 0) { if (path->slots[0] == 0) break; path->slots[0]--; } btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]); if (found_key.objectid != key.objectid) break; found_key.offset = 0; found_key.type = 0; ret = btrfs_bin_search(path->nodes[0], 0, &found_key, &start_slot); if (ret < 0) break; ret = btrfs_del_items(trans, log, path, start_slot, path->slots[0] - start_slot + 1); /* * If start slot isn't 0 then we don't need to re-search, we've * found the last guy with the objectid in this tree. */ if (ret || start_slot != 0) break; btrfs_release_path(path); } btrfs_release_path(path); if (ret > 0) ret = 0; return ret; } static int truncate_inode_items(struct btrfs_trans_handle *trans, struct btrfs_root *log_root, struct btrfs_inode *inode, u64 new_size, u32 min_type) { struct btrfs_truncate_control control = { .new_size = new_size, .ino = btrfs_ino(inode), .min_type = min_type, .skip_ref_updates = true, }; return btrfs_truncate_inode_items(trans, log_root, &control); } static void fill_inode_item(struct btrfs_trans_handle *trans, struct extent_buffer *leaf, struct btrfs_inode_item *item, struct inode *inode, bool log_inode_only, u64 logged_isize) { u64 flags; if (log_inode_only) { /* set the generation to zero so the recover code * can tell the difference between an logging * just to say 'this inode exists' and a logging * to say 'update this inode with these values' */ btrfs_set_inode_generation(leaf, item, 0); btrfs_set_inode_size(leaf, item, logged_isize); } else { btrfs_set_inode_generation(leaf, item, BTRFS_I(inode)->generation); btrfs_set_inode_size(leaf, item, inode->i_size); } btrfs_set_inode_uid(leaf, item, i_uid_read(inode)); btrfs_set_inode_gid(leaf, item, i_gid_read(inode)); btrfs_set_inode_mode(leaf, item, inode->i_mode); btrfs_set_inode_nlink(leaf, item, inode->i_nlink); btrfs_set_timespec_sec(leaf, &item->atime, inode_get_atime_sec(inode)); btrfs_set_timespec_nsec(leaf, &item->atime, inode_get_atime_nsec(inode)); btrfs_set_timespec_sec(leaf, &item->mtime, inode_get_mtime_sec(inode)); btrfs_set_timespec_nsec(leaf, &item->mtime, inode_get_mtime_nsec(inode)); btrfs_set_timespec_sec(leaf, &item->ctime, inode_get_ctime_sec(inode)); btrfs_set_timespec_nsec(leaf, &item->ctime, inode_get_ctime_nsec(inode)); btrfs_set_timespec_sec(leaf, &item->otime, BTRFS_I(inode)->i_otime_sec); btrfs_set_timespec_nsec(leaf, &item->otime, BTRFS_I(inode)->i_otime_nsec); /* * We do not need to set the nbytes field, in fact during a fast fsync * its value may not even be correct, since a fast fsync does not wait * for ordered extent completion, which is where we update nbytes, it * only waits for writeback to complete. During log replay as we find * file extent items and replay them, we adjust the nbytes field of the * inode item in subvolume tree as needed (see overwrite_item()). */ btrfs_set_inode_sequence(leaf, item, inode_peek_iversion(inode)); btrfs_set_inode_transid(leaf, item, trans->transid); btrfs_set_inode_rdev(leaf, item, inode->i_rdev); flags = btrfs_inode_combine_flags(BTRFS_I(inode)->flags, BTRFS_I(inode)->ro_flags); btrfs_set_inode_flags(leaf, item, flags); btrfs_set_inode_block_group(leaf, item, 0); } static int log_inode_item(struct btrfs_trans_handle *trans, struct btrfs_root *log, struct btrfs_path *path, struct btrfs_inode *inode, bool inode_item_dropped) { struct btrfs_inode_item *inode_item; struct btrfs_key key; int ret; btrfs_get_inode_key(inode, &key); /* * If we are doing a fast fsync and the inode was logged before in the * current transaction, then we know the inode was previously logged and * it exists in the log tree. For performance reasons, in this case use * btrfs_search_slot() directly with ins_len set to 0 so that we never * attempt a write lock on the leaf's parent, which adds unnecessary lock * contention in case there are concurrent fsyncs for other inodes of the * same subvolume. Using btrfs_insert_empty_item() when the inode item * already exists can also result in unnecessarily splitting a leaf. */ if (!inode_item_dropped && inode->logged_trans == trans->transid) { ret = btrfs_search_slot(trans, log, &key, path, 0, 1); ASSERT(ret <= 0); if (ret > 0) ret = -ENOENT; } else { /* * This means it is the first fsync in the current transaction, * so the inode item is not in the log and we need to insert it. * We can never get -EEXIST because we are only called for a fast * fsync and in case an inode eviction happens after the inode was * logged before in the current transaction, when we load again * the inode, we set BTRFS_INODE_NEEDS_FULL_SYNC on its runtime * flags and set ->logged_trans to 0. */ ret = btrfs_insert_empty_item(trans, log, path, &key, sizeof(*inode_item)); ASSERT(ret != -EEXIST); } if (ret) return ret; inode_item = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_inode_item); fill_inode_item(trans, path->nodes[0], inode_item, &inode->vfs_inode, false, 0); btrfs_release_path(path); return 0; } static int log_csums(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_root *log_root, struct btrfs_ordered_sum *sums) { const u64 lock_end = sums->logical + sums->len - 1; struct extent_state *cached_state = NULL; int ret; /* * If this inode was not used for reflink operations in the current * transaction with new extents, then do the fast path, no need to * worry about logging checksum items with overlapping ranges. */ if (inode->last_reflink_trans < trans->transid) return btrfs_csum_file_blocks(trans, log_root, sums); /* * Serialize logging for checksums. This is to avoid racing with the * same checksum being logged by another task that is logging another * file which happens to refer to the same extent as well. Such races * can leave checksum items in the log with overlapping ranges. */ ret = btrfs_lock_extent(&log_root->log_csum_range, sums->logical, lock_end, &cached_state); if (ret) return ret; /* * Due to extent cloning, we might have logged a csum item that covers a * subrange of a cloned extent, and later we can end up logging a csum * item for a larger subrange of the same extent or the entire range. * This would leave csum items in the log tree that cover the same range * and break the searches for checksums in the log tree, resulting in * some checksums missing in the fs/subvolume tree. So just delete (or * trim and adjust) any existing csum items in the log for this range. */ ret = btrfs_del_csums(trans, log_root, sums->logical, sums->len); if (!ret) ret = btrfs_csum_file_blocks(trans, log_root, sums); btrfs_unlock_extent(&log_root->log_csum_range, sums->logical, lock_end, &cached_state); return ret; } static noinline int copy_items(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *dst_path, struct btrfs_path *src_path, int start_slot, int nr, int inode_only, u64 logged_isize, struct btrfs_log_ctx *ctx) { struct btrfs_root *log = inode->root->log_root; struct btrfs_file_extent_item *extent; struct extent_buffer *src; int ret; struct btrfs_key *ins_keys; u32 *ins_sizes; struct btrfs_item_batch batch; char AUTO_KFREE(ins_data); int dst_index; const bool skip_csum = (inode->flags & BTRFS_INODE_NODATASUM); const u64 i_size = i_size_read(&inode->vfs_inode); /* * To keep lockdep happy and avoid deadlocks, clone the source leaf and * use the clone. This is because otherwise we would be changing the log * tree, to insert items from the subvolume tree or insert csum items, * while holding a read lock on a leaf from the subvolume tree, which * creates a nasty lock dependency when COWing log tree nodes/leaves: * * 1) Modifying the log tree triggers an extent buffer allocation while * holding a write lock on a parent extent buffer from the log tree. * Allocating the pages for an extent buffer, or the extent buffer * struct, can trigger inode eviction and finally the inode eviction * will trigger a release/remove of a delayed node, which requires * taking the delayed node's mutex; * * 2) Allocating a metadata extent for a log tree can trigger the async * reclaim thread and make us wait for it to release enough space and * unblock our reservation ticket. The reclaim thread can start * flushing delayed items, and that in turn results in the need to * lock delayed node mutexes and in the need to write lock extent * buffers of a subvolume tree - all this while holding a write lock * on the parent extent buffer in the log tree. * * So one task in scenario 1) running in parallel with another task in * scenario 2) could lead to a deadlock, one wanting to lock a delayed * node mutex while having a read lock on a leaf from the subvolume, * while the other is holding the delayed node's mutex and wants to * write lock the same subvolume leaf for flushing delayed items. */ ret = clone_leaf(src_path, ctx); if (ret < 0) return ret; src = src_path->nodes[0]; ins_data = kmalloc_array(nr, sizeof(struct btrfs_key) + sizeof(u32), GFP_NOFS); if (!ins_data) return -ENOMEM; ins_sizes = (u32 *)ins_data; ins_keys = (struct btrfs_key *)(ins_data + nr * sizeof(u32)); batch.keys = ins_keys; batch.data_sizes = ins_sizes; batch.total_data_size = 0; batch.nr = 0; dst_index = 0; for (int i = 0; i < nr; i++) { const int src_slot = start_slot + i; struct btrfs_root *csum_root; struct btrfs_ordered_sum *sums; struct btrfs_ordered_sum *sums_next; LIST_HEAD(ordered_sums); u64 disk_bytenr; u64 disk_num_bytes; u64 extent_offset; u64 extent_num_bytes; bool is_old_extent; btrfs_item_key_to_cpu(src, &ins_keys[dst_index], src_slot); if (ins_keys[dst_index].type != BTRFS_EXTENT_DATA_KEY) goto add_to_batch; extent = btrfs_item_ptr(src, src_slot, struct btrfs_file_extent_item); is_old_extent = (btrfs_file_extent_generation(src, extent) < trans->transid); /* * Don't copy extents from past generations. That would make us * log a lot more metadata for common cases like doing only a * few random writes into a file and then fsync it for the first * time or after the full sync flag is set on the inode. We can * get leaves full of extent items, most of which are from past * generations, so we can skip them - as long as the inode has * not been the target of a reflink operation in this transaction, * as in that case it might have had file extent items with old * generations copied into it. We also must always log prealloc * extents that start at or beyond eof, otherwise we would lose * them on log replay. */ if (is_old_extent && ins_keys[dst_index].offset < i_size && inode->last_reflink_trans < trans->transid) continue; if (skip_csum) goto add_to_batch; /* Only regular extents have checksums. */ if (btrfs_file_extent_type(src, extent) != BTRFS_FILE_EXTENT_REG) goto add_to_batch; /* * If it's an extent created in a past transaction, then its * checksums are already accessible from the committed csum tree, * no need to log them. */ if (is_old_extent) goto add_to_batch; disk_bytenr = btrfs_file_extent_disk_bytenr(src, extent); /* If it's an explicit hole, there are no checksums. */ if (disk_bytenr == 0) goto add_to_batch; disk_num_bytes = btrfs_file_extent_disk_num_bytes(src, extent); if (btrfs_file_extent_compression(src, extent)) { extent_offset = 0; extent_num_bytes = disk_num_bytes; } else { extent_offset = btrfs_file_extent_offset(src, extent); extent_num_bytes = btrfs_file_extent_num_bytes(src, extent); } csum_root = btrfs_csum_root(trans->fs_info, disk_bytenr); disk_bytenr += extent_offset; ret = btrfs_lookup_csums_list(csum_root, disk_bytenr, disk_bytenr + extent_num_bytes - 1, &ordered_sums, false); if (ret < 0) return ret; ret = 0; list_for_each_entry_safe(sums, sums_next, &ordered_sums, list) { if (!ret) ret = log_csums(trans, inode, log, sums); list_del(&sums->list); kfree(sums); } if (ret) return ret; add_to_batch: ins_sizes[dst_index] = btrfs_item_size(src, src_slot); batch.total_data_size += ins_sizes[dst_index]; batch.nr++; dst_index++; } /* * We have a leaf full of old extent items that don't need to be logged, * so we don't need to do anything. */ if (batch.nr == 0) return 0; ret = btrfs_insert_empty_items(trans, log, dst_path, &batch); if (ret) return ret; dst_index = 0; for (int i = 0; i < nr; i++) { const int src_slot = start_slot + i; const int dst_slot = dst_path->slots[0] + dst_index; struct btrfs_key key; unsigned long src_offset; unsigned long dst_offset; /* * We're done, all the remaining items in the source leaf * correspond to old file extent items. */ if (dst_index >= batch.nr) break; btrfs_item_key_to_cpu(src, &key, src_slot); if (key.type != BTRFS_EXTENT_DATA_KEY) goto copy_item; extent = btrfs_item_ptr(src, src_slot, struct btrfs_file_extent_item); /* See the comment in the previous loop, same logic. */ if (btrfs_file_extent_generation(src, extent) < trans->transid && key.offset < i_size && inode->last_reflink_trans < trans->transid) continue; copy_item: dst_offset = btrfs_item_ptr_offset(dst_path->nodes[0], dst_slot); src_offset = btrfs_item_ptr_offset(src, src_slot); if (key.type == BTRFS_INODE_ITEM_KEY) { struct btrfs_inode_item *inode_item; inode_item = btrfs_item_ptr(dst_path->nodes[0], dst_slot, struct btrfs_inode_item); fill_inode_item(trans, dst_path->nodes[0], inode_item, &inode->vfs_inode, inode_only == LOG_INODE_EXISTS, logged_isize); } else { copy_extent_buffer(dst_path->nodes[0], src, dst_offset, src_offset, ins_sizes[dst_index]); } dst_index++; } btrfs_release_path(dst_path); return ret; } static int extent_cmp(void *priv, const struct list_head *a, const struct list_head *b) { const struct extent_map *em1, *em2; em1 = list_entry(a, struct extent_map, list); em2 = list_entry(b, struct extent_map, list); if (em1->start < em2->start) return -1; else if (em1->start > em2->start) return 1; return 0; } static int log_extent_csums(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_root *log_root, const struct extent_map *em, struct btrfs_log_ctx *ctx) { struct btrfs_ordered_extent *ordered; struct btrfs_root *csum_root; u64 block_start; u64 csum_offset; u64 csum_len; u64 mod_start = em->start; u64 mod_len = em->len; LIST_HEAD(ordered_sums); int ret = 0; if (inode->flags & BTRFS_INODE_NODATASUM || (em->flags & EXTENT_FLAG_PREALLOC) || em->disk_bytenr == EXTENT_MAP_HOLE) return 0; list_for_each_entry(ordered, &ctx->ordered_extents, log_list) { const u64 ordered_end = ordered->file_offset + ordered->num_bytes; const u64 mod_end = mod_start + mod_len; struct btrfs_ordered_sum *sums; if (mod_len == 0) break; if (ordered_end <= mod_start) continue; if (mod_end <= ordered->file_offset) break; /* * We are going to copy all the csums on this ordered extent, so * go ahead and adjust mod_start and mod_len in case this ordered * extent has already been logged. */ if (ordered->file_offset > mod_start) { if (ordered_end >= mod_end) mod_len = ordered->file_offset - mod_start; /* * If we have this case * * |--------- logged extent ---------| * |----- ordered extent ----| * * Just don't mess with mod_start and mod_len, we'll * just end up logging more csums than we need and it * will be ok. */ } else { if (ordered_end < mod_end) { mod_len = mod_end - ordered_end; mod_start = ordered_end; } else { mod_len = 0; } } /* * To keep us from looping for the above case of an ordered * extent that falls inside of the logged extent. */ if (test_and_set_bit(BTRFS_ORDERED_LOGGED_CSUM, &ordered->flags)) continue; list_for_each_entry(sums, &ordered->list, list) { ret = log_csums(trans, inode, log_root, sums); if (ret) return ret; } } /* We're done, found all csums in the ordered extents. */ if (mod_len == 0) return 0; /* If we're compressed we have to save the entire range of csums. */ if (btrfs_extent_map_is_compressed(em)) { csum_offset = 0; csum_len = em->disk_num_bytes; } else { csum_offset = mod_start - em->start; csum_len = mod_len; } /* block start is already adjusted for the file extent offset. */ block_start = btrfs_extent_map_block_start(em); csum_root = btrfs_csum_root(trans->fs_info, block_start); ret = btrfs_lookup_csums_list(csum_root, block_start + csum_offset, block_start + csum_offset + csum_len - 1, &ordered_sums, false); if (ret < 0) return ret; ret = 0; while (!list_empty(&ordered_sums)) { struct btrfs_ordered_sum *sums = list_first_entry(&ordered_sums, struct btrfs_ordered_sum, list); if (!ret) ret = log_csums(trans, inode, log_root, sums); list_del(&sums->list); kfree(sums); } return ret; } static int log_one_extent(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, const struct extent_map *em, struct btrfs_path *path, struct btrfs_log_ctx *ctx) { struct btrfs_drop_extents_args drop_args = { 0 }; struct btrfs_root *log = inode->root->log_root; struct btrfs_file_extent_item fi = { 0 }; struct extent_buffer *leaf; struct btrfs_key key; enum btrfs_compression_type compress_type; u64 extent_offset = em->offset; u64 block_start = btrfs_extent_map_block_start(em); u64 block_len; int ret; btrfs_set_stack_file_extent_generation(&fi, trans->transid); if (em->flags & EXTENT_FLAG_PREALLOC) btrfs_set_stack_file_extent_type(&fi, BTRFS_FILE_EXTENT_PREALLOC); else btrfs_set_stack_file_extent_type(&fi, BTRFS_FILE_EXTENT_REG); block_len = em->disk_num_bytes; compress_type = btrfs_extent_map_compression(em); if (compress_type != BTRFS_COMPRESS_NONE) { btrfs_set_stack_file_extent_disk_bytenr(&fi, block_start); btrfs_set_stack_file_extent_disk_num_bytes(&fi, block_len); } else if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) { btrfs_set_stack_file_extent_disk_bytenr(&fi, block_start - extent_offset); btrfs_set_stack_file_extent_disk_num_bytes(&fi, block_len); } btrfs_set_stack_file_extent_offset(&fi, extent_offset); btrfs_set_stack_file_extent_num_bytes(&fi, em->len); btrfs_set_stack_file_extent_ram_bytes(&fi, em->ram_bytes); btrfs_set_stack_file_extent_compression(&fi, compress_type); ret = log_extent_csums(trans, inode, log, em, ctx); if (ret) return ret; /* * If this is the first time we are logging the inode in the current * transaction, we can avoid btrfs_drop_extents(), which is expensive * because it does a deletion search, which always acquires write locks * for extent buffers at levels 2, 1 and 0. This not only wastes time * but also adds significant contention in a log tree, since log trees * are small, with a root at level 2 or 3 at most, due to their short * life span. */ if (ctx->logged_before) { drop_args.path = path; drop_args.start = em->start; drop_args.end = em->start + em->len; drop_args.replace_extent = true; drop_args.extent_item_size = sizeof(fi); ret = btrfs_drop_extents(trans, log, inode, &drop_args); if (ret) return ret; } if (!drop_args.extent_inserted) { key.objectid = btrfs_ino(inode); key.type = BTRFS_EXTENT_DATA_KEY; key.offset = em->start; ret = btrfs_insert_empty_item(trans, log, path, &key, sizeof(fi)); if (ret) return ret; } leaf = path->nodes[0]; write_extent_buffer(leaf, &fi, btrfs_item_ptr_offset(leaf, path->slots[0]), sizeof(fi)); btrfs_release_path(path); return ret; } /* * Log all prealloc extents beyond the inode's i_size to make sure we do not * lose them after doing a full/fast fsync and replaying the log. We scan the * subvolume's root instead of iterating the inode's extent map tree because * otherwise we can log incorrect extent items based on extent map conversion. * That can happen due to the fact that extent maps are merged when they * are not in the extent map tree's list of modified extents. */ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, struct btrfs_log_ctx *ctx) { struct btrfs_root *root = inode->root; struct btrfs_key key; const u64 i_size = i_size_read(&inode->vfs_inode); const u64 ino = btrfs_ino(inode); BTRFS_PATH_AUTO_FREE(dst_path); bool dropped_extents = false; u64 truncate_offset = i_size; struct extent_buffer *leaf; int slot; int ins_nr = 0; int start_slot = 0; int ret; if (!(inode->flags & BTRFS_INODE_PREALLOC)) return 0; key.objectid = ino; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = i_size; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) goto out; /* * We must check if there is a prealloc extent that starts before the * i_size and crosses the i_size boundary. This is to ensure later we * truncate down to the end of that extent and not to the i_size, as * otherwise we end up losing part of the prealloc extent after a log * replay and with an implicit hole if there is another prealloc extent * that starts at an offset beyond i_size. */ ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY); if (ret < 0) goto out; if (ret == 0) { struct btrfs_file_extent_item *ei; leaf = path->nodes[0]; slot = path->slots[0]; ei = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_PREALLOC) { u64 extent_end; btrfs_item_key_to_cpu(leaf, &key, slot); extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, ei); if (extent_end > i_size) truncate_offset = extent_end; } } else { ret = 0; } while (true) { leaf = path->nodes[0]; slot = path->slots[0]; if (slot >= btrfs_header_nritems(leaf)) { if (ins_nr > 0) { ret = copy_items(trans, inode, dst_path, path, start_slot, ins_nr, 1, 0, ctx); if (ret < 0) goto out; ins_nr = 0; } ret = btrfs_next_leaf(root, path); if (ret < 0) goto out; if (ret > 0) { ret = 0; break; } continue; } btrfs_item_key_to_cpu(leaf, &key, slot); if (key.objectid > ino) break; if (WARN_ON_ONCE(key.objectid < ino) || key.type < BTRFS_EXTENT_DATA_KEY || key.offset < i_size) { path->slots[0]++; continue; } /* * Avoid overlapping items in the log tree. The first time we * get here, get rid of everything from a past fsync. After * that, if the current extent starts before the end of the last * extent we copied, truncate the last one. This can happen if * an ordered extent completion modifies the subvolume tree * while btrfs_next_leaf() has the tree unlocked. */ if (!dropped_extents || key.offset < truncate_offset) { ret = truncate_inode_items(trans, root->log_root, inode, min(key.offset, truncate_offset), BTRFS_EXTENT_DATA_KEY); if (ret) goto out; dropped_extents = true; } truncate_offset = btrfs_file_extent_end(path); if (ins_nr == 0) start_slot = slot; ins_nr++; path->slots[0]++; if (!dst_path) { dst_path = btrfs_alloc_path(); if (!dst_path) { ret = -ENOMEM; goto out; } } } if (ins_nr > 0) ret = copy_items(trans, inode, dst_path, path, start_slot, ins_nr, 1, 0, ctx); out: btrfs_release_path(path); return ret; } static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, struct btrfs_log_ctx *ctx) { struct btrfs_ordered_extent *ordered; struct btrfs_ordered_extent *tmp; struct extent_map *em, *n; LIST_HEAD(extents); struct extent_map_tree *tree = &inode->extent_tree; int ret = 0; int num = 0; write_lock(&tree->lock); list_for_each_entry_safe(em, n, &tree->modified_extents, list) { list_del_init(&em->list); /* * Just an arbitrary number, this can be really CPU intensive * once we start getting a lot of extents, and really once we * have a bunch of extents we just want to commit since it will * be faster. */ if (++num > 32768) { list_del_init(&tree->modified_extents); ret = -EFBIG; goto process; } if (em->generation < trans->transid) continue; /* We log prealloc extents beyond eof later. */ if ((em->flags & EXTENT_FLAG_PREALLOC) && em->start >= i_size_read(&inode->vfs_inode)) continue; /* Need a ref to keep it from getting evicted from cache */ refcount_inc(&em->refs); em->flags |= EXTENT_FLAG_LOGGING; list_add_tail(&em->list, &extents); num++; } list_sort(NULL, &extents, extent_cmp); process: while (!list_empty(&extents)) { em = list_first_entry(&extents, struct extent_map, list); list_del_init(&em->list); /* * If we had an error we just need to delete everybody from our * private list. */ if (ret) { btrfs_clear_em_logging(inode, em); btrfs_free_extent_map(em); continue; } write_unlock(&tree->lock); ret = log_one_extent(trans, inode, em, path, ctx); write_lock(&tree->lock); btrfs_clear_em_logging(inode, em); btrfs_free_extent_map(em); } WARN_ON(!list_empty(&extents)); write_unlock(&tree->lock); if (!ret) ret = btrfs_log_prealloc_extents(trans, inode, path, ctx); if (ret) return ret; /* * We have logged all extents successfully, now make sure the commit of * the current transaction waits for the ordered extents to complete * before it commits and wipes out the log trees, otherwise we would * lose data if an ordered extents completes after the transaction * commits and a power failure happens after the transaction commit. */ list_for_each_entry_safe(ordered, tmp, &ctx->ordered_extents, log_list) { list_del_init(&ordered->log_list); set_bit(BTRFS_ORDERED_LOGGED, &ordered->flags); if (!test_bit(BTRFS_ORDERED_COMPLETE, &ordered->flags)) { spin_lock(&inode->ordered_tree_lock); if (!test_bit(BTRFS_ORDERED_COMPLETE, &ordered->flags)) { set_bit(BTRFS_ORDERED_PENDING, &ordered->flags); atomic_inc(&trans->transaction->pending_ordered); } spin_unlock(&inode->ordered_tree_lock); } btrfs_put_ordered_extent(ordered); } return 0; } static int logged_inode_size(struct btrfs_root *log, struct btrfs_inode *inode, struct btrfs_path *path, u64 *size_ret) { struct btrfs_key key; int ret; key.objectid = btrfs_ino(inode); key.type = BTRFS_INODE_ITEM_KEY; key.offset = 0; ret = btrfs_search_slot(NULL, log, &key, path, 0, 0); if (ret < 0) { return ret; } else if (ret > 0) { *size_ret = 0; } else { struct btrfs_inode_item *item; item = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_inode_item); *size_ret = btrfs_inode_size(path->nodes[0], item); /* * If the in-memory inode's i_size is smaller then the inode * size stored in the btree, return the inode's i_size, so * that we get a correct inode size after replaying the log * when before a power failure we had a shrinking truncate * followed by addition of a new name (rename / new hard link). * Otherwise return the inode size from the btree, to avoid * data loss when replaying a log due to previously doing a * write that expands the inode's size and logging a new name * immediately after. */ if (*size_ret > inode->vfs_inode.i_size) *size_ret = inode->vfs_inode.i_size; } btrfs_release_path(path); return 0; } /* * At the moment we always log all xattrs. This is to figure out at log replay * time which xattrs must have their deletion replayed. If a xattr is missing * in the log tree and exists in the fs/subvol tree, we delete it. This is * because if a xattr is deleted, the inode is fsynced and a power failure * happens, causing the log to be replayed the next time the fs is mounted, * we want the xattr to not exist anymore (same behaviour as other filesystems * with a journal, ext3/4, xfs, f2fs, etc). */ static int btrfs_log_all_xattrs(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, struct btrfs_path *dst_path, struct btrfs_log_ctx *ctx) { struct btrfs_root *root = inode->root; int ret; struct btrfs_key key; const u64 ino = btrfs_ino(inode); int ins_nr = 0; int start_slot = 0; bool found_xattrs = false; if (test_bit(BTRFS_INODE_NO_XATTRS, &inode->runtime_flags)) return 0; key.objectid = ino; key.type = BTRFS_XATTR_ITEM_KEY; key.offset = 0; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) return ret; while (true) { int slot = path->slots[0]; struct extent_buffer *leaf = path->nodes[0]; int nritems = btrfs_header_nritems(leaf); if (slot >= nritems) { if (ins_nr > 0) { ret = copy_items(trans, inode, dst_path, path, start_slot, ins_nr, 1, 0, ctx); if (ret < 0) return ret; ins_nr = 0; } ret = btrfs_next_leaf(root, path); if (ret < 0) return ret; else if (ret > 0) break; continue; } btrfs_item_key_to_cpu(leaf, &key, slot); if (key.objectid != ino || key.type != BTRFS_XATTR_ITEM_KEY) break; if (ins_nr == 0) start_slot = slot; ins_nr++; path->slots[0]++; found_xattrs = true; cond_resched(); } if (ins_nr > 0) { ret = copy_items(trans, inode, dst_path, path, start_slot, ins_nr, 1, 0, ctx); if (ret < 0) return ret; } if (!found_xattrs) set_bit(BTRFS_INODE_NO_XATTRS, &inode->runtime_flags); return 0; } /* * When using the NO_HOLES feature if we punched a hole that causes the * deletion of entire leafs or all the extent items of the first leaf (the one * that contains the inode item and references) we may end up not processing * any extents, because there are no leafs with a generation matching the * current transaction that have extent items for our inode. So we need to find * if any holes exist and then log them. We also need to log holes after any * truncate operation that changes the inode's size. */ static int btrfs_log_holes(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path) { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_key key; const u64 ino = btrfs_ino(inode); const u64 i_size = i_size_read(&inode->vfs_inode); u64 prev_extent_end = 0; int ret; if (!btrfs_fs_incompat(fs_info, NO_HOLES) || i_size == 0) return 0; key.objectid = ino; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = 0; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) return ret; while (true) { struct extent_buffer *leaf = path->nodes[0]; if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) { ret = btrfs_next_leaf(root, path); if (ret < 0) return ret; if (ret > 0) { ret = 0; break; } leaf = path->nodes[0]; } btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) break; /* We have a hole, log it. */ if (prev_extent_end < key.offset) { const u64 hole_len = key.offset - prev_extent_end; /* * Release the path to avoid deadlocks with other code * paths that search the root while holding locks on * leafs from the log root. */ btrfs_release_path(path); ret = btrfs_insert_hole_extent(trans, root->log_root, ino, prev_extent_end, hole_len); if (ret < 0) return ret; /* * Search for the same key again in the root. Since it's * an extent item and we are holding the inode lock, the * key must still exist. If it doesn't just emit warning * and return an error to fall back to a transaction * commit. */ ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) return ret; if (WARN_ON(ret > 0)) return -ENOENT; leaf = path->nodes[0]; } prev_extent_end = btrfs_file_extent_end(path); path->slots[0]++; cond_resched(); } if (prev_extent_end < i_size) { u64 hole_len; btrfs_release_path(path); hole_len = ALIGN(i_size - prev_extent_end, fs_info->sectorsize); ret = btrfs_insert_hole_extent(trans, root->log_root, ino, prev_extent_end, hole_len); if (ret < 0) return ret; } return 0; } /* * When we are logging a new inode X, check if it doesn't have a reference that * matches the reference from some other inode Y created in a past transaction * and that was renamed in the current transaction. If we don't do this, then at * log replay time we can lose inode Y (and all its files if it's a directory): * * mkdir /mnt/x * echo "hello world" > /mnt/x/foobar * sync * mv /mnt/x /mnt/y * mkdir /mnt/x # or touch /mnt/x * xfs_io -c fsync /mnt/x * <power fail> * mount fs, trigger log replay * * After the log replay procedure, we would lose the first directory and all its * files (file foobar). * For the case where inode Y is not a directory we simply end up losing it: * * echo "123" > /mnt/foo * sync * mv /mnt/foo /mnt/bar * echo "abc" > /mnt/foo * xfs_io -c fsync /mnt/foo * <power fail> * * We also need this for cases where a snapshot entry is replaced by some other * entry (file or directory) otherwise we end up with an unreplayable log due to * attempts to delete the snapshot entry (entry of type BTRFS_ROOT_ITEM_KEY) as * if it were a regular entry: * * mkdir /mnt/x * btrfs subvolume snapshot /mnt /mnt/x/snap * btrfs subvolume delete /mnt/x/snap * rmdir /mnt/x * mkdir /mnt/x * fsync /mnt/x or fsync some new file inside it * <power fail> * * The snapshot delete, rmdir of x, mkdir of a new x and the fsync all happen in * the same transaction. */ static int btrfs_check_ref_name_override(struct extent_buffer *eb, const int slot, const struct btrfs_key *key, struct btrfs_inode *inode, u64 *other_ino, u64 *other_parent) { BTRFS_PATH_AUTO_FREE(search_path); char AUTO_KFREE(name); u32 name_len = 0; u32 item_size = btrfs_item_size(eb, slot); u32 cur_offset = 0; unsigned long ptr = btrfs_item_ptr_offset(eb, slot); search_path = btrfs_alloc_path(); if (!search_path) return -ENOMEM; search_path->search_commit_root = true; search_path->skip_locking = true; while (cur_offset < item_size) { u64 parent; u32 this_name_len; u32 this_len; unsigned long name_ptr; struct btrfs_dir_item *di; struct fscrypt_str name_str; if (key->type == BTRFS_INODE_REF_KEY) { struct btrfs_inode_ref *iref; iref = (struct btrfs_inode_ref *)(ptr + cur_offset); parent = key->offset; this_name_len = btrfs_inode_ref_name_len(eb, iref); name_ptr = (unsigned long)(iref + 1); this_len = sizeof(*iref) + this_name_len; } else { struct btrfs_inode_extref *extref; extref = (struct btrfs_inode_extref *)(ptr + cur_offset); parent = btrfs_inode_extref_parent(eb, extref); this_name_len = btrfs_inode_extref_name_len(eb, extref); name_ptr = (unsigned long)&extref->name; this_len = sizeof(*extref) + this_name_len; } if (this_name_len > name_len) { char *new_name; new_name = krealloc(name, this_name_len, GFP_NOFS); if (!new_name) return -ENOMEM; name_len = this_name_len; name = new_name; } read_extent_buffer(eb, name, name_ptr, this_name_len); name_str.name = name; name_str.len = this_name_len; di = btrfs_lookup_dir_item(NULL, inode->root, search_path, parent, &name_str, 0); if (di && !IS_ERR(di)) { struct btrfs_key di_key; btrfs_dir_item_key_to_cpu(search_path->nodes[0], di, &di_key); if (di_key.type == BTRFS_INODE_ITEM_KEY) { if (di_key.objectid != key->objectid) { *other_ino = di_key.objectid; *other_parent = parent; return 1; } else { return 0; } } else { return -EAGAIN; } } else if (IS_ERR(di)) { return PTR_ERR(di); } btrfs_release_path(search_path); cur_offset += this_len; } return 0; } /* * Check if we need to log an inode. This is used in contexts where while * logging an inode we need to log another inode (either that it exists or in * full mode). This is used instead of btrfs_inode_in_log() because the later * requires the inode to be in the log and have the log transaction committed, * while here we do not care if the log transaction was already committed - our * caller will commit the log later - and we want to avoid logging an inode * multiple times when multiple tasks have joined the same log transaction. */ static bool need_log_inode(const struct btrfs_trans_handle *trans, struct btrfs_inode *inode) { /* * If a directory was not modified, no dentries added or removed, we can * and should avoid logging it. */ if (S_ISDIR(inode->vfs_inode.i_mode) && inode->last_trans < trans->transid) return false; /* * If this inode does not have new/updated/deleted xattrs since the last * time it was logged and is flagged as logged in the current transaction, * we can skip logging it. As for new/deleted names, those are updated in * the log by link/unlink/rename operations. * In case the inode was logged and then evicted and reloaded, its * logged_trans will be 0, in which case we have to fully log it since * logged_trans is a transient field, not persisted. */ if (inode_logged(trans, inode, NULL) == 1 && !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags)) return false; return true; } struct btrfs_dir_list { u64 ino; struct list_head list; }; /* * Log the inodes of the new dentries of a directory. * See process_dir_items_leaf() for details about why it is needed. * This is a recursive operation - if an existing dentry corresponds to a * directory, that directory's new entries are logged too (same behaviour as * ext3/4, xfs, f2fs, nilfs2). Note that when logging the inodes * the dentries point to we do not acquire their VFS lock, otherwise lockdep * complains about the following circular lock dependency / possible deadlock: * * CPU0 CPU1 * ---- ---- * lock(&type->i_mutex_dir_key#3/2); * lock(sb_internal#2); * lock(&type->i_mutex_dir_key#3/2); * lock(&sb->s_type->i_mutex_key#14); * * Where sb_internal is the lock (a counter that works as a lock) acquired by * sb_start_intwrite() in btrfs_start_transaction(). * Not acquiring the VFS lock of the inodes is still safe because: * * 1) For regular files we log with a mode of LOG_INODE_EXISTS. It's possible * that while logging the inode new references (names) are added or removed * from the inode, leaving the logged inode item with a link count that does * not match the number of logged inode reference items. This is fine because * at log replay time we compute the real number of links and correct the * link count in the inode item (see replay_one_buffer() and * link_to_fixup_dir()); * * 2) For directories we log with a mode of LOG_INODE_ALL. It's possible that * while logging the inode's items new index items (key type * BTRFS_DIR_INDEX_KEY) are added to fs/subvol tree and the logged inode item * has a size that doesn't match the sum of the lengths of all the logged * names - this is ok, not a problem, because at log replay time we set the * directory's i_size to the correct value (see replay_one_name() and * overwrite_item()). */ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, struct btrfs_inode *start_inode, struct btrfs_log_ctx *ctx) { struct btrfs_root *root = start_inode->root; struct btrfs_path *path; LIST_HEAD(dir_list); struct btrfs_dir_list *dir_elem; u64 ino = btrfs_ino(start_inode); struct btrfs_inode *curr_inode = start_inode; int ret = 0; path = btrfs_alloc_path(); if (!path) return -ENOMEM; /* Pairs with btrfs_add_delayed_iput below. */ ihold(&curr_inode->vfs_inode); while (true) { struct btrfs_key key; struct btrfs_key found_key; u64 next_index; bool continue_curr_inode = true; int iter_ret; key.objectid = ino; key.type = BTRFS_DIR_INDEX_KEY; key.offset = btrfs_get_first_dir_index_to_log(curr_inode); next_index = key.offset; again: btrfs_for_each_slot(root->log_root, &key, &found_key, path, iter_ret) { struct extent_buffer *leaf = path->nodes[0]; struct btrfs_dir_item *di; struct btrfs_key di_key; struct btrfs_inode *di_inode; int log_mode = LOG_INODE_EXISTS; int type; if (found_key.objectid != ino || found_key.type != BTRFS_DIR_INDEX_KEY) { continue_curr_inode = false; break; } next_index = found_key.offset + 1; di = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dir_item); type = btrfs_dir_ftype(leaf, di); if (btrfs_dir_transid(leaf, di) < trans->transid) continue; btrfs_dir_item_key_to_cpu(leaf, di, &di_key); if (di_key.type == BTRFS_ROOT_ITEM_KEY) continue; btrfs_release_path(path); di_inode = btrfs_iget_logging(di_key.objectid, root); if (IS_ERR(di_inode)) { ret = PTR_ERR(di_inode); goto out; } if (!need_log_inode(trans, di_inode)) { btrfs_add_delayed_iput(di_inode); break; } ctx->log_new_dentries = false; if (type == BTRFS_FT_DIR) log_mode = LOG_INODE_ALL; ret = btrfs_log_inode(trans, di_inode, log_mode, ctx); btrfs_add_delayed_iput(di_inode); if (ret) goto out; if (ctx->log_new_dentries) { dir_elem = kmalloc(sizeof(*dir_elem), GFP_NOFS); if (!dir_elem) { ret = -ENOMEM; goto out; } dir_elem->ino = di_key.objectid; list_add_tail(&dir_elem->list, &dir_list); } break; } btrfs_release_path(path); if (iter_ret < 0) { ret = iter_ret; goto out; } else if (iter_ret > 0) { continue_curr_inode = false; } else { key = found_key; } if (continue_curr_inode && key.offset < (u64)-1) { key.offset++; goto again; } btrfs_set_first_dir_index_to_log(curr_inode, next_index); if (list_empty(&dir_list)) break; dir_elem = list_first_entry(&dir_list, struct btrfs_dir_list, list); ino = dir_elem->ino; list_del(&dir_elem->list); kfree(dir_elem); btrfs_add_delayed_iput(curr_inode); curr_inode = btrfs_iget_logging(ino, root); if (IS_ERR(curr_inode)) { ret = PTR_ERR(curr_inode); curr_inode = NULL; break; } } out: btrfs_free_path(path); if (curr_inode) btrfs_add_delayed_iput(curr_inode); if (ret) { struct btrfs_dir_list *next; list_for_each_entry_safe(dir_elem, next, &dir_list, list) kfree(dir_elem); } return ret; } struct btrfs_ino_list { u64 ino; u64 parent; struct list_head list; }; static void free_conflicting_inodes(struct btrfs_log_ctx *ctx) { struct btrfs_ino_list *curr; struct btrfs_ino_list *next; list_for_each_entry_safe(curr, next, &ctx->conflict_inodes, list) { list_del(&curr->list); kfree(curr); } } static int conflicting_inode_is_dir(struct btrfs_root *root, u64 ino, struct btrfs_path *path) { struct btrfs_key key; int ret; key.objectid = ino; key.type = BTRFS_INODE_ITEM_KEY; key.offset = 0; path->search_commit_root = true; path->skip_locking = true; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (WARN_ON_ONCE(ret > 0)) { /* * We have previously found the inode through the commit root * so this should not happen. If it does, just error out and * fallback to a transaction commit. */ ret = -ENOENT; } else if (ret == 0) { struct btrfs_inode_item *item; item = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_inode_item); if (S_ISDIR(btrfs_inode_mode(path->nodes[0], item))) ret = 1; } btrfs_release_path(path); path->search_commit_root = false; path->skip_locking = false; return ret; } static bool can_log_conflicting_inode(const struct btrfs_trans_handle *trans, const struct btrfs_inode *inode) { if (!S_ISDIR(inode->vfs_inode.i_mode)) return true; if (inode->last_unlink_trans < trans->transid) return true; /* * If this is a directory and its unlink_trans is not from a past * transaction then we must fallback to a transaction commit in order * to avoid getting a directory with 2 hard links after log replay. * * This happens if a directory A is renamed, moved from one parent * directory to another one, a new file is created in the old parent * directory with the old name of our directory A, the new file is * fsynced, then we moved the new file to some other parent directory * and fsync again the new file. This results in a log tree where we * logged that directory A existed, with the INODE_REF item for the * new location but without having logged its old parent inode, so * that on log replay we add a new link for the new location but the * old link remains, resulting in a link count of 2. */ return false; } static int add_conflicting_inode(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_path *path, u64 ino, u64 parent, struct btrfs_log_ctx *ctx) { struct btrfs_ino_list *ino_elem; struct btrfs_inode *inode; /* * It's rare to have a lot of conflicting inodes, in practice it is not * common to have more than 1 or 2. We don't want to collect too many, * as we could end up logging too many inodes (even if only in * LOG_INODE_EXISTS mode) and slow down other fsyncs or transaction * commits. */ if (ctx->num_conflict_inodes >= MAX_CONFLICT_INODES) return BTRFS_LOG_FORCE_COMMIT; inode = btrfs_iget_logging(ino, root); /* * If the other inode that had a conflicting dir entry was deleted in * the current transaction then we either: * * 1) Log the parent directory (later after adding it to the list) if * the inode is a directory. This is because it may be a deleted * subvolume/snapshot or it may be a regular directory that had * deleted subvolumes/snapshots (or subdirectories that had them), * and at the moment we can't deal with dropping subvolumes/snapshots * during log replay. So we just log the parent, which will result in * a fallback to a transaction commit if we are dealing with those * cases (last_unlink_trans will match the current transaction); * * 2) Do nothing if it's not a directory. During log replay we simply * unlink the conflicting dentry from the parent directory and then * add the dentry for our inode. Like this we can avoid logging the * parent directory (and maybe fallback to a transaction commit in * case it has a last_unlink_trans == trans->transid, due to moving * some inode from it to some other directory). */ if (IS_ERR(inode)) { int ret = PTR_ERR(inode); if (ret != -ENOENT) return ret; ret = conflicting_inode_is_dir(root, ino, path); /* Not a directory or we got an error. */ if (ret <= 0) return ret; /* Conflicting inode is a directory, so we'll log its parent. */ ino_elem = kmalloc(sizeof(*ino_elem), GFP_NOFS); if (!ino_elem) return -ENOMEM; ino_elem->ino = ino; ino_elem->parent = parent; list_add_tail(&ino_elem->list, &ctx->conflict_inodes); ctx->num_conflict_inodes++; return 0; } /* * If the inode was already logged skip it - otherwise we can hit an * infinite loop. Example: * * From the commit root (previous transaction) we have the following * inodes: * * inode 257 a directory * inode 258 with references "zz" and "zz_link" on inode 257 * inode 259 with reference "a" on inode 257 * * And in the current (uncommitted) transaction we have: * * inode 257 a directory, unchanged * inode 258 with references "a" and "a2" on inode 257 * inode 259 with reference "zz_link" on inode 257 * inode 261 with reference "zz" on inode 257 * * When logging inode 261 the following infinite loop could * happen if we don't skip already logged inodes: * * - we detect inode 258 as a conflicting inode, with inode 261 * on reference "zz", and log it; * * - we detect inode 259 as a conflicting inode, with inode 258 * on reference "a", and log it; * * - we detect inode 258 as a conflicting inode, with inode 259 * on reference "zz_link", and log it - again! After this we * repeat the above steps forever. * * Here we can use need_log_inode() because we only need to log the * inode in LOG_INODE_EXISTS mode and rename operations update the log, * so that the log ends up with the new name and without the old name. */ if (!need_log_inode(trans, inode)) { btrfs_add_delayed_iput(inode); return 0; } if (!can_log_conflicting_inode(trans, inode)) { btrfs_add_delayed_iput(inode); return BTRFS_LOG_FORCE_COMMIT; } btrfs_add_delayed_iput(inode); ino_elem = kmalloc(sizeof(*ino_elem), GFP_NOFS); if (!ino_elem) return -ENOMEM; ino_elem->ino = ino; ino_elem->parent = parent; list_add_tail(&ino_elem->list, &ctx->conflict_inodes); ctx->num_conflict_inodes++; return 0; } static int log_conflicting_inodes(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_log_ctx *ctx) { int ret = 0; /* * Conflicting inodes are logged by the first call to btrfs_log_inode(), * otherwise we could have unbounded recursion of btrfs_log_inode() * calls. This check guarantees we can have only 1 level of recursion. */ if (ctx->logging_conflict_inodes) return 0; ctx->logging_conflict_inodes = true; /* * New conflicting inodes may be found and added to the list while we * are logging a conflicting inode, so keep iterating while the list is * not empty. */ while (!list_empty(&ctx->conflict_inodes)) { struct btrfs_ino_list *curr; struct btrfs_inode *inode; u64 ino; u64 parent; curr = list_first_entry(&ctx->conflict_inodes, struct btrfs_ino_list, list); ino = curr->ino; parent = curr->parent; list_del(&curr->list); kfree(curr); inode = btrfs_iget_logging(ino, root); /* * If the other inode that had a conflicting dir entry was * deleted in the current transaction, we need to log its parent * directory. See the comment at add_conflicting_inode(). */ if (IS_ERR(inode)) { ret = PTR_ERR(inode); if (ret != -ENOENT) break; inode = btrfs_iget_logging(parent, root); if (IS_ERR(inode)) { ret = PTR_ERR(inode); break; } if (!can_log_conflicting_inode(trans, inode)) { btrfs_add_delayed_iput(inode); ret = BTRFS_LOG_FORCE_COMMIT; break; } /* * Always log the directory, we cannot make this * conditional on need_log_inode() because the directory * might have been logged in LOG_INODE_EXISTS mode or * the dir index of the conflicting inode is not in a * dir index key range logged for the directory. So we * must make sure the deletion is recorded. */ ret = btrfs_log_inode(trans, inode, LOG_INODE_ALL, ctx); btrfs_add_delayed_iput(inode); if (ret) break; continue; } /* * Here we can use need_log_inode() because we only need to log * the inode in LOG_INODE_EXISTS mode and rename operations * update the log, so that the log ends up with the new name and * without the old name. * * We did this check at add_conflicting_inode(), but here we do * it again because if some other task logged the inode after * that, we can avoid doing it again. */ if (!need_log_inode(trans, inode)) { btrfs_add_delayed_iput(inode); continue; } /* * We are safe logging the other inode without acquiring its * lock as long as we log with the LOG_INODE_EXISTS mode. We * are safe against concurrent renames of the other inode as * well because during a rename we pin the log and update the * log with the new name before we unpin it. */ ret = btrfs_log_inode(trans, inode, LOG_INODE_EXISTS, ctx); btrfs_add_delayed_iput(inode); if (ret) break; } ctx->logging_conflict_inodes = false; if (ret) free_conflicting_inodes(ctx); return ret; } static int copy_inode_items_to_log(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_key *min_key, const struct btrfs_key *max_key, struct btrfs_path *path, struct btrfs_path *dst_path, const u64 logged_isize, const int inode_only, struct btrfs_log_ctx *ctx, bool *need_log_inode_item) { const u64 i_size = i_size_read(&inode->vfs_inode); struct btrfs_root *root = inode->root; int ins_start_slot = 0; int ins_nr = 0; int ret; while (1) { ret = btrfs_search_forward(root, min_key, path, trans->transid); if (ret < 0) return ret; if (ret > 0) { ret = 0; break; } again: /* Note, ins_nr might be > 0 here, cleanup outside the loop */ if (min_key->objectid != max_key->objectid) break; if (min_key->type > max_key->type) break; if (min_key->type == BTRFS_INODE_ITEM_KEY) { *need_log_inode_item = false; } else if (min_key->type == BTRFS_EXTENT_DATA_KEY && min_key->offset >= i_size) { /* * Extents at and beyond eof are logged with * btrfs_log_prealloc_extents(). * Only regular files have BTRFS_EXTENT_DATA_KEY keys, * and no keys greater than that, so bail out. */ break; } else if (min_key->type == BTRFS_INODE_REF_KEY || min_key->type == BTRFS_INODE_EXTREF_KEY) { u64 other_ino = 0; u64 other_parent = 0; ret = btrfs_check_ref_name_override(path->nodes[0], path->slots[0], min_key, inode, &other_ino, &other_parent); if (ret < 0) { return ret; } else if (ret > 0 && other_ino != btrfs_ino(ctx->inode)) { if (ins_nr > 0) { ins_nr++; } else { ins_nr = 1; ins_start_slot = path->slots[0]; } ret = copy_items(trans, inode, dst_path, path, ins_start_slot, ins_nr, inode_only, logged_isize, ctx); if (ret < 0) return ret; ins_nr = 0; btrfs_release_path(path); ret = add_conflicting_inode(trans, root, path, other_ino, other_parent, ctx); if (ret) return ret; goto next_key; } } else if (min_key->type == BTRFS_XATTR_ITEM_KEY) { /* Skip xattrs, logged later with btrfs_log_all_xattrs() */ if (ins_nr == 0) goto next_slot; ret = copy_items(trans, inode, dst_path, path, ins_start_slot, ins_nr, inode_only, logged_isize, ctx); if (ret < 0) return ret; ins_nr = 0; goto next_slot; } if (ins_nr && ins_start_slot + ins_nr == path->slots[0]) { ins_nr++; goto next_slot; } else if (!ins_nr) { ins_start_slot = path->slots[0]; ins_nr = 1; goto next_slot; } ret = copy_items(trans, inode, dst_path, path, ins_start_slot, ins_nr, inode_only, logged_isize, ctx); if (ret < 0) return ret; ins_nr = 1; ins_start_slot = path->slots[0]; next_slot: path->slots[0]++; if (path->slots[0] < btrfs_header_nritems(path->nodes[0])) { btrfs_item_key_to_cpu(path->nodes[0], min_key, path->slots[0]); goto again; } if (ins_nr) { ret = copy_items(trans, inode, dst_path, path, ins_start_slot, ins_nr, inode_only, logged_isize, ctx); if (ret < 0) return ret; ins_nr = 0; } btrfs_release_path(path); next_key: if (min_key->offset < (u64)-1) { min_key->offset++; } else if (min_key->type < max_key->type) { min_key->type++; min_key->offset = 0; } else { break; } /* * We may process many leaves full of items for our inode, so * avoid monopolizing a cpu for too long by rescheduling while * not holding locks on any tree. */ cond_resched(); } if (ins_nr) { ret = copy_items(trans, inode, dst_path, path, ins_start_slot, ins_nr, inode_only, logged_isize, ctx); if (ret) return ret; } if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) { /* * Release the path because otherwise we might attempt to double * lock the same leaf with btrfs_log_prealloc_extents() below. */ btrfs_release_path(path); ret = btrfs_log_prealloc_extents(trans, inode, dst_path, ctx); } return ret; } static int insert_delayed_items_batch(struct btrfs_trans_handle *trans, struct btrfs_root *log, struct btrfs_path *path, const struct btrfs_item_batch *batch, const struct btrfs_delayed_item *first_item) { const struct btrfs_delayed_item *curr = first_item; int ret; ret = btrfs_insert_empty_items(trans, log, path, batch); if (ret) return ret; for (int i = 0; i < batch->nr; i++) { char *data_ptr; data_ptr = btrfs_item_ptr(path->nodes[0], path->slots[0], char); write_extent_buffer(path->nodes[0], &curr->data, (unsigned long)data_ptr, curr->data_len); curr = list_next_entry(curr, log_list); path->slots[0]++; } btrfs_release_path(path); return 0; } static int log_delayed_insertion_items(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, const struct list_head *delayed_ins_list, struct btrfs_log_ctx *ctx) { /* 195 (4095 bytes of keys and sizes) fits in a single 4K page. */ const int max_batch_size = 195; const int leaf_data_size = BTRFS_LEAF_DATA_SIZE(trans->fs_info); const u64 ino = btrfs_ino(inode); struct btrfs_root *log = inode->root->log_root; struct btrfs_item_batch batch = { .nr = 0, .total_data_size = 0, }; const struct btrfs_delayed_item *first = NULL; const struct btrfs_delayed_item *curr; char *ins_data; struct btrfs_key *ins_keys; u32 *ins_sizes; u64 curr_batch_size = 0; int batch_idx = 0; int ret; /* We are adding dir index items to the log tree. */ lockdep_assert_held(&inode->log_mutex); /* * We collect delayed items before copying index keys from the subvolume * to the log tree. However just after we collected them, they may have * been flushed (all of them or just some of them), and therefore we * could have copied them from the subvolume tree to the log tree. * So find the first delayed item that was not yet logged (they are * sorted by index number). */ list_for_each_entry(curr, delayed_ins_list, log_list) { if (curr->index > inode->last_dir_index_offset) { first = curr; break; } } /* Empty list or all delayed items were already logged. */ if (!first) return 0; ins_data = kmalloc_array(max_batch_size, sizeof(u32) + sizeof(struct btrfs_key), GFP_NOFS); if (!ins_data) return -ENOMEM; ins_sizes = (u32 *)ins_data; batch.data_sizes = ins_sizes; ins_keys = (struct btrfs_key *)(ins_data + max_batch_size * sizeof(u32)); batch.keys = ins_keys; curr = first; while (!list_entry_is_head(curr, delayed_ins_list, log_list)) { const u32 curr_size = curr->data_len + sizeof(struct btrfs_item); if (curr_batch_size + curr_size > leaf_data_size || batch.nr == max_batch_size) { ret = insert_delayed_items_batch(trans, log, path, &batch, first); if (ret) goto out; batch_idx = 0; batch.nr = 0; batch.total_data_size = 0; curr_batch_size = 0; first = curr; } ins_sizes[batch_idx] = curr->data_len; ins_keys[batch_idx].objectid = ino; ins_keys[batch_idx].type = BTRFS_DIR_INDEX_KEY; ins_keys[batch_idx].offset = curr->index; curr_batch_size += curr_size; batch.total_data_size += curr->data_len; batch.nr++; batch_idx++; curr = list_next_entry(curr, log_list); } ASSERT(batch.nr >= 1, "batch.nr=%d", batch.nr); ret = insert_delayed_items_batch(trans, log, path, &batch, first); curr = list_last_entry(delayed_ins_list, struct btrfs_delayed_item, log_list); inode->last_dir_index_offset = curr->index; out: kfree(ins_data); return ret; } static int log_delayed_deletions_full(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, const struct list_head *delayed_del_list, struct btrfs_log_ctx *ctx) { const u64 ino = btrfs_ino(inode); const struct btrfs_delayed_item *curr; curr = list_first_entry(delayed_del_list, struct btrfs_delayed_item, log_list); while (!list_entry_is_head(curr, delayed_del_list, log_list)) { u64 first_dir_index = curr->index; u64 last_dir_index; const struct btrfs_delayed_item *next; int ret; /* * Find a range of consecutive dir index items to delete. Like * this we log a single dir range item spanning several contiguous * dir items instead of logging one range item per dir index item. */ next = list_next_entry(curr, log_list); while (!list_entry_is_head(next, delayed_del_list, log_list)) { if (next->index != curr->index + 1) break; curr = next; next = list_next_entry(next, log_list); } last_dir_index = curr->index; ASSERT(last_dir_index >= first_dir_index, "last_dir_index=%llu first_dir_index=%llu", last_dir_index, first_dir_index); ret = insert_dir_log_key(trans, inode->root->log_root, path, ino, first_dir_index, last_dir_index); if (ret) return ret; curr = list_next_entry(curr, log_list); } return 0; } static int batch_delete_dir_index_items(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, const struct list_head *delayed_del_list, const struct btrfs_delayed_item *first, const struct btrfs_delayed_item **last_ret) { const struct btrfs_delayed_item *next; struct extent_buffer *leaf = path->nodes[0]; const int last_slot = btrfs_header_nritems(leaf) - 1; int slot = path->slots[0] + 1; const u64 ino = btrfs_ino(inode); next = list_next_entry(first, log_list); while (slot < last_slot && !list_entry_is_head(next, delayed_del_list, log_list)) { struct btrfs_key key; btrfs_item_key_to_cpu(leaf, &key, slot); if (key.objectid != ino || key.type != BTRFS_DIR_INDEX_KEY || key.offset != next->index) break; slot++; *last_ret = next; next = list_next_entry(next, log_list); } return btrfs_del_items(trans, inode->root->log_root, path, path->slots[0], slot - path->slots[0]); } static int log_delayed_deletions_incremental(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, const struct list_head *delayed_del_list, struct btrfs_log_ctx *ctx) { struct btrfs_root *log = inode->root->log_root; const struct btrfs_delayed_item *curr; u64 last_range_start = 0; u64 last_range_end = 0; struct btrfs_key key; key.objectid = btrfs_ino(inode); key.type = BTRFS_DIR_INDEX_KEY; curr = list_first_entry(delayed_del_list, struct btrfs_delayed_item, log_list); while (!list_entry_is_head(curr, delayed_del_list, log_list)) { const struct btrfs_delayed_item *last = curr; u64 first_dir_index = curr->index; u64 last_dir_index; bool deleted_items = false; int ret; key.offset = curr->index; ret = btrfs_search_slot(trans, log, &key, path, -1, 1); if (ret < 0) { return ret; } else if (ret == 0) { ret = batch_delete_dir_index_items(trans, inode, path, delayed_del_list, curr, &last); if (ret) return ret; deleted_items = true; } btrfs_release_path(path); /* * If we deleted items from the leaf, it means we have a range * item logging their range, so no need to add one or update an * existing one. Otherwise we have to log a dir range item. */ if (deleted_items) goto next_batch; last_dir_index = last->index; ASSERT(last_dir_index >= first_dir_index, "last_dir_index=%llu first_dir_index=%llu", last_dir_index, first_dir_index); /* * If this range starts right after where the previous one ends, * then we want to reuse the previous range item and change its * end offset to the end of this range. This is just to minimize * leaf space usage, by avoiding adding a new range item. */ if (last_range_end != 0 && first_dir_index == last_range_end + 1) first_dir_index = last_range_start; ret = insert_dir_log_key(trans, log, path, key.objectid, first_dir_index, last_dir_index); if (ret) return ret; last_range_start = first_dir_index; last_range_end = last_dir_index; next_batch: curr = list_next_entry(last, log_list); } return 0; } static int log_delayed_deletion_items(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_path *path, const struct list_head *delayed_del_list, struct btrfs_log_ctx *ctx) { /* * We are deleting dir index items from the log tree or adding range * items to it. */ lockdep_assert_held(&inode->log_mutex); if (list_empty(delayed_del_list)) return 0; if (ctx->logged_before) return log_delayed_deletions_incremental(trans, inode, path, delayed_del_list, ctx); return log_delayed_deletions_full(trans, inode, path, delayed_del_list, ctx); } /* * Similar logic as for log_new_dir_dentries(), but it iterates over the delayed * items instead of the subvolume tree. */ static int log_new_delayed_dentries(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, const struct list_head *delayed_ins_list, struct btrfs_log_ctx *ctx) { const bool orig_log_new_dentries = ctx->log_new_dentries; struct btrfs_delayed_item *item; int ret = 0; /* * No need for the log mutex, plus to avoid potential deadlocks or * lockdep annotations due to nesting of delayed inode mutexes and log * mutexes. */ lockdep_assert_not_held(&inode->log_mutex); ASSERT(!ctx->logging_new_delayed_dentries, "ctx->logging_new_delayed_dentries=%d", ctx->logging_new_delayed_dentries); ctx->logging_new_delayed_dentries = true; list_for_each_entry(item, delayed_ins_list, log_list) { struct btrfs_dir_item *dir_item; struct btrfs_inode *di_inode; struct btrfs_key key; int log_mode = LOG_INODE_EXISTS; dir_item = (struct btrfs_dir_item *)item->data; btrfs_disk_key_to_cpu(&key, &dir_item->location); if (key.type == BTRFS_ROOT_ITEM_KEY) continue; di_inode = btrfs_iget_logging(key.objectid, inode->root); if (IS_ERR(di_inode)) { ret = PTR_ERR(di_inode); break; } if (!need_log_inode(trans, di_inode)) { btrfs_add_delayed_iput(di_inode); continue; } if (btrfs_stack_dir_ftype(dir_item) == BTRFS_FT_DIR) log_mode = LOG_INODE_ALL; ctx->log_new_dentries = false; ret = btrfs_log_inode(trans, di_inode, log_mode, ctx); if (!ret && ctx->log_new_dentries) ret = log_new_dir_dentries(trans, di_inode, ctx); btrfs_add_delayed_iput(di_inode); if (ret) break; } ctx->log_new_dentries = orig_log_new_dentries; ctx->logging_new_delayed_dentries = false; return ret; } /* log a single inode in the tree log. * At least one parent directory for this inode must exist in the tree * or be logged already. * * Any items from this inode changed by the current transaction are copied * to the log tree. An extra reference is taken on any extents in this * file, allowing us to avoid a whole pile of corner cases around logging * blocks that have been removed from the tree. * * See LOG_INODE_ALL and related defines for a description of what inode_only * does. * * This handles both files and directories. */ static int btrfs_log_inode(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, int inode_only, struct btrfs_log_ctx *ctx) { struct btrfs_path *path; struct btrfs_path *dst_path; struct btrfs_key min_key; struct btrfs_key max_key; struct btrfs_root *log = inode->root->log_root; int ret; bool fast_search = false; u64 ino = btrfs_ino(inode); struct extent_map_tree *em_tree = &inode->extent_tree; u64 logged_isize = 0; bool need_log_inode_item = true; bool xattrs_logged = false; bool inode_item_dropped = true; bool full_dir_logging = false; LIST_HEAD(delayed_ins_list); LIST_HEAD(delayed_del_list); path = btrfs_alloc_path(); if (!path) return -ENOMEM; dst_path = btrfs_alloc_path(); if (!dst_path) { btrfs_free_path(path); return -ENOMEM; } min_key.objectid = ino; min_key.type = BTRFS_INODE_ITEM_KEY; min_key.offset = 0; max_key.objectid = ino; /* today the code can only do partial logging of directories */ if (S_ISDIR(inode->vfs_inode.i_mode) || (!test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) && inode_only >= LOG_INODE_EXISTS)) max_key.type = BTRFS_XATTR_ITEM_KEY; else max_key.type = (u8)-1; max_key.offset = (u64)-1; if (S_ISDIR(inode->vfs_inode.i_mode) && inode_only == LOG_INODE_ALL) full_dir_logging = true; /* * If we are logging a directory while we are logging dentries of the * delayed items of some other inode, then we need to flush the delayed * items of this directory and not log the delayed items directly. This * is to prevent more than one level of recursion into btrfs_log_inode() * by having something like this: * * $ mkdir -p a/b/c/d/e/f/g/h/... * $ xfs_io -c "fsync" a * * Where all directories in the path did not exist before and are * created in the current transaction. * So in such a case we directly log the delayed items of the main * directory ("a") without flushing them first, while for each of its * subdirectories we flush their delayed items before logging them. * This prevents a potential unbounded recursion like this: * * btrfs_log_inode() * log_new_delayed_dentries() * btrfs_log_inode() * log_new_delayed_dentries() * btrfs_log_inode() * log_new_delayed_dentries() * (...) * * We have thresholds for the maximum number of delayed items to have in * memory, and once they are hit, the items are flushed asynchronously. * However the limit is quite high, so lets prevent deep levels of * recursion to happen by limiting the maximum depth to be 1. */ if (full_dir_logging && ctx->logging_new_delayed_dentries) { ret = btrfs_commit_inode_delayed_items(trans, inode); if (ret) goto out; } mutex_lock(&inode->log_mutex); /* * For symlinks, we must always log their content, which is stored in an * inline extent, otherwise we could end up with an empty symlink after * log replay, which is invalid on linux (symlink(2) returns -ENOENT if * one attempts to create an empty symlink). * We don't need to worry about flushing delalloc, because when we create * the inline extent when the symlink is created (we never have delalloc * for symlinks). */ if (S_ISLNK(inode->vfs_inode.i_mode)) inode_only = LOG_INODE_ALL; /* * Before logging the inode item, cache the value returned by * inode_logged(), because after that we have the need to figure out if * the inode was previously logged in this transaction. */ ret = inode_logged(trans, inode, path); if (ret < 0) goto out_unlock; ctx->logged_before = (ret == 1); ret = 0; /* * This is for cases where logging a directory could result in losing a * a file after replaying the log. For example, if we move a file from a * directory A to a directory B, then fsync directory A, we have no way * to known the file was moved from A to B, so logging just A would * result in losing the file after a log replay. */ if (full_dir_logging && inode->last_unlink_trans >= trans->transid) { ret = BTRFS_LOG_FORCE_COMMIT; goto out_unlock; } /* * a brute force approach to making sure we get the most uptodate * copies of everything. */ if (S_ISDIR(inode->vfs_inode.i_mode)) { clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags); if (ctx->logged_before) ret = drop_inode_items(trans, log, path, inode, BTRFS_XATTR_ITEM_KEY); } else { if (inode_only == LOG_INODE_EXISTS && ctx->logged_before) { /* * Make sure the new inode item we write to the log has * the same isize as the current one (if it exists). * This is necessary to prevent data loss after log * replay, and also to prevent doing a wrong expanding * truncate - for e.g. create file, write 4K into offset * 0, fsync, write 4K into offset 4096, add hard link, * fsync some other file (to sync log), power fail - if * we use the inode's current i_size, after log replay * we get a 8Kb file, with the last 4Kb extent as a hole * (zeroes), as if an expanding truncate happened, * instead of getting a file of 4Kb only. */ ret = logged_inode_size(log, inode, path, &logged_isize); if (ret) goto out_unlock; } if (test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags)) { if (inode_only == LOG_INODE_EXISTS) { max_key.type = BTRFS_XATTR_ITEM_KEY; if (ctx->logged_before) ret = drop_inode_items(trans, log, path, inode, max_key.type); } else { clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags); clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags); if (ctx->logged_before) ret = truncate_inode_items(trans, log, inode, 0, 0); } } else if (test_and_clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags) || inode_only == LOG_INODE_EXISTS) { if (inode_only == LOG_INODE_ALL) fast_search = true; max_key.type = BTRFS_XATTR_ITEM_KEY; if (ctx->logged_before) ret = drop_inode_items(trans, log, path, inode, max_key.type); } else { if (inode_only == LOG_INODE_ALL) fast_search = true; inode_item_dropped = false; goto log_extents; } } if (ret) goto out_unlock; /* * If we are logging a directory in full mode, collect the delayed items * before iterating the subvolume tree, so that we don't miss any new * dir index items in case they get flushed while or right after we are * iterating the subvolume tree. */ if (full_dir_logging && !ctx->logging_new_delayed_dentries) btrfs_log_get_delayed_items(inode, &delayed_ins_list, &delayed_del_list); /* * If we are fsyncing a file with 0 hard links, then commit the delayed * inode because the last inode ref (or extref) item may still be in the * subvolume tree and if we log it the file will still exist after a log * replay. So commit the delayed inode to delete that last ref and we * skip logging it. */ if (inode->vfs_inode.i_nlink == 0) { ret = btrfs_commit_inode_delayed_inode(inode); if (ret) goto out_unlock; } ret = copy_inode_items_to_log(trans, inode, &min_key, &max_key, path, dst_path, logged_isize, inode_only, ctx, &need_log_inode_item); if (ret) goto out_unlock; btrfs_release_path(path); btrfs_release_path(dst_path); ret = btrfs_log_all_xattrs(trans, inode, path, dst_path, ctx); if (ret) goto out_unlock; xattrs_logged = true; if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) { btrfs_release_path(path); btrfs_release_path(dst_path); ret = btrfs_log_holes(trans, inode, path); if (ret) goto out_unlock; } log_extents: btrfs_release_path(path); btrfs_release_path(dst_path); if (need_log_inode_item) { ret = log_inode_item(trans, log, dst_path, inode, inode_item_dropped); if (ret) goto out_unlock; /* * If we are doing a fast fsync and the inode was logged before * in this transaction, we don't need to log the xattrs because * they were logged before. If xattrs were added, changed or * deleted since the last time we logged the inode, then we have * already logged them because the inode had the runtime flag * BTRFS_INODE_COPY_EVERYTHING set. */ if (!xattrs_logged && inode->logged_trans < trans->transid) { ret = btrfs_log_all_xattrs(trans, inode, path, dst_path, ctx); if (ret) goto out_unlock; btrfs_release_path(path); } } if (fast_search) { ret = btrfs_log_changed_extents(trans, inode, dst_path, ctx); if (ret) goto out_unlock; } else if (inode_only == LOG_INODE_ALL) { struct extent_map *em, *n; write_lock(&em_tree->lock); list_for_each_entry_safe(em, n, &em_tree->modified_extents, list) list_del_init(&em->list); write_unlock(&em_tree->lock); } if (full_dir_logging) { ret = log_directory_changes(trans, inode, path, dst_path, ctx); if (ret) goto out_unlock; ret = log_delayed_insertion_items(trans, inode, path, &delayed_ins_list, ctx); if (ret) goto out_unlock; ret = log_delayed_deletion_items(trans, inode, path, &delayed_del_list, ctx); if (ret) goto out_unlock; } spin_lock(&inode->lock); inode->logged_trans = trans->transid; /* * Don't update last_log_commit if we logged that an inode exists. * We do this for three reasons: * * 1) We might have had buffered writes to this inode that were * flushed and had their ordered extents completed in this * transaction, but we did not previously log the inode with * LOG_INODE_ALL. Later the inode was evicted and after that * it was loaded again and this LOG_INODE_EXISTS log operation * happened. We must make sure that if an explicit fsync against * the inode is performed later, it logs the new extents, an * updated inode item, etc, and syncs the log. The same logic * applies to direct IO writes instead of buffered writes. * * 2) When we log the inode with LOG_INODE_EXISTS, its inode item * is logged with an i_size of 0 or whatever value was logged * before. If later the i_size of the inode is increased by a * truncate operation, the log is synced through an fsync of * some other inode and then finally an explicit fsync against * this inode is made, we must make sure this fsync logs the * inode with the new i_size, the hole between old i_size and * the new i_size, and syncs the log. * * 3) If we are logging that an ancestor inode exists as part of * logging a new name from a link or rename operation, don't update * its last_log_commit - otherwise if an explicit fsync is made * against an ancestor, the fsync considers the inode in the log * and doesn't sync the log, resulting in the ancestor missing after * a power failure unless the log was synced as part of an fsync * against any other unrelated inode. */ if (!ctx->logging_new_name && inode_only != LOG_INODE_EXISTS) inode->last_log_commit = inode->last_sub_trans; spin_unlock(&inode->lock); /* * Reset the last_reflink_trans so that the next fsync does not need to * go through the slower path when logging extents and their checksums. */ if (inode_only == LOG_INODE_ALL) inode->last_reflink_trans = 0; out_unlock: mutex_unlock(&inode->log_mutex); out: btrfs_free_path(path); btrfs_free_path(dst_path); if (ret) free_conflicting_inodes(ctx); else ret = log_conflicting_inodes(trans, inode->root, ctx); if (full_dir_logging && !ctx->logging_new_delayed_dentries) { if (!ret) ret = log_new_delayed_dentries(trans, inode, &delayed_ins_list, ctx); btrfs_log_put_delayed_items(inode, &delayed_ins_list, &delayed_del_list); } return ret; } static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_log_ctx *ctx) { int ret; BTRFS_PATH_AUTO_FREE(path); struct btrfs_key key; struct btrfs_root *root = inode->root; const u64 ino = btrfs_ino(inode); path = btrfs_alloc_path(); if (!path) return -ENOMEM; path->skip_locking = true; path->search_commit_root = true; key.objectid = ino; key.type = BTRFS_INODE_REF_KEY; key.offset = 0; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) return ret; while (true) { struct extent_buffer *leaf = path->nodes[0]; int slot = path->slots[0]; u32 cur_offset = 0; u32 item_size; unsigned long ptr; if (slot >= btrfs_header_nritems(leaf)) { ret = btrfs_next_leaf(root, path); if (ret < 0) return ret; if (ret > 0) break; continue; } btrfs_item_key_to_cpu(leaf, &key, slot); /* BTRFS_INODE_EXTREF_KEY is BTRFS_INODE_REF_KEY + 1 */ if (key.objectid != ino || key.type > BTRFS_INODE_EXTREF_KEY) break; item_size = btrfs_item_size(leaf, slot); ptr = btrfs_item_ptr_offset(leaf, slot); while (cur_offset < item_size) { u64 dir_id; struct btrfs_inode *dir_inode; if (key.type == BTRFS_INODE_EXTREF_KEY) { struct btrfs_inode_extref *extref; extref = (struct btrfs_inode_extref *) (ptr + cur_offset); dir_id = btrfs_inode_extref_parent(leaf, extref); cur_offset += sizeof(*extref); cur_offset += btrfs_inode_extref_name_len(leaf, extref); } else { dir_id = key.offset; cur_offset = item_size; } dir_inode = btrfs_iget_logging(dir_id, root); /* * If the parent inode was deleted, return an error to * fallback to a transaction commit. This is to prevent * getting an inode that was moved from one parent A to * a parent B, got its former parent A deleted and then * it got fsync'ed, from existing at both parents after * a log replay (and the old parent still existing). * Example: * * mkdir /mnt/A * mkdir /mnt/B * touch /mnt/B/bar * sync * mv /mnt/B/bar /mnt/A/bar * mv -T /mnt/A /mnt/B * fsync /mnt/B/bar * <power fail> * * If we ignore the old parent B which got deleted, * after a log replay we would have file bar linked * at both parents and the old parent B would still * exist. */ if (IS_ERR(dir_inode)) return PTR_ERR(dir_inode); if (!need_log_inode(trans, dir_inode)) { btrfs_add_delayed_iput(dir_inode); continue; } ctx->log_new_dentries = false; ret = btrfs_log_inode(trans, dir_inode, LOG_INODE_ALL, ctx); if (!ret && ctx->log_new_dentries) ret = log_new_dir_dentries(trans, dir_inode, ctx); btrfs_add_delayed_iput(dir_inode); if (ret) return ret; } path->slots[0]++; } return 0; } static int log_new_ancestors(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_path *path, struct btrfs_log_ctx *ctx) { struct btrfs_key found_key; btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]); while (true) { struct extent_buffer *leaf; int slot; struct btrfs_key search_key; struct btrfs_inode *inode; u64 ino; int ret = 0; btrfs_release_path(path); ino = found_key.offset; search_key.objectid = found_key.offset; search_key.type = BTRFS_INODE_ITEM_KEY; search_key.offset = 0; inode = btrfs_iget_logging(ino, root); if (IS_ERR(inode)) return PTR_ERR(inode); if (inode->generation >= trans->transid && need_log_inode(trans, inode)) ret = btrfs_log_inode(trans, inode, LOG_INODE_EXISTS, ctx); btrfs_add_delayed_iput(inode); if (ret) return ret; if (search_key.objectid == BTRFS_FIRST_FREE_OBJECTID) break; search_key.type = BTRFS_INODE_REF_KEY; ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0); if (ret < 0) return ret; leaf = path->nodes[0]; slot = path->slots[0]; if (slot >= btrfs_header_nritems(leaf)) { ret = btrfs_next_leaf(root, path); if (ret < 0) return ret; else if (ret > 0) return -ENOENT; leaf = path->nodes[0]; slot = path->slots[0]; } btrfs_item_key_to_cpu(leaf, &found_key, slot); if (found_key.objectid != search_key.objectid || found_key.type != BTRFS_INODE_REF_KEY) return -ENOENT; } return 0; } static int log_new_ancestors_fast(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct dentry *parent, struct btrfs_log_ctx *ctx) { struct btrfs_root *root = inode->root; struct dentry *old_parent = NULL; struct super_block *sb = inode->vfs_inode.i_sb; int ret = 0; while (true) { if (!parent || d_really_is_negative(parent) || sb != parent->d_sb) break; inode = BTRFS_I(d_inode(parent)); if (root != inode->root) break; if (inode->generation >= trans->transid && need_log_inode(trans, inode)) { ret = btrfs_log_inode(trans, inode, LOG_INODE_EXISTS, ctx); if (ret) break; } if (IS_ROOT(parent)) break; parent = dget_parent(parent); dput(old_parent); old_parent = parent; } dput(old_parent); return ret; } static int log_all_new_ancestors(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct dentry *parent, struct btrfs_log_ctx *ctx) { struct btrfs_root *root = inode->root; const u64 ino = btrfs_ino(inode); BTRFS_PATH_AUTO_FREE(path); struct btrfs_key search_key; int ret; /* * For a single hard link case, go through a fast path that does not * need to iterate the fs/subvolume tree. */ if (inode->vfs_inode.i_nlink < 2) return log_new_ancestors_fast(trans, inode, parent, ctx); path = btrfs_alloc_path(); if (!path) return -ENOMEM; search_key.objectid = ino; search_key.type = BTRFS_INODE_REF_KEY; search_key.offset = 0; again: ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0); if (ret < 0) return ret; if (ret == 0) path->slots[0]++; while (true) { struct extent_buffer *leaf = path->nodes[0]; int slot = path->slots[0]; struct btrfs_key found_key; if (slot >= btrfs_header_nritems(leaf)) { ret = btrfs_next_leaf(root, path); if (ret < 0) return ret; if (ret > 0) break; continue; } btrfs_item_key_to_cpu(leaf, &found_key, slot); if (found_key.objectid != ino || found_key.type > BTRFS_INODE_EXTREF_KEY) break; /* * Don't deal with extended references because they are rare * cases and too complex to deal with (we would need to keep * track of which subitem we are processing for each item in * this loop, etc). So just return some error to fallback to * a transaction commit. */ if (found_key.type == BTRFS_INODE_EXTREF_KEY) return -EMLINK; /* * Logging ancestors needs to do more searches on the fs/subvol * tree, so it releases the path as needed to avoid deadlocks. * Keep track of the last inode ref key and resume from that key * after logging all new ancestors for the current hard link. */ memcpy(&search_key, &found_key, sizeof(search_key)); ret = log_new_ancestors(trans, root, path, ctx); if (ret) return ret; btrfs_release_path(path); goto again; } return 0; } /* * helper function around btrfs_log_inode to make sure newly created * parent directories also end up in the log. A minimal inode and backref * only logging is done of any parent directories that are older than * the last committed transaction */ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct dentry *parent, int inode_only, struct btrfs_log_ctx *ctx) { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; int ret = 0; bool log_dentries; if (btrfs_test_opt(fs_info, NOTREELOG)) return BTRFS_LOG_FORCE_COMMIT; if (btrfs_root_refs(&root->root_item) == 0) return BTRFS_LOG_FORCE_COMMIT; /* * If we're logging an inode from a subvolume created in the current * transaction we must force a commit since the root is not persisted. */ if (btrfs_root_generation(&root->root_item) == trans->transid) return BTRFS_LOG_FORCE_COMMIT; /* Skip already logged inodes and without new extents. */ if (btrfs_inode_in_log(inode, trans->transid) && list_empty(&ctx->ordered_extents)) return BTRFS_NO_LOG_SYNC; ret = start_log_trans(trans, root, ctx); if (ret) return ret; ret = btrfs_log_inode(trans, inode, inode_only, ctx); if (ret) goto end_trans; /* * for regular files, if its inode is already on disk, we don't * have to worry about the parents at all. This is because * we can use the last_unlink_trans field to record renames * and other fun in this file. */ if (S_ISREG(inode->vfs_inode.i_mode) && inode->generation < trans->transid && inode->last_unlink_trans < trans->transid) { ret = 0; goto end_trans; } /* * Track if we need to log dentries because ctx->log_new_dentries can * be modified in the call chains below. */ log_dentries = ctx->log_new_dentries; /* * On unlink we must make sure all our current and old parent directory * inodes are fully logged. This is to prevent leaving dangling * directory index entries in directories that were our parents but are * not anymore. Not doing this results in old parent directory being * impossible to delete after log replay (rmdir will always fail with * error -ENOTEMPTY). * * Example 1: * * mkdir testdir * touch testdir/foo * ln testdir/foo testdir/bar * sync * unlink testdir/bar * xfs_io -c fsync testdir/foo * <power failure> * mount fs, triggers log replay * * If we don't log the parent directory (testdir), after log replay the * directory still has an entry pointing to the file inode using the bar * name, but a matching BTRFS_INODE_[REF|EXTREF]_KEY does not exist and * the file inode has a link count of 1. * * Example 2: * * mkdir testdir * touch foo * ln foo testdir/foo2 * ln foo testdir/foo3 * sync * unlink testdir/foo3 * xfs_io -c fsync foo * <power failure> * mount fs, triggers log replay * * Similar as the first example, after log replay the parent directory * testdir still has an entry pointing to the inode file with name foo3 * but the file inode does not have a matching BTRFS_INODE_REF_KEY item * and has a link count of 2. */ if (inode->last_unlink_trans >= trans->transid) { ret = btrfs_log_all_parents(trans, inode, ctx); if (ret) goto end_trans; } ret = log_all_new_ancestors(trans, inode, parent, ctx); if (ret) goto end_trans; if (log_dentries) ret = log_new_dir_dentries(trans, inode, ctx); end_trans: if (ret < 0) { btrfs_set_log_full_commit(trans); ret = BTRFS_LOG_FORCE_COMMIT; } if (ret) btrfs_remove_log_ctx(root, ctx); btrfs_end_log_trans(root); return ret; } /* * it is not safe to log dentry if the chunk root has added new * chunks. This returns 0 if the dentry was logged, and 1 otherwise. * If this returns 1, you must commit the transaction to safely get your * data on disk. */ int btrfs_log_dentry_safe(struct btrfs_trans_handle *trans, struct dentry *dentry, struct btrfs_log_ctx *ctx) { struct dentry *parent = dget_parent(dentry); int ret; ret = btrfs_log_inode_parent(trans, BTRFS_I(d_inode(dentry)), parent, LOG_INODE_ALL, ctx); dput(parent); return ret; } /* * should be called during mount to recover any replay any log trees * from the FS */ int btrfs_recover_log_trees(struct btrfs_root *log_root_tree) { int ret; struct btrfs_path *path; struct btrfs_trans_handle *trans; struct btrfs_key key; struct btrfs_fs_info *fs_info = log_root_tree->fs_info; struct walk_control wc = { .process_func = process_one_buffer, .stage = LOG_WALK_PIN_ONLY, }; path = btrfs_alloc_path(); if (!path) return -ENOMEM; set_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags); trans = btrfs_start_transaction(fs_info->tree_root, 0); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto error; } wc.trans = trans; wc.pin = true; wc.log = log_root_tree; ret = walk_log_tree(&wc); wc.log = NULL; if (unlikely(ret)) { btrfs_abort_transaction(trans, ret); goto error; } again: key.objectid = BTRFS_TREE_LOG_OBJECTID; key.type = BTRFS_ROOT_ITEM_KEY; key.offset = (u64)-1; while (1) { struct btrfs_key found_key; ret = btrfs_search_slot(NULL, log_root_tree, &key, path, 0, 0); if (unlikely(ret < 0)) { btrfs_abort_transaction(trans, ret); goto error; } if (ret > 0) { if (path->slots[0] == 0) break; path->slots[0]--; } btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]); btrfs_release_path(path); if (found_key.objectid != BTRFS_TREE_LOG_OBJECTID) break; wc.log = btrfs_read_tree_root(log_root_tree, &found_key); if (IS_ERR(wc.log)) { ret = PTR_ERR(wc.log); wc.log = NULL; btrfs_abort_transaction(trans, ret); goto error; } wc.root = btrfs_get_fs_root(fs_info, found_key.offset, true); if (IS_ERR(wc.root)) { ret = PTR_ERR(wc.root); wc.root = NULL; if (unlikely(ret != -ENOENT)) { btrfs_abort_transaction(trans, ret); goto error; } /* * We didn't find the subvol, likely because it was * deleted. This is ok, simply skip this log and go to * the next one. * * We need to exclude the root because we can't have * other log replays overwriting this log as we'll read * it back in a few more times. This will keep our * block from being modified, and we'll just bail for * each subsequent pass. */ ret = btrfs_pin_extent_for_log_replay(trans, wc.log->node); if (unlikely(ret)) { btrfs_abort_transaction(trans, ret); goto error; } goto next; } wc.root->log_root = wc.log; ret = btrfs_record_root_in_trans(trans, wc.root); if (unlikely(ret)) { btrfs_abort_transaction(trans, ret); goto next; } ret = walk_log_tree(&wc); if (unlikely(ret)) { btrfs_abort_transaction(trans, ret); goto next; } if (wc.stage == LOG_WALK_REPLAY_ALL) { struct btrfs_root *root = wc.root; wc.subvol_path = path; ret = fixup_inode_link_counts(&wc); wc.subvol_path = NULL; if (unlikely(ret)) { btrfs_abort_transaction(trans, ret); goto next; } /* * We have just replayed everything, and the highest * objectid of fs roots probably has changed in case * some inode_item's got replayed. * * root->objectid_mutex is not acquired as log replay * could only happen during mount. */ ret = btrfs_init_root_free_objectid(root); if (unlikely(ret)) { btrfs_abort_transaction(trans, ret); goto next; } } next: if (wc.root) { wc.root->log_root = NULL; btrfs_put_root(wc.root); } btrfs_put_root(wc.log); wc.log = NULL; if (ret) goto error; if (found_key.offset == 0) break; key.offset = found_key.offset - 1; } btrfs_release_path(path); /* step one is to pin it all, step two is to replay just inodes */ if (wc.pin) { wc.pin = false; wc.process_func = replay_one_buffer; wc.stage = LOG_WALK_REPLAY_INODES; goto again; } /* step three is to replay everything */ if (wc.stage < LOG_WALK_REPLAY_ALL) { wc.stage++; goto again; } btrfs_free_path(path); /* step 4: commit the transaction, which also unpins the blocks */ ret = btrfs_commit_transaction(trans); if (ret) return ret; clear_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags); return 0; error: if (wc.trans) btrfs_end_transaction(wc.trans); btrfs_put_root(wc.log); clear_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags); btrfs_free_path(path); return ret; } /* * there are some corner cases where we want to force a full * commit instead of allowing a directory to be logged. * * They revolve around files there were unlinked from the directory, and * this function updates the parent directory so that a full commit is * properly done if it is fsync'd later after the unlinks are done. * * Must be called before the unlink operations (updates to the subvolume tree, * inodes, etc) are done. */ void btrfs_record_unlink_dir(struct btrfs_trans_handle *trans, struct btrfs_inode *dir, struct btrfs_inode *inode, bool for_rename) { /* * when we're logging a file, if it hasn't been renamed * or unlinked, and its inode is fully committed on disk, * we don't have to worry about walking up the directory chain * to log its parents. * * So, we use the last_unlink_trans field to put this transid * into the file. When the file is logged we check it and * don't log the parents if the file is fully on disk. */ mutex_lock(&inode->log_mutex); inode->last_unlink_trans = trans->transid; mutex_unlock(&inode->log_mutex); if (!for_rename) return; /* * If this directory was already logged, any new names will be logged * with btrfs_log_new_name() and old names will be deleted from the log * tree with btrfs_del_dir_entries_in_log() or with * btrfs_del_inode_ref_in_log(). */ if (inode_logged(trans, dir, NULL) == 1) return; /* * If the inode we're about to unlink was logged before, the log will be * properly updated with the new name with btrfs_log_new_name() and the * old name removed with btrfs_del_dir_entries_in_log() or with * btrfs_del_inode_ref_in_log(). */ if (inode_logged(trans, inode, NULL) == 1) return; /* * when renaming files across directories, if the directory * there we're unlinking from gets fsync'd later on, there's * no way to find the destination directory later and fsync it * properly. So, we have to be conservative and force commits * so the new name gets discovered. */ mutex_lock(&dir->log_mutex); dir->last_unlink_trans = trans->transid; mutex_unlock(&dir->log_mutex); } /* * Make sure that if someone attempts to fsync the parent directory of a deleted * snapshot, it ends up triggering a transaction commit. This is to guarantee * that after replaying the log tree of the parent directory's root we will not * see the snapshot anymore and at log replay time we will not see any log tree * corresponding to the deleted snapshot's root, which could lead to replaying * it after replaying the log tree of the parent directory (which would replay * the snapshot delete operation). * * Must be called before the actual snapshot destroy operation (updates to the * parent root and tree of tree roots trees, etc) are done. */ void btrfs_record_snapshot_destroy(struct btrfs_trans_handle *trans, struct btrfs_inode *dir) { mutex_lock(&dir->log_mutex); dir->last_unlink_trans = trans->transid; mutex_unlock(&dir->log_mutex); } /* * Call this when creating a subvolume in a directory. * Because we don't commit a transaction when creating a subvolume, we can't * allow the directory pointing to the subvolume to be logged with an entry that * points to an unpersisted root if we are still in the transaction used to * create the subvolume, so make any attempt to log the directory to result in a * full log sync. * Also we don't need to worry with renames, since btrfs_rename() marks the log * for full commit when renaming a subvolume. * * Must be called before creating the subvolume entry in its parent directory. */ void btrfs_record_new_subvolume(const struct btrfs_trans_handle *trans, struct btrfs_inode *dir) { mutex_lock(&dir->log_mutex); dir->last_unlink_trans = trans->transid; mutex_unlock(&dir->log_mutex); } /* * Update the log after adding a new name for an inode. * * @trans: Transaction handle. * @old_dentry: The dentry associated with the old name and the old * parent directory. * @old_dir: The inode of the previous parent directory for the case * of a rename. For a link operation, it must be NULL. * @old_dir_index: The index number associated with the old name, meaningful * only for rename operations (when @old_dir is not NULL). * Ignored for link operations. * @parent: The dentry associated with the directory under which the * new name is located. * * Call this after adding a new name for an inode, as a result of a link or * rename operation, and it will properly update the log to reflect the new name. */ void btrfs_log_new_name(struct btrfs_trans_handle *trans, struct dentry *old_dentry, struct btrfs_inode *old_dir, u64 old_dir_index, struct dentry *parent) { struct btrfs_inode *inode = BTRFS_I(d_inode(old_dentry)); struct btrfs_root *root = inode->root; struct btrfs_log_ctx ctx; bool log_pinned = false; int ret; /* The inode has a new name (ref/extref), so make sure we log it. */ set_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags); btrfs_init_log_ctx(&ctx, inode); ctx.logging_new_name = true; /* * this will force the logging code to walk the dentry chain * up for the file */ if (!S_ISDIR(inode->vfs_inode.i_mode)) inode->last_unlink_trans = trans->transid; /* * if this inode hasn't been logged and directory we're renaming it * from hasn't been logged, we don't need to log it */ ret = inode_logged(trans, inode, NULL); if (ret < 0) { goto out; } else if (ret == 0) { if (!old_dir) return; /* * If the inode was not logged and we are doing a rename (old_dir is not * NULL), check if old_dir was logged - if it was not we can return and * do nothing. */ ret = inode_logged(trans, old_dir, NULL); if (ret < 0) goto out; else if (ret == 0) return; } ret = 0; /* * Now that we know we need to update the log, allocate the scratch eb * for the context before joining a log transaction below, as this can * take time and therefore we could delay log commits from other tasks. */ btrfs_init_log_ctx_scratch_eb(&ctx); /* * If we are doing a rename (old_dir is not NULL) from a directory that * was previously logged, make sure that on log replay we get the old * dir entry deleted. This is needed because we will also log the new * name of the renamed inode, so we need to make sure that after log * replay we don't end up with both the new and old dir entries existing. */ if (old_dir && old_dir->logged_trans == trans->transid) { struct btrfs_root *log = old_dir->root->log_root; struct btrfs_path *path; struct fscrypt_name fname; ASSERT(old_dir_index >= BTRFS_DIR_START_INDEX, "old_dir_index=%llu", old_dir_index); ret = fscrypt_setup_filename(&old_dir->vfs_inode, &old_dentry->d_name, 0, &fname); if (ret) goto out; path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; fscrypt_free_filename(&fname); goto out; } /* * We have two inodes to update in the log, the old directory and * the inode that got renamed, so we must pin the log to prevent * anyone from syncing the log until we have updated both inodes * in the log. */ ret = join_running_log_trans(root); /* * At least one of the inodes was logged before, so this should * not fail, but if it does, it's not serious, just bail out and * mark the log for a full commit. */ if (WARN_ON_ONCE(ret < 0)) { btrfs_free_path(path); fscrypt_free_filename(&fname); goto out; } log_pinned = true; /* * Other concurrent task might be logging the old directory, * as it can be triggered when logging other inode that had or * still has a dentry in the old directory. We lock the old * directory's log_mutex to ensure the deletion of the old * name is persisted, because during directory logging we * delete all BTRFS_DIR_LOG_INDEX_KEY keys and the deletion of * the old name's dir index item is in the delayed items, so * it could be missed by an in progress directory logging. */ mutex_lock(&old_dir->log_mutex); ret = del_logged_dentry(trans, log, path, btrfs_ino(old_dir), &fname.disk_name, old_dir_index); if (ret > 0) { /* * The dentry does not exist in the log, so record its * deletion. */ btrfs_release_path(path); ret = insert_dir_log_key(trans, log, path, btrfs_ino(old_dir), old_dir_index, old_dir_index); } mutex_unlock(&old_dir->log_mutex); btrfs_free_path(path); fscrypt_free_filename(&fname); if (ret < 0) goto out; } /* * We don't care about the return value. If we fail to log the new name * then we know the next attempt to sync the log will fallback to a full * transaction commit (due to a call to btrfs_set_log_full_commit()), so * we don't need to worry about getting a log committed that has an * inconsistent state after a rename operation. */ btrfs_log_inode_parent(trans, inode, parent, LOG_INODE_EXISTS, &ctx); ASSERT(list_empty(&ctx.conflict_inodes)); out: /* * If an error happened mark the log for a full commit because it's not * consistent and up to date or we couldn't find out if one of the * inodes was logged before in this transaction. Do it before unpinning * the log, to avoid any races with someone else trying to commit it. */ if (ret < 0) btrfs_set_log_full_commit(trans); if (log_pinned) btrfs_end_log_trans(root); free_extent_buffer(ctx.scratch_eb); } |
| 3941 3940 7820 7824 3942 8591 8640 8600 8585 668 3081 3086 3079 3064 286 7815 7861 7860 2 2 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | // SPDX-License-Identifier: GPL-2.0 /* * Lockless hierarchical page accounting & limiting * * Copyright (C) 2014 Red Hat, Inc., Johannes Weiner */ #include <linux/page_counter.h> #include <linux/atomic.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/sched.h> #include <linux/bug.h> #include <asm/page.h> static bool track_protection(struct page_counter *c) { return c->protection_support; } static void propagate_protected_usage(struct page_counter *c, unsigned long usage) { unsigned long protected, old_protected; long delta; if (!c->parent) return; protected = min(usage, READ_ONCE(c->min)); old_protected = atomic_long_read(&c->min_usage); if (protected != old_protected) { old_protected = atomic_long_xchg(&c->min_usage, protected); delta = protected - old_protected; if (delta) atomic_long_add(delta, &c->parent->children_min_usage); } protected = min(usage, READ_ONCE(c->low)); old_protected = atomic_long_read(&c->low_usage); if (protected != old_protected) { old_protected = atomic_long_xchg(&c->low_usage, protected); delta = protected - old_protected; if (delta) atomic_long_add(delta, &c->parent->children_low_usage); } } /** * page_counter_cancel - take pages out of the local counter * @counter: counter * @nr_pages: number of pages to cancel */ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages) { long new; new = atomic_long_sub_return(nr_pages, &counter->usage); /* More uncharges than charges? */ if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n", new, nr_pages)) { new = 0; atomic_long_set(&counter->usage, new); } if (track_protection(counter)) propagate_protected_usage(counter, new); } /** * page_counter_charge - hierarchically charge pages * @counter: counter * @nr_pages: number of pages to charge * * NOTE: This does not consider any configured counter limits. */ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) { struct page_counter *c; bool protection = track_protection(counter); for (c = counter; c; c = c->parent) { long new; new = atomic_long_add_return(nr_pages, &c->usage); if (protection) propagate_protected_usage(c, new); /* * This is indeed racy, but we can live with some * inaccuracy in the watermark. * * Notably, we have two watermarks to allow for both a globally * visible peak and one that can be reset at a smaller scope. * * Since we reset both watermarks when the global reset occurs, * we can guarantee that watermark >= local_watermark, so we * don't need to do both comparisons every time. * * On systems with branch predictors, the inner condition should * be almost free. */ if (new > READ_ONCE(c->local_watermark)) { WRITE_ONCE(c->local_watermark, new); if (new > READ_ONCE(c->watermark)) WRITE_ONCE(c->watermark, new); } } } /** * page_counter_try_charge - try to hierarchically charge pages * @counter: counter * @nr_pages: number of pages to charge * @fail: points first counter to hit its limit, if any * * Returns %true on success, or %false and @fail if the counter or one * of its ancestors has hit its configured limit. */ bool page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages, struct page_counter **fail) { struct page_counter *c; bool protection = track_protection(counter); bool track_failcnt = counter->track_failcnt; for (c = counter; c; c = c->parent) { long new; /* * Charge speculatively to avoid an expensive CAS. If * a bigger charge fails, it might falsely lock out a * racing smaller charge and send it into reclaim * early, but the error is limited to the difference * between the two sizes, which is less than 2M/4M in * case of a THP locking out a regular page charge. * * The atomic_long_add_return() implies a full memory * barrier between incrementing the count and reading * the limit. When racing with page_counter_set_max(), * we either see the new limit or the setter sees the * counter has changed and retries. */ new = atomic_long_add_return(nr_pages, &c->usage); if (new > c->max) { atomic_long_sub(nr_pages, &c->usage); /* * This is racy, but we can live with some * inaccuracy in the failcnt which is only used * to report stats. */ if (track_failcnt) data_race(c->failcnt++); *fail = c; goto failed; } if (protection) propagate_protected_usage(c, new); /* see comment on page_counter_charge */ if (new > READ_ONCE(c->local_watermark)) { WRITE_ONCE(c->local_watermark, new); if (new > READ_ONCE(c->watermark)) WRITE_ONCE(c->watermark, new); } } return true; failed: for (c = counter; c != *fail; c = c->parent) page_counter_cancel(c, nr_pages); return false; } /** * page_counter_uncharge - hierarchically uncharge pages * @counter: counter * @nr_pages: number of pages to uncharge */ void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages) { struct page_counter *c; for (c = counter; c; c = c->parent) page_counter_cancel(c, nr_pages); } /** * page_counter_set_max - set the maximum number of pages allowed * @counter: counter * @nr_pages: limit to set * * Returns 0 on success, -EBUSY if the current number of pages on the * counter already exceeds the specified limit. * * The caller must serialize invocations on the same counter. */ int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages) { for (;;) { unsigned long old; long usage; /* * Update the limit while making sure that it's not * below the concurrently-changing counter value. * * The xchg implies two full memory barriers before * and after, so the read-swap-read is ordered and * ensures coherency with page_counter_try_charge(): * that function modifies the count before checking * the limit, so if it sees the old limit, we see the * modified counter and retry. */ usage = page_counter_read(counter); if (usage > nr_pages) return -EBUSY; old = xchg(&counter->max, nr_pages); if (page_counter_read(counter) <= usage || nr_pages >= old) return 0; counter->max = old; cond_resched(); } } /** * page_counter_set_min - set the amount of protected memory * @counter: counter * @nr_pages: value to set * * The caller must serialize invocations on the same counter. */ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages) { struct page_counter *c; WRITE_ONCE(counter->min, nr_pages); for (c = counter; c; c = c->parent) propagate_protected_usage(c, atomic_long_read(&c->usage)); } /** * page_counter_set_low - set the amount of protected memory * @counter: counter * @nr_pages: value to set * * The caller must serialize invocations on the same counter. */ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages) { struct page_counter *c; WRITE_ONCE(counter->low, nr_pages); for (c = counter; c; c = c->parent) propagate_protected_usage(c, atomic_long_read(&c->usage)); } /** * page_counter_memparse - memparse() for page counter limits * @buf: string to parse * @max: string meaning maximum possible value * @nr_pages: returns the result in number of pages * * Returns -EINVAL, or 0 and @nr_pages on success. @nr_pages will be * limited to %PAGE_COUNTER_MAX. */ int page_counter_memparse(const char *buf, const char *max, unsigned long *nr_pages) { char *end; u64 bytes; if (!strcmp(buf, max)) { *nr_pages = PAGE_COUNTER_MAX; return 0; } bytes = memparse(buf, &end); if (*end != '\0') return -EINVAL; *nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX); return 0; } #if IS_ENABLED(CONFIG_MEMCG) || IS_ENABLED(CONFIG_CGROUP_DMEM) /* * This function calculates an individual page counter's effective * protection which is derived from its own memory.min/low, its * parent's and siblings' settings, as well as the actual memory * distribution in the tree. * * The following rules apply to the effective protection values: * * 1. At the first level of reclaim, effective protection is equal to * the declared protection in memory.min and memory.low. * * 2. To enable safe delegation of the protection configuration, at * subsequent levels the effective protection is capped to the * parent's effective protection. * * 3. To make complex and dynamic subtrees easier to configure, the * user is allowed to overcommit the declared protection at a given * level. If that is the case, the parent's effective protection is * distributed to the children in proportion to how much protection * they have declared and how much of it they are utilizing. * * This makes distribution proportional, but also work-conserving: * if one counter claims much more protection than it uses memory, * the unused remainder is available to its siblings. * * 4. Conversely, when the declared protection is undercommitted at a * given level, the distribution of the larger parental protection * budget is NOT proportional. A counter's protection from a sibling * is capped to its own memory.min/low setting. * * 5. However, to allow protecting recursive subtrees from each other * without having to declare each individual counter's fixed share * of the ancestor's claim to protection, any unutilized - * "floating" - protection from up the tree is distributed in * proportion to each counter's *usage*. This makes the protection * neutral wrt sibling cgroups and lets them compete freely over * the shared parental protection budget, but it protects the * subtree as a whole from neighboring subtrees. * * Note that 4. and 5. are not in conflict: 4. is about protecting * against immediate siblings whereas 5. is about protecting against * neighboring subtrees. */ static unsigned long effective_protection(unsigned long usage, unsigned long parent_usage, unsigned long setting, unsigned long parent_effective, unsigned long siblings_protected, bool recursive_protection) { unsigned long protected; unsigned long ep; protected = min(usage, setting); /* * If all cgroups at this level combined claim and use more * protection than what the parent affords them, distribute * shares in proportion to utilization. * * We are using actual utilization rather than the statically * claimed protection in order to be work-conserving: claimed * but unused protection is available to siblings that would * otherwise get a smaller chunk than what they claimed. */ if (siblings_protected > parent_effective) return protected * parent_effective / siblings_protected; /* * Ok, utilized protection of all children is within what the * parent affords them, so we know whatever this child claims * and utilizes is effectively protected. * * If there is unprotected usage beyond this value, reclaim * will apply pressure in proportion to that amount. * * If there is unutilized protection, the cgroup will be fully * shielded from reclaim, but we do return a smaller value for * protection than what the group could enjoy in theory. This * is okay. With the overcommit distribution above, effective * protection is always dependent on how memory is actually * consumed among the siblings anyway. */ ep = protected; /* * If the children aren't claiming (all of) the protection * afforded to them by the parent, distribute the remainder in * proportion to the (unprotected) memory of each cgroup. That * way, cgroups that aren't explicitly prioritized wrt each * other compete freely over the allowance, but they are * collectively protected from neighboring trees. * * We're using unprotected memory for the weight so that if * some cgroups DO claim explicit protection, we don't protect * the same bytes twice. * * Check both usage and parent_usage against the respective * protected values. One should imply the other, but they * aren't read atomically - make sure the division is sane. */ if (!recursive_protection) return ep; if (parent_effective > siblings_protected && parent_usage > siblings_protected && usage > protected) { unsigned long unclaimed; unclaimed = parent_effective - siblings_protected; unclaimed *= usage - protected; unclaimed /= parent_usage - siblings_protected; ep += unclaimed; } return ep; } /** * page_counter_calculate_protection - check if memory consumption is in the normal range * @root: the top ancestor of the sub-tree being checked * @counter: the page_counter the counter to update * @recursive_protection: Whether to use memory_recursiveprot behavior. * * Calculates elow/emin thresholds for given page_counter. * * WARNING: This function is not stateless! It can only be used as part * of a top-down tree iteration, not for isolated queries. */ void page_counter_calculate_protection(struct page_counter *root, struct page_counter *counter, bool recursive_protection) { unsigned long usage, parent_usage; struct page_counter *parent = counter->parent; /* * Effective values of the reclaim targets are ignored so they * can be stale. Have a look at mem_cgroup_protection for more * details. * TODO: calculation should be more robust so that we do not need * that special casing. */ if (root == counter) return; usage = page_counter_read(counter); if (!usage) return; if (parent == root) { counter->emin = READ_ONCE(counter->min); counter->elow = READ_ONCE(counter->low); return; } parent_usage = page_counter_read(parent); WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage, READ_ONCE(counter->min), READ_ONCE(parent->emin), atomic_long_read(&parent->children_min_usage), recursive_protection)); WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage, READ_ONCE(counter->low), READ_ONCE(parent->elow), atomic_long_read(&parent->children_low_usage), recursive_protection)); } #endif /* CONFIG_MEMCG || CONFIG_CGROUP_DMEM */ |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | /* SPDX-License-Identifier: GPL-2.0+ */ /* * NILFS segment usage file. * * Copyright (C) 2006-2008 Nippon Telegraph and Telephone Corporation. * * Written by Koji Sato. */ #ifndef _NILFS_SUFILE_H #define _NILFS_SUFILE_H #include <linux/fs.h> #include <linux/buffer_head.h> #include "mdt.h" static inline unsigned long nilfs_sufile_get_nsegments(struct inode *sufile) { return ((struct the_nilfs *)sufile->i_sb->s_fs_info)->ns_nsegments; } unsigned long nilfs_sufile_get_ncleansegs(struct inode *sufile); int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end); int nilfs_sufile_alloc(struct inode *, __u64 *); int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum); int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum, unsigned long nblocks, time64_t modtime); int nilfs_sufile_get_stat(struct inode *, struct nilfs_sustat *); ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned int, size_t); ssize_t nilfs_sufile_set_suinfo(struct inode *, void *, unsigned int, size_t); int nilfs_sufile_updatev(struct inode *, __u64 *, size_t, int, size_t *, void (*dofunc)(struct inode *, __u64, struct buffer_head *, struct buffer_head *)); int nilfs_sufile_update(struct inode *, __u64, int, void (*dofunc)(struct inode *, __u64, struct buffer_head *, struct buffer_head *)); void nilfs_sufile_do_scrap(struct inode *, __u64, struct buffer_head *, struct buffer_head *); void nilfs_sufile_do_free(struct inode *, __u64, struct buffer_head *, struct buffer_head *); void nilfs_sufile_do_cancel_free(struct inode *, __u64, struct buffer_head *, struct buffer_head *); void nilfs_sufile_do_set_error(struct inode *, __u64, struct buffer_head *, struct buffer_head *); int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs); int nilfs_sufile_read(struct super_block *sb, size_t susize, struct nilfs_inode *raw_inode, struct inode **inodep); int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range); /** * nilfs_sufile_scrap - make a segment garbage * @sufile: inode of segment usage file * @segnum: segment number to be freed * * Return: 0 on success, or a negative error code on failure. */ static inline int nilfs_sufile_scrap(struct inode *sufile, __u64 segnum) { return nilfs_sufile_update(sufile, segnum, 1, nilfs_sufile_do_scrap); } /** * nilfs_sufile_free - free segment * @sufile: inode of segment usage file * @segnum: segment number to be freed * * Return: 0 on success, or a negative error code on failure. */ static inline int nilfs_sufile_free(struct inode *sufile, __u64 segnum) { return nilfs_sufile_update(sufile, segnum, 0, nilfs_sufile_do_free); } /** * nilfs_sufile_freev - free segments * @sufile: inode of segment usage file * @segnumv: array of segment numbers * @nsegs: size of @segnumv array * @ndone: place to store the number of freed segments * * Return: 0 on success, or a negative error code on failure. */ static inline int nilfs_sufile_freev(struct inode *sufile, __u64 *segnumv, size_t nsegs, size_t *ndone) { return nilfs_sufile_updatev(sufile, segnumv, nsegs, 0, ndone, nilfs_sufile_do_free); } /** * nilfs_sufile_cancel_freev - reallocate freeing segments * @sufile: inode of segment usage file * @segnumv: array of segment numbers * @nsegs: size of @segnumv array * @ndone: place to store the number of cancelled segments * * Return: 0 on success, or a negative error code on failure. */ static inline int nilfs_sufile_cancel_freev(struct inode *sufile, __u64 *segnumv, size_t nsegs, size_t *ndone) { return nilfs_sufile_updatev(sufile, segnumv, nsegs, 0, ndone, nilfs_sufile_do_cancel_free); } /** * nilfs_sufile_set_error - mark a segment as erroneous * @sufile: inode of segment usage file * @segnum: segment number * * Description: nilfs_sufile_set_error() marks the segment specified by * @segnum as erroneous. The error segment will never be used again. * * Return: 0 on success, or one of the following negative error codes on * failure: * * %-EINVAL - Invalid segment usage number. * * %-EIO - I/O error (including metadata corruption). * * %-ENOMEM - Insufficient memory available. */ static inline int nilfs_sufile_set_error(struct inode *sufile, __u64 segnum) { return nilfs_sufile_update(sufile, segnum, 0, nilfs_sufile_do_set_error); } #endif /* _NILFS_SUFILE_H */ |
| 2 2 2 2 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | // SPDX-License-Identifier: GPL-2.0 #include <linux/ptrace.h> #include <linux/sched.h> #include <linux/sched/task_stack.h> #include <linux/export.h> #include <asm/syscall.h> static int collect_syscall(struct task_struct *target, struct syscall_info *info) { unsigned long args[6] = { }; struct pt_regs *regs; if (!try_get_task_stack(target)) { /* Task has no stack, so the task isn't in a syscall. */ memset(info, 0, sizeof(*info)); info->data.nr = -1; return 0; } regs = task_pt_regs(target); if (unlikely(!regs)) { put_task_stack(target); return -EAGAIN; } info->sp = user_stack_pointer(regs); info->data.instruction_pointer = instruction_pointer(regs); info->data.nr = syscall_get_nr(target, regs); if (info->data.nr != -1L) syscall_get_arguments(target, regs, args); info->data.args[0] = args[0]; info->data.args[1] = args[1]; info->data.args[2] = args[2]; info->data.args[3] = args[3]; info->data.args[4] = args[4]; info->data.args[5] = args[5]; put_task_stack(target); return 0; } /** * task_current_syscall - Discover what a blocked task is doing. * @target: thread to examine * @info: structure with the following fields: * .sp - filled with user stack pointer * .data.nr - filled with system call number or -1 * .data.args - filled with @maxargs system call arguments * .data.instruction_pointer - filled with user PC * * If @target is blocked in a system call, returns zero with @info.data.nr * set to the call's number and @info.data.args filled in with its * arguments. Registers not used for system call arguments may not be available * and it is not kosher to use &struct user_regset calls while the system * call is still in progress. Note we may get this result if @target * has finished its system call but not yet returned to user mode, such * as when it's stopped for signal handling or syscall exit tracing. * * If @target is blocked in the kernel during a fault or exception, * returns zero with *@info.data.nr set to -1 and does not fill in * @info.data.args. If so, it's now safe to examine @target using * &struct user_regset get() calls as long as we're sure @target won't return * to user mode. * * Returns -%EAGAIN if @target does not remain blocked. */ int task_current_syscall(struct task_struct *target, struct syscall_info *info) { unsigned long ncsw; unsigned int state; if (target == current) return collect_syscall(target, info); state = READ_ONCE(target->__state); if (unlikely(!state)) return -EAGAIN; ncsw = wait_task_inactive(target, state); if (unlikely(!ncsw) || unlikely(collect_syscall(target, info)) || unlikely(wait_task_inactive(target, state) != ncsw)) return -EAGAIN; return 0; } |
| 38 131 15 52 94 94 94 3 145 146 6 43 143 133 132 146 146 43 71 106 78 77 78 78 79 155 19 47 47 47 33 33 32 1 32 1 38 38 38 37 2 3 1 31 8 8 8 1 1 1 1 1 1 8 31 31 31 31 13 5 7 4 5 8 8 8 8 8 7 1 8 1 7 7 8 30 30 30 30 24 27 22 28 30 30 2 38 35 39 4 3 78 115 39 39 1 39 39 39 39 4 2 4 39 39 37 2 37 38 33 16 6 31 2 2 2 2 2 55 55 33 35 48 10 55 2 10 51 4 10 55 10 55 55 49 10 55 47 13 55 19 48 119 119 118 2 117 117 93 83 105 78 55 77 73 8 8 105 17 17 4 90 7 5 13 47 55 6 31 28 19 19 19 19 13 47 104 4 14 1 14 8 117 108 54 54 48 8 53 53 53 53 53 1 3 3 3 1 2 2 2 1 1 90 91 42 42 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (c) 2000-2006 Silicon Graphics, Inc. * Copyright (c) 2016-2018 Christoph Hellwig. * All Rights Reserved. */ #include "xfs.h" #include "xfs_fs.h" #include "xfs_shared.h" #include "xfs_format.h" #include "xfs_log_format.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_inode.h" #include "xfs_btree.h" #include "xfs_bmap_btree.h" #include "xfs_bmap.h" #include "xfs_bmap_util.h" #include "xfs_errortag.h" #include "xfs_error.h" #include "xfs_trans.h" #include "xfs_trans_space.h" #include "xfs_inode_item.h" #include "xfs_iomap.h" #include "xfs_trace.h" #include "xfs_quota.h" #include "xfs_rtgroup.h" #include "xfs_dquot_item.h" #include "xfs_dquot.h" #include "xfs_reflink.h" #include "xfs_health.h" #include "xfs_rtbitmap.h" #include "xfs_icache.h" #include "xfs_zone_alloc.h" #define XFS_ALLOC_ALIGN(mp, off) \ (((off) >> mp->m_allocsize_log) << mp->m_allocsize_log) static int xfs_alert_fsblock_zero( xfs_inode_t *ip, xfs_bmbt_irec_t *imap) { xfs_alert_tag(ip->i_mount, XFS_PTAG_FSBLOCK_ZERO, "Access to block zero in inode %llu " "start_block: %llx start_off: %llx " "blkcnt: %llx extent-state: %x", (unsigned long long)ip->i_ino, (unsigned long long)imap->br_startblock, (unsigned long long)imap->br_startoff, (unsigned long long)imap->br_blockcount, imap->br_state); xfs_bmap_mark_sick(ip, XFS_DATA_FORK); return -EFSCORRUPTED; } u64 xfs_iomap_inode_sequence( struct xfs_inode *ip, u16 iomap_flags) { u64 cookie = 0; if (iomap_flags & IOMAP_F_XATTR) return READ_ONCE(ip->i_af.if_seq); if ((iomap_flags & IOMAP_F_SHARED) && ip->i_cowfp) cookie = (u64)READ_ONCE(ip->i_cowfp->if_seq) << 32; return cookie | READ_ONCE(ip->i_df.if_seq); } /* * Check that the iomap passed to us is still valid for the given offset and * length. */ static bool xfs_iomap_valid( struct inode *inode, const struct iomap *iomap) { struct xfs_inode *ip = XFS_I(inode); if (iomap->type == IOMAP_HOLE) return true; if (iomap->validity_cookie != xfs_iomap_inode_sequence(ip, iomap->flags)) { trace_xfs_iomap_invalid(ip, iomap); return false; } XFS_ERRORTAG_DELAY(ip->i_mount, XFS_ERRTAG_WRITE_DELAY_MS); return true; } const struct iomap_write_ops xfs_iomap_write_ops = { .iomap_valid = xfs_iomap_valid, }; int xfs_bmbt_to_iomap( struct xfs_inode *ip, struct iomap *iomap, struct xfs_bmbt_irec *imap, unsigned int mapping_flags, u16 iomap_flags, u64 sequence_cookie) { struct xfs_mount *mp = ip->i_mount; struct xfs_buftarg *target = xfs_inode_buftarg(ip); if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock))) { xfs_bmap_mark_sick(ip, XFS_DATA_FORK); return xfs_alert_fsblock_zero(ip, imap); } if (imap->br_startblock == HOLESTARTBLOCK) { iomap->addr = IOMAP_NULL_ADDR; iomap->type = IOMAP_HOLE; } else if (imap->br_startblock == DELAYSTARTBLOCK || isnullstartblock(imap->br_startblock)) { iomap->addr = IOMAP_NULL_ADDR; iomap->type = IOMAP_DELALLOC; } else { xfs_daddr_t daddr = xfs_fsb_to_db(ip, imap->br_startblock); iomap->addr = BBTOB(daddr); if (mapping_flags & IOMAP_DAX) iomap->addr += target->bt_dax_part_off; if (imap->br_state == XFS_EXT_UNWRITTEN) iomap->type = IOMAP_UNWRITTEN; else iomap->type = IOMAP_MAPPED; /* * Mark iomaps starting at the first sector of a RTG as merge * boundary so that each I/O completions is contained to a * single RTG. */ if (XFS_IS_REALTIME_INODE(ip) && xfs_has_rtgroups(mp) && xfs_rtbno_is_group_start(mp, imap->br_startblock)) iomap->flags |= IOMAP_F_BOUNDARY; } iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff); iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount); if (mapping_flags & IOMAP_DAX) iomap->dax_dev = target->bt_daxdev; else iomap->bdev = target->bt_bdev; iomap->flags = iomap_flags; /* * If the inode is dirty for datasync purposes, let iomap know so it * doesn't elide the IO completion journal flushes on O_DSYNC IO. */ if (ip->i_itemp) { struct xfs_inode_log_item *iip = ip->i_itemp; spin_lock(&iip->ili_lock); if (iip->ili_datasync_seq) iomap->flags |= IOMAP_F_DIRTY; spin_unlock(&iip->ili_lock); } iomap->validity_cookie = sequence_cookie; return 0; } static void xfs_hole_to_iomap( struct xfs_inode *ip, struct iomap *iomap, xfs_fileoff_t offset_fsb, xfs_fileoff_t end_fsb) { struct xfs_buftarg *target = xfs_inode_buftarg(ip); iomap->addr = IOMAP_NULL_ADDR; iomap->type = IOMAP_HOLE; iomap->offset = XFS_FSB_TO_B(ip->i_mount, offset_fsb); iomap->length = XFS_FSB_TO_B(ip->i_mount, end_fsb - offset_fsb); iomap->bdev = target->bt_bdev; iomap->dax_dev = target->bt_daxdev; } static inline xfs_fileoff_t xfs_iomap_end_fsb( struct xfs_mount *mp, loff_t offset, loff_t count) { ASSERT(offset <= mp->m_super->s_maxbytes); return min(XFS_B_TO_FSB(mp, offset + count), XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes)); } static xfs_extlen_t xfs_eof_alignment( struct xfs_inode *ip) { struct xfs_mount *mp = ip->i_mount; xfs_extlen_t align = 0; if (!XFS_IS_REALTIME_INODE(ip)) { /* * Round up the allocation request to a stripe unit * (m_dalign) boundary if the file size is >= stripe unit * size, and we are allocating past the allocation eof. * * If mounted with the "-o swalloc" option the alignment is * increased from the strip unit size to the stripe width. */ if (mp->m_swidth && xfs_has_swalloc(mp)) align = mp->m_swidth; else if (mp->m_dalign) align = mp->m_dalign; if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align)) align = 0; } return align; } /* * Check if last_fsb is outside the last extent, and if so grow it to the next * stripe unit boundary. */ xfs_fileoff_t xfs_iomap_eof_align_last_fsb( struct xfs_inode *ip, xfs_fileoff_t end_fsb) { struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); xfs_extlen_t extsz = xfs_get_extsz_hint(ip); xfs_extlen_t align = xfs_eof_alignment(ip); struct xfs_bmbt_irec irec; struct xfs_iext_cursor icur; ASSERT(!xfs_need_iread_extents(ifp)); /* * Always round up the allocation request to the extent hint boundary. */ if (extsz) { if (align) align = roundup_64(align, extsz); else align = extsz; } if (align) { xfs_fileoff_t aligned_end_fsb = roundup_64(end_fsb, align); xfs_iext_last(ifp, &icur); if (!xfs_iext_get_extent(ifp, &icur, &irec) || aligned_end_fsb >= irec.br_startoff + irec.br_blockcount) return aligned_end_fsb; } return end_fsb; } int xfs_iomap_write_direct( struct xfs_inode *ip, xfs_fileoff_t offset_fsb, xfs_fileoff_t count_fsb, unsigned int flags, struct xfs_bmbt_irec *imap, u64 *seq) { struct xfs_mount *mp = ip->i_mount; struct xfs_trans *tp; xfs_filblks_t resaligned; int nimaps; unsigned int dblocks, rblocks; bool force = false; int error; int bmapi_flags = XFS_BMAPI_PREALLOC; int nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT; ASSERT(count_fsb > 0); resaligned = xfs_aligned_fsb_count(offset_fsb, count_fsb, xfs_get_extsz_hint(ip)); if (unlikely(XFS_IS_REALTIME_INODE(ip))) { dblocks = XFS_DIOSTRAT_SPACE_RES(mp, 0); rblocks = resaligned; } else { dblocks = XFS_DIOSTRAT_SPACE_RES(mp, resaligned); rblocks = 0; } error = xfs_qm_dqattach(ip); if (error) return error; /* * For DAX, we do not allocate unwritten extents, but instead we zero * the block before we commit the transaction. Ideally we'd like to do * this outside the transaction context, but if we commit and then crash * we may not have zeroed the blocks and this will be exposed on * recovery of the allocation. Hence we must zero before commit. * * Further, if we are mapping unwritten extents here, we need to zero * and convert them to written so that we don't need an unwritten extent * callback for DAX. This also means that we need to be able to dip into * the reserve block pool for bmbt block allocation if there is no space * left but we need to do unwritten extent conversion. */ if (flags & IOMAP_DAX) { bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO; if (imap->br_state == XFS_EXT_UNWRITTEN) { force = true; nr_exts = XFS_IEXT_WRITE_UNWRITTEN_CNT; dblocks = XFS_DIOSTRAT_SPACE_RES(mp, 0) << 1; } } error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks, rblocks, force, &tp); if (error) return error; error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK, nr_exts); if (error) goto out_trans_cancel; /* * From this point onwards we overwrite the imap pointer that the * caller gave to us. */ nimaps = 1; error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb, bmapi_flags, 0, imap, &nimaps); if (error) goto out_trans_cancel; /* * Complete the transaction */ error = xfs_trans_commit(tp); if (error) goto out_unlock; if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock))) { xfs_bmap_mark_sick(ip, XFS_DATA_FORK); error = xfs_alert_fsblock_zero(ip, imap); } out_unlock: *seq = xfs_iomap_inode_sequence(ip, 0); xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; out_trans_cancel: xfs_trans_cancel(tp); goto out_unlock; } STATIC bool xfs_quota_need_throttle( struct xfs_inode *ip, xfs_dqtype_t type, xfs_fsblock_t alloc_blocks) { struct xfs_dquot *dq = xfs_inode_dquot(ip, type); struct xfs_dquot_res *res; struct xfs_dquot_pre *pre; if (!dq || !xfs_this_quota_on(ip->i_mount, type)) return false; if (XFS_IS_REALTIME_INODE(ip)) { res = &dq->q_rtb; pre = &dq->q_rtb_prealloc; } else { res = &dq->q_blk; pre = &dq->q_blk_prealloc; } /* no hi watermark, no throttle */ if (!pre->q_prealloc_hi_wmark) return false; /* under the lo watermark, no throttle */ if (res->reserved + alloc_blocks < pre->q_prealloc_lo_wmark) return false; return true; } STATIC void xfs_quota_calc_throttle( struct xfs_inode *ip, xfs_dqtype_t type, xfs_fsblock_t *qblocks, int *qshift, int64_t *qfreesp) { struct xfs_dquot *dq = xfs_inode_dquot(ip, type); struct xfs_dquot_res *res; struct xfs_dquot_pre *pre; int64_t freesp; int shift = 0; if (!dq) { res = NULL; pre = NULL; } else if (XFS_IS_REALTIME_INODE(ip)) { res = &dq->q_rtb; pre = &dq->q_rtb_prealloc; } else { res = &dq->q_blk; pre = &dq->q_blk_prealloc; } /* no dq, or over hi wmark, squash the prealloc completely */ if (!res || res->reserved >= pre->q_prealloc_hi_wmark) { *qblocks = 0; *qfreesp = 0; return; } freesp = pre->q_prealloc_hi_wmark - res->reserved; if (freesp < pre->q_low_space[XFS_QLOWSP_5_PCNT]) { shift = 2; if (freesp < pre->q_low_space[XFS_QLOWSP_3_PCNT]) shift += 2; if (freesp < pre->q_low_space[XFS_QLOWSP_1_PCNT]) shift += 2; } if (freesp < *qfreesp) *qfreesp = freesp; /* only overwrite the throttle values if we are more aggressive */ if ((freesp >> shift) < (*qblocks >> *qshift)) { *qblocks = freesp; *qshift = shift; } } static int64_t xfs_iomap_freesp( struct xfs_mount *mp, unsigned int idx, uint64_t low_space[XFS_LOWSP_MAX], int *shift) { int64_t freesp; freesp = xfs_estimate_freecounter(mp, idx); if (freesp < low_space[XFS_LOWSP_5_PCNT]) { *shift = 2; if (freesp < low_space[XFS_LOWSP_4_PCNT]) (*shift)++; if (freesp < low_space[XFS_LOWSP_3_PCNT]) (*shift)++; if (freesp < low_space[XFS_LOWSP_2_PCNT]) (*shift)++; if (freesp < low_space[XFS_LOWSP_1_PCNT]) (*shift)++; } return freesp; } /* * If we don't have a user specified preallocation size, dynamically increase * the preallocation size as the size of the file grows. Cap the maximum size * at a single extent or less if the filesystem is near full. The closer the * filesystem is to being full, the smaller the maximum preallocation. */ STATIC xfs_fsblock_t xfs_iomap_prealloc_size( struct xfs_inode *ip, int whichfork, loff_t offset, loff_t count, struct xfs_iext_cursor *icur) { struct xfs_iext_cursor ncur = *icur; struct xfs_bmbt_irec prev, got; struct xfs_mount *mp = ip->i_mount; struct xfs_ifork *ifp = xfs_ifork_ptr(ip, whichfork); xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); int64_t freesp; xfs_fsblock_t qblocks; xfs_fsblock_t alloc_blocks = 0; xfs_extlen_t plen; int shift = 0; int qshift = 0; /* * As an exception we don't do any preallocation at all if the file is * smaller than the minimum preallocation and we are using the default * dynamic preallocation scheme, as it is likely this is the only write * to the file that is going to be done. */ if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks)) return 0; /* * Use the minimum preallocation size for small files or if we are * writing right after a hole. */ if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) || !xfs_iext_prev_extent(ifp, &ncur, &prev) || prev.br_startoff + prev.br_blockcount < offset_fsb) return mp->m_allocsize_blocks; /* * Take the size of the preceding data extents as the basis for the * preallocation size. Note that we don't care if the previous extents * are written or not. */ plen = prev.br_blockcount; while (xfs_iext_prev_extent(ifp, &ncur, &got)) { if (plen > XFS_MAX_BMBT_EXTLEN / 2 || isnullstartblock(got.br_startblock) || got.br_startoff + got.br_blockcount != prev.br_startoff || got.br_startblock + got.br_blockcount != prev.br_startblock) break; plen += got.br_blockcount; prev = got; } /* * If the size of the extents is greater than half the maximum extent * length, then use the current offset as the basis. This ensures that * for large files the preallocation size always extends to * XFS_BMBT_MAX_EXTLEN rather than falling short due to things like stripe * unit/width alignment of real extents. */ alloc_blocks = plen * 2; if (alloc_blocks > XFS_MAX_BMBT_EXTLEN) alloc_blocks = XFS_B_TO_FSB(mp, offset); qblocks = alloc_blocks; /* * XFS_BMBT_MAX_EXTLEN is not a power of two value but we round the prealloc * down to the nearest power of two value after throttling. To prevent * the round down from unconditionally reducing the maximum supported * prealloc size, we round up first, apply appropriate throttling, round * down and cap the value to XFS_BMBT_MAX_EXTLEN. */ alloc_blocks = XFS_FILEOFF_MIN(roundup_pow_of_two(XFS_MAX_BMBT_EXTLEN), alloc_blocks); if (unlikely(XFS_IS_REALTIME_INODE(ip))) freesp = xfs_rtbxlen_to_blen(mp, xfs_iomap_freesp(mp, XC_FREE_RTEXTENTS, mp->m_low_rtexts, &shift)); else freesp = xfs_iomap_freesp(mp, XC_FREE_BLOCKS, mp->m_low_space, &shift); /* * Check each quota to cap the prealloc size, provide a shift value to * throttle with and adjust amount of available space. */ if (xfs_quota_need_throttle(ip, XFS_DQTYPE_USER, alloc_blocks)) xfs_quota_calc_throttle(ip, XFS_DQTYPE_USER, &qblocks, &qshift, &freesp); if (xfs_quota_need_throttle(ip, XFS_DQTYPE_GROUP, alloc_blocks)) xfs_quota_calc_throttle(ip, XFS_DQTYPE_GROUP, &qblocks, &qshift, &freesp); if (xfs_quota_need_throttle(ip, XFS_DQTYPE_PROJ, alloc_blocks)) xfs_quota_calc_throttle(ip, XFS_DQTYPE_PROJ, &qblocks, &qshift, &freesp); /* * The final prealloc size is set to the minimum of free space available * in each of the quotas and the overall filesystem. * * The shift throttle value is set to the maximum value as determined by * the global low free space values and per-quota low free space values. */ alloc_blocks = min(alloc_blocks, qblocks); shift = max(shift, qshift); if (shift) alloc_blocks >>= shift; /* * rounddown_pow_of_two() returns an undefined result if we pass in * alloc_blocks = 0. */ if (alloc_blocks) alloc_blocks = rounddown_pow_of_two(alloc_blocks); if (alloc_blocks > XFS_MAX_BMBT_EXTLEN) alloc_blocks = XFS_MAX_BMBT_EXTLEN; /* * If we are still trying to allocate more space than is * available, squash the prealloc hard. This can happen if we * have a large file on a small filesystem and the above * lowspace thresholds are smaller than XFS_BMBT_MAX_EXTLEN. */ while (alloc_blocks && alloc_blocks >= freesp) alloc_blocks >>= 4; if (alloc_blocks < mp->m_allocsize_blocks) alloc_blocks = mp->m_allocsize_blocks; trace_xfs_iomap_prealloc_size(ip, alloc_blocks, shift, mp->m_allocsize_blocks); return alloc_blocks; } int xfs_iomap_write_unwritten( xfs_inode_t *ip, xfs_off_t offset, xfs_off_t count, bool update_isize) { xfs_mount_t *mp = ip->i_mount; xfs_fileoff_t offset_fsb; xfs_filblks_t count_fsb; xfs_filblks_t numblks_fsb; int nimaps; xfs_trans_t *tp; xfs_bmbt_irec_t imap; struct inode *inode = VFS_I(ip); xfs_fsize_t i_size; uint resblks; int error; trace_xfs_unwritten_convert(ip, offset, count); offset_fsb = XFS_B_TO_FSBT(mp, offset); count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count); count_fsb = (xfs_filblks_t)(count_fsb - offset_fsb); /* * Reserve enough blocks in this transaction for two complete extent * btree splits. We may be converting the middle part of an unwritten * extent and in this case we will insert two new extents in the btree * each of which could cause a full split. * * This reservation amount will be used in the first call to * xfs_bmbt_split() to select an AG with enough space to satisfy the * rest of the operation. */ resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0) << 1; /* Attach dquots so that bmbt splits are accounted correctly. */ error = xfs_qm_dqattach(ip); if (error) return error; do { /* * Set up a transaction to convert the range of extents * from unwritten to real. Do allocations in a loop until * we have covered the range passed in. * * Note that we can't risk to recursing back into the filesystem * here as we might be asked to write out the same inode that we * complete here and might deadlock on the iolock. */ error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, resblks, 0, true, &tp); if (error) return error; error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK, XFS_IEXT_WRITE_UNWRITTEN_CNT); if (error) goto error_on_bmapi_transaction; /* * Modify the unwritten extent state of the buffer. */ nimaps = 1; error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb, XFS_BMAPI_CONVERT, resblks, &imap, &nimaps); if (error) goto error_on_bmapi_transaction; /* * Log the updated inode size as we go. We have to be careful * to only log it up to the actual write offset if it is * halfway into a block. */ i_size = XFS_FSB_TO_B(mp, offset_fsb + count_fsb); if (i_size > offset + count) i_size = offset + count; if (update_isize && i_size > i_size_read(inode)) i_size_write(inode, i_size); i_size = xfs_new_eof(ip, i_size); if (i_size) { ip->i_disk_size = i_size; xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); } error = xfs_trans_commit(tp); xfs_iunlock(ip, XFS_ILOCK_EXCL); if (error) return error; if (unlikely(!xfs_valid_startblock(ip, imap.br_startblock))) { xfs_bmap_mark_sick(ip, XFS_DATA_FORK); return xfs_alert_fsblock_zero(ip, &imap); } if ((numblks_fsb = imap.br_blockcount) == 0) { /* * The numblks_fsb value should always get * smaller, otherwise the loop is stuck. */ ASSERT(imap.br_blockcount); break; } offset_fsb += numblks_fsb; count_fsb -= numblks_fsb; } while (count_fsb > 0); return 0; error_on_bmapi_transaction: xfs_trans_cancel(tp); xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; } static inline bool imap_needs_alloc( struct inode *inode, unsigned flags, struct xfs_bmbt_irec *imap, int nimaps) { /* don't allocate blocks when just zeroing */ if (flags & IOMAP_ZERO) return false; if (!nimaps || imap->br_startblock == HOLESTARTBLOCK || imap->br_startblock == DELAYSTARTBLOCK) return true; /* we convert unwritten extents before copying the data for DAX */ if ((flags & IOMAP_DAX) && imap->br_state == XFS_EXT_UNWRITTEN) return true; return false; } static inline bool imap_needs_cow( struct xfs_inode *ip, unsigned int flags, struct xfs_bmbt_irec *imap, int nimaps) { if (!xfs_is_cow_inode(ip)) return false; /* when zeroing we don't have to COW holes or unwritten extents */ if (flags & (IOMAP_UNSHARE | IOMAP_ZERO)) { if (!nimaps || imap->br_startblock == HOLESTARTBLOCK || imap->br_state == XFS_EXT_UNWRITTEN) return false; } return true; } /* * Extents not yet cached requires exclusive access, don't block for * IOMAP_NOWAIT. * * This is basically an opencoded xfs_ilock_data_map_shared() call, but with * support for IOMAP_NOWAIT. */ static int xfs_ilock_for_iomap( struct xfs_inode *ip, unsigned flags, unsigned *lockmode) { if (flags & IOMAP_NOWAIT) { if (xfs_need_iread_extents(&ip->i_df)) return -EAGAIN; if (!xfs_ilock_nowait(ip, *lockmode)) return -EAGAIN; } else { if (xfs_need_iread_extents(&ip->i_df)) *lockmode = XFS_ILOCK_EXCL; xfs_ilock(ip, *lockmode); } return 0; } /* * Check that the imap we are going to return to the caller spans the entire * range that the caller requested for the IO. */ static bool imap_spans_range( struct xfs_bmbt_irec *imap, xfs_fileoff_t offset_fsb, xfs_fileoff_t end_fsb) { if (imap->br_startoff > offset_fsb) return false; if (imap->br_startoff + imap->br_blockcount < end_fsb) return false; return true; } static bool xfs_bmap_hw_atomic_write_possible( struct xfs_inode *ip, struct xfs_bmbt_irec *imap, xfs_fileoff_t offset_fsb, xfs_fileoff_t end_fsb) { struct xfs_mount *mp = ip->i_mount; xfs_fsize_t len = XFS_FSB_TO_B(mp, end_fsb - offset_fsb); /* * atomic writes are required to be naturally aligned for disk blocks, * which ensures that we adhere to block layer rules that we won't * straddle any boundary or violate write alignment requirement. */ if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount)) return false; /* * Spanning multiple extents would mean that multiple BIOs would be * issued, and so would lose atomicity required for REQ_ATOMIC-based * atomics. */ if (!imap_spans_range(imap, offset_fsb, end_fsb)) return false; /* * The ->iomap_begin caller should ensure this, but check anyway. */ return len <= xfs_inode_buftarg(ip)->bt_awu_max; } static int xfs_direct_write_iomap_begin( struct inode *inode, loff_t offset, loff_t length, unsigned flags, struct iomap *iomap, struct iomap *srcmap) { struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; struct xfs_bmbt_irec imap, cmap; xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, length); xfs_fileoff_t orig_end_fsb = end_fsb; int nimaps = 1, error = 0; bool shared = false; u16 iomap_flags = 0; bool needs_alloc; unsigned int lockmode; u64 seq; ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO)); if (xfs_is_shutdown(mp)) return -EIO; /* * Writes that span EOF might trigger an IO size update on completion, * so consider them to be dirty for the purposes of O_DSYNC even if * there is no other metadata changes pending or have been made here. */ if (offset + length > i_size_read(inode)) iomap_flags |= IOMAP_F_DIRTY; /* HW-offload atomics are always used in this path */ if (flags & IOMAP_ATOMIC) iomap_flags |= IOMAP_F_ATOMIC_BIO; /* * COW writes may allocate delalloc space or convert unwritten COW * extents, so we need to make sure to take the lock exclusively here. */ if (xfs_is_cow_inode(ip)) lockmode = XFS_ILOCK_EXCL; else lockmode = XFS_ILOCK_SHARED; relock: error = xfs_ilock_for_iomap(ip, flags, &lockmode); if (error) return error; /* * The reflink iflag could have changed since the earlier unlocked * check, check if it again and relock if needed. */ if (xfs_is_cow_inode(ip) && lockmode == XFS_ILOCK_SHARED) { xfs_iunlock(ip, lockmode); lockmode = XFS_ILOCK_EXCL; goto relock; } error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap, &nimaps, 0); if (error) goto out_unlock; if (imap_needs_cow(ip, flags, &imap, nimaps)) { error = -EAGAIN; if (flags & IOMAP_NOWAIT) goto out_unlock; /* may drop and re-acquire the ilock */ error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared, &lockmode, (flags & IOMAP_DIRECT) || IS_DAX(inode)); if (error) goto out_unlock; if (shared) { if ((flags & IOMAP_ATOMIC) && !xfs_bmap_hw_atomic_write_possible(ip, &cmap, offset_fsb, end_fsb)) { error = -ENOPROTOOPT; goto out_unlock; } goto out_found_cow; } end_fsb = imap.br_startoff + imap.br_blockcount; length = XFS_FSB_TO_B(mp, end_fsb) - offset; } needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps); if (flags & IOMAP_ATOMIC) { error = -ENOPROTOOPT; /* * If we allocate less than what is required for the write * then we may end up with multiple extents, which means that * REQ_ATOMIC-based cannot be used, so avoid this possibility. */ if (needs_alloc && orig_end_fsb - offset_fsb > 1) goto out_unlock; if (!xfs_bmap_hw_atomic_write_possible(ip, &imap, offset_fsb, orig_end_fsb)) goto out_unlock; } if (needs_alloc) goto allocate_blocks; /* * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with * a single map so that we avoid partial IO failures due to the rest of * the I/O range not covered by this map triggering an EAGAIN condition * when it is subsequently mapped and aborting the I/O. */ if (flags & (IOMAP_NOWAIT | IOMAP_OVERWRITE_ONLY)) { error = -EAGAIN; if (!imap_spans_range(&imap, offset_fsb, end_fsb)) goto out_unlock; } /* * For overwrite only I/O, we cannot convert unwritten extents without * requiring sub-block zeroing. This can only be done under an * exclusive IOLOCK, hence return -EAGAIN if this is not a written * extent to tell the caller to try again. */ if (flags & IOMAP_OVERWRITE_ONLY) { error = -EAGAIN; if (imap.br_state != XFS_EXT_NORM && ((offset | length) & mp->m_blockmask)) goto out_unlock; } seq = xfs_iomap_inode_sequence(ip, iomap_flags); xfs_iunlock(ip, lockmode); trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap); return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq); allocate_blocks: error = -EAGAIN; if (flags & (IOMAP_NOWAIT | IOMAP_OVERWRITE_ONLY)) goto out_unlock; /* * We cap the maximum length we map to a sane size to keep the chunks * of work done where somewhat symmetric with the work writeback does. * This is a completely arbitrary number pulled out of thin air as a * best guess for initial testing. * * Note that the values needs to be less than 32-bits wide until the * lower level functions are updated. */ length = min_t(loff_t, length, 1024 * PAGE_SIZE); end_fsb = xfs_iomap_end_fsb(mp, offset, length); if (offset + length > XFS_ISIZE(ip)) end_fsb = xfs_iomap_eof_align_last_fsb(ip, end_fsb); else if (nimaps && imap.br_startblock == HOLESTARTBLOCK) end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount); xfs_iunlock(ip, lockmode); error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, flags, &imap, &seq); if (error) return error; trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap); return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags | IOMAP_F_NEW, seq); out_found_cow: length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount); trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap); if (imap.br_startblock != HOLESTARTBLOCK) { seq = xfs_iomap_inode_sequence(ip, 0); error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq); if (error) goto out_unlock; } seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED); xfs_iunlock(ip, lockmode); return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq); out_unlock: if (lockmode) xfs_iunlock(ip, lockmode); return error; } const struct iomap_ops xfs_direct_write_iomap_ops = { .iomap_begin = xfs_direct_write_iomap_begin, }; #ifdef CONFIG_XFS_RT /* * This is really simple. The space has already been reserved before taking the * IOLOCK, the actual block allocation is done just before submitting the bio * and only recorded in the extent map on I/O completion. */ static int xfs_zoned_direct_write_iomap_begin( struct inode *inode, loff_t offset, loff_t length, unsigned flags, struct iomap *iomap, struct iomap *srcmap) { struct xfs_inode *ip = XFS_I(inode); int error; ASSERT(!(flags & IOMAP_OVERWRITE_ONLY)); /* * Needs to be pushed down into the allocator so that only writes into * a single zone can be supported. */ if (flags & IOMAP_NOWAIT) return -EAGAIN; /* * Ensure the extent list is in memory in so that we don't have to do * read it from the I/O completion handler. */ if (xfs_need_iread_extents(&ip->i_df)) { xfs_ilock(ip, XFS_ILOCK_EXCL); error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK); xfs_iunlock(ip, XFS_ILOCK_EXCL); if (error) return error; } iomap->type = IOMAP_MAPPED; iomap->flags = IOMAP_F_DIRTY; iomap->bdev = ip->i_mount->m_rtdev_targp->bt_bdev; iomap->offset = offset; iomap->length = length; iomap->flags = IOMAP_F_ANON_WRITE; return 0; } const struct iomap_ops xfs_zoned_direct_write_iomap_ops = { .iomap_begin = xfs_zoned_direct_write_iomap_begin, }; #endif /* CONFIG_XFS_RT */ #ifdef DEBUG static void xfs_check_atomic_cow_conversion( struct xfs_inode *ip, xfs_fileoff_t offset_fsb, xfs_filblks_t count_fsb, const struct xfs_bmbt_irec *cmap) { struct xfs_iext_cursor icur; struct xfs_bmbt_irec cmap2 = { }; if (xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, &cmap2)) xfs_trim_extent(&cmap2, offset_fsb, count_fsb); ASSERT(cmap2.br_startoff == cmap->br_startoff); ASSERT(cmap2.br_blockcount == cmap->br_blockcount); ASSERT(cmap2.br_startblock == cmap->br_startblock); ASSERT(cmap2.br_state == cmap->br_state); } #else # define xfs_check_atomic_cow_conversion(...) ((void)0) #endif static int xfs_atomic_write_cow_iomap_begin( struct inode *inode, loff_t offset, loff_t length, unsigned flags, struct iomap *iomap, struct iomap *srcmap) { struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; const xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); const xfs_fileoff_t end_fsb = XFS_B_TO_FSB(mp, offset + length); const xfs_filblks_t count_fsb = end_fsb - offset_fsb; xfs_filblks_t hole_count_fsb; int nmaps = 1; xfs_filblks_t resaligned; struct xfs_bmbt_irec cmap; struct xfs_iext_cursor icur; struct xfs_trans *tp; unsigned int dblocks = 0, rblocks = 0; int error; u64 seq; ASSERT(flags & IOMAP_WRITE); ASSERT(flags & IOMAP_DIRECT); if (xfs_is_shutdown(mp)) return -EIO; if (!xfs_can_sw_atomic_write(mp)) { ASSERT(xfs_can_sw_atomic_write(mp)); return -EINVAL; } /* blocks are always allocated in this path */ if (flags & IOMAP_NOWAIT) return -EAGAIN; trace_xfs_iomap_atomic_write_cow(ip, offset, length); retry: xfs_ilock(ip, XFS_ILOCK_EXCL); if (!ip->i_cowfp) { ASSERT(!xfs_is_reflink_inode(ip)); xfs_ifork_init_cow(ip); } if (!xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, &cmap)) cmap.br_startoff = end_fsb; if (cmap.br_startoff <= offset_fsb) { if (isnullstartblock(cmap.br_startblock)) goto convert_delay; /* * cmap could extend outside the write range due to previous * speculative preallocations. We must trim cmap to the write * range because the cow fork treats written mappings to mean * "write in progress". */ xfs_trim_extent(&cmap, offset_fsb, count_fsb); goto found; } hole_count_fsb = cmap.br_startoff - offset_fsb; resaligned = xfs_aligned_fsb_count(offset_fsb, hole_count_fsb, xfs_get_cowextsz_hint(ip)); xfs_iunlock(ip, XFS_ILOCK_EXCL); if (XFS_IS_REALTIME_INODE(ip)) { dblocks = XFS_DIOSTRAT_SPACE_RES(mp, 0); rblocks = resaligned; } else { dblocks = XFS_DIOSTRAT_SPACE_RES(mp, resaligned); rblocks = 0; } error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks, rblocks, false, &tp); if (error) return error; /* extent layout could have changed since the unlock, so check again */ if (!xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, &cmap)) cmap.br_startoff = end_fsb; if (cmap.br_startoff <= offset_fsb) { xfs_trans_cancel(tp); if (isnullstartblock(cmap.br_startblock)) goto convert_delay; xfs_trim_extent(&cmap, offset_fsb, count_fsb); goto found; } /* * Allocate the entire reservation as unwritten blocks. * * Use XFS_BMAPI_EXTSZALIGN to hint at aligning new extents according to * extszhint, such that there will be a greater chance that future * atomic writes to that same range will be aligned (and don't require * this COW-based method). */ error = xfs_bmapi_write(tp, ip, offset_fsb, hole_count_fsb, XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC | XFS_BMAPI_EXTSZALIGN, 0, &cmap, &nmaps); if (error) { xfs_trans_cancel(tp); goto out_unlock; } xfs_inode_set_cowblocks_tag(ip); error = xfs_trans_commit(tp); if (error) goto out_unlock; /* * cmap could map more blocks than the range we passed into bmapi_write * because of EXTSZALIGN or adjacent pre-existing unwritten mappings * that were merged. Trim cmap to the original write range so that we * don't convert more than we were asked to do for this write. */ xfs_trim_extent(&cmap, offset_fsb, count_fsb); found: if (cmap.br_state != XFS_EXT_NORM) { error = xfs_reflink_convert_cow_locked(ip, cmap.br_startoff, cmap.br_blockcount); if (error) goto out_unlock; cmap.br_state = XFS_EXT_NORM; xfs_check_atomic_cow_conversion(ip, offset_fsb, count_fsb, &cmap); } trace_xfs_iomap_found(ip, offset, length, XFS_COW_FORK, &cmap); seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED); xfs_iunlock(ip, XFS_ILOCK_EXCL); return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq); convert_delay: xfs_iunlock(ip, XFS_ILOCK_EXCL); error = xfs_bmapi_convert_delalloc(ip, XFS_COW_FORK, offset, iomap, NULL); if (error) return error; /* * Try the lookup again, because the delalloc conversion might have * turned the COW mapping into unwritten, but we need it to be in * written state. */ goto retry; out_unlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; } const struct iomap_ops xfs_atomic_write_cow_iomap_ops = { .iomap_begin = xfs_atomic_write_cow_iomap_begin, }; static int xfs_dax_write_iomap_end( struct inode *inode, loff_t pos, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap) { struct xfs_inode *ip = XFS_I(inode); if (!xfs_is_cow_inode(ip)) return 0; if (!written) return xfs_reflink_cancel_cow_range(ip, pos, length, true); return xfs_reflink_end_cow(ip, pos, written); } const struct iomap_ops xfs_dax_write_iomap_ops = { .iomap_begin = xfs_direct_write_iomap_begin, .iomap_end = xfs_dax_write_iomap_end, }; /* * Convert a hole to a delayed allocation. */ static void xfs_bmap_add_extent_hole_delay( struct xfs_inode *ip, /* incore inode pointer */ int whichfork, struct xfs_iext_cursor *icur, struct xfs_bmbt_irec *new) /* new data to add to file extents */ { struct xfs_ifork *ifp; /* inode fork pointer */ xfs_bmbt_irec_t left; /* left neighbor extent entry */ xfs_filblks_t newlen=0; /* new indirect size */ xfs_filblks_t oldlen=0; /* old indirect size */ xfs_bmbt_irec_t right; /* right neighbor extent entry */ uint32_t state = xfs_bmap_fork_to_state(whichfork); xfs_filblks_t temp; /* temp for indirect calculations */ ifp = xfs_ifork_ptr(ip, whichfork); ASSERT(isnullstartblock(new->br_startblock)); /* * Check and set flags if this segment has a left neighbor */ if (xfs_iext_peek_prev_extent(ifp, icur, &left)) { state |= BMAP_LEFT_VALID; if (isnullstartblock(left.br_startblock)) state |= BMAP_LEFT_DELAY; } /* * Check and set flags if the current (right) segment exists. * If it doesn't exist, we're converting the hole at end-of-file. */ if (xfs_iext_get_extent(ifp, icur, &right)) { state |= BMAP_RIGHT_VALID; if (isnullstartblock(right.br_startblock)) state |= BMAP_RIGHT_DELAY; } /* * Set contiguity flags on the left and right neighbors. * Don't let extents get too large, even if the pieces are contiguous. */ if ((state & BMAP_LEFT_VALID) && (state & BMAP_LEFT_DELAY) && left.br_startoff + left.br_blockcount == new->br_startoff && left.br_blockcount + new->br_blockcount <= XFS_MAX_BMBT_EXTLEN) state |= BMAP_LEFT_CONTIG; if ((state & BMAP_RIGHT_VALID) && (state & BMAP_RIGHT_DELAY) && new->br_startoff + new->br_blockcount == right.br_startoff && new->br_blockcount + right.br_blockcount <= XFS_MAX_BMBT_EXTLEN && (!(state & BMAP_LEFT_CONTIG) || (left.br_blockcount + new->br_blockcount + right.br_blockcount <= XFS_MAX_BMBT_EXTLEN))) state |= BMAP_RIGHT_CONTIG; /* * Switch out based on the contiguity flags. */ switch (state & (BMAP_LEFT_CONTIG | BMAP_RIGHT_CONTIG)) { case BMAP_LEFT_CONTIG | BMAP_RIGHT_CONTIG: /* * New allocation is contiguous with delayed allocations * on the left and on the right. * Merge all three into a single extent record. */ temp = left.br_blockcount + new->br_blockcount + right.br_blockcount; oldlen = startblockval(left.br_startblock) + startblockval(new->br_startblock) + startblockval(right.br_startblock); newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), oldlen); left.br_startblock = nullstartblock(newlen); left.br_blockcount = temp; xfs_iext_remove(ip, icur, state); xfs_iext_prev(ifp, icur); xfs_iext_update_extent(ip, state, icur, &left); break; case BMAP_LEFT_CONTIG: /* * New allocation is contiguous with a delayed allocation * on the left. * Merge the new allocation with the left neighbor. */ temp = left.br_blockcount + new->br_blockcount; oldlen = startblockval(left.br_startblock) + startblockval(new->br_startblock); newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), oldlen); left.br_blockcount = temp; left.br_startblock = nullstartblock(newlen); xfs_iext_prev(ifp, icur); xfs_iext_update_extent(ip, state, icur, &left); break; case BMAP_RIGHT_CONTIG: /* * New allocation is contiguous with a delayed allocation * on the right. * Merge the new allocation with the right neighbor. */ temp = new->br_blockcount + right.br_blockcount; oldlen = startblockval(new->br_startblock) + startblockval(right.br_startblock); newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), oldlen); right.br_startoff = new->br_startoff; right.br_startblock = nullstartblock(newlen); right.br_blockcount = temp; xfs_iext_update_extent(ip, state, icur, &right); break; case 0: /* * New allocation is not contiguous with another * delayed allocation. * Insert a new entry. */ oldlen = newlen = 0; xfs_iext_insert(ip, icur, new, state); break; } if (oldlen != newlen) { ASSERT(oldlen > newlen); xfs_add_fdblocks(ip->i_mount, oldlen - newlen); /* * Nothing to do for disk quota accounting here. */ xfs_mod_delalloc(ip, 0, (int64_t)newlen - oldlen); } } /* * Add a delayed allocation extent to an inode. Blocks are reserved from the * global pool and the extent inserted into the inode in-core extent tree. * * On entry, got refers to the first extent beyond the offset of the extent to * allocate or eof is specified if no such extent exists. On return, got refers * to the extent record that was inserted to the inode fork. * * Note that the allocated extent may have been merged with contiguous extents * during insertion into the inode fork. Thus, got does not reflect the current * state of the inode fork on return. If necessary, the caller can use lastx to * look up the updated record in the inode fork. */ static int xfs_bmapi_reserve_delalloc( struct xfs_inode *ip, int whichfork, xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc, struct xfs_bmbt_irec *got, struct xfs_iext_cursor *icur, int eof) { struct xfs_mount *mp = ip->i_mount; struct xfs_ifork *ifp = xfs_ifork_ptr(ip, whichfork); xfs_extlen_t alen; xfs_extlen_t indlen; uint64_t fdblocks; int error; xfs_fileoff_t aoff; bool use_cowextszhint = whichfork == XFS_COW_FORK && !prealloc; retry: /* * Cap the alloc length. Keep track of prealloc so we know whether to * tag the inode before we return. */ aoff = off; alen = XFS_FILBLKS_MIN(len + prealloc, XFS_MAX_BMBT_EXTLEN); if (!eof) alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff); if (prealloc && alen >= len) prealloc = alen - len; /* * If we're targetting the COW fork but aren't creating a speculative * posteof preallocation, try to expand the reservation to align with * the COW extent size hint if there's sufficient free space. * * Unlike the data fork, the CoW cancellation functions will free all * the reservations at inactivation, so we don't require that every * delalloc reservation have a dirty pagecache. */ if (use_cowextszhint) { struct xfs_bmbt_irec prev; xfs_extlen_t extsz = xfs_get_cowextsz_hint(ip); if (!xfs_iext_peek_prev_extent(ifp, icur, &prev)) prev.br_startoff = NULLFILEOFF; error = xfs_bmap_extsize_align(mp, got, &prev, extsz, 0, eof, 1, 0, &aoff, &alen); ASSERT(!error); } /* * Make a transaction-less quota reservation for delayed allocation * blocks. This number gets adjusted later. We return if we haven't * allocated blocks already inside this loop. */ error = xfs_quota_reserve_blkres(ip, alen); if (error) goto out; /* * Split changing sb for alen and indlen since they could be coming * from different places. */ indlen = (xfs_extlen_t)xfs_bmap_worst_indlen(ip, alen); ASSERT(indlen > 0); fdblocks = indlen; if (XFS_IS_REALTIME_INODE(ip)) { ASSERT(!xfs_is_zoned_inode(ip)); error = xfs_dec_frextents(mp, xfs_blen_to_rtbxlen(mp, alen)); if (error) goto out_unreserve_quota; } else { fdblocks += alen; } error = xfs_dec_fdblocks(mp, fdblocks, false); if (error) goto out_unreserve_frextents; ip->i_delayed_blks += alen; xfs_mod_delalloc(ip, alen, indlen); got->br_startoff = aoff; got->br_startblock = nullstartblock(indlen); got->br_blockcount = alen; got->br_state = XFS_EXT_NORM; xfs_bmap_add_extent_hole_delay(ip, whichfork, icur, got); /* * Tag the inode if blocks were preallocated. Note that COW fork * preallocation can occur at the start or end of the extent, even when * prealloc == 0, so we must also check the aligned offset and length. */ if (whichfork == XFS_DATA_FORK && prealloc) xfs_inode_set_eofblocks_tag(ip); if (whichfork == XFS_COW_FORK && (prealloc || aoff < off || alen > len)) xfs_inode_set_cowblocks_tag(ip); return 0; out_unreserve_frextents: if (XFS_IS_REALTIME_INODE(ip)) xfs_add_frextents(mp, xfs_blen_to_rtbxlen(mp, alen)); out_unreserve_quota: if (XFS_IS_QUOTA_ON(mp)) xfs_quota_unreserve_blkres(ip, alen); out: if (error == -ENOSPC || error == -EDQUOT) { trace_xfs_delalloc_enospc(ip, off, len); if (prealloc || use_cowextszhint) { /* retry without any preallocation */ use_cowextszhint = false; prealloc = 0; goto retry; } } return error; } static int xfs_zoned_buffered_write_iomap_begin( struct inode *inode, loff_t offset, loff_t count, unsigned flags, struct iomap *iomap, struct iomap *srcmap) { struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); struct xfs_zone_alloc_ctx *ac = iter->private; struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count); u16 iomap_flags = IOMAP_F_SHARED; unsigned int lockmode = XFS_ILOCK_EXCL; xfs_filblks_t count_fsb; xfs_extlen_t indlen; struct xfs_bmbt_irec got; struct xfs_iext_cursor icur; int error = 0; ASSERT(!xfs_get_extsz_hint(ip)); ASSERT(!(flags & IOMAP_UNSHARE)); ASSERT(ac); if (xfs_is_shutdown(mp)) return -EIO; error = xfs_qm_dqattach(ip); if (error) return error; error = xfs_ilock_for_iomap(ip, flags, &lockmode); if (error) return error; if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) || XFS_TEST_ERROR(mp, XFS_ERRTAG_BMAPIFORMAT)) { xfs_bmap_mark_sick(ip, XFS_DATA_FORK); error = -EFSCORRUPTED; goto out_unlock; } XFS_STATS_INC(mp, xs_blk_mapw); error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK); if (error) goto out_unlock; /* * For zeroing operations check if there is any data to zero first. * * For regular writes we always need to allocate new blocks, but need to * provide the source mapping when the range is unaligned to support * read-modify-write of the whole block in the page cache. * * In either case we need to limit the reported range to the boundaries * of the source map in the data fork. */ if (!IS_ALIGNED(offset, mp->m_sb.sb_blocksize) || !IS_ALIGNED(offset + count, mp->m_sb.sb_blocksize) || (flags & IOMAP_ZERO)) { struct xfs_bmbt_irec smap; struct xfs_iext_cursor scur; if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &scur, &smap)) smap.br_startoff = end_fsb; /* fake hole until EOF */ if (smap.br_startoff > offset_fsb) { /* * We never need to allocate blocks for zeroing a hole. */ if (flags & IOMAP_ZERO) { xfs_hole_to_iomap(ip, iomap, offset_fsb, smap.br_startoff); goto out_unlock; } end_fsb = min(end_fsb, smap.br_startoff); } else { end_fsb = min(end_fsb, smap.br_startoff + smap.br_blockcount); xfs_trim_extent(&smap, offset_fsb, end_fsb - offset_fsb); error = xfs_bmbt_to_iomap(ip, srcmap, &smap, flags, 0, xfs_iomap_inode_sequence(ip, 0)); if (error) goto out_unlock; } } if (!ip->i_cowfp) xfs_ifork_init_cow(ip); if (!xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, &got)) got.br_startoff = end_fsb; if (got.br_startoff <= offset_fsb) { trace_xfs_reflink_cow_found(ip, &got); goto done; } /* * Cap the maximum length to keep the chunks of work done here somewhat * symmetric with the work writeback does. */ end_fsb = min(end_fsb, got.br_startoff); count_fsb = min3(end_fsb - offset_fsb, XFS_MAX_BMBT_EXTLEN, XFS_B_TO_FSB(mp, 1024 * PAGE_SIZE)); /* * The block reservation is supposed to cover all blocks that the * operation could possible write, but there is a nasty corner case * where blocks could be stolen from underneath us: * * 1) while this thread iterates over a larger buffered write, * 2) another thread is causing a write fault that calls into * ->page_mkwrite in range this thread writes to, using up the * delalloc reservation created by a previous call to this function. * 3) another thread does direct I/O on the range that the write fault * happened on, which causes writeback of the dirty data. * 4) this then set the stale flag, which cuts the current iomap * iteration short, causing the new call to ->iomap_begin that gets * us here again, but now without a sufficient reservation. * * This is a very unusual I/O pattern, and nothing but generic/095 is * known to hit it. There's not really much we can do here, so turn this * into a short write. */ if (count_fsb > ac->reserved_blocks) { xfs_warn_ratelimited(mp, "Short write on ino 0x%llx comm %.20s due to three-way race with write fault and direct I/O", ip->i_ino, current->comm); count_fsb = ac->reserved_blocks; if (!count_fsb) { error = -EIO; goto out_unlock; } } error = xfs_quota_reserve_blkres(ip, count_fsb); if (error) goto out_unlock; indlen = xfs_bmap_worst_indlen(ip, count_fsb); error = xfs_dec_fdblocks(mp, indlen, false); if (error) goto out_unlock; ip->i_delayed_blks += count_fsb; xfs_mod_delalloc(ip, count_fsb, indlen); got.br_startoff = offset_fsb; got.br_startblock = nullstartblock(indlen); got.br_blockcount = count_fsb; got.br_state = XFS_EXT_NORM; xfs_bmap_add_extent_hole_delay(ip, XFS_COW_FORK, &icur, &got); ac->reserved_blocks -= count_fsb; iomap_flags |= IOMAP_F_NEW; trace_xfs_iomap_alloc(ip, offset, XFS_FSB_TO_B(mp, count_fsb), XFS_COW_FORK, &got); done: error = xfs_bmbt_to_iomap(ip, iomap, &got, flags, iomap_flags, xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED)); out_unlock: xfs_iunlock(ip, lockmode); return error; } static int xfs_buffered_write_iomap_begin( struct inode *inode, loff_t offset, loff_t count, unsigned flags, struct iomap *iomap, struct iomap *srcmap) { struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count); struct xfs_bmbt_irec imap, cmap; struct xfs_iext_cursor icur, ccur; xfs_fsblock_t prealloc_blocks = 0; bool eof = false, cow_eof = false, shared = false; int allocfork = XFS_DATA_FORK; int error = 0; unsigned int lockmode = XFS_ILOCK_EXCL; unsigned int iomap_flags = 0; u64 seq; if (xfs_is_shutdown(mp)) return -EIO; if (xfs_is_zoned_inode(ip)) return xfs_zoned_buffered_write_iomap_begin(inode, offset, count, flags, iomap, srcmap); /* we can't use delayed allocations when using extent size hints */ if (xfs_get_extsz_hint(ip)) return xfs_direct_write_iomap_begin(inode, offset, count, flags, iomap, srcmap); error = xfs_qm_dqattach(ip); if (error) return error; error = xfs_ilock_for_iomap(ip, flags, &lockmode); if (error) return error; if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) || XFS_TEST_ERROR(mp, XFS_ERRTAG_BMAPIFORMAT)) { xfs_bmap_mark_sick(ip, XFS_DATA_FORK); error = -EFSCORRUPTED; goto out_unlock; } XFS_STATS_INC(mp, xs_blk_mapw); error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK); if (error) goto out_unlock; /* * Search the data fork first to look up our source mapping. We * always need the data fork map, as we have to return it to the * iomap code so that the higher level write code can read data in to * perform read-modify-write cycles for unaligned writes. */ eof = !xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap); if (eof) imap.br_startoff = end_fsb; /* fake hole until the end */ /* We never need to allocate blocks for zeroing or unsharing a hole. */ if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) && imap.br_startoff > offset_fsb) { xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff); goto out_unlock; } /* * For zeroing, trim extents that extend beyond the EOF block. If a * delalloc extent starts beyond the EOF block, convert it to an * unwritten extent. */ if (flags & IOMAP_ZERO) { xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip)); if (isnullstartblock(imap.br_startblock) && offset_fsb >= eof_fsb) goto convert_delay; if (offset_fsb < eof_fsb && end_fsb > eof_fsb) end_fsb = eof_fsb; /* * Look up dirty folios for unwritten mappings within EOF. * Providing this bypasses the flush iomap uses to trigger * extent conversion when unwritten mappings have dirty * pagecache in need of zeroing. * * Trim the mapping to the end pos of the lookup, which in turn * was trimmed to the end of the batch if it became full before * the end of the mapping. */ if (imap.br_state == XFS_EXT_UNWRITTEN && offset_fsb < eof_fsb) { loff_t foffset = offset, fend; fend = offset + min(co |