| 326 260 263 262 262 263 262 324 45 45 143 141 125 125 125 123 125 125 19 19 1 19 19 19 19 19 238 238 238 238 238 236 238 237 286 236 232 238 238 238 288 288 236 190 269 157 158 160 159 157 159 156 159 123 123 46 123 123 123 123 123 123 123 123 123 123 123 123 46 46 46 46 46 45 237 158 295 243 295 123 123 123 123 123 123 13 13 13 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 | // SPDX-License-Identifier: GPL-2.0 /* * bus.c - bus driver management * * Copyright (c) 2002-3 Patrick Mochel * Copyright (c) 2002-3 Open Source Development Labs * Copyright (c) 2007 Greg Kroah-Hartman <gregkh@suse.de> * Copyright (c) 2007 Novell Inc. * Copyright (c) 2023 Greg Kroah-Hartman <gregkh@linuxfoundation.org> */ #include <linux/async.h> #include <linux/device/bus.h> #include <linux/device.h> #include <linux/module.h> #include <linux/errno.h> #include <linux/slab.h> #include <linux/init.h> #include <linux/string.h> #include <linux/mutex.h> #include <linux/sysfs.h> #include "base.h" #include "power/power.h" /* /sys/devices/system */ static struct kset *system_kset; /* /sys/bus */ static struct kset *bus_kset; #define to_bus_attr(_attr) container_of(_attr, struct bus_attribute, attr) /* * sysfs bindings for drivers */ #define to_drv_attr(_attr) container_of(_attr, struct driver_attribute, attr) #define DRIVER_ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) \ struct driver_attribute driver_attr_##_name = \ __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) static int __must_check bus_rescan_devices_helper(struct device *dev, void *data); /** * bus_to_subsys - Turn a struct bus_type into a struct subsys_private * * @bus: pointer to the struct bus_type to look up * * The driver core internals needs to work on the subsys_private structure, not * the external struct bus_type pointer. This function walks the list of * registered busses in the system and finds the matching one and returns the * internal struct subsys_private that relates to that bus. * * Note, the reference count of the return value is INCREMENTED if it is not * NULL. A call to subsys_put() must be done when finished with the pointer in * order for it to be properly freed. */ struct subsys_private *bus_to_subsys(const struct bus_type *bus) { struct subsys_private *sp = NULL; struct kobject *kobj; if (!bus || !bus_kset) return NULL; spin_lock(&bus_kset->list_lock); if (list_empty(&bus_kset->list)) goto done; list_for_each_entry(kobj, &bus_kset->list, entry) { struct kset *kset = container_of(kobj, struct kset, kobj); sp = container_of_const(kset, struct subsys_private, subsys); if (sp->bus == bus) goto done; } sp = NULL; done: sp = subsys_get(sp); spin_unlock(&bus_kset->list_lock); return sp; } static const struct bus_type *bus_get(const struct bus_type *bus) { struct subsys_private *sp = bus_to_subsys(bus); if (sp) return bus; return NULL; } static void bus_put(const struct bus_type *bus) { struct subsys_private *sp = bus_to_subsys(bus); /* two puts are required as the call to bus_to_subsys incremented it again */ subsys_put(sp); subsys_put(sp); } static ssize_t drv_attr_show(struct kobject *kobj, struct attribute *attr, char *buf) { struct driver_attribute *drv_attr = to_drv_attr(attr); struct driver_private *drv_priv = to_driver(kobj); ssize_t ret = -EIO; if (drv_attr->show) ret = drv_attr->show(drv_priv->driver, buf); return ret; } static ssize_t drv_attr_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t count) { struct driver_attribute *drv_attr = to_drv_attr(attr); struct driver_private *drv_priv = to_driver(kobj); ssize_t ret = -EIO; if (drv_attr->store) ret = drv_attr->store(drv_priv->driver, buf, count); return ret; } static const struct sysfs_ops driver_sysfs_ops = { .show = drv_attr_show, .store = drv_attr_store, }; static void driver_release(struct kobject *kobj) { struct driver_private *drv_priv = to_driver(kobj); pr_debug("driver: '%s': %s\n", kobject_name(kobj), __func__); kfree(drv_priv); } static const struct kobj_type driver_ktype = { .sysfs_ops = &driver_sysfs_ops, .release = driver_release, }; /* * sysfs bindings for buses */ static ssize_t bus_attr_show(struct kobject *kobj, struct attribute *attr, char *buf) { struct bus_attribute *bus_attr = to_bus_attr(attr); struct subsys_private *subsys_priv = to_subsys_private(kobj); /* return -EIO for reading a bus attribute without show() */ ssize_t ret = -EIO; if (bus_attr->show) ret = bus_attr->show(subsys_priv->bus, buf); return ret; } static ssize_t bus_attr_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t count) { struct bus_attribute *bus_attr = to_bus_attr(attr); struct subsys_private *subsys_priv = to_subsys_private(kobj); /* return -EIO for writing a bus attribute without store() */ ssize_t ret = -EIO; if (bus_attr->store) ret = bus_attr->store(subsys_priv->bus, buf, count); return ret; } static const struct sysfs_ops bus_sysfs_ops = { .show = bus_attr_show, .store = bus_attr_store, }; int bus_create_file(const struct bus_type *bus, struct bus_attribute *attr) { struct subsys_private *sp = bus_to_subsys(bus); int error; if (!sp) return -EINVAL; error = sysfs_create_file(&sp->subsys.kobj, &attr->attr); subsys_put(sp); return error; } EXPORT_SYMBOL_GPL(bus_create_file); void bus_remove_file(const struct bus_type *bus, struct bus_attribute *attr) { struct subsys_private *sp = bus_to_subsys(bus); if (!sp) return; sysfs_remove_file(&sp->subsys.kobj, &attr->attr); subsys_put(sp); } EXPORT_SYMBOL_GPL(bus_remove_file); static void bus_release(struct kobject *kobj) { struct subsys_private *priv = to_subsys_private(kobj); lockdep_unregister_key(&priv->lock_key); kfree(priv); } static const struct kobj_type bus_ktype = { .sysfs_ops = &bus_sysfs_ops, .release = bus_release, }; static int bus_uevent_filter(const struct kobject *kobj) { const struct kobj_type *ktype = get_ktype(kobj); if (ktype == &bus_ktype) return 1; return 0; } static const struct kset_uevent_ops bus_uevent_ops = { .filter = bus_uevent_filter, }; /* Manually detach a device from its associated driver. */ static ssize_t unbind_store(struct device_driver *drv, const char *buf, size_t count) { const struct bus_type *bus = bus_get(drv->bus); struct device *dev; int err = -ENODEV; dev = bus_find_device_by_name(bus, NULL, buf); if (dev && dev->driver == drv) { device_driver_detach(dev); err = count; } put_device(dev); bus_put(bus); return err; } static DRIVER_ATTR_IGNORE_LOCKDEP(unbind, 0200, NULL, unbind_store); /* * Manually attach a device to a driver. * Note: the driver must want to bind to the device, * it is not possible to override the driver's id table. */ static ssize_t bind_store(struct device_driver *drv, const char *buf, size_t count) { const struct bus_type *bus = bus_get(drv->bus); struct device *dev; int err = -ENODEV; dev = bus_find_device_by_name(bus, NULL, buf); if (dev && driver_match_device(drv, dev)) { err = device_driver_attach(drv, dev); if (!err) { /* success */ err = count; } } put_device(dev); bus_put(bus); return err; } static DRIVER_ATTR_IGNORE_LOCKDEP(bind, 0200, NULL, bind_store); static ssize_t drivers_autoprobe_show(const struct bus_type *bus, char *buf) { struct subsys_private *sp = bus_to_subsys(bus); int ret; if (!sp) return -EINVAL; ret = sysfs_emit(buf, "%d\n", sp->drivers_autoprobe); subsys_put(sp); return ret; } static ssize_t drivers_autoprobe_store(const struct bus_type *bus, const char *buf, size_t count) { struct subsys_private *sp = bus_to_subsys(bus); if (!sp) return -EINVAL; if (buf[0] == '0') sp->drivers_autoprobe = 0; else sp->drivers_autoprobe = 1; subsys_put(sp); return count; } static ssize_t drivers_probe_store(const struct bus_type *bus, const char *buf, size_t count) { struct device *dev; int err = -EINVAL; dev = bus_find_device_by_name(bus, NULL, buf); if (!dev) return -ENODEV; if (bus_rescan_devices_helper(dev, NULL) == 0) err = count; put_device(dev); return err; } static struct device *next_device(struct klist_iter *i) { struct klist_node *n = klist_next(i); struct device *dev = NULL; struct device_private *dev_prv; if (n) { dev_prv = to_device_private_bus(n); dev = dev_prv->device; } return dev; } static struct device *prev_device(struct klist_iter *i) { struct klist_node *n = klist_prev(i); struct device *dev = NULL; struct device_private *dev_prv; if (n) { dev_prv = to_device_private_bus(n); dev = dev_prv->device; } return dev; } /** * bus_for_each_dev - device iterator. * @bus: bus type. * @start: device to start iterating from. * @data: data for the callback. * @fn: function to be called for each device. * * Iterate over @bus's list of devices, and call @fn for each, * passing it @data. If @start is not NULL, we use that device to * begin iterating from. * * We check the return of @fn each time. If it returns anything * other than 0, we break out and return that value. * * NOTE: The device that returns a non-zero value is not retained * in any way, nor is its refcount incremented. If the caller needs * to retain this data, it should do so, and increment the reference * count in the supplied callback. */ int bus_for_each_dev(const struct bus_type *bus, struct device *start, void *data, device_iter_t fn) { struct subsys_private *sp = bus_to_subsys(bus); struct klist_iter i; struct device *dev; int error = 0; if (!sp) return -EINVAL; klist_iter_init_node(&sp->klist_devices, &i, (start ? &start->p->knode_bus : NULL)); while (!error && (dev = next_device(&i))) error = fn(dev, data); klist_iter_exit(&i); subsys_put(sp); return error; } EXPORT_SYMBOL_GPL(bus_for_each_dev); /** * bus_find_device - device iterator for locating a particular device. * @bus: bus type * @start: Device to begin with * @data: Data to pass to match function * @match: Callback function to check device * * This is similar to the bus_for_each_dev() function above, but it * returns a reference to a device that is 'found' for later use, as * determined by the @match callback. * * The callback should return 0 if the device doesn't match and non-zero * if it does. If the callback returns non-zero, this function will * return to the caller and not iterate over any more devices. */ struct device *bus_find_device(const struct bus_type *bus, struct device *start, const void *data, device_match_t match) { struct subsys_private *sp = bus_to_subsys(bus); struct klist_iter i; struct device *dev; if (!sp) return NULL; klist_iter_init_node(&sp->klist_devices, &i, (start ? &start->p->knode_bus : NULL)); while ((dev = next_device(&i))) { if (match(dev, data)) { get_device(dev); break; } } klist_iter_exit(&i); subsys_put(sp); return dev; } EXPORT_SYMBOL_GPL(bus_find_device); struct device *bus_find_device_reverse(const struct bus_type *bus, struct device *start, const void *data, device_match_t match) { struct subsys_private *sp = bus_to_subsys(bus); struct klist_iter i; struct device *dev; if (!sp) return NULL; klist_iter_init_node(&sp->klist_devices, &i, (start ? &start->p->knode_bus : NULL)); while ((dev = prev_device(&i))) { if (match(dev, data)) { get_device(dev); break; } } klist_iter_exit(&i); subsys_put(sp); return dev; } EXPORT_SYMBOL_GPL(bus_find_device_reverse); static struct device_driver *next_driver(struct klist_iter *i) { struct klist_node *n = klist_next(i); struct driver_private *drv_priv; if (n) { drv_priv = container_of(n, struct driver_private, knode_bus); return drv_priv->driver; } return NULL; } /** * bus_for_each_drv - driver iterator * @bus: bus we're dealing with. * @start: driver to start iterating on. * @data: data to pass to the callback. * @fn: function to call for each driver. * * This is nearly identical to the device iterator above. * We iterate over each driver that belongs to @bus, and call * @fn for each. If @fn returns anything but 0, we break out * and return it. If @start is not NULL, we use it as the head * of the list. * * NOTE: we don't return the driver that returns a non-zero * value, nor do we leave the reference count incremented for that * driver. If the caller needs to know that info, it must set it * in the callback. It must also be sure to increment the refcount * so it doesn't disappear before returning to the caller. */ int bus_for_each_drv(const struct bus_type *bus, struct device_driver *start, void *data, int (*fn)(struct device_driver *, void *)) { struct subsys_private *sp = bus_to_subsys(bus); struct klist_iter i; struct device_driver *drv; int error = 0; if (!sp) return -EINVAL; klist_iter_init_node(&sp->klist_drivers, &i, start ? &start->p->knode_bus : NULL); while ((drv = next_driver(&i)) && !error) error = fn(drv, data); klist_iter_exit(&i); subsys_put(sp); return error; } EXPORT_SYMBOL_GPL(bus_for_each_drv); /** * bus_add_device - add device to bus * @dev: device being added * * - Add device's bus attributes. * - Create links to device's bus. * - Add the device to its bus's list of devices. */ int bus_add_device(struct device *dev) { struct subsys_private *sp = bus_to_subsys(dev->bus); int error; if (!sp) { /* * This is a normal operation for many devices that do not * have a bus assigned to them, just say that all went * well. */ return 0; } /* * Reference in sp is now incremented and will be dropped when * the device is removed from the bus */ pr_debug("bus: '%s': add device %s\n", sp->bus->name, dev_name(dev)); error = device_add_groups(dev, sp->bus->dev_groups); if (error) goto out_put; error = sysfs_create_link(&sp->devices_kset->kobj, &dev->kobj, dev_name(dev)); if (error) goto out_groups; error = sysfs_create_link(&dev->kobj, &sp->subsys.kobj, "subsystem"); if (error) goto out_subsys; klist_add_tail(&dev->p->knode_bus, &sp->klist_devices); return 0; out_subsys: sysfs_remove_link(&sp->devices_kset->kobj, dev_name(dev)); out_groups: device_remove_groups(dev, sp->bus->dev_groups); out_put: subsys_put(sp); return error; } /** * bus_probe_device - probe drivers for a new device * @dev: device to probe * * - Automatically probe for a driver if the bus allows it. */ void bus_probe_device(struct device *dev) { struct subsys_private *sp = bus_to_subsys(dev->bus); struct subsys_interface *sif; if (!sp) return; device_initial_probe(dev); mutex_lock(&sp->mutex); list_for_each_entry(sif, &sp->interfaces, node) if (sif->add_dev) sif->add_dev(dev, sif); mutex_unlock(&sp->mutex); subsys_put(sp); } /** * bus_remove_device - remove device from bus * @dev: device to be removed * * - Remove device from all interfaces. * - Remove symlink from bus' directory. * - Delete device from bus's list. * - Detach from its driver. * - Drop reference taken in bus_add_device(). */ void bus_remove_device(struct device *dev) { struct subsys_private *sp = bus_to_subsys(dev->bus); struct subsys_interface *sif; if (!sp) return; mutex_lock(&sp->mutex); list_for_each_entry(sif, &sp->interfaces, node) if (sif->remove_dev) sif->remove_dev(dev, sif); mutex_unlock(&sp->mutex); sysfs_remove_link(&dev->kobj, "subsystem"); sysfs_remove_link(&sp->devices_kset->kobj, dev_name(dev)); device_remove_groups(dev, dev->bus->dev_groups); if (klist_node_attached(&dev->p->knode_bus)) klist_del(&dev->p->knode_bus); pr_debug("bus: '%s': remove device %s\n", dev->bus->name, dev_name(dev)); device_release_driver(dev); /* * Decrement the reference count twice, once for the bus_to_subsys() * call in the start of this function, and the second one from the * reference increment in bus_add_device() */ subsys_put(sp); subsys_put(sp); } static int __must_check add_bind_files(struct device_driver *drv) { int ret; ret = driver_create_file(drv, &driver_attr_unbind); if (ret == 0) { ret = driver_create_file(drv, &driver_attr_bind); if (ret) driver_remove_file(drv, &driver_attr_unbind); } return ret; } static void remove_bind_files(struct device_driver *drv) { driver_remove_file(drv, &driver_attr_bind); driver_remove_file(drv, &driver_attr_unbind); } static BUS_ATTR_WO(drivers_probe); static BUS_ATTR_RW(drivers_autoprobe); static int add_probe_files(const struct bus_type *bus) { int retval; retval = bus_create_file(bus, &bus_attr_drivers_probe); if (retval) goto out; retval = bus_create_file(bus, &bus_attr_drivers_autoprobe); if (retval) bus_remove_file(bus, &bus_attr_drivers_probe); out: return retval; } static void remove_probe_files(const struct bus_type *bus) { bus_remove_file(bus, &bus_attr_drivers_autoprobe); bus_remove_file(bus, &bus_attr_drivers_probe); } static ssize_t uevent_store(struct device_driver *drv, const char *buf, size_t count) { int rc; rc = kobject_synth_uevent(&drv->p->kobj, buf, count); return rc ? rc : count; } static DRIVER_ATTR_WO(uevent); /** * bus_add_driver - Add a driver to the bus. * @drv: driver. */ int bus_add_driver(struct device_driver *drv) { struct subsys_private *sp = bus_to_subsys(drv->bus); struct driver_private *priv; int error = 0; if (!sp) return -EINVAL; /* * Reference in sp is now incremented and will be dropped when * the driver is removed from the bus */ pr_debug("bus: '%s': add driver %s\n", sp->bus->name, drv->name); priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (!priv) { error = -ENOMEM; goto out_put_bus; } klist_init(&priv->klist_devices, NULL, NULL); priv->driver = drv; drv->p = priv; priv->kobj.kset = sp->drivers_kset; error = kobject_init_and_add(&priv->kobj, &driver_ktype, NULL, "%s", drv->name); if (error) goto out_unregister; klist_add_tail(&priv->knode_bus, &sp->klist_drivers); if (sp->drivers_autoprobe) { error = driver_attach(drv); if (error) goto out_del_list; } error = module_add_driver(drv->owner, drv); if (error) { printk(KERN_ERR "%s: failed to create module links for %s\n", __func__, drv->name); goto out_detach; } error = driver_create_file(drv, &driver_attr_uevent); if (error) { printk(KERN_ERR "%s: uevent attr (%s) failed\n", __func__, drv->name); } error = driver_add_groups(drv, sp->bus->drv_groups); if (error) { /* How the hell do we get out of this pickle? Give up */ printk(KERN_ERR "%s: driver_add_groups(%s) failed\n", __func__, drv->name); } if (!drv->suppress_bind_attrs) { error = add_bind_files(drv); if (error) { /* Ditto */ printk(KERN_ERR "%s: add_bind_files(%s) failed\n", __func__, drv->name); } } return 0; out_detach: driver_detach(drv); out_del_list: klist_del(&priv->knode_bus); out_unregister: kobject_put(&priv->kobj); /* drv->p is freed in driver_release() */ drv->p = NULL; out_put_bus: subsys_put(sp); return error; } /** * bus_remove_driver - delete driver from bus's knowledge. * @drv: driver. * * Detach the driver from the devices it controls, and remove * it from its bus's list of drivers. Finally, we drop the reference * to the bus we took in bus_add_driver(). */ void bus_remove_driver(struct device_driver *drv) { struct subsys_private *sp = bus_to_subsys(drv->bus); if (!sp) return; pr_debug("bus: '%s': remove driver %s\n", sp->bus->name, drv->name); if (!drv->suppress_bind_attrs) remove_bind_files(drv); driver_remove_groups(drv, sp->bus->drv_groups); driver_remove_file(drv, &driver_attr_uevent); klist_remove(&drv->p->knode_bus); driver_detach(drv); module_remove_driver(drv); kobject_put(&drv->p->kobj); /* * Decrement the reference count twice, once for the bus_to_subsys() * call in the start of this function, and the second one from the * reference increment in bus_add_driver() */ subsys_put(sp); subsys_put(sp); } /* Helper for bus_rescan_devices's iter */ static int __must_check bus_rescan_devices_helper(struct device *dev, void *data) { int ret = 0; if (!dev->driver) { if (dev->parent && dev->bus->need_parent_lock) device_lock(dev->parent); ret = device_attach(dev); if (dev->parent && dev->bus->need_parent_lock) device_unlock(dev->parent); } return ret < 0 ? ret : 0; } /** * bus_rescan_devices - rescan devices on the bus for possible drivers * @bus: the bus to scan. * * This function will look for devices on the bus with no driver * attached and rescan it against existing drivers to see if it matches * any by calling device_attach() for the unbound devices. */ int bus_rescan_devices(const struct bus_type *bus) { return bus_for_each_dev(bus, NULL, NULL, bus_rescan_devices_helper); } EXPORT_SYMBOL_GPL(bus_rescan_devices); /** * device_reprobe - remove driver for a device and probe for a new driver * @dev: the device to reprobe * * This function detaches the attached driver (if any) for the given * device and restarts the driver probing process. It is intended * to use if probing criteria changed during a devices lifetime and * driver attachment should change accordingly. */ int device_reprobe(struct device *dev) { if (dev->driver) device_driver_detach(dev); return bus_rescan_devices_helper(dev, NULL); } EXPORT_SYMBOL_GPL(device_reprobe); static void klist_devices_get(struct klist_node *n) { struct device_private *dev_prv = to_device_private_bus(n); struct device *dev = dev_prv->device; get_device(dev); } static void klist_devices_put(struct klist_node *n) { struct device_private *dev_prv = to_device_private_bus(n); struct device *dev = dev_prv->device; put_device(dev); } static ssize_t bus_uevent_store(const struct bus_type *bus, const char *buf, size_t count) { struct subsys_private *sp = bus_to_subsys(bus); int ret; if (!sp) return -EINVAL; ret = kobject_synth_uevent(&sp->subsys.kobj, buf, count); subsys_put(sp); if (ret) return ret; return count; } /* * "open code" the old BUS_ATTR() macro here. We want to use BUS_ATTR_WO() * here, but can not use it as earlier in the file we have * DEVICE_ATTR_WO(uevent), which would cause a clash with the with the store * function name. */ static struct bus_attribute bus_attr_uevent = __ATTR(uevent, 0200, NULL, bus_uevent_store); /** * bus_register - register a driver-core subsystem * @bus: bus to register * * Once we have that, we register the bus with the kobject * infrastructure, then register the children subsystems it has: * the devices and drivers that belong to the subsystem. */ int bus_register(const struct bus_type *bus) { int retval; struct subsys_private *priv; struct kobject *bus_kobj; struct lock_class_key *key; priv = kzalloc(sizeof(struct subsys_private), GFP_KERNEL); if (!priv) return -ENOMEM; priv->bus = bus; BLOCKING_INIT_NOTIFIER_HEAD(&priv->bus_notifier); bus_kobj = &priv->subsys.kobj; retval = kobject_set_name(bus_kobj, "%s", bus->name); if (retval) goto out; bus_kobj->kset = bus_kset; bus_kobj->ktype = &bus_ktype; priv->drivers_autoprobe = 1; retval = kset_register(&priv->subsys); if (retval) goto out; retval = bus_create_file(bus, &bus_attr_uevent); if (retval) goto bus_uevent_fail; priv->devices_kset = kset_create_and_add("devices", NULL, bus_kobj); if (!priv->devices_kset) { retval = -ENOMEM; goto bus_devices_fail; } priv->drivers_kset = kset_create_and_add("drivers", NULL, bus_kobj); if (!priv->drivers_kset) { retval = -ENOMEM; goto bus_drivers_fail; } INIT_LIST_HEAD(&priv->interfaces); key = &priv->lock_key; lockdep_register_key(key); __mutex_init(&priv->mutex, "subsys mutex", key); klist_init(&priv->klist_devices, klist_devices_get, klist_devices_put); klist_init(&priv->klist_drivers, NULL, NULL); retval = add_probe_files(bus); if (retval) goto bus_probe_files_fail; retval = sysfs_create_groups(bus_kobj, bus->bus_groups); if (retval) goto bus_groups_fail; pr_debug("bus: '%s': registered\n", bus->name); return 0; bus_groups_fail: remove_probe_files(bus); bus_probe_files_fail: kset_unregister(priv->drivers_kset); bus_drivers_fail: kset_unregister(priv->devices_kset); bus_devices_fail: bus_remove_file(bus, &bus_attr_uevent); bus_uevent_fail: kset_unregister(&priv->subsys); /* Above kset_unregister() will kfree @priv */ priv = NULL; out: kfree(priv); return retval; } EXPORT_SYMBOL_GPL(bus_register); /** * bus_unregister - remove a bus from the system * @bus: bus. * * Unregister the child subsystems and the bus itself. * Finally, we call bus_put() to release the refcount */ void bus_unregister(const struct bus_type *bus) { struct subsys_private *sp = bus_to_subsys(bus); struct kobject *bus_kobj; if (!sp) return; pr_debug("bus: '%s': unregistering\n", bus->name); if (sp->dev_root) device_unregister(sp->dev_root); bus_kobj = &sp->subsys.kobj; sysfs_remove_groups(bus_kobj, bus->bus_groups); remove_probe_files(bus); bus_remove_file(bus, &bus_attr_uevent); kset_unregister(sp->drivers_kset); kset_unregister(sp->devices_kset); kset_unregister(&sp->subsys); subsys_put(sp); } EXPORT_SYMBOL_GPL(bus_unregister); int bus_register_notifier(const struct bus_type *bus, struct notifier_block *nb) { struct subsys_private *sp = bus_to_subsys(bus); int retval; if (!sp) return -EINVAL; retval = blocking_notifier_chain_register(&sp->bus_notifier, nb); subsys_put(sp); return retval; } EXPORT_SYMBOL_GPL(bus_register_notifier); int bus_unregister_notifier(const struct bus_type *bus, struct notifier_block *nb) { struct subsys_private *sp = bus_to_subsys(bus); int retval; if (!sp) return -EINVAL; retval = blocking_notifier_chain_unregister(&sp->bus_notifier, nb); subsys_put(sp); return retval; } EXPORT_SYMBOL_GPL(bus_unregister_notifier); void bus_notify(struct device *dev, enum bus_notifier_event value) { struct subsys_private *sp = bus_to_subsys(dev->bus); if (!sp) return; blocking_notifier_call_chain(&sp->bus_notifier, value, dev); subsys_put(sp); } struct kset *bus_get_kset(const struct bus_type *bus) { struct subsys_private *sp = bus_to_subsys(bus); struct kset *kset; if (!sp) return NULL; kset = &sp->subsys; subsys_put(sp); return kset; } EXPORT_SYMBOL_GPL(bus_get_kset); /* * Yes, this forcibly breaks the klist abstraction temporarily. It * just wants to sort the klist, not change reference counts and * take/drop locks rapidly in the process. It does all this while * holding the lock for the list, so objects can't otherwise be * added/removed while we're swizzling. */ static void device_insertion_sort_klist(struct device *a, struct list_head *list, int (*compare)(const struct device *a, const struct device *b)) { struct klist_node *n; struct device_private *dev_prv; struct device *b; list_for_each_entry(n, list, n_node) { dev_prv = to_device_private_bus(n); b = dev_prv->device; if (compare(a, b) <= 0) { list_move_tail(&a->p->knode_bus.n_node, &b->p->knode_bus.n_node); return; } } list_move_tail(&a->p->knode_bus.n_node, list); } void bus_sort_breadthfirst(const struct bus_type *bus, int (*compare)(const struct device *a, const struct device *b)) { struct subsys_private *sp = bus_to_subsys(bus); LIST_HEAD(sorted_devices); struct klist_node *n, *tmp; struct device_private *dev_prv; struct device *dev; struct klist *device_klist; if (!sp) return; device_klist = &sp->klist_devices; spin_lock(&device_klist->k_lock); list_for_each_entry_safe(n, tmp, &device_klist->k_list, n_node) { dev_prv = to_device_private_bus(n); dev = dev_prv->device; device_insertion_sort_klist(dev, &sorted_devices, compare); } list_splice(&sorted_devices, &device_klist->k_list); spin_unlock(&device_klist->k_lock); subsys_put(sp); } EXPORT_SYMBOL_GPL(bus_sort_breadthfirst); struct subsys_dev_iter { struct klist_iter ki; const struct device_type *type; }; /** * subsys_dev_iter_init - initialize subsys device iterator * @iter: subsys iterator to initialize * @sp: the subsys private (i.e. bus) we wanna iterate over * @start: the device to start iterating from, if any * @type: device_type of the devices to iterate over, NULL for all * * Initialize subsys iterator @iter such that it iterates over devices * of @subsys. If @start is set, the list iteration will start there, * otherwise if it is NULL, the iteration starts at the beginning of * the list. */ static void subsys_dev_iter_init(struct subsys_dev_iter *iter, struct subsys_private *sp, struct device *start, const struct device_type *type) { struct klist_node *start_knode = NULL; if (start) start_knode = &start->p->knode_bus; klist_iter_init_node(&sp->klist_devices, &iter->ki, start_knode); iter->type = type; } /** * subsys_dev_iter_next - iterate to the next device * @iter: subsys iterator to proceed * * Proceed @iter to the next device and return it. Returns NULL if * iteration is complete. * * The returned device is referenced and won't be released till * iterator is proceed to the next device or exited. The caller is * free to do whatever it wants to do with the device including * calling back into subsys code. */ static struct device *subsys_dev_iter_next(struct subsys_dev_iter *iter) { struct klist_node *knode; struct device *dev; for (;;) { knode = klist_next(&iter->ki); if (!knode) return NULL; dev = to_device_private_bus(knode)->device; if (!iter->type || iter->type == dev->type) return dev; } } /** * subsys_dev_iter_exit - finish iteration * @iter: subsys iterator to finish * * Finish an iteration. Always call this function after iteration is * complete whether the iteration ran till the end or not. */ static void subsys_dev_iter_exit(struct subsys_dev_iter *iter) { klist_iter_exit(&iter->ki); } int subsys_interface_register(struct subsys_interface *sif) { struct subsys_private *sp; struct subsys_dev_iter iter; struct device *dev; if (!sif || !sif->subsys) return -ENODEV; sp = bus_to_subsys(sif->subsys); if (!sp) return -EINVAL; /* * Reference in sp is now incremented and will be dropped when * the interface is removed from the bus */ mutex_lock(&sp->mutex); list_add_tail(&sif->node, &sp->interfaces); if (sif->add_dev) { subsys_dev_iter_init(&iter, sp, NULL, NULL); while ((dev = subsys_dev_iter_next(&iter))) sif->add_dev(dev, sif); subsys_dev_iter_exit(&iter); } mutex_unlock(&sp->mutex); return 0; } EXPORT_SYMBOL_GPL(subsys_interface_register); void subsys_interface_unregister(struct subsys_interface *sif) { struct subsys_private *sp; struct subsys_dev_iter iter; struct device *dev; if (!sif || !sif->subsys) return; sp = bus_to_subsys(sif->subsys); if (!sp) return; mutex_lock(&sp->mutex); list_del_init(&sif->node); if (sif->remove_dev) { subsys_dev_iter_init(&iter, sp, NULL, NULL); while ((dev = subsys_dev_iter_next(&iter))) sif->remove_dev(dev, sif); subsys_dev_iter_exit(&iter); } mutex_unlock(&sp->mutex); /* * Decrement the reference count twice, once for the bus_to_subsys() * call in the start of this function, and the second one from the * reference increment in subsys_interface_register() */ subsys_put(sp); subsys_put(sp); } EXPORT_SYMBOL_GPL(subsys_interface_unregister); static void system_root_device_release(struct device *dev) { kfree(dev); } static int subsys_register(const struct bus_type *subsys, const struct attribute_group **groups, struct kobject *parent_of_root) { struct subsys_private *sp; struct device *dev; int err; err = bus_register(subsys); if (err < 0) return err; sp = bus_to_subsys(subsys); if (!sp) { err = -EINVAL; goto err_sp; } dev = kzalloc(sizeof(struct device), GFP_KERNEL); if (!dev) { err = -ENOMEM; goto err_dev; } err = dev_set_name(dev, "%s", subsys->name); if (err < 0) goto err_name; dev->kobj.parent = parent_of_root; dev->groups = groups; dev->release = system_root_device_release; err = device_register(dev); if (err < 0) goto err_dev_reg; sp->dev_root = dev; subsys_put(sp); return 0; err_dev_reg: put_device(dev); dev = NULL; err_name: kfree(dev); err_dev: subsys_put(sp); err_sp: bus_unregister(subsys); return err; } /** * subsys_system_register - register a subsystem at /sys/devices/system/ * @subsys: system subsystem * @groups: default attributes for the root device * * All 'system' subsystems have a /sys/devices/system/<name> root device * with the name of the subsystem. The root device can carry subsystem- * wide attributes. All registered devices are below this single root * device and are named after the subsystem with a simple enumeration * number appended. The registered devices are not explicitly named; * only 'id' in the device needs to be set. * * Do not use this interface for anything new, it exists for compatibility * with bad ideas only. New subsystems should use plain subsystems; and * add the subsystem-wide attributes should be added to the subsystem * directory itself and not some create fake root-device placed in * /sys/devices/system/<name>. */ int subsys_system_register(const struct bus_type *subsys, const struct attribute_group **groups) { return subsys_register(subsys, groups, &system_kset->kobj); } EXPORT_SYMBOL_GPL(subsys_system_register); /** * subsys_virtual_register - register a subsystem at /sys/devices/virtual/ * @subsys: virtual subsystem * @groups: default attributes for the root device * * All 'virtual' subsystems have a /sys/devices/system/<name> root device * with the name of the subsystem. The root device can carry subsystem-wide * attributes. All registered devices are below this single root device. * There's no restriction on device naming. This is for kernel software * constructs which need sysfs interface. */ int subsys_virtual_register(const struct bus_type *subsys, const struct attribute_group **groups) { struct kobject *virtual_dir; virtual_dir = virtual_device_parent(); if (!virtual_dir) return -ENOMEM; return subsys_register(subsys, groups, virtual_dir); } EXPORT_SYMBOL_GPL(subsys_virtual_register); /** * driver_find - locate driver on a bus by its name. * @name: name of the driver. * @bus: bus to scan for the driver. * * Call kset_find_obj() to iterate over list of drivers on * a bus to find driver by name. Return driver if found. * * This routine provides no locking to prevent the driver it returns * from being unregistered or unloaded while the caller is using it. * The caller is responsible for preventing this. */ struct device_driver *driver_find(const char *name, const struct bus_type *bus) { struct subsys_private *sp = bus_to_subsys(bus); struct kobject *k; struct driver_private *priv; if (!sp) return NULL; k = kset_find_obj(sp->drivers_kset, name); subsys_put(sp); if (!k) return NULL; priv = to_driver(k); /* Drop reference added by kset_find_obj() */ kobject_put(k); return priv->driver; } EXPORT_SYMBOL_GPL(driver_find); /* * Warning, the value could go to "removed" instantly after calling this function, so be very * careful when calling it... */ bool bus_is_registered(const struct bus_type *bus) { struct subsys_private *sp = bus_to_subsys(bus); bool is_initialized = false; if (sp) { is_initialized = true; subsys_put(sp); } return is_initialized; } /** * bus_get_dev_root - return a pointer to the "device root" of a bus * @bus: bus to return the device root of. * * If a bus has a "device root" structure, return it, WITH THE REFERENCE * COUNT INCREMENTED. * * Note, when finished with the device, a call to put_device() is required. * * If the device root is not present (or bus is not a valid pointer), NULL * will be returned. */ struct device *bus_get_dev_root(const struct bus_type *bus) { struct subsys_private *sp = bus_to_subsys(bus); struct device *dev_root; if (!sp) return NULL; dev_root = get_device(sp->dev_root); subsys_put(sp); return dev_root; } EXPORT_SYMBOL_GPL(bus_get_dev_root); int __init buses_init(void) { bus_kset = kset_create_and_add("bus", &bus_uevent_ops, NULL); if (!bus_kset) return -ENOMEM; system_kset = kset_create_and_add("system", NULL, &devices_kset->kobj); if (!system_kset) { /* Do error handling here as devices_init() do */ kset_unregister(bus_kset); bus_kset = NULL; pr_err("%s: failed to create and add kset 'bus'\n", __func__); return -ENOMEM; } return 0; } |
| 2702 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | /* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM ipi #if !defined(_TRACE_IPI_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_IPI_H #include <linux/tracepoint.h> TRACE_EVENT(ipi_send_cpu, TP_PROTO(const unsigned int cpu, unsigned long callsite, void *callback), TP_ARGS(cpu, callsite, callback), TP_STRUCT__entry( __field(unsigned int, cpu) __field(void *, callsite) __field(void *, callback) ), TP_fast_assign( __entry->cpu = cpu; __entry->callsite = (void *)callsite; __entry->callback = callback; ), TP_printk("cpu=%u callsite=%pS callback=%pS", __entry->cpu, __entry->callsite, __entry->callback) ); TRACE_EVENT(ipi_send_cpumask, TP_PROTO(const struct cpumask *cpumask, unsigned long callsite, void *callback), TP_ARGS(cpumask, callsite, callback), TP_STRUCT__entry( __cpumask(cpumask) __field(void *, callsite) __field(void *, callback) ), TP_fast_assign( __assign_cpumask(cpumask, cpumask_bits(cpumask)); __entry->callsite = (void *)callsite; __entry->callback = callback; ), TP_printk("cpumask=%s callsite=%pS callback=%pS", __get_cpumask(cpumask), __entry->callsite, __entry->callback) ); #ifdef CONFIG_HAVE_EXTRA_IPI_TRACEPOINTS /** * ipi_raise - called when a smp cross call is made * * @mask: mask of recipient CPUs for the IPI * @reason: string identifying the IPI purpose * * It is necessary for @reason to be a static string declared with * __tracepoint_string. */ TRACE_EVENT(ipi_raise, TP_PROTO(const struct cpumask *mask, const char *reason), TP_ARGS(mask, reason), TP_STRUCT__entry( __bitmask(target_cpus, nr_cpumask_bits) __field(const char *, reason) ), TP_fast_assign( __assign_bitmask(target_cpus, cpumask_bits(mask), nr_cpumask_bits); __entry->reason = reason; ), TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), __entry->reason) ); DECLARE_EVENT_CLASS(ipi_handler, TP_PROTO(const char *reason), TP_ARGS(reason), TP_STRUCT__entry( __field(const char *, reason) ), TP_fast_assign( __entry->reason = reason; ), TP_printk("(%s)", __entry->reason) ); /** * ipi_entry - called immediately before the IPI handler * * @reason: string identifying the IPI purpose * * It is necessary for @reason to be a static string declared with * __tracepoint_string, ideally the same as used with trace_ipi_raise * for that IPI. */ DEFINE_EVENT(ipi_handler, ipi_entry, TP_PROTO(const char *reason), TP_ARGS(reason) ); /** * ipi_exit - called immediately after the IPI handler returns * * @reason: string identifying the IPI purpose * * It is necessary for @reason to be a static string declared with * __tracepoint_string, ideally the same as used with trace_ipi_raise for * that IPI. */ DEFINE_EVENT(ipi_handler, ipi_exit, TP_PROTO(const char *reason), TP_ARGS(reason) ); #endif /* CONFIG_HAVE_EXTRA_IPI_TRACEPOINTS */ #endif /* _TRACE_IPI_H */ /* This part must be outside protection */ #include <trace/define_trace.h> |
| 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | // SPDX-License-Identifier: GPL-2.0-only /* * Creates audit record for dropped/accepted packets * * (C) 2010-2011 Thomas Graf <tgraf@redhat.com> * (C) 2010-2011 Red Hat, Inc. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/audit.h> #include <linux/module.h> #include <linux/skbuff.h> #include <linux/tcp.h> #include <linux/udp.h> #include <linux/if_arp.h> #include <linux/netfilter/x_tables.h> #include <linux/netfilter/xt_AUDIT.h> #include <linux/netfilter_bridge/ebtables.h> #include <net/ipv6.h> #include <net/ip.h> MODULE_LICENSE("GPL"); MODULE_AUTHOR("Thomas Graf <tgraf@redhat.com>"); MODULE_DESCRIPTION("Xtables: creates audit records for dropped/accepted packets"); MODULE_ALIAS("ipt_AUDIT"); MODULE_ALIAS("ip6t_AUDIT"); MODULE_ALIAS("ebt_AUDIT"); MODULE_ALIAS("arpt_AUDIT"); static bool audit_ip4(struct audit_buffer *ab, struct sk_buff *skb) { struct iphdr _iph; const struct iphdr *ih; ih = skb_header_pointer(skb, skb_network_offset(skb), sizeof(_iph), &_iph); if (!ih) return false; audit_log_format(ab, " saddr=%pI4 daddr=%pI4 proto=%hhu", &ih->saddr, &ih->daddr, ih->protocol); return true; } static bool audit_ip6(struct audit_buffer *ab, struct sk_buff *skb) { struct ipv6hdr _ip6h; const struct ipv6hdr *ih; u8 nexthdr; __be16 frag_off; ih = skb_header_pointer(skb, skb_network_offset(skb), sizeof(_ip6h), &_ip6h); if (!ih) return false; nexthdr = ih->nexthdr; ipv6_skip_exthdr(skb, skb_network_offset(skb) + sizeof(_ip6h), &nexthdr, &frag_off); audit_log_format(ab, " saddr=%pI6c daddr=%pI6c proto=%hhu", &ih->saddr, &ih->daddr, nexthdr); return true; } static unsigned int audit_tg(struct sk_buff *skb, const struct xt_action_param *par) { struct audit_buffer *ab; int fam = -1; if (audit_enabled == AUDIT_OFF) goto errout; ab = audit_log_start(NULL, GFP_ATOMIC, AUDIT_NETFILTER_PKT); if (ab == NULL) goto errout; audit_log_format(ab, "mark=%#x", skb->mark); switch (xt_family(par)) { case NFPROTO_BRIDGE: switch (eth_hdr(skb)->h_proto) { case htons(ETH_P_IP): fam = audit_ip4(ab, skb) ? NFPROTO_IPV4 : -1; break; case htons(ETH_P_IPV6): fam = audit_ip6(ab, skb) ? NFPROTO_IPV6 : -1; break; } break; case NFPROTO_IPV4: fam = audit_ip4(ab, skb) ? NFPROTO_IPV4 : -1; break; case NFPROTO_IPV6: fam = audit_ip6(ab, skb) ? NFPROTO_IPV6 : -1; break; } if (fam == -1) audit_log_format(ab, " saddr=? daddr=? proto=-1"); audit_log_end(ab); errout: return XT_CONTINUE; } static unsigned int audit_tg_ebt(struct sk_buff *skb, const struct xt_action_param *par) { audit_tg(skb, par); return EBT_CONTINUE; } static int audit_tg_check(const struct xt_tgchk_param *par) { const struct xt_audit_info *info = par->targinfo; if (info->type > XT_AUDIT_TYPE_MAX) { pr_info_ratelimited("Audit type out of range (valid range: 0..%u)\n", XT_AUDIT_TYPE_MAX); return -ERANGE; } return 0; } static struct xt_target audit_tg_reg[] __read_mostly = { { .name = "AUDIT", .family = NFPROTO_UNSPEC, .target = audit_tg, .targetsize = sizeof(struct xt_audit_info), .checkentry = audit_tg_check, .me = THIS_MODULE, }, { .name = "AUDIT", .family = NFPROTO_BRIDGE, .target = audit_tg_ebt, .targetsize = sizeof(struct xt_audit_info), .checkentry = audit_tg_check, .me = THIS_MODULE, }, }; static int __init audit_tg_init(void) { return xt_register_targets(audit_tg_reg, ARRAY_SIZE(audit_tg_reg)); } static void __exit audit_tg_exit(void) { xt_unregister_targets(audit_tg_reg, ARRAY_SIZE(audit_tg_reg)); } module_init(audit_tg_init); module_exit(audit_tg_exit); |
| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 | // SPDX-License-Identifier: GPL-2.0+ /* * usbdux.c * Copyright (C) 2003-2014 Bernd Porr, mail@berndporr.me.uk */ /* * Driver: usbdux * Description: University of Stirling USB DAQ & INCITE Technology Limited * Devices: [ITL] USB-DUX (usbdux) * Author: Bernd Porr <mail@berndporr.me.uk> * Updated: 10 Oct 2014 * Status: Stable * * Connection scheme for the counter at the digital port: * 0=/CLK0, 1=UP/DOWN0, 2=RESET0, 4=/CLK1, 5=UP/DOWN1, 6=RESET1. * The sampling rate of the counter is approximately 500Hz. * * Note that under USB2.0 the length of the channel list determines * the max sampling rate. If you sample only one channel you get 8kHz * sampling rate. If you sample two channels you get 4kHz and so on. */ /* * I must give credit here to Chris Baugher who * wrote the driver for AT-MIO-16d. I used some parts of this * driver. I also must give credits to David Brownell * who supported me with the USB development. * * Bernd Porr * * * Revision history: * 0.94: D/A output should work now with any channel list combinations * 0.95: .owner commented out for kernel vers below 2.4.19 * sanity checks in ai/ao_cmd * 0.96: trying to get it working with 2.6, moved all memory alloc to comedi's * attach final USB IDs * moved memory allocation completely to the corresponding comedi * functions firmware upload is by fxload and no longer by comedi (due to * enumeration) * 0.97: USB IDs received, adjusted table * 0.98: SMP, locking, memory alloc: moved all usb memory alloc * to the usb subsystem and moved all comedi related memory * alloc to comedi. * | kernel | registration | usbdux-usb | usbdux-comedi | comedi | * 0.99: USB 2.0: changed protocol to isochronous transfer * IRQ transfer is too buggy and too risky in 2.0 * for the high speed ISO transfer is now a working version * available * 0.99b: Increased the iso transfer buffer for high sp.to 10 buffers. Some VIA * chipsets miss out IRQs. Deeper buffering is needed. * 1.00: full USB 2.0 support for the A/D converter. Now: max 8kHz sampling * rate. * Firmware vers 1.00 is needed for this. * Two 16 bit up/down/reset counter with a sampling rate of 1kHz * And loads of cleaning up, in particular streamlining the * bulk transfers. * 1.1: moved EP4 transfers to EP1 to make space for a PWM output on EP4 * 1.2: added PWM support via EP4 * 2.0: PWM seems to be stable and is not interfering with the other functions * 2.1: changed PWM API * 2.2: added firmware kernel request to fix an udev problem * 2.3: corrected a bug in bulk timeouts which were far too short * 2.4: fixed a bug which causes the driver to hang when it ran out of data. * Thanks to Jan-Matthias Braun and Ian to spot the bug and fix it. * */ #include <linux/kernel.h> #include <linux/module.h> #include <linux/slab.h> #include <linux/input.h> #include <linux/fcntl.h> #include <linux/compiler.h> #include <linux/comedi/comedi_usb.h> /* constants for firmware upload and download */ #define USBDUX_FIRMWARE "usbdux_firmware.bin" #define USBDUX_FIRMWARE_MAX_LEN 0x2000 #define USBDUX_FIRMWARE_CMD 0xa0 #define VENDOR_DIR_IN 0xc0 #define VENDOR_DIR_OUT 0x40 #define USBDUX_CPU_CS 0xe600 /* usbdux bulk transfer commands */ #define USBDUX_CMD_MULT_AI 0 #define USBDUX_CMD_AO 1 #define USBDUX_CMD_DIO_CFG 2 #define USBDUX_CMD_DIO_BITS 3 #define USBDUX_CMD_SINGLE_AI 4 #define USBDUX_CMD_TIMER_RD 5 #define USBDUX_CMD_TIMER_WR 6 #define USBDUX_CMD_PWM_ON 7 #define USBDUX_CMD_PWM_OFF 8 /* timeout for the USB-transfer in ms */ #define BULK_TIMEOUT 1000 /* 300Hz max frequ under PWM */ #define MIN_PWM_PERIOD ((long)(1E9 / 300)) /* Default PWM frequency */ #define PWM_DEFAULT_PERIOD ((long)(1E9 / 100)) /* Size of one A/D value */ #define SIZEADIN ((sizeof(u16))) /* * Size of the input-buffer IN BYTES * Always multiple of 8 for 8 microframes which is needed in the highspeed mode */ #define SIZEINBUF (8 * SIZEADIN) /* 16 bytes. */ #define SIZEINSNBUF 16 /* size of one value for the D/A converter: channel and value */ #define SIZEDAOUT ((sizeof(u8) + sizeof(u16))) /* * Size of the output-buffer in bytes * Actually only the first 4 triplets are used but for the * high speed mode we need to pad it to 8 (microframes). */ #define SIZEOUTBUF (8 * SIZEDAOUT) /* * Size of the buffer for the dux commands: just now max size is determined * by the analogue out + command byte + panic bytes... */ #define SIZEOFDUXBUFFER (8 * SIZEDAOUT + 2) /* Number of in-URBs which receive the data: min=2 */ #define NUMOFINBUFFERSFULL 5 /* Number of out-URBs which send the data: min=2 */ #define NUMOFOUTBUFFERSFULL 5 /* Number of in-URBs which receive the data: min=5 */ /* must have more buffers due to buggy USB ctr */ #define NUMOFINBUFFERSHIGH 10 /* Number of out-URBs which send the data: min=5 */ /* must have more buffers due to buggy USB ctr */ #define NUMOFOUTBUFFERSHIGH 10 /* number of retries to get the right dux command */ #define RETRIES 10 static const struct comedi_lrange range_usbdux_ai_range = { 4, { BIP_RANGE(4.096), BIP_RANGE(4.096 / 2), UNI_RANGE(4.096), UNI_RANGE(4.096 / 2) } }; static const struct comedi_lrange range_usbdux_ao_range = { 2, { BIP_RANGE(4.096), UNI_RANGE(4.096) } }; struct usbdux_private { /* actual number of in-buffers */ int n_ai_urbs; /* actual number of out-buffers */ int n_ao_urbs; /* ISO-transfer handling: buffers */ struct urb **ai_urbs; struct urb **ao_urbs; /* pwm-transfer handling */ struct urb *pwm_urb; /* PWM period */ unsigned int pwm_period; /* PWM internal delay for the GPIF in the FX2 */ u8 pwm_delay; /* size of the PWM buffer which holds the bit pattern */ int pwm_buf_sz; /* input buffer for the ISO-transfer */ __le16 *in_buf; /* input buffer for single insn */ __le16 *insn_buf; unsigned int high_speed:1; unsigned int ai_cmd_running:1; unsigned int ao_cmd_running:1; unsigned int pwm_cmd_running:1; /* time between samples in units of the timer */ unsigned int ai_timer; unsigned int ao_timer; /* counter between aquisitions */ unsigned int ai_counter; unsigned int ao_counter; /* interval in frames/uframes */ unsigned int ai_interval; /* commands */ u8 *dux_commands; struct mutex mut; }; static void usbdux_unlink_urbs(struct urb **urbs, int num_urbs) { int i; for (i = 0; i < num_urbs; i++) usb_kill_urb(urbs[i]); } static void usbdux_ai_stop(struct comedi_device *dev, int do_unlink) { struct usbdux_private *devpriv = dev->private; if (do_unlink && devpriv->ai_urbs) usbdux_unlink_urbs(devpriv->ai_urbs, devpriv->n_ai_urbs); devpriv->ai_cmd_running = 0; } static int usbdux_ai_cancel(struct comedi_device *dev, struct comedi_subdevice *s) { struct usbdux_private *devpriv = dev->private; /* prevent other CPUs from submitting new commands just now */ mutex_lock(&devpriv->mut); /* unlink only if the urb really has been submitted */ usbdux_ai_stop(dev, devpriv->ai_cmd_running); mutex_unlock(&devpriv->mut); return 0; } static void usbduxsub_ai_handle_urb(struct comedi_device *dev, struct comedi_subdevice *s, struct urb *urb) { struct usbdux_private *devpriv = dev->private; struct comedi_async *async = s->async; struct comedi_cmd *cmd = &async->cmd; int ret; int i; devpriv->ai_counter--; if (devpriv->ai_counter == 0) { devpriv->ai_counter = devpriv->ai_timer; /* get the data from the USB bus and hand it over to comedi */ for (i = 0; i < cmd->chanlist_len; i++) { unsigned int range = CR_RANGE(cmd->chanlist[i]); u16 val = le16_to_cpu(devpriv->in_buf[i]); /* bipolar data is two's-complement */ if (comedi_range_is_bipolar(s, range)) val = comedi_offset_munge(s, val); /* transfer data */ if (!comedi_buf_write_samples(s, &val, 1)) return; } if (cmd->stop_src == TRIG_COUNT && async->scans_done >= cmd->stop_arg) async->events |= COMEDI_CB_EOA; } /* if command is still running, resubmit urb */ if (!(async->events & COMEDI_CB_CANCEL_MASK)) { urb->dev = comedi_to_usb_dev(dev); ret = usb_submit_urb(urb, GFP_ATOMIC); if (ret < 0) { dev_err(dev->class_dev, "urb resubmit failed in int-context! err=%d\n", ret); if (ret == -EL2NSYNC) dev_err(dev->class_dev, "buggy USB host controller or bug in IRQ handler!\n"); async->events |= COMEDI_CB_ERROR; } } } static void usbduxsub_ai_isoc_irq(struct urb *urb) { struct comedi_device *dev = urb->context; struct comedi_subdevice *s = dev->read_subdev; struct comedi_async *async = s->async; struct usbdux_private *devpriv = dev->private; /* exit if not running a command, do not resubmit urb */ if (!devpriv->ai_cmd_running) return; switch (urb->status) { case 0: /* copy the result in the transfer buffer */ memcpy(devpriv->in_buf, urb->transfer_buffer, SIZEINBUF); usbduxsub_ai_handle_urb(dev, s, urb); break; case -EILSEQ: /* * error in the ISOchronous data * we don't copy the data into the transfer buffer * and recycle the last data byte */ dev_dbg(dev->class_dev, "CRC error in ISO IN stream\n"); usbduxsub_ai_handle_urb(dev, s, urb); break; case -ECONNRESET: case -ENOENT: case -ESHUTDOWN: case -ECONNABORTED: /* after an unlink command, unplug, ... etc */ async->events |= COMEDI_CB_ERROR; break; default: /* a real error */ dev_err(dev->class_dev, "Non-zero urb status received in ai intr context: %d\n", urb->status); async->events |= COMEDI_CB_ERROR; break; } /* * comedi_handle_events() cannot be used in this driver. The (*cancel) * operation would unlink the urb. */ if (async->events & COMEDI_CB_CANCEL_MASK) usbdux_ai_stop(dev, 0); comedi_event(dev, s); } static void usbdux_ao_stop(struct comedi_device *dev, int do_unlink) { struct usbdux_private *devpriv = dev->private; if (do_unlink && devpriv->ao_urbs) usbdux_unlink_urbs(devpriv->ao_urbs, devpriv->n_ao_urbs); devpriv->ao_cmd_running = 0; } static int usbdux_ao_cancel(struct comedi_device *dev, struct comedi_subdevice *s) { struct usbdux_private *devpriv = dev->private; /* prevent other CPUs from submitting a command just now */ mutex_lock(&devpriv->mut); /* unlink only if it is really running */ usbdux_ao_stop(dev, devpriv->ao_cmd_running); mutex_unlock(&devpriv->mut); return 0; } static void usbduxsub_ao_handle_urb(struct comedi_device *dev, struct comedi_subdevice *s, struct urb *urb) { struct usbdux_private *devpriv = dev->private; struct comedi_async *async = s->async; struct comedi_cmd *cmd = &async->cmd; u8 *datap; int ret; int i; devpriv->ao_counter--; if (devpriv->ao_counter == 0) { devpriv->ao_counter = devpriv->ao_timer; if (cmd->stop_src == TRIG_COUNT && async->scans_done >= cmd->stop_arg) { async->events |= COMEDI_CB_EOA; return; } /* transmit data to the USB bus */ datap = urb->transfer_buffer; *datap++ = cmd->chanlist_len; for (i = 0; i < cmd->chanlist_len; i++) { unsigned int chan = CR_CHAN(cmd->chanlist[i]); unsigned short val; if (!comedi_buf_read_samples(s, &val, 1)) { dev_err(dev->class_dev, "buffer underflow\n"); async->events |= COMEDI_CB_OVERFLOW; return; } /* pointer to the DA */ *datap++ = val & 0xff; *datap++ = (val >> 8) & 0xff; *datap++ = chan << 6; s->readback[chan] = val; } } /* if command is still running, resubmit urb for BULK transfer */ if (!(async->events & COMEDI_CB_CANCEL_MASK)) { urb->transfer_buffer_length = SIZEOUTBUF; urb->dev = comedi_to_usb_dev(dev); urb->status = 0; if (devpriv->high_speed) urb->interval = 8; /* uframes */ else urb->interval = 1; /* frames */ urb->number_of_packets = 1; urb->iso_frame_desc[0].offset = 0; urb->iso_frame_desc[0].length = SIZEOUTBUF; urb->iso_frame_desc[0].status = 0; ret = usb_submit_urb(urb, GFP_ATOMIC); if (ret < 0) { dev_err(dev->class_dev, "ao urb resubm failed in int-cont. ret=%d", ret); if (ret == -EL2NSYNC) dev_err(dev->class_dev, "buggy USB host controller or bug in IRQ handling!\n"); async->events |= COMEDI_CB_ERROR; } } } static void usbduxsub_ao_isoc_irq(struct urb *urb) { struct comedi_device *dev = urb->context; struct comedi_subdevice *s = dev->write_subdev; struct comedi_async *async = s->async; struct usbdux_private *devpriv = dev->private; /* exit if not running a command, do not resubmit urb */ if (!devpriv->ao_cmd_running) return; switch (urb->status) { case 0: usbduxsub_ao_handle_urb(dev, s, urb); break; case -ECONNRESET: case -ENOENT: case -ESHUTDOWN: case -ECONNABORTED: /* after an unlink command, unplug, ... etc */ async->events |= COMEDI_CB_ERROR; break; default: /* a real error */ dev_err(dev->class_dev, "Non-zero urb status received in ao intr context: %d\n", urb->status); async->events |= COMEDI_CB_ERROR; break; } /* * comedi_handle_events() cannot be used in this driver. The (*cancel) * operation would unlink the urb. */ if (async->events & COMEDI_CB_CANCEL_MASK) usbdux_ao_stop(dev, 0); comedi_event(dev, s); } static int usbdux_submit_urbs(struct comedi_device *dev, struct urb **urbs, int num_urbs, int input_urb) { struct usb_device *usb = comedi_to_usb_dev(dev); struct usbdux_private *devpriv = dev->private; struct urb *urb; int ret; int i; /* Submit all URBs and start the transfer on the bus */ for (i = 0; i < num_urbs; i++) { urb = urbs[i]; /* in case of a resubmission after an unlink... */ if (input_urb) urb->interval = devpriv->ai_interval; urb->context = dev; urb->dev = usb; urb->status = 0; urb->transfer_flags = URB_ISO_ASAP; ret = usb_submit_urb(urb, GFP_ATOMIC); if (ret) return ret; } return 0; } static int usbdux_ai_cmdtest(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_cmd *cmd) { struct usbdux_private *devpriv = dev->private; int err = 0; /* Step 1 : check if triggers are trivially valid */ err |= comedi_check_trigger_src(&cmd->start_src, TRIG_NOW | TRIG_INT); err |= comedi_check_trigger_src(&cmd->scan_begin_src, TRIG_TIMER); err |= comedi_check_trigger_src(&cmd->convert_src, TRIG_NOW); err |= comedi_check_trigger_src(&cmd->scan_end_src, TRIG_COUNT); err |= comedi_check_trigger_src(&cmd->stop_src, TRIG_COUNT | TRIG_NONE); if (err) return 1; /* Step 2a : make sure trigger sources are unique */ err |= comedi_check_trigger_is_unique(cmd->start_src); err |= comedi_check_trigger_is_unique(cmd->stop_src); /* Step 2b : and mutually compatible */ if (err) return 2; /* Step 3: check if arguments are trivially valid */ err |= comedi_check_trigger_arg_is(&cmd->start_arg, 0); if (cmd->scan_begin_src == TRIG_FOLLOW) /* internal trigger */ err |= comedi_check_trigger_arg_is(&cmd->scan_begin_arg, 0); if (cmd->scan_begin_src == TRIG_TIMER) { /* full speed does 1kHz scans every USB frame */ unsigned int arg = 1000000; unsigned int min_arg = arg; if (devpriv->high_speed) { /* * In high speed mode microframes are possible. * However, during one microframe we can roughly * sample one channel. Thus, the more channels * are in the channel list the more time we need. */ int i = 1; /* find a power of 2 for the number of channels */ while (i < cmd->chanlist_len) i = i * 2; arg /= 8; min_arg = arg * i; } err |= comedi_check_trigger_arg_min(&cmd->scan_begin_arg, min_arg); /* calc the real sampling rate with the rounding errors */ arg = (cmd->scan_begin_arg / arg) * arg; err |= comedi_check_trigger_arg_is(&cmd->scan_begin_arg, arg); } err |= comedi_check_trigger_arg_is(&cmd->scan_end_arg, cmd->chanlist_len); if (cmd->stop_src == TRIG_COUNT) err |= comedi_check_trigger_arg_min(&cmd->stop_arg, 1); else /* TRIG_NONE */ err |= comedi_check_trigger_arg_is(&cmd->stop_arg, 0); if (err) return 3; return 0; } /* * creates the ADC command for the MAX1271 * range is the range value from comedi */ static u8 create_adc_command(unsigned int chan, unsigned int range) { u8 p = (range <= 1); u8 r = ((range % 2) == 0); return (chan << 4) | ((p == 1) << 2) | ((r == 1) << 3); } static int send_dux_commands(struct comedi_device *dev, unsigned int cmd_type) { struct usb_device *usb = comedi_to_usb_dev(dev); struct usbdux_private *devpriv = dev->private; int nsent; devpriv->dux_commands[0] = cmd_type; return usb_bulk_msg(usb, usb_sndbulkpipe(usb, 1), devpriv->dux_commands, SIZEOFDUXBUFFER, &nsent, BULK_TIMEOUT); } static int receive_dux_commands(struct comedi_device *dev, unsigned int command) { struct usb_device *usb = comedi_to_usb_dev(dev); struct usbdux_private *devpriv = dev->private; int ret; int nrec; int i; for (i = 0; i < RETRIES; i++) { ret = usb_bulk_msg(usb, usb_rcvbulkpipe(usb, 8), devpriv->insn_buf, SIZEINSNBUF, &nrec, BULK_TIMEOUT); if (ret < 0) return ret; if (le16_to_cpu(devpriv->insn_buf[0]) == command) return ret; } /* command not received */ return -EFAULT; } static int usbdux_ai_inttrig(struct comedi_device *dev, struct comedi_subdevice *s, unsigned int trig_num) { struct usbdux_private *devpriv = dev->private; struct comedi_cmd *cmd = &s->async->cmd; int ret; if (trig_num != cmd->start_arg) return -EINVAL; mutex_lock(&devpriv->mut); if (!devpriv->ai_cmd_running) { devpriv->ai_cmd_running = 1; ret = usbdux_submit_urbs(dev, devpriv->ai_urbs, devpriv->n_ai_urbs, 1); if (ret < 0) { devpriv->ai_cmd_running = 0; goto ai_trig_exit; } s->async->inttrig = NULL; } else { ret = -EBUSY; } ai_trig_exit: mutex_unlock(&devpriv->mut); return ret; } static int usbdux_ai_cmd(struct comedi_device *dev, struct comedi_subdevice *s) { struct usbdux_private *devpriv = dev->private; struct comedi_cmd *cmd = &s->async->cmd; int len = cmd->chanlist_len; int ret = -EBUSY; int i; /* block other CPUs from starting an ai_cmd */ mutex_lock(&devpriv->mut); if (devpriv->ai_cmd_running) goto ai_cmd_exit; devpriv->dux_commands[1] = len; for (i = 0; i < len; ++i) { unsigned int chan = CR_CHAN(cmd->chanlist[i]); unsigned int range = CR_RANGE(cmd->chanlist[i]); devpriv->dux_commands[i + 2] = create_adc_command(chan, range); } ret = send_dux_commands(dev, USBDUX_CMD_MULT_AI); if (ret < 0) goto ai_cmd_exit; if (devpriv->high_speed) { /* * every channel gets a time window of 125us. Thus, if we * sample all 8 channels we need 1ms. If we sample only one * channel we need only 125us */ devpriv->ai_interval = 1; /* find a power of 2 for the interval */ while (devpriv->ai_interval < len) devpriv->ai_interval *= 2; devpriv->ai_timer = cmd->scan_begin_arg / (125000 * devpriv->ai_interval); } else { /* interval always 1ms */ devpriv->ai_interval = 1; devpriv->ai_timer = cmd->scan_begin_arg / 1000000; } if (devpriv->ai_timer < 1) { ret = -EINVAL; goto ai_cmd_exit; } devpriv->ai_counter = devpriv->ai_timer; if (cmd->start_src == TRIG_NOW) { /* enable this acquisition operation */ devpriv->ai_cmd_running = 1; ret = usbdux_submit_urbs(dev, devpriv->ai_urbs, devpriv->n_ai_urbs, 1); if (ret < 0) { devpriv->ai_cmd_running = 0; /* fixme: unlink here?? */ goto ai_cmd_exit; } s->async->inttrig = NULL; } else { /* TRIG_INT */ /* don't enable the acquision operation */ /* wait for an internal signal */ s->async->inttrig = usbdux_ai_inttrig; } ai_cmd_exit: mutex_unlock(&devpriv->mut); return ret; } /* Mode 0 is used to get a single conversion on demand */ static int usbdux_ai_insn_read(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_insn *insn, unsigned int *data) { struct usbdux_private *devpriv = dev->private; unsigned int chan = CR_CHAN(insn->chanspec); unsigned int range = CR_RANGE(insn->chanspec); unsigned int val; int ret = -EBUSY; int i; mutex_lock(&devpriv->mut); if (devpriv->ai_cmd_running) goto ai_read_exit; /* set command for the first channel */ devpriv->dux_commands[1] = create_adc_command(chan, range); /* adc commands */ ret = send_dux_commands(dev, USBDUX_CMD_SINGLE_AI); if (ret < 0) goto ai_read_exit; for (i = 0; i < insn->n; i++) { ret = receive_dux_commands(dev, USBDUX_CMD_SINGLE_AI); if (ret < 0) goto ai_read_exit; val = le16_to_cpu(devpriv->insn_buf[1]); /* bipolar data is two's-complement */ if (comedi_range_is_bipolar(s, range)) val = comedi_offset_munge(s, val); data[i] = val; } ai_read_exit: mutex_unlock(&devpriv->mut); return ret ? ret : insn->n; } static int usbdux_ao_insn_read(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_insn *insn, unsigned int *data) { struct usbdux_private *devpriv = dev->private; int ret; mutex_lock(&devpriv->mut); ret = comedi_readback_insn_read(dev, s, insn, data); mutex_unlock(&devpriv->mut); return ret; } static int usbdux_ao_insn_write(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_insn *insn, unsigned int *data) { struct usbdux_private *devpriv = dev->private; unsigned int chan = CR_CHAN(insn->chanspec); __le16 *p = (__le16 *)&devpriv->dux_commands[2]; int ret = -EBUSY; int i; mutex_lock(&devpriv->mut); if (devpriv->ao_cmd_running) goto ao_write_exit; /* number of channels: 1 */ devpriv->dux_commands[1] = 1; /* channel number */ devpriv->dux_commands[4] = chan << 6; for (i = 0; i < insn->n; i++) { unsigned int val = data[i]; /* one 16 bit value */ *p = cpu_to_le16(val); ret = send_dux_commands(dev, USBDUX_CMD_AO); if (ret < 0) goto ao_write_exit; s->readback[chan] = val; } ao_write_exit: mutex_unlock(&devpriv->mut); return ret ? ret : insn->n; } static int usbdux_ao_inttrig(struct comedi_device *dev, struct comedi_subdevice *s, unsigned int trig_num) { struct usbdux_private *devpriv = dev->private; struct comedi_cmd *cmd = &s->async->cmd; int ret; if (trig_num != cmd->start_arg) return -EINVAL; mutex_lock(&devpriv->mut); if (!devpriv->ao_cmd_running) { devpriv->ao_cmd_running = 1; ret = usbdux_submit_urbs(dev, devpriv->ao_urbs, devpriv->n_ao_urbs, 0); if (ret < 0) { devpriv->ao_cmd_running = 0; goto ao_trig_exit; } s->async->inttrig = NULL; } else { ret = -EBUSY; } ao_trig_exit: mutex_unlock(&devpriv->mut); return ret; } static int usbdux_ao_cmdtest(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_cmd *cmd) { int err = 0; unsigned int flags; /* Step 1 : check if triggers are trivially valid */ err |= comedi_check_trigger_src(&cmd->start_src, TRIG_NOW | TRIG_INT); if (0) { /* (devpriv->high_speed) */ /* the sampling rate is set by the coversion rate */ flags = TRIG_FOLLOW; } else { /* start a new scan (output at once) with a timer */ flags = TRIG_TIMER; } err |= comedi_check_trigger_src(&cmd->scan_begin_src, flags); if (0) { /* (devpriv->high_speed) */ /* * in usb-2.0 only one conversion it transmitted * but with 8kHz/n */ flags = TRIG_TIMER; } else { /* * all conversion events happen simultaneously with * a rate of 1kHz/n */ flags = TRIG_NOW; } err |= comedi_check_trigger_src(&cmd->convert_src, flags); err |= comedi_check_trigger_src(&cmd->scan_end_src, TRIG_COUNT); err |= comedi_check_trigger_src(&cmd->stop_src, TRIG_COUNT | TRIG_NONE); if (err) return 1; /* Step 2a : make sure trigger sources are unique */ err |= comedi_check_trigger_is_unique(cmd->start_src); err |= comedi_check_trigger_is_unique(cmd->stop_src); /* Step 2b : and mutually compatible */ if (err) return 2; /* Step 3: check if arguments are trivially valid */ err |= comedi_check_trigger_arg_is(&cmd->start_arg, 0); if (cmd->scan_begin_src == TRIG_FOLLOW) /* internal trigger */ err |= comedi_check_trigger_arg_is(&cmd->scan_begin_arg, 0); if (cmd->scan_begin_src == TRIG_TIMER) { err |= comedi_check_trigger_arg_min(&cmd->scan_begin_arg, 1000000); } /* not used now, is for later use */ if (cmd->convert_src == TRIG_TIMER) err |= comedi_check_trigger_arg_min(&cmd->convert_arg, 125000); err |= comedi_check_trigger_arg_is(&cmd->scan_end_arg, cmd->chanlist_len); if (cmd->stop_src == TRIG_COUNT) err |= comedi_check_trigger_arg_min(&cmd->stop_arg, 1); else /* TRIG_NONE */ err |= comedi_check_trigger_arg_is(&cmd->stop_arg, 0); if (err) return 3; return 0; } static int usbdux_ao_cmd(struct comedi_device *dev, struct comedi_subdevice *s) { struct usbdux_private *devpriv = dev->private; struct comedi_cmd *cmd = &s->async->cmd; int ret = -EBUSY; mutex_lock(&devpriv->mut); if (devpriv->ao_cmd_running) goto ao_cmd_exit; /* we count in steps of 1ms (125us) */ /* 125us mode not used yet */ if (0) { /* (devpriv->high_speed) */ /* 125us */ /* timing of the conversion itself: every 125 us */ devpriv->ao_timer = cmd->convert_arg / 125000; } else { /* 1ms */ /* timing of the scan: we get all channels at once */ devpriv->ao_timer = cmd->scan_begin_arg / 1000000; if (devpriv->ao_timer < 1) { ret = -EINVAL; goto ao_cmd_exit; } } devpriv->ao_counter = devpriv->ao_timer; if (cmd->start_src == TRIG_NOW) { /* enable this acquisition operation */ devpriv->ao_cmd_running = 1; ret = usbdux_submit_urbs(dev, devpriv->ao_urbs, devpriv->n_ao_urbs, 0); if (ret < 0) { devpriv->ao_cmd_running = 0; /* fixme: unlink here?? */ goto ao_cmd_exit; } s->async->inttrig = NULL; } else { /* TRIG_INT */ /* submit the urbs later */ /* wait for an internal signal */ s->async->inttrig = usbdux_ao_inttrig; } ao_cmd_exit: mutex_unlock(&devpriv->mut); return ret; } static int usbdux_dio_insn_config(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_insn *insn, unsigned int *data) { int ret; ret = comedi_dio_insn_config(dev, s, insn, data, 0); if (ret) return ret; /* * We don't tell the firmware here as it would take 8 frames * to submit the information. We do it in the insn_bits. */ return insn->n; } static int usbdux_dio_insn_bits(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_insn *insn, unsigned int *data) { struct usbdux_private *devpriv = dev->private; int ret; mutex_lock(&devpriv->mut); comedi_dio_update_state(s, data); /* Always update the hardware. See the (*insn_config). */ devpriv->dux_commands[1] = s->io_bits; devpriv->dux_commands[2] = s->state; /* * This command also tells the firmware to return * the digital input lines. */ ret = send_dux_commands(dev, USBDUX_CMD_DIO_BITS); if (ret < 0) goto dio_exit; ret = receive_dux_commands(dev, USBDUX_CMD_DIO_BITS); if (ret < 0) goto dio_exit; data[1] = le16_to_cpu(devpriv->insn_buf[1]); dio_exit: mutex_unlock(&devpriv->mut); return ret ? ret : insn->n; } static int usbdux_counter_read(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_insn *insn, unsigned int *data) { struct usbdux_private *devpriv = dev->private; unsigned int chan = CR_CHAN(insn->chanspec); int ret = 0; int i; mutex_lock(&devpriv->mut); for (i = 0; i < insn->n; i++) { ret = send_dux_commands(dev, USBDUX_CMD_TIMER_RD); if (ret < 0) goto counter_read_exit; ret = receive_dux_commands(dev, USBDUX_CMD_TIMER_RD); if (ret < 0) goto counter_read_exit; data[i] = le16_to_cpu(devpriv->insn_buf[chan + 1]); } counter_read_exit: mutex_unlock(&devpriv->mut); return ret ? ret : insn->n; } static int usbdux_counter_write(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_insn *insn, unsigned int *data) { struct usbdux_private *devpriv = dev->private; unsigned int chan = CR_CHAN(insn->chanspec); __le16 *p = (__le16 *)&devpriv->dux_commands[2]; int ret = 0; int i; mutex_lock(&devpriv->mut); devpriv->dux_commands[1] = chan; for (i = 0; i < insn->n; i++) { *p = cpu_to_le16(data[i]); ret = send_dux_commands(dev, USBDUX_CMD_TIMER_WR); if (ret < 0) break; } mutex_unlock(&devpriv->mut); return ret ? ret : insn->n; } static int usbdux_counter_config(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_insn *insn, unsigned int *data) { /* nothing to do so far */ return 2; } static void usbduxsub_unlink_pwm_urbs(struct comedi_device *dev) { struct usbdux_private *devpriv = dev->private; usb_kill_urb(devpriv->pwm_urb); } static void usbdux_pwm_stop(struct comedi_device *dev, int do_unlink) { struct usbdux_private *devpriv = dev->private; if (do_unlink) usbduxsub_unlink_pwm_urbs(dev); devpriv->pwm_cmd_running = 0; } static int usbdux_pwm_cancel(struct comedi_device *dev, struct comedi_subdevice *s) { struct usbdux_private *devpriv = dev->private; int ret; mutex_lock(&devpriv->mut); /* unlink only if it is really running */ usbdux_pwm_stop(dev, devpriv->pwm_cmd_running); ret = send_dux_commands(dev, USBDUX_CMD_PWM_OFF); mutex_unlock(&devpriv->mut); return ret; } static void usbduxsub_pwm_irq(struct urb *urb) { struct comedi_device *dev = urb->context; struct usbdux_private *devpriv = dev->private; int ret; switch (urb->status) { case 0: /* success */ break; case -ECONNRESET: case -ENOENT: case -ESHUTDOWN: case -ECONNABORTED: /* * after an unlink command, unplug, ... etc * no unlink needed here. Already shutting down. */ if (devpriv->pwm_cmd_running) usbdux_pwm_stop(dev, 0); return; default: /* a real error */ if (devpriv->pwm_cmd_running) { dev_err(dev->class_dev, "Non-zero urb status received in pwm intr context: %d\n", urb->status); usbdux_pwm_stop(dev, 0); } return; } /* are we actually running? */ if (!devpriv->pwm_cmd_running) return; urb->transfer_buffer_length = devpriv->pwm_buf_sz; urb->dev = comedi_to_usb_dev(dev); urb->status = 0; if (devpriv->pwm_cmd_running) { ret = usb_submit_urb(urb, GFP_ATOMIC); if (ret < 0) { dev_err(dev->class_dev, "pwm urb resubm failed in int-cont. ret=%d", ret); if (ret == -EL2NSYNC) dev_err(dev->class_dev, "buggy USB host controller or bug in IRQ handling!\n"); /* don't do an unlink here */ usbdux_pwm_stop(dev, 0); } } } static int usbduxsub_submit_pwm_urbs(struct comedi_device *dev) { struct usb_device *usb = comedi_to_usb_dev(dev); struct usbdux_private *devpriv = dev->private; struct urb *urb = devpriv->pwm_urb; /* in case of a resubmission after an unlink... */ usb_fill_bulk_urb(urb, usb, usb_sndbulkpipe(usb, 4), urb->transfer_buffer, devpriv->pwm_buf_sz, usbduxsub_pwm_irq, dev); return usb_submit_urb(urb, GFP_ATOMIC); } static int usbdux_pwm_period(struct comedi_device *dev, struct comedi_subdevice *s, unsigned int period) { struct usbdux_private *devpriv = dev->private; int fx2delay; if (period < MIN_PWM_PERIOD) return -EAGAIN; fx2delay = (period / (6 * 512 * 1000 / 33)) - 6; if (fx2delay > 255) return -EAGAIN; devpriv->pwm_delay = fx2delay; devpriv->pwm_period = period; return 0; } static int usbdux_pwm_start(struct comedi_device *dev, struct comedi_subdevice *s) { struct usbdux_private *devpriv = dev->private; int ret = 0; mutex_lock(&devpriv->mut); if (devpriv->pwm_cmd_running) goto pwm_start_exit; devpriv->dux_commands[1] = devpriv->pwm_delay; ret = send_dux_commands(dev, USBDUX_CMD_PWM_ON); if (ret < 0) goto pwm_start_exit; /* initialise the buffer */ memset(devpriv->pwm_urb->transfer_buffer, 0, devpriv->pwm_buf_sz); devpriv->pwm_cmd_running = 1; ret = usbduxsub_submit_pwm_urbs(dev); if (ret < 0) devpriv->pwm_cmd_running = 0; pwm_start_exit: mutex_unlock(&devpriv->mut); return ret; } static void usbdux_pwm_pattern(struct comedi_device *dev, struct comedi_subdevice *s, unsigned int chan, unsigned int value, unsigned int sign) { struct usbdux_private *devpriv = dev->private; char pwm_mask = (1 << chan); /* DIO bit for the PWM data */ char sgn_mask = (16 << chan); /* DIO bit for the sign */ char *buf = (char *)(devpriv->pwm_urb->transfer_buffer); int szbuf = devpriv->pwm_buf_sz; int i; for (i = 0; i < szbuf; i++) { char c = *buf; c &= ~pwm_mask; if (i < value) c |= pwm_mask; if (!sign) c &= ~sgn_mask; else c |= sgn_mask; *buf++ = c; } } static int usbdux_pwm_write(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_insn *insn, unsigned int *data) { unsigned int chan = CR_CHAN(insn->chanspec); /* * It doesn't make sense to support more than one value here * because it would just overwrite the PWM buffer. */ if (insn->n != 1) return -EINVAL; /* * The sign is set via a special INSN only, this gives us 8 bits * for normal operation, sign is 0 by default. */ usbdux_pwm_pattern(dev, s, chan, data[0], 0); return insn->n; } static int usbdux_pwm_config(struct comedi_device *dev, struct comedi_subdevice *s, struct comedi_insn *insn, unsigned int *data) { struct usbdux_private *devpriv = dev->private; unsigned int chan = CR_CHAN(insn->chanspec); switch (data[0]) { case INSN_CONFIG_ARM: /* * if not zero the PWM is limited to a certain time which is * not supported here */ if (data[1] != 0) return -EINVAL; return usbdux_pwm_start(dev, s); case INSN_CONFIG_DISARM: return usbdux_pwm_cancel(dev, s); case INSN_CONFIG_GET_PWM_STATUS: data[1] = devpriv->pwm_cmd_running; return 0; case INSN_CONFIG_PWM_SET_PERIOD: return usbdux_pwm_period(dev, s, data[1]); case INSN_CONFIG_PWM_GET_PERIOD: data[1] = devpriv->pwm_period; return 0; case INSN_CONFIG_PWM_SET_H_BRIDGE: /* * data[1] = value * data[2] = sign (for a relay) */ usbdux_pwm_pattern(dev, s, chan, data[1], (data[2] != 0)); return 0; case INSN_CONFIG_PWM_GET_H_BRIDGE: /* values are not kept in this driver, nothing to return here */ return -EINVAL; } return -EINVAL; } static int usbdux_firmware_upload(struct comedi_device *dev, const u8 *data, size_t size, unsigned long context) { struct usb_device *usb = comedi_to_usb_dev(dev); u8 *buf; u8 *tmp; int ret; if (!data) return 0; if (size > USBDUX_FIRMWARE_MAX_LEN) { dev_err(dev->class_dev, "usbdux firmware binary it too large for FX2.\n"); return -ENOMEM; } /* we generate a local buffer for the firmware */ buf = kmemdup(data, size, GFP_KERNEL); if (!buf) return -ENOMEM; /* we need a malloc'ed buffer for usb_control_msg() */ tmp = kmalloc(1, GFP_KERNEL); if (!tmp) { kfree(buf); return -ENOMEM; } /* stop the current firmware on the device */ *tmp = 1; /* 7f92 to one */ ret = usb_control_msg(usb, usb_sndctrlpipe(usb, 0), USBDUX_FIRMWARE_CMD, VENDOR_DIR_OUT, USBDUX_CPU_CS, 0x0000, tmp, 1, BULK_TIMEOUT); if (ret < 0) { dev_err(dev->class_dev, "can not stop firmware\n"); goto done; } /* upload the new firmware to the device */ ret = usb_control_msg(usb, usb_sndctrlpipe(usb, 0), USBDUX_FIRMWARE_CMD, VENDOR_DIR_OUT, 0, 0x0000, buf, size, BULK_TIMEOUT); if (ret < 0) { dev_err(dev->class_dev, "firmware upload failed\n"); goto done; } /* start the new firmware on the device */ *tmp = 0; /* 7f92 to zero */ ret = usb_control_msg(usb, usb_sndctrlpipe(usb, 0), USBDUX_FIRMWARE_CMD, VENDOR_DIR_OUT, USBDUX_CPU_CS, 0x0000, tmp, 1, BULK_TIMEOUT); if (ret < 0) dev_err(dev->class_dev, "can not start firmware\n"); done: kfree(tmp); kfree(buf); return ret; } static int usbdux_alloc_usb_buffers(struct comedi_device *dev) { struct usb_device *usb = comedi_to_usb_dev(dev); struct usbdux_private *devpriv = dev->private; struct urb *urb; int i; devpriv->dux_commands = kzalloc(SIZEOFDUXBUFFER, GFP_KERNEL); devpriv->in_buf = kzalloc(SIZEINBUF, GFP_KERNEL); devpriv->insn_buf = kzalloc(SIZEINSNBUF, GFP_KERNEL); devpriv->ai_urbs = kcalloc(devpriv->n_ai_urbs, sizeof(void *), GFP_KERNEL); devpriv->ao_urbs = kcalloc(devpriv->n_ao_urbs, sizeof(void *), GFP_KERNEL); if (!devpriv->dux_commands || !devpriv->in_buf || !devpriv->insn_buf || !devpriv->ai_urbs || !devpriv->ao_urbs) return -ENOMEM; for (i = 0; i < devpriv->n_ai_urbs; i++) { /* one frame: 1ms */ urb = usb_alloc_urb(1, GFP_KERNEL); if (!urb) return -ENOMEM; devpriv->ai_urbs[i] = urb; urb->dev = usb; urb->context = dev; urb->pipe = usb_rcvisocpipe(usb, 6); urb->transfer_flags = URB_ISO_ASAP; urb->transfer_buffer = kzalloc(SIZEINBUF, GFP_KERNEL); if (!urb->transfer_buffer) return -ENOMEM; urb->complete = usbduxsub_ai_isoc_irq; urb->number_of_packets = 1; urb->transfer_buffer_length = SIZEINBUF; urb->iso_frame_desc[0].offset = 0; urb->iso_frame_desc[0].length = SIZEINBUF; } for (i = 0; i < devpriv->n_ao_urbs; i++) { /* one frame: 1ms */ urb = usb_alloc_urb(1, GFP_KERNEL); if (!urb) return -ENOMEM; devpriv->ao_urbs[i] = urb; urb->dev = usb; urb->context = dev; urb->pipe = usb_sndisocpipe(usb, 2); urb->transfer_flags = URB_ISO_ASAP; urb->transfer_buffer = kzalloc(SIZEOUTBUF, GFP_KERNEL); if (!urb->transfer_buffer) return -ENOMEM; urb->complete = usbduxsub_ao_isoc_irq; urb->number_of_packets = 1; urb->transfer_buffer_length = SIZEOUTBUF; urb->iso_frame_desc[0].offset = 0; urb->iso_frame_desc[0].length = SIZEOUTBUF; if (devpriv->high_speed) urb->interval = 8; /* uframes */ else urb->interval = 1; /* frames */ } /* pwm */ if (devpriv->pwm_buf_sz) { urb = usb_alloc_urb(0, GFP_KERNEL); if (!urb) return -ENOMEM; devpriv->pwm_urb = urb; /* max bulk ep size in high speed */ urb->transfer_buffer = kzalloc(devpriv->pwm_buf_sz, GFP_KERNEL); if (!urb->transfer_buffer) return -ENOMEM; } return 0; } static void usbdux_free_usb_buffers(struct comedi_device *dev) { struct usbdux_private *devpriv = dev->private; struct urb *urb; int i; urb = devpriv->pwm_urb; if (urb) { kfree(urb->transfer_buffer); usb_free_urb(urb); } if (devpriv->ao_urbs) { for (i = 0; i < devpriv->n_ao_urbs; i++) { urb = devpriv->ao_urbs[i]; if (urb) { kfree(urb->transfer_buffer); usb_free_urb(urb); } } kfree(devpriv->ao_urbs); } if (devpriv->ai_urbs) { for (i = 0; i < devpriv->n_ai_urbs; i++) { urb = devpriv->ai_urbs[i]; if (urb) { kfree(urb->transfer_buffer); usb_free_urb(urb); } } kfree(devpriv->ai_urbs); } kfree(devpriv->insn_buf); kfree(devpriv->in_buf); kfree(devpriv->dux_commands); } static int usbdux_auto_attach(struct comedi_device *dev, unsigned long context_unused) { struct usb_interface *intf = comedi_to_usb_interface(dev); struct usb_device *usb = comedi_to_usb_dev(dev); struct usbdux_private *devpriv; struct comedi_subdevice *s; int ret; devpriv = comedi_alloc_devpriv(dev, sizeof(*devpriv)); if (!devpriv) return -ENOMEM; mutex_init(&devpriv->mut); usb_set_intfdata(intf, devpriv); devpriv->high_speed = (usb->speed == USB_SPEED_HIGH); if (devpriv->high_speed) { devpriv->n_ai_urbs = NUMOFINBUFFERSHIGH; devpriv->n_ao_urbs = NUMOFOUTBUFFERSHIGH; devpriv->pwm_buf_sz = 512; } else { devpriv->n_ai_urbs = NUMOFINBUFFERSFULL; devpriv->n_ao_urbs = NUMOFOUTBUFFERSFULL; } ret = usbdux_alloc_usb_buffers(dev); if (ret) return ret; /* setting to alternate setting 3: enabling iso ep and bulk ep. */ ret = usb_set_interface(usb, intf->altsetting->desc.bInterfaceNumber, 3); if (ret < 0) { dev_err(dev->class_dev, "could not set alternate setting 3 in high speed\n"); return ret; } ret = comedi_load_firmware(dev, &usb->dev, USBDUX_FIRMWARE, usbdux_firmware_upload, 0); if (ret < 0) return ret; ret = comedi_alloc_subdevices(dev, (devpriv->high_speed) ? 5 : 4); if (ret) return ret; /* Analog Input subdevice */ s = &dev->subdevices[0]; dev->read_subdev = s; s->type = COMEDI_SUBD_AI; s->subdev_flags = SDF_READABLE | SDF_GROUND | SDF_CMD_READ; s->n_chan = 8; s->maxdata = 0x0fff; s->len_chanlist = 8; s->range_table = &range_usbdux_ai_range; s->insn_read = usbdux_ai_insn_read; s->do_cmdtest = usbdux_ai_cmdtest; s->do_cmd = usbdux_ai_cmd; s->cancel = usbdux_ai_cancel; /* Analog Output subdevice */ s = &dev->subdevices[1]; dev->write_subdev = s; s->type = COMEDI_SUBD_AO; s->subdev_flags = SDF_WRITABLE | SDF_GROUND | SDF_CMD_WRITE; s->n_chan = 4; s->maxdata = 0x0fff; s->len_chanlist = s->n_chan; s->range_table = &range_usbdux_ao_range; s->do_cmdtest = usbdux_ao_cmdtest; s->do_cmd = usbdux_ao_cmd; s->cancel = usbdux_ao_cancel; s->insn_read = usbdux_ao_insn_read; s->insn_write = usbdux_ao_insn_write; ret = comedi_alloc_subdev_readback(s); if (ret) return ret; /* Digital I/O subdevice */ s = &dev->subdevices[2]; s->type = COMEDI_SUBD_DIO; s->subdev_flags = SDF_READABLE | SDF_WRITABLE; s->n_chan = 8; s->maxdata = 1; s->range_table = &range_digital; s->insn_bits = usbdux_dio_insn_bits; s->insn_config = usbdux_dio_insn_config; /* Counter subdevice */ s = &dev->subdevices[3]; s->type = COMEDI_SUBD_COUNTER; s->subdev_flags = SDF_WRITABLE | SDF_READABLE; s->n_chan = 4; s->maxdata = 0xffff; s->insn_read = usbdux_counter_read; s->insn_write = usbdux_counter_write; s->insn_config = usbdux_counter_config; if (devpriv->high_speed) { /* PWM subdevice */ s = &dev->subdevices[4]; s->type = COMEDI_SUBD_PWM; s->subdev_flags = SDF_WRITABLE | SDF_PWM_HBRIDGE; s->n_chan = 8; s->maxdata = devpriv->pwm_buf_sz; s->insn_write = usbdux_pwm_write; s->insn_config = usbdux_pwm_config; usbdux_pwm_period(dev, s, PWM_DEFAULT_PERIOD); } return 0; } static void usbdux_detach(struct comedi_device *dev) { struct usb_interface *intf = comedi_to_usb_interface(dev); struct usbdux_private *devpriv = dev->private; usb_set_intfdata(intf, NULL); if (!devpriv) return; mutex_lock(&devpriv->mut); /* force unlink all urbs */ usbdux_pwm_stop(dev, 1); usbdux_ao_stop(dev, 1); usbdux_ai_stop(dev, 1); usbdux_free_usb_buffers(dev); mutex_unlock(&devpriv->mut); mutex_destroy(&devpriv->mut); } static struct comedi_driver usbdux_driver = { .driver_name = "usbdux", .module = THIS_MODULE, .auto_attach = usbdux_auto_attach, .detach = usbdux_detach, }; static int usbdux_usb_probe(struct usb_interface *intf, const struct usb_device_id *id) { return comedi_usb_auto_config(intf, &usbdux_driver, 0); } static const struct usb_device_id usbdux_usb_table[] = { { USB_DEVICE(0x13d8, 0x0001) }, { USB_DEVICE(0x13d8, 0x0002) }, { } }; MODULE_DEVICE_TABLE(usb, usbdux_usb_table); static struct usb_driver usbdux_usb_driver = { .name = "usbdux", .probe = usbdux_usb_probe, .disconnect = comedi_usb_auto_unconfig, .id_table = usbdux_usb_table, }; module_comedi_usb_driver(usbdux_driver, usbdux_usb_driver); MODULE_AUTHOR("Bernd Porr, BerndPorr@f2s.com"); MODULE_DESCRIPTION("Stirling/ITL USB-DUX -- Bernd.Porr@f2s.com"); MODULE_LICENSE("GPL"); MODULE_FIRMWARE(USBDUX_FIRMWARE); |
| 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 | // SPDX-License-Identifier: GPL-2.0 /* * consolemap.c * * Mapping from internal code (such as Latin-1 or Unicode or IBM PC code) * to font positions. * * aeb, 950210 * * Support for multiple unimaps by Jakub Jelinek <jj@ultra.linux.cz>, July 1998 * * Fix bug in inverse translation. Stanislav Voronyi <stas@cnti.uanet.kharkov.ua>, Dec 1998 * * In order to prevent the following circular lock dependency: * &mm->mmap_lock --> cpu_hotplug.lock --> console_lock --> &mm->mmap_lock * * We cannot allow page fault to happen while holding the console_lock. * Therefore, all the userspace copy operations have to be done outside * the console_lock critical sections. * * As all the affected functions are all called directly from vt_ioctl(), we * can allocate some small buffers directly on stack without worrying about * stack overflow. */ #include <linux/bitfield.h> #include <linux/bits.h> #include <linux/module.h> #include <linux/kd.h> #include <linux/errno.h> #include <linux/mm.h> #include <linux/slab.h> #include <linux/init.h> #include <linux/tty.h> #include <linux/uaccess.h> #include <linux/console.h> #include <linux/consolemap.h> #include <linux/vt_kern.h> #include <linux/string.h> static unsigned short translations[][E_TABSZ] = { /* 8-bit Latin-1 mapped to Unicode -- trivial mapping */ [LAT1_MAP] = { 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f, 0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001a, 0x001b, 0x001c, 0x001d, 0x001e, 0x001f, 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002a, 0x002b, 0x002c, 0x002d, 0x002e, 0x002f, 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003a, 0x003b, 0x003c, 0x003d, 0x003e, 0x003f, 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004a, 0x004b, 0x004c, 0x004d, 0x004e, 0x004f, 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005a, 0x005b, 0x005c, 0x005d, 0x005e, 0x005f, 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006a, 0x006b, 0x006c, 0x006d, 0x006e, 0x006f, 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007a, 0x007b, 0x007c, 0x007d, 0x007e, 0x007f, 0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085, 0x0086, 0x0087, 0x0088, 0x0089, 0x008a, 0x008b, 0x008c, 0x008d, 0x008e, 0x008f, 0x0090, 0x0091, 0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097, 0x0098, 0x0099, 0x009a, 0x009b, 0x009c, 0x009d, 0x009e, 0x009f, 0x00a0, 0x00a1, 0x00a2, 0x00a3, 0x00a4, 0x00a5, 0x00a6, 0x00a7, 0x00a8, 0x00a9, 0x00aa, 0x00ab, 0x00ac, 0x00ad, 0x00ae, 0x00af, 0x00b0, 0x00b1, 0x00b2, 0x00b3, 0x00b4, 0x00b5, 0x00b6, 0x00b7, 0x00b8, 0x00b9, 0x00ba, 0x00bb, 0x00bc, 0x00bd, 0x00be, 0x00bf, 0x00c0, 0x00c1, 0x00c2, 0x00c3, 0x00c4, 0x00c5, 0x00c6, 0x00c7, 0x00c8, 0x00c9, 0x00ca, 0x00cb, 0x00cc, 0x00cd, 0x00ce, 0x00cf, 0x00d0, 0x00d1, 0x00d2, 0x00d3, 0x00d4, 0x00d5, 0x00d6, 0x00d7, 0x00d8, 0x00d9, 0x00da, 0x00db, 0x00dc, 0x00dd, 0x00de, 0x00df, 0x00e0, 0x00e1, 0x00e2, 0x00e3, 0x00e4, 0x00e5, 0x00e6, 0x00e7, 0x00e8, 0x00e9, 0x00ea, 0x00eb, 0x00ec, 0x00ed, 0x00ee, 0x00ef, 0x00f0, 0x00f1, 0x00f2, 0x00f3, 0x00f4, 0x00f5, 0x00f6, 0x00f7, 0x00f8, 0x00f9, 0x00fa, 0x00fb, 0x00fc, 0x00fd, 0x00fe, 0x00ff }, /* VT100 graphics mapped to Unicode */ [GRAF_MAP] = { 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f, 0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001a, 0x001b, 0x001c, 0x001d, 0x001e, 0x001f, 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002a, 0x2192, 0x2190, 0x2191, 0x2193, 0x002f, 0x2588, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003a, 0x003b, 0x003c, 0x003d, 0x003e, 0x003f, 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004a, 0x004b, 0x004c, 0x004d, 0x004e, 0x004f, 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005a, 0x005b, 0x005c, 0x005d, 0x005e, 0x00a0, 0x25c6, 0x2592, 0x2409, 0x240c, 0x240d, 0x240a, 0x00b0, 0x00b1, 0x2591, 0x240b, 0x2518, 0x2510, 0x250c, 0x2514, 0x253c, 0x23ba, 0x23bb, 0x2500, 0x23bc, 0x23bd, 0x251c, 0x2524, 0x2534, 0x252c, 0x2502, 0x2264, 0x2265, 0x03c0, 0x2260, 0x00a3, 0x00b7, 0x007f, 0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085, 0x0086, 0x0087, 0x0088, 0x0089, 0x008a, 0x008b, 0x008c, 0x008d, 0x008e, 0x008f, 0x0090, 0x0091, 0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097, 0x0098, 0x0099, 0x009a, 0x009b, 0x009c, 0x009d, 0x009e, 0x009f, 0x00a0, 0x00a1, 0x00a2, 0x00a3, 0x00a4, 0x00a5, 0x00a6, 0x00a7, 0x00a8, 0x00a9, 0x00aa, 0x00ab, 0x00ac, 0x00ad, 0x00ae, 0x00af, 0x00b0, 0x00b1, 0x00b2, 0x00b3, 0x00b4, 0x00b5, 0x00b6, 0x00b7, 0x00b8, 0x00b9, 0x00ba, 0x00bb, 0x00bc, 0x00bd, 0x00be, 0x00bf, 0x00c0, 0x00c1, 0x00c2, 0x00c3, 0x00c4, 0x00c5, 0x00c6, 0x00c7, 0x00c8, 0x00c9, 0x00ca, 0x00cb, 0x00cc, 0x00cd, 0x00ce, 0x00cf, 0x00d0, 0x00d1, 0x00d2, 0x00d3, 0x00d4, 0x00d5, 0x00d6, 0x00d7, 0x00d8, 0x00d9, 0x00da, 0x00db, 0x00dc, 0x00dd, 0x00de, 0x00df, 0x00e0, 0x00e1, 0x00e2, 0x00e3, 0x00e4, 0x00e5, 0x00e6, 0x00e7, 0x00e8, 0x00e9, 0x00ea, 0x00eb, 0x00ec, 0x00ed, 0x00ee, 0x00ef, 0x00f0, 0x00f1, 0x00f2, 0x00f3, 0x00f4, 0x00f5, 0x00f6, 0x00f7, 0x00f8, 0x00f9, 0x00fa, 0x00fb, 0x00fc, 0x00fd, 0x00fe, 0x00ff }, /* IBM Codepage 437 mapped to Unicode */ [IBMPC_MAP] = { 0x0000, 0x263a, 0x263b, 0x2665, 0x2666, 0x2663, 0x2660, 0x2022, 0x25d8, 0x25cb, 0x25d9, 0x2642, 0x2640, 0x266a, 0x266b, 0x263c, 0x25b6, 0x25c0, 0x2195, 0x203c, 0x00b6, 0x00a7, 0x25ac, 0x21a8, 0x2191, 0x2193, 0x2192, 0x2190, 0x221f, 0x2194, 0x25b2, 0x25bc, 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002a, 0x002b, 0x002c, 0x002d, 0x002e, 0x002f, 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003a, 0x003b, 0x003c, 0x003d, 0x003e, 0x003f, 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004a, 0x004b, 0x004c, 0x004d, 0x004e, 0x004f, 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005a, 0x005b, 0x005c, 0x005d, 0x005e, 0x005f, 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006a, 0x006b, 0x006c, 0x006d, 0x006e, 0x006f, 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007a, 0x007b, 0x007c, 0x007d, 0x007e, 0x2302, 0x00c7, 0x00fc, 0x00e9, 0x00e2, 0x00e4, 0x00e0, 0x00e5, 0x00e7, 0x00ea, 0x00eb, 0x00e8, 0x00ef, 0x00ee, 0x00ec, 0x00c4, 0x00c5, 0x00c9, 0x00e6, 0x00c6, 0x00f4, 0x00f6, 0x00f2, 0x00fb, 0x00f9, 0x00ff, 0x00d6, 0x00dc, 0x00a2, 0x00a3, 0x00a5, 0x20a7, 0x0192, 0x00e1, 0x00ed, 0x00f3, 0x00fa, 0x00f1, 0x00d1, 0x00aa, 0x00ba, 0x00bf, 0x2310, 0x00ac, 0x00bd, 0x00bc, 0x00a1, 0x00ab, 0x00bb, 0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556, 0x2555, 0x2563, 0x2551, 0x2557, 0x255d, 0x255c, 0x255b, 0x2510, 0x2514, 0x2534, 0x252c, 0x251c, 0x2500, 0x253c, 0x255e, 0x255f, 0x255a, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256c, 0x2567, 0x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256b, 0x256a, 0x2518, 0x250c, 0x2588, 0x2584, 0x258c, 0x2590, 0x2580, 0x03b1, 0x00df, 0x0393, 0x03c0, 0x03a3, 0x03c3, 0x00b5, 0x03c4, 0x03a6, 0x0398, 0x03a9, 0x03b4, 0x221e, 0x03c6, 0x03b5, 0x2229, 0x2261, 0x00b1, 0x2265, 0x2264, 0x2320, 0x2321, 0x00f7, 0x2248, 0x00b0, 0x2219, 0x00b7, 0x221a, 0x207f, 0x00b2, 0x25a0, 0x00a0 }, /* User mapping -- default to codes for direct font mapping */ [USER_MAP] = { 0xf000, 0xf001, 0xf002, 0xf003, 0xf004, 0xf005, 0xf006, 0xf007, 0xf008, 0xf009, 0xf00a, 0xf00b, 0xf00c, 0xf00d, 0xf00e, 0xf00f, 0xf010, 0xf011, 0xf012, 0xf013, 0xf014, 0xf015, 0xf016, 0xf017, 0xf018, 0xf019, 0xf01a, 0xf01b, 0xf01c, 0xf01d, 0xf01e, 0xf01f, 0xf020, 0xf021, 0xf022, 0xf023, 0xf024, 0xf025, 0xf026, 0xf027, 0xf028, 0xf029, 0xf02a, 0xf02b, 0xf02c, 0xf02d, 0xf02e, 0xf02f, 0xf030, 0xf031, 0xf032, 0xf033, 0xf034, 0xf035, 0xf036, 0xf037, 0xf038, 0xf039, 0xf03a, 0xf03b, 0xf03c, 0xf03d, 0xf03e, 0xf03f, 0xf040, 0xf041, 0xf042, 0xf043, 0xf044, 0xf045, 0xf046, 0xf047, 0xf048, 0xf049, 0xf04a, 0xf04b, 0xf04c, 0xf04d, 0xf04e, 0xf04f, 0xf050, 0xf051, 0xf052, 0xf053, 0xf054, 0xf055, 0xf056, 0xf057, 0xf058, 0xf059, 0xf05a, 0xf05b, 0xf05c, 0xf05d, 0xf05e, 0xf05f, 0xf060, 0xf061, 0xf062, 0xf063, 0xf064, 0xf065, 0xf066, 0xf067, 0xf068, 0xf069, 0xf06a, 0xf06b, 0xf06c, 0xf06d, 0xf06e, 0xf06f, 0xf070, 0xf071, 0xf072, 0xf073, 0xf074, 0xf075, 0xf076, 0xf077, 0xf078, 0xf079, 0xf07a, 0xf07b, 0xf07c, 0xf07d, 0xf07e, 0xf07f, 0xf080, 0xf081, 0xf082, 0xf083, 0xf084, 0xf085, 0xf086, 0xf087, 0xf088, 0xf089, 0xf08a, 0xf08b, 0xf08c, 0xf08d, 0xf08e, 0xf08f, 0xf090, 0xf091, 0xf092, 0xf093, 0xf094, 0xf095, 0xf096, 0xf097, 0xf098, 0xf099, 0xf09a, 0xf09b, 0xf09c, 0xf09d, 0xf09e, 0xf09f, 0xf0a0, 0xf0a1, 0xf0a2, 0xf0a3, 0xf0a4, 0xf0a5, 0xf0a6, 0xf0a7, 0xf0a8, 0xf0a9, 0xf0aa, 0xf0ab, 0xf0ac, 0xf0ad, 0xf0ae, 0xf0af, 0xf0b0, 0xf0b1, 0xf0b2, 0xf0b3, 0xf0b4, 0xf0b5, 0xf0b6, 0xf0b7, 0xf0b8, 0xf0b9, 0xf0ba, 0xf0bb, 0xf0bc, 0xf0bd, 0xf0be, 0xf0bf, 0xf0c0, 0xf0c1, 0xf0c2, 0xf0c3, 0xf0c4, 0xf0c5, 0xf0c6, 0xf0c7, 0xf0c8, 0xf0c9, 0xf0ca, 0xf0cb, 0xf0cc, 0xf0cd, 0xf0ce, 0xf0cf, 0xf0d0, 0xf0d1, 0xf0d2, 0xf0d3, 0xf0d4, 0xf0d5, 0xf0d6, 0xf0d7, 0xf0d8, 0xf0d9, 0xf0da, 0xf0db, 0xf0dc, 0xf0dd, 0xf0de, 0xf0df, 0xf0e0, 0xf0e1, 0xf0e2, 0xf0e3, 0xf0e4, 0xf0e5, 0xf0e6, 0xf0e7, 0xf0e8, 0xf0e9, 0xf0ea, 0xf0eb, 0xf0ec, 0xf0ed, 0xf0ee, 0xf0ef, 0xf0f0, 0xf0f1, 0xf0f2, 0xf0f3, 0xf0f4, 0xf0f5, 0xf0f6, 0xf0f7, 0xf0f8, 0xf0f9, 0xf0fa, 0xf0fb, 0xf0fc, 0xf0fd, 0xf0fe, 0xf0ff } }; /* The standard kernel character-to-font mappings are not invertible -- this is just a best effort. */ #define MAX_GLYPH 512 /* Max possible glyph value */ static enum translation_map inv_translate[MAX_NR_CONSOLES]; #define UNI_DIRS 32U #define UNI_DIR_ROWS 32U #define UNI_ROW_GLYPHS 64U #define UNI_DIR_BITS GENMASK(15, 11) #define UNI_ROW_BITS GENMASK(10, 6) #define UNI_GLYPH_BITS GENMASK( 5, 0) #define UNI_DIR(uni) FIELD_GET(UNI_DIR_BITS, (uni)) #define UNI_ROW(uni) FIELD_GET(UNI_ROW_BITS, (uni)) #define UNI_GLYPH(uni) FIELD_GET(UNI_GLYPH_BITS, (uni)) #define UNI(dir, row, glyph) (FIELD_PREP(UNI_DIR_BITS, (dir)) | \ FIELD_PREP(UNI_ROW_BITS, (row)) | \ FIELD_PREP(UNI_GLYPH_BITS, (glyph))) /** * struct uni_pagedict - unicode directory * * @uni_pgdir: 32*32*64 table with glyphs * @refcount: reference count of this structure * @sum: checksum * @inverse_translations: best-effort inverse mapping * @inverse_trans_unicode: best-effort inverse mapping to unicode */ struct uni_pagedict { u16 **uni_pgdir[UNI_DIRS]; unsigned long refcount; unsigned long sum; unsigned char *inverse_translations[LAST_MAP + 1]; u16 *inverse_trans_unicode; }; static struct uni_pagedict *dflt; static void set_inverse_transl(struct vc_data *conp, struct uni_pagedict *dict, enum translation_map m) { unsigned short *t = translations[m]; unsigned char *inv; if (!dict) return; inv = dict->inverse_translations[m]; if (!inv) { inv = dict->inverse_translations[m] = kmalloc(MAX_GLYPH, GFP_KERNEL); if (!inv) return; } memset(inv, 0, MAX_GLYPH); for (unsigned int ch = 0; ch < ARRAY_SIZE(translations[m]); ch++) { int glyph = conv_uni_to_pc(conp, t[ch]); if (glyph >= 0 && glyph < MAX_GLYPH && inv[glyph] < 32) { /* prefer '-' above SHY etc. */ inv[glyph] = ch; } } } static void set_inverse_trans_unicode(struct uni_pagedict *dict) { unsigned int d, r, g; u16 *inv; if (!dict) return; inv = dict->inverse_trans_unicode; if (!inv) { inv = dict->inverse_trans_unicode = kmalloc_array(MAX_GLYPH, sizeof(*inv), GFP_KERNEL); if (!inv) return; } memset(inv, 0, MAX_GLYPH * sizeof(*inv)); for (d = 0; d < UNI_DIRS; d++) { u16 **dir = dict->uni_pgdir[d]; if (!dir) continue; for (r = 0; r < UNI_DIR_ROWS; r++) { u16 *row = dir[r]; if (!row) continue; for (g = 0; g < UNI_ROW_GLYPHS; g++) { u16 glyph = row[g]; if (glyph < MAX_GLYPH && inv[glyph] < 32) inv[glyph] = UNI(d, r, g); } } } } unsigned short *set_translate(enum translation_map m, struct vc_data *vc) { inv_translate[vc->vc_num] = m; return translations[m]; } /* * Inverse translation is impossible for several reasons: * 1. The font<->character maps are not 1-1. * 2. The text may have been written while a different translation map * was active. * Still, it is now possible to a certain extent to cut and paste non-ASCII. */ u16 inverse_translate(const struct vc_data *conp, u16 glyph, bool use_unicode) { struct uni_pagedict *p; enum translation_map m; if (glyph >= MAX_GLYPH) return 0; p = *conp->uni_pagedict_loc; if (!p) return glyph; if (use_unicode) { if (!p->inverse_trans_unicode) return glyph; return p->inverse_trans_unicode[glyph]; } m = inv_translate[conp->vc_num]; if (!p->inverse_translations[m]) return glyph; return p->inverse_translations[m][glyph]; } EXPORT_SYMBOL_GPL(inverse_translate); static void update_user_maps(void) { int i; struct uni_pagedict *p, *q = NULL; for (i = 0; i < MAX_NR_CONSOLES; i++) { if (!vc_cons_allocated(i)) continue; p = *vc_cons[i].d->uni_pagedict_loc; if (p && p != q) { set_inverse_transl(vc_cons[i].d, p, USER_MAP); set_inverse_trans_unicode(p); q = p; } } } /* * Load customizable translation table * arg points to a 256 byte translation table. * * The "old" variants are for translation directly to font (using the * 0xf000-0xf0ff "transparent" Unicodes) whereas the "new" variants set * Unicodes explicitly. */ int con_set_trans_old(unsigned char __user * arg) { unsigned short inbuf[E_TABSZ]; unsigned int i; unsigned char ch; for (i = 0; i < ARRAY_SIZE(inbuf); i++) { if (get_user(ch, &arg[i])) return -EFAULT; inbuf[i] = UNI_DIRECT_BASE | ch; } guard(console_lock)(); memcpy(translations[USER_MAP], inbuf, sizeof(inbuf)); update_user_maps(); return 0; } int con_get_trans_old(unsigned char __user * arg) { int i, ch; unsigned short *p = translations[USER_MAP]; unsigned char outbuf[E_TABSZ]; scoped_guard(console_lock) for (i = 0; i < ARRAY_SIZE(outbuf); i++) { ch = conv_uni_to_pc(vc_cons[fg_console].d, p[i]); outbuf[i] = (ch & ~0xff) ? 0 : ch; } return copy_to_user(arg, outbuf, sizeof(outbuf)) ? -EFAULT : 0; } int con_set_trans_new(ushort __user * arg) { unsigned short inbuf[E_TABSZ]; if (copy_from_user(inbuf, arg, sizeof(inbuf))) return -EFAULT; guard(console_lock)(); memcpy(translations[USER_MAP], inbuf, sizeof(inbuf)); update_user_maps(); return 0; } int con_get_trans_new(ushort __user * arg) { unsigned short outbuf[E_TABSZ]; scoped_guard(console_lock) memcpy(outbuf, translations[USER_MAP], sizeof(outbuf)); return copy_to_user(arg, outbuf, sizeof(outbuf)) ? -EFAULT : 0; } /* * Unicode -> current font conversion * * A font has at most 512 chars, usually 256. * But one font position may represent several Unicode chars. * A hashtable is somewhat of a pain to deal with, so use a * "paged table" instead. Simulation has shown the memory cost of * this 3-level paged table scheme to be comparable to a hash table. */ extern u8 dfont_unicount[]; /* Defined in console_defmap.c */ extern u16 dfont_unitable[]; static void con_release_unimap(struct uni_pagedict *dict) { unsigned int d, r; if (dict == dflt) dflt = NULL; for (d = 0; d < UNI_DIRS; d++) { u16 **dir = dict->uni_pgdir[d]; if (dir != NULL) { for (r = 0; r < UNI_DIR_ROWS; r++) kfree(dir[r]); kfree(dir); } dict->uni_pgdir[d] = NULL; } for (r = 0; r < ARRAY_SIZE(dict->inverse_translations); r++) { kfree(dict->inverse_translations[r]); dict->inverse_translations[r] = NULL; } kfree(dict->inverse_trans_unicode); dict->inverse_trans_unicode = NULL; } /* Caller must hold the console lock */ void con_free_unimap(struct vc_data *vc) { struct uni_pagedict *p; p = *vc->uni_pagedict_loc; if (!p) return; *vc->uni_pagedict_loc = NULL; if (--p->refcount) return; con_release_unimap(p); kfree(p); } static int con_unify_unimap(struct vc_data *conp, struct uni_pagedict *dict1) { struct uni_pagedict *dict2; unsigned int cons, d, r; for (cons = 0; cons < MAX_NR_CONSOLES; cons++) { if (!vc_cons_allocated(cons)) continue; dict2 = *vc_cons[cons].d->uni_pagedict_loc; if (!dict2 || dict2 == dict1 || dict2->sum != dict1->sum) continue; for (d = 0; d < UNI_DIRS; d++) { u16 **dir1 = dict1->uni_pgdir[d]; u16 **dir2 = dict2->uni_pgdir[d]; if (!dir1 && !dir2) continue; if (!dir1 || !dir2) break; for (r = 0; r < UNI_DIR_ROWS; r++) { if (!dir1[r] && !dir2[r]) continue; if (!dir1[r] || !dir2[r]) break; if (memcmp(dir1[r], dir2[r], UNI_ROW_GLYPHS * sizeof(*dir1[r]))) break; } if (r < UNI_DIR_ROWS) break; } if (d == UNI_DIRS) { dict2->refcount++; *conp->uni_pagedict_loc = dict2; con_release_unimap(dict1); kfree(dict1); return 1; } } return 0; } static int con_insert_unipair(struct uni_pagedict *p, u_short unicode, u_short fontpos) { u16 **dir, *row; unsigned int n; n = UNI_DIR(unicode); dir = p->uni_pgdir[n]; if (!dir) { dir = p->uni_pgdir[n] = kcalloc(UNI_DIR_ROWS, sizeof(*dir), GFP_KERNEL); if (!dir) return -ENOMEM; } n = UNI_ROW(unicode); row = dir[n]; if (!row) { row = dir[n] = kmalloc_array(UNI_ROW_GLYPHS, sizeof(*row), GFP_KERNEL); if (!row) return -ENOMEM; /* No glyphs for the characters (yet) */ memset(row, 0xff, UNI_ROW_GLYPHS * sizeof(*row)); } row[UNI_GLYPH(unicode)] = fontpos; p->sum += (fontpos << 20U) + unicode; return 0; } static int con_allocate_new(struct vc_data *vc) { struct uni_pagedict *new, *old = *vc->uni_pagedict_loc; new = kzalloc(sizeof(*new), GFP_KERNEL); if (!new) return -ENOMEM; new->refcount = 1; *vc->uni_pagedict_loc = new; if (old) old->refcount--; return 0; } /* Caller must hold the lock */ static int con_do_clear_unimap(struct vc_data *vc) { struct uni_pagedict *old = *vc->uni_pagedict_loc; if (!old || old->refcount > 1) return con_allocate_new(vc); old->sum = 0; con_release_unimap(old); return 0; } int con_clear_unimap(struct vc_data *vc) { guard(console_lock)(); return con_do_clear_unimap(vc); } static struct uni_pagedict *con_unshare_unimap(struct vc_data *vc, struct uni_pagedict *old) { struct uni_pagedict *new; unsigned int d, r, g; int ret; u16 uni = 0; ret = con_allocate_new(vc); if (ret) return ERR_PTR(ret); new = *vc->uni_pagedict_loc; /* * uni_pgdir is a 32*32*64 table with rows allocated when its first * entry is added. The unicode value must still be incremented for * empty rows. We are copying entries from "old" to "new". */ for (d = 0; d < UNI_DIRS; d++) { u16 **dir = old->uni_pgdir[d]; if (!dir) { /* Account for empty table */ uni += UNI_DIR_ROWS * UNI_ROW_GLYPHS; continue; } for (r = 0; r < UNI_DIR_ROWS; r++) { u16 *row = dir[r]; if (!row) { /* Account for row of 64 empty entries */ uni += UNI_ROW_GLYPHS; continue; } for (g = 0; g < UNI_ROW_GLYPHS; g++, uni++) { if (row[g] == 0xffff) continue; /* * Found one, copy entry for unicode uni with * fontpos value row[g]. */ ret = con_insert_unipair(new, uni, row[g]); if (ret) { old->refcount++; *vc->uni_pagedict_loc = old; con_release_unimap(new); kfree(new); return ERR_PTR(ret); } } } } return new; } int con_set_unimap(struct vc_data *vc, ushort ct, struct unipair __user *list) { struct uni_pagedict *dict; struct unipair *plist; int err = 0; if (!ct) return 0; struct unipair *unilist __free(kvfree) = vmemdup_array_user(list, ct, sizeof(*unilist)); if (IS_ERR(unilist)) return PTR_ERR(unilist); guard(console_lock)(); /* Save original vc_unipagdir_loc in case we allocate a new one */ dict = *vc->uni_pagedict_loc; if (!dict) return -EINVAL; if (dict->refcount > 1) { dict = con_unshare_unimap(vc, dict); if (IS_ERR(dict)) return PTR_ERR(dict); } else if (dict == dflt) { dflt = NULL; } /* * Insert user specified unicode pairs into new table. */ for (plist = unilist; ct; ct--, plist++) { int err1 = con_insert_unipair(dict, plist->unicode, plist->fontpos); if (err1) err = err1; } /* * Merge with fontmaps of any other virtual consoles. */ if (con_unify_unimap(vc, dict)) return err; for (enum translation_map m = FIRST_MAP; m <= LAST_MAP; m++) set_inverse_transl(vc, dict, m); set_inverse_trans_unicode(dict); return err; } /** * con_set_default_unimap - set default unicode map * @vc: the console we are updating * * Loads the unimap for the hardware font, as defined in uni_hash.tbl. * The representation used was the most compact I could come up * with. This routine is executed at video setup, and when the * PIO_FONTRESET ioctl is called. * * The caller must hold the console lock */ int con_set_default_unimap(struct vc_data *vc) { struct uni_pagedict *dict; unsigned int fontpos, count; int err = 0, err1; u16 *dfont; if (dflt) { dict = *vc->uni_pagedict_loc; if (dict == dflt) return 0; dflt->refcount++; *vc->uni_pagedict_loc = dflt; if (dict && !--dict->refcount) { con_release_unimap(dict); kfree(dict); } return 0; } /* The default font is always 256 characters */ err = con_do_clear_unimap(vc); if (err) return err; dict = *vc->uni_pagedict_loc; dfont = dfont_unitable; for (fontpos = 0; fontpos < 256U; fontpos++) for (count = dfont_unicount[fontpos]; count; count--) { err1 = con_insert_unipair(dict, *(dfont++), fontpos); if (err1) err = err1; } if (con_unify_unimap(vc, dict)) { dflt = *vc->uni_pagedict_loc; return err; } for (enum translation_map m = FIRST_MAP; m <= LAST_MAP; m++) set_inverse_transl(vc, dict, m); set_inverse_trans_unicode(dict); dflt = dict; return err; } EXPORT_SYMBOL(con_set_default_unimap); /** * con_copy_unimap - copy unimap between two vts * @dst_vc: target * @src_vc: source * * The caller must hold the console lock when invoking this method */ int con_copy_unimap(struct vc_data *dst_vc, struct vc_data *src_vc) { struct uni_pagedict *src; if (!*src_vc->uni_pagedict_loc) return -EINVAL; if (*dst_vc->uni_pagedict_loc == *src_vc->uni_pagedict_loc) return 0; con_free_unimap(dst_vc); src = *src_vc->uni_pagedict_loc; src->refcount++; *dst_vc->uni_pagedict_loc = src; return 0; } EXPORT_SYMBOL(con_copy_unimap); /* * con_get_unimap - get the unicode map * * Read the console unicode data for this console. Called from the ioctl * handlers. */ int con_get_unimap(struct vc_data *vc, ushort ct, ushort __user *uct, struct unipair __user *list) { ushort ect; struct uni_pagedict *dict; unsigned int d, r, g; struct unipair *unilist __free(kvfree) = kvmalloc_array(ct, sizeof(*unilist), GFP_KERNEL); if (!unilist) return -ENOMEM; scoped_guard(console_lock) { ect = 0; dict = *vc->uni_pagedict_loc; if (!dict) break; for (d = 0; d < UNI_DIRS; d++) { u16 **dir = dict->uni_pgdir[d]; if (!dir) continue; for (r = 0; r < UNI_DIR_ROWS; r++) { u16 *row = dir[r]; if (!row) continue; for (g = 0; g < UNI_ROW_GLYPHS; g++, row++) { if (*row >= MAX_GLYPH) continue; if (ect < ct) { unilist[ect].unicode = UNI(d, r, g); unilist[ect].fontpos = *row; } ect++; } } } } if (copy_to_user(list, unilist, min(ect, ct) * sizeof(*unilist))) return -EFAULT; if (put_user(ect, uct)) return -EFAULT; if (ect > ct) return -ENOMEM; return 0; } /* * Always use USER_MAP. These functions are used by the keyboard, * which shouldn't be affected by G0/G1 switching, etc. * If the user map still contains default values, i.e. the * direct-to-font mapping, then assume user is using Latin1. * * FIXME: at some point we need to decide if we want to lock the table * update element itself via the keyboard_event_lock for consistency with the * keyboard driver as well as the consoles */ /* may be called during an interrupt */ u32 conv_8bit_to_uni(unsigned char c) { unsigned short uni = translations[USER_MAP][c]; return uni == (0xf000 | c) ? c : uni; } int conv_uni_to_8bit(u32 uni) { int c; for (c = 0; c < ARRAY_SIZE(translations[USER_MAP]); c++) if (translations[USER_MAP][c] == uni || (translations[USER_MAP][c] == (c | 0xf000) && uni == c)) return c; return -1; } int conv_uni_to_pc(struct vc_data *conp, long ucs) { struct uni_pagedict *dict; u16 **dir, *row, glyph; /* Only 16-bit codes supported at this time */ if (ucs > 0xffff) return -4; /* Not found */ else if (ucs < 0x20) return -1; /* Not a printable character */ /* * UNI_DIRECT_BASE indicates the start of the region in the User Zone * which always has a 1:1 mapping to the currently loaded font. The * UNI_DIRECT_MASK indicates the bit span of the region. */ else if ((ucs & ~UNI_DIRECT_MASK) == UNI_DIRECT_BASE) return ucs & UNI_DIRECT_MASK; dict = *conp->uni_pagedict_loc; if (!dict) return -3; dir = dict->uni_pgdir[UNI_DIR(ucs)]; if (!dir) return -4; row = dir[UNI_ROW(ucs)]; if (!row) return -4; glyph = row[UNI_GLYPH(ucs)]; if (glyph >= MAX_GLYPH) return -4; return glyph; } /* * This is called at sys_setup time, after memory and the console are * initialized. It must be possible to call kmalloc(..., GFP_KERNEL) * from this function, hence the call from sys_setup. */ void __init console_map_init(void) { int i; for (i = 0; i < MAX_NR_CONSOLES; i++) if (vc_cons_allocated(i) && !*vc_cons[i].d->uni_pagedict_loc) con_set_default_unimap(vc_cons[i].d); } |
| 961 959 703 694 169 167 166 166 168 168 5 11 11 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 | // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2022 Christian Brauner <brauner@kernel.org> */ #include <linux/cred.h> #include <linux/fs.h> #include <linux/mnt_idmapping.h> #include <linux/slab.h> #include <linux/user_namespace.h> #include <linux/seq_file.h> #include "internal.h" /* * Outside of this file vfs{g,u}id_t are always created from k{g,u}id_t, * never from raw values. These are just internal helpers. */ #define VFSUIDT_INIT_RAW(val) (vfsuid_t){ val } #define VFSGIDT_INIT_RAW(val) (vfsgid_t){ val } struct mnt_idmap { struct uid_gid_map uid_map; struct uid_gid_map gid_map; refcount_t count; }; /* * Carries the initial idmapping of 0:0:4294967295 which is an identity * mapping. This means that {g,u}id 0 is mapped to {g,u}id 0, {g,u}id 1 is * mapped to {g,u}id 1, [...], {g,u}id 1000 to {g,u}id 1000, [...]. */ struct mnt_idmap nop_mnt_idmap = { .count = REFCOUNT_INIT(1), }; EXPORT_SYMBOL_GPL(nop_mnt_idmap); /* * Carries the invalid idmapping of a full 0-4294967295 {g,u}id range. * This means that all {g,u}ids are mapped to INVALID_VFS{G,U}ID. */ struct mnt_idmap invalid_mnt_idmap = { .count = REFCOUNT_INIT(1), }; EXPORT_SYMBOL_GPL(invalid_mnt_idmap); /** * initial_idmapping - check whether this is the initial mapping * @ns: idmapping to check * * Check whether this is the initial mapping, mapping 0 to 0, 1 to 1, * [...], 1000 to 1000 [...]. * * Return: true if this is the initial mapping, false if not. */ static inline bool initial_idmapping(const struct user_namespace *ns) { return ns == &init_user_ns; } /** * make_vfsuid - map a filesystem kuid according to an idmapping * @idmap: the mount's idmapping * @fs_userns: the filesystem's idmapping * @kuid : kuid to be mapped * * Take a @kuid and remap it from @fs_userns into @idmap. Use this * function when preparing a @kuid to be reported to userspace. * * If initial_idmapping() determines that this is not an idmapped mount * we can simply return @kuid unchanged. * If initial_idmapping() tells us that the filesystem is not mounted with an * idmapping we know the value of @kuid won't change when calling * from_kuid() so we can simply retrieve the value via __kuid_val() * directly. * * Return: @kuid mapped according to @idmap. * If @kuid has no mapping in either @idmap or @fs_userns INVALID_UID is * returned. */ vfsuid_t make_vfsuid(struct mnt_idmap *idmap, struct user_namespace *fs_userns, kuid_t kuid) { uid_t uid; if (idmap == &nop_mnt_idmap) return VFSUIDT_INIT(kuid); if (idmap == &invalid_mnt_idmap) return INVALID_VFSUID; if (initial_idmapping(fs_userns)) uid = __kuid_val(kuid); else uid = from_kuid(fs_userns, kuid); if (uid == (uid_t)-1) return INVALID_VFSUID; return VFSUIDT_INIT_RAW(map_id_down(&idmap->uid_map, uid)); } EXPORT_SYMBOL_GPL(make_vfsuid); /** * make_vfsgid - map a filesystem kgid according to an idmapping * @idmap: the mount's idmapping * @fs_userns: the filesystem's idmapping * @kgid : kgid to be mapped * * Take a @kgid and remap it from @fs_userns into @idmap. Use this * function when preparing a @kgid to be reported to userspace. * * If initial_idmapping() determines that this is not an idmapped mount * we can simply return @kgid unchanged. * If initial_idmapping() tells us that the filesystem is not mounted with an * idmapping we know the value of @kgid won't change when calling * from_kgid() so we can simply retrieve the value via __kgid_val() * directly. * * Return: @kgid mapped according to @idmap. * If @kgid has no mapping in either @idmap or @fs_userns INVALID_GID is * returned. */ vfsgid_t make_vfsgid(struct mnt_idmap *idmap, struct user_namespace *fs_userns, kgid_t kgid) { gid_t gid; if (idmap == &nop_mnt_idmap) return VFSGIDT_INIT(kgid); if (idmap == &invalid_mnt_idmap) return INVALID_VFSGID; if (initial_idmapping(fs_userns)) gid = __kgid_val(kgid); else gid = from_kgid(fs_userns, kgid); if (gid == (gid_t)-1) return INVALID_VFSGID; return VFSGIDT_INIT_RAW(map_id_down(&idmap->gid_map, gid)); } EXPORT_SYMBOL_GPL(make_vfsgid); /** * from_vfsuid - map a vfsuid into the filesystem idmapping * @idmap: the mount's idmapping * @fs_userns: the filesystem's idmapping * @vfsuid : vfsuid to be mapped * * Map @vfsuid into the filesystem idmapping. This function has to be used in * order to e.g. write @vfsuid to inode->i_uid. * * Return: @vfsuid mapped into the filesystem idmapping */ kuid_t from_vfsuid(struct mnt_idmap *idmap, struct user_namespace *fs_userns, vfsuid_t vfsuid) { uid_t uid; if (idmap == &nop_mnt_idmap) return AS_KUIDT(vfsuid); if (idmap == &invalid_mnt_idmap) return INVALID_UID; uid = map_id_up(&idmap->uid_map, __vfsuid_val(vfsuid)); if (uid == (uid_t)-1) return INVALID_UID; if (initial_idmapping(fs_userns)) return KUIDT_INIT(uid); return make_kuid(fs_userns, uid); } EXPORT_SYMBOL_GPL(from_vfsuid); /** * from_vfsgid - map a vfsgid into the filesystem idmapping * @idmap: the mount's idmapping * @fs_userns: the filesystem's idmapping * @vfsgid : vfsgid to be mapped * * Map @vfsgid into the filesystem idmapping. This function has to be used in * order to e.g. write @vfsgid to inode->i_gid. * * Return: @vfsgid mapped into the filesystem idmapping */ kgid_t from_vfsgid(struct mnt_idmap *idmap, struct user_namespace *fs_userns, vfsgid_t vfsgid) { gid_t gid; if (idmap == &nop_mnt_idmap) return AS_KGIDT(vfsgid); if (idmap == &invalid_mnt_idmap) return INVALID_GID; gid = map_id_up(&idmap->gid_map, __vfsgid_val(vfsgid)); if (gid == (gid_t)-1) return INVALID_GID; if (initial_idmapping(fs_userns)) return KGIDT_INIT(gid); return make_kgid(fs_userns, gid); } EXPORT_SYMBOL_GPL(from_vfsgid); #ifdef CONFIG_MULTIUSER /** * vfsgid_in_group_p() - check whether a vfsuid matches the caller's groups * @vfsgid: the mnt gid to match * * This function can be used to determine whether @vfsuid matches any of the * caller's groups. * * Return: 1 if vfsuid matches caller's groups, 0 if not. */ int vfsgid_in_group_p(vfsgid_t vfsgid) { return in_group_p(AS_KGIDT(vfsgid)); } #else int vfsgid_in_group_p(vfsgid_t vfsgid) { return 1; } #endif EXPORT_SYMBOL_GPL(vfsgid_in_group_p); static int copy_mnt_idmap(struct uid_gid_map *map_from, struct uid_gid_map *map_to) { struct uid_gid_extent *forward, *reverse; u32 nr_extents = READ_ONCE(map_from->nr_extents); /* Pairs with smp_wmb() when writing the idmapping. */ smp_rmb(); /* * Don't blindly copy @map_to into @map_from if nr_extents is * smaller or equal to UID_GID_MAP_MAX_BASE_EXTENTS. Since we * read @nr_extents someone could have written an idmapping and * then we might end up with inconsistent data. So just don't do * anything at all. */ if (nr_extents == 0) return -EINVAL; /* * Here we know that nr_extents is greater than zero which means * a map has been written. Since idmappings can't be changed * once they have been written we know that we can safely copy * from @map_to into @map_from. */ if (nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) { *map_to = *map_from; return 0; } forward = kmemdup_array(map_from->forward, nr_extents, sizeof(struct uid_gid_extent), GFP_KERNEL_ACCOUNT); if (!forward) return -ENOMEM; reverse = kmemdup_array(map_from->reverse, nr_extents, sizeof(struct uid_gid_extent), GFP_KERNEL_ACCOUNT); if (!reverse) { kfree(forward); return -ENOMEM; } /* * The idmapping isn't exposed anywhere so we don't need to care * about ordering between extent pointers and @nr_extents * initialization. */ map_to->forward = forward; map_to->reverse = reverse; map_to->nr_extents = nr_extents; return 0; } static void free_mnt_idmap(struct mnt_idmap *idmap) { if (idmap->uid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) { kfree(idmap->uid_map.forward); kfree(idmap->uid_map.reverse); } if (idmap->gid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) { kfree(idmap->gid_map.forward); kfree(idmap->gid_map.reverse); } kfree(idmap); } struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns) { struct mnt_idmap *idmap; int ret; idmap = kzalloc(sizeof(struct mnt_idmap), GFP_KERNEL_ACCOUNT); if (!idmap) return ERR_PTR(-ENOMEM); refcount_set(&idmap->count, 1); ret = copy_mnt_idmap(&mnt_userns->uid_map, &idmap->uid_map); if (!ret) ret = copy_mnt_idmap(&mnt_userns->gid_map, &idmap->gid_map); if (ret) { free_mnt_idmap(idmap); idmap = ERR_PTR(ret); } return idmap; } /** * mnt_idmap_get - get a reference to an idmapping * @idmap: the idmap to bump the reference on * * If @idmap is not the @nop_mnt_idmap bump the reference count. * * Return: @idmap with reference count bumped if @not_mnt_idmap isn't passed. */ struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap) { if (idmap != &nop_mnt_idmap && idmap != &invalid_mnt_idmap) refcount_inc(&idmap->count); return idmap; } EXPORT_SYMBOL_GPL(mnt_idmap_get); /** * mnt_idmap_put - put a reference to an idmapping * @idmap: the idmap to put the reference on * * If this is a non-initial idmapping, put the reference count when a mount is * released and free it if we're the last user. */ void mnt_idmap_put(struct mnt_idmap *idmap) { if (idmap != &nop_mnt_idmap && idmap != &invalid_mnt_idmap && refcount_dec_and_test(&idmap->count)) free_mnt_idmap(idmap); } EXPORT_SYMBOL_GPL(mnt_idmap_put); int statmount_mnt_idmap(struct mnt_idmap *idmap, struct seq_file *seq, bool uid_map) { struct uid_gid_map *map, *map_up; u32 idx, nr_mappings; if (!is_valid_mnt_idmap(idmap)) return 0; /* * Idmappings are shown relative to the caller's idmapping. * This is both the most intuitive and most useful solution. */ if (uid_map) { map = &idmap->uid_map; map_up = ¤t_user_ns()->uid_map; } else { map = &idmap->gid_map; map_up = ¤t_user_ns()->gid_map; } for (idx = 0, nr_mappings = 0; idx < map->nr_extents; idx++) { uid_t lower; struct uid_gid_extent *extent; if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) extent = &map->extent[idx]; else extent = &map->forward[idx]; /* * Verify that the whole range of the mapping can be * resolved in the caller's idmapping. If it cannot be * resolved skip the mapping. */ lower = map_id_range_up(map_up, extent->lower_first, extent->count); if (lower == (uid_t) -1) continue; seq_printf(seq, "%u %u %u", extent->first, lower, extent->count); seq->count++; /* mappings are separated by \0 */ if (seq_has_overflowed(seq)) return -EAGAIN; nr_mappings++; } return nr_mappings; } |
| 3 2 2 2 3 2 2 2 1 1 1 144 52 9 7 8 3 3 652 7 664 16 656 28 13 18 1 24 504 508 5 49 50 1 1 109 19 2 4 2 3 18 18 18 17 189 182 178 190 22 26 14 11 14 5 4 5 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 1991, 1992 Linus Torvalds * * This file contains the interface functions for the various time related * system calls: time, stime, gettimeofday, settimeofday, adjtime * * Modification history: * * 1993-09-02 Philip Gladstone * Created file with time related functions from sched/core.c and adjtimex() * 1993-10-08 Torsten Duwe * adjtime interface update and CMOS clock write code * 1995-08-13 Torsten Duwe * kernel PLL updated to 1994-12-13 specs (rfc-1589) * 1999-01-16 Ulrich Windl * Introduced error checking for many cases in adjtimex(). * Updated NTP code according to technical memorandum Jan '96 * "A Kernel Model for Precision Timekeeping" by Dave Mills * Allow time_constant larger than MAXTC(6) for NTP v4 (MAXTC == 10) * (Even though the technical memorandum forbids it) * 2004-07-14 Christoph Lameter * Added getnstimeofday to allow the posix timer functions to return * with nanosecond accuracy */ #include <linux/export.h> #include <linux/kernel.h> #include <linux/timex.h> #include <linux/capability.h> #include <linux/timekeeper_internal.h> #include <linux/errno.h> #include <linux/syscalls.h> #include <linux/security.h> #include <linux/fs.h> #include <linux/math64.h> #include <linux/ptrace.h> #include <linux/uaccess.h> #include <linux/compat.h> #include <asm/unistd.h> #include <generated/timeconst.h> #include "timekeeping.h" /* * The timezone where the local system is located. Used as a default by some * programs who obtain this value by using gettimeofday. */ struct timezone sys_tz; EXPORT_SYMBOL(sys_tz); #ifdef __ARCH_WANT_SYS_TIME /* * sys_time() can be implemented in user-level using * sys_gettimeofday(). Is this for backwards compatibility? If so, * why not move it into the appropriate arch directory (for those * architectures that need it). */ SYSCALL_DEFINE1(time, __kernel_old_time_t __user *, tloc) { __kernel_old_time_t i = (__kernel_old_time_t)ktime_get_real_seconds(); if (tloc) { if (put_user(i,tloc)) return -EFAULT; } force_successful_syscall_return(); return i; } /* * sys_stime() can be implemented in user-level using * sys_settimeofday(). Is this for backwards compatibility? If so, * why not move it into the appropriate arch directory (for those * architectures that need it). */ SYSCALL_DEFINE1(stime, __kernel_old_time_t __user *, tptr) { struct timespec64 tv; int err; if (get_user(tv.tv_sec, tptr)) return -EFAULT; tv.tv_nsec = 0; err = security_settime64(&tv, NULL); if (err) return err; do_settimeofday64(&tv); return 0; } #endif /* __ARCH_WANT_SYS_TIME */ #ifdef CONFIG_COMPAT_32BIT_TIME #ifdef __ARCH_WANT_SYS_TIME32 /* old_time32_t is a 32 bit "long" and needs to get converted. */ SYSCALL_DEFINE1(time32, old_time32_t __user *, tloc) { old_time32_t i; i = (old_time32_t)ktime_get_real_seconds(); if (tloc) { if (put_user(i,tloc)) return -EFAULT; } force_successful_syscall_return(); return i; } SYSCALL_DEFINE1(stime32, old_time32_t __user *, tptr) { struct timespec64 tv; int err; if (get_user(tv.tv_sec, tptr)) return -EFAULT; tv.tv_nsec = 0; err = security_settime64(&tv, NULL); if (err) return err; do_settimeofday64(&tv); return 0; } #endif /* __ARCH_WANT_SYS_TIME32 */ #endif SYSCALL_DEFINE2(gettimeofday, struct __kernel_old_timeval __user *, tv, struct timezone __user *, tz) { if (likely(tv != NULL)) { struct timespec64 ts; ktime_get_real_ts64(&ts); if (put_user(ts.tv_sec, &tv->tv_sec) || put_user(ts.tv_nsec / 1000, &tv->tv_usec)) return -EFAULT; } if (unlikely(tz != NULL)) { if (copy_to_user(tz, &sys_tz, sizeof(sys_tz))) return -EFAULT; } return 0; } /* * In case for some reason the CMOS clock has not already been running * in UTC, but in some local time: The first time we set the timezone, * we will warp the clock so that it is ticking UTC time instead of * local time. Presumably, if someone is setting the timezone then we * are running in an environment where the programs understand about * timezones. This should be done at boot time in the /etc/rc script, * as soon as possible, so that the clock can be set right. Otherwise, * various programs will get confused when the clock gets warped. */ int do_sys_settimeofday64(const struct timespec64 *tv, const struct timezone *tz) { static int firsttime = 1; int error = 0; if (tv && !timespec64_valid_settod(tv)) return -EINVAL; error = security_settime64(tv, tz); if (error) return error; if (tz) { /* Verify we're within the +-15 hrs range */ if (tz->tz_minuteswest > 15*60 || tz->tz_minuteswest < -15*60) return -EINVAL; sys_tz = *tz; update_vsyscall_tz(); if (firsttime) { firsttime = 0; if (!tv) timekeeping_warp_clock(); } } if (tv) return do_settimeofday64(tv); return 0; } SYSCALL_DEFINE2(settimeofday, struct __kernel_old_timeval __user *, tv, struct timezone __user *, tz) { struct timespec64 new_ts; struct timezone new_tz; if (tv) { if (get_user(new_ts.tv_sec, &tv->tv_sec) || get_user(new_ts.tv_nsec, &tv->tv_usec)) return -EFAULT; if (new_ts.tv_nsec > USEC_PER_SEC || new_ts.tv_nsec < 0) return -EINVAL; new_ts.tv_nsec *= NSEC_PER_USEC; } if (tz) { if (copy_from_user(&new_tz, tz, sizeof(*tz))) return -EFAULT; } return do_sys_settimeofday64(tv ? &new_ts : NULL, tz ? &new_tz : NULL); } #ifdef CONFIG_COMPAT COMPAT_SYSCALL_DEFINE2(gettimeofday, struct old_timeval32 __user *, tv, struct timezone __user *, tz) { if (tv) { struct timespec64 ts; ktime_get_real_ts64(&ts); if (put_user(ts.tv_sec, &tv->tv_sec) || put_user(ts.tv_nsec / 1000, &tv->tv_usec)) return -EFAULT; } if (tz) { if (copy_to_user(tz, &sys_tz, sizeof(sys_tz))) return -EFAULT; } return 0; } COMPAT_SYSCALL_DEFINE2(settimeofday, struct old_timeval32 __user *, tv, struct timezone __user *, tz) { struct timespec64 new_ts; struct timezone new_tz; if (tv) { if (get_user(new_ts.tv_sec, &tv->tv_sec) || get_user(new_ts.tv_nsec, &tv->tv_usec)) return -EFAULT; if (new_ts.tv_nsec > USEC_PER_SEC || new_ts.tv_nsec < 0) return -EINVAL; new_ts.tv_nsec *= NSEC_PER_USEC; } if (tz) { if (copy_from_user(&new_tz, tz, sizeof(*tz))) return -EFAULT; } return do_sys_settimeofday64(tv ? &new_ts : NULL, tz ? &new_tz : NULL); } #endif #ifdef CONFIG_64BIT SYSCALL_DEFINE1(adjtimex, struct __kernel_timex __user *, txc_p) { struct __kernel_timex txc; /* Local copy of parameter */ int ret; /* Copy the user data space into the kernel copy * structure. But bear in mind that the structures * may change */ if (copy_from_user(&txc, txc_p, sizeof(struct __kernel_timex))) return -EFAULT; ret = do_adjtimex(&txc); return copy_to_user(txc_p, &txc, sizeof(struct __kernel_timex)) ? -EFAULT : ret; } #endif #ifdef CONFIG_COMPAT_32BIT_TIME int get_old_timex32(struct __kernel_timex *txc, const struct old_timex32 __user *utp) { struct old_timex32 tx32; memset(txc, 0, sizeof(struct __kernel_timex)); if (copy_from_user(&tx32, utp, sizeof(struct old_timex32))) return -EFAULT; txc->modes = tx32.modes; txc->offset = tx32.offset; txc->freq = tx32.freq; txc->maxerror = tx32.maxerror; txc->esterror = tx32.esterror; txc->status = tx32.status; txc->constant = tx32.constant; txc->precision = tx32.precision; txc->tolerance = tx32.tolerance; txc->time.tv_sec = tx32.time.tv_sec; txc->time.tv_usec = tx32.time.tv_usec; txc->tick = tx32.tick; txc->ppsfreq = tx32.ppsfreq; txc->jitter = tx32.jitter; txc->shift = tx32.shift; txc->stabil = tx32.stabil; txc->jitcnt = tx32.jitcnt; txc->calcnt = tx32.calcnt; txc->errcnt = tx32.errcnt; txc->stbcnt = tx32.stbcnt; return 0; } int put_old_timex32(struct old_timex32 __user *utp, const struct __kernel_timex *txc) { struct old_timex32 tx32; memset(&tx32, 0, sizeof(struct old_timex32)); tx32.modes = txc->modes; tx32.offset = txc->offset; tx32.freq = txc->freq; tx32.maxerror = txc->maxerror; tx32.esterror = txc->esterror; tx32.status = txc->status; tx32.constant = txc->constant; tx32.precision = txc->precision; tx32.tolerance = txc->tolerance; tx32.time.tv_sec = txc->time.tv_sec; tx32.time.tv_usec = txc->time.tv_usec; tx32.tick = txc->tick; tx32.ppsfreq = txc->ppsfreq; tx32.jitter = txc->jitter; tx32.shift = txc->shift; tx32.stabil = txc->stabil; tx32.jitcnt = txc->jitcnt; tx32.calcnt = txc->calcnt; tx32.errcnt = txc->errcnt; tx32.stbcnt = txc->stbcnt; tx32.tai = txc->tai; if (copy_to_user(utp, &tx32, sizeof(struct old_timex32))) return -EFAULT; return 0; } SYSCALL_DEFINE1(adjtimex_time32, struct old_timex32 __user *, utp) { struct __kernel_timex txc; int err, ret; err = get_old_timex32(&txc, utp); if (err) return err; ret = do_adjtimex(&txc); err = put_old_timex32(utp, &txc); if (err) return err; return ret; } #endif /** * jiffies_to_msecs - Convert jiffies to milliseconds * @j: jiffies value * * Avoid unnecessary multiplications/divisions in the * two most common HZ cases. * * Return: milliseconds value */ unsigned int jiffies_to_msecs(const unsigned long j) { #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ) return (MSEC_PER_SEC / HZ) * j; #elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC) return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC); #else # if BITS_PER_LONG == 32 return (HZ_TO_MSEC_MUL32 * j + (1ULL << HZ_TO_MSEC_SHR32) - 1) >> HZ_TO_MSEC_SHR32; # else return DIV_ROUND_UP(j * HZ_TO_MSEC_NUM, HZ_TO_MSEC_DEN); # endif #endif } EXPORT_SYMBOL(jiffies_to_msecs); /** * jiffies_to_usecs - Convert jiffies to microseconds * @j: jiffies value * * Return: microseconds value */ unsigned int jiffies_to_usecs(const unsigned long j) { /* * Hz usually doesn't go much further MSEC_PER_SEC. * jiffies_to_usecs() and usecs_to_jiffies() depend on that. */ BUILD_BUG_ON(HZ > USEC_PER_SEC); #if !(USEC_PER_SEC % HZ) return (USEC_PER_SEC / HZ) * j; #else # if BITS_PER_LONG == 32 return (HZ_TO_USEC_MUL32 * j) >> HZ_TO_USEC_SHR32; # else return (j * HZ_TO_USEC_NUM) / HZ_TO_USEC_DEN; # endif #endif } EXPORT_SYMBOL(jiffies_to_usecs); /** * mktime64 - Converts date to seconds. * @year0: year to convert * @mon0: month to convert * @day: day to convert * @hour: hour to convert * @min: minute to convert * @sec: second to convert * * Converts Gregorian date to seconds since 1970-01-01 00:00:00. * Assumes input in normal date format, i.e. 1980-12-31 23:59:59 * => year=1980, mon=12, day=31, hour=23, min=59, sec=59. * * [For the Julian calendar (which was used in Russia before 1917, * Britain & colonies before 1752, anywhere else before 1582, * and is still in use by some communities) leave out the * -year/100+year/400 terms, and add 10.] * * This algorithm was first published by Gauss (I think). * * A leap second can be indicated by calling this function with sec as * 60 (allowable under ISO 8601). The leap second is treated the same * as the following second since they don't exist in UNIX time. * * An encoding of midnight at the end of the day as 24:00:00 - ie. midnight * tomorrow - (allowable under ISO 8601) is supported. * * Return: seconds since the epoch time for the given input date */ time64_t mktime64(const unsigned int year0, const unsigned int mon0, const unsigned int day, const unsigned int hour, const unsigned int min, const unsigned int sec) { unsigned int mon = mon0, year = year0; /* 1..12 -> 11,12,1..10 */ if (0 >= (int) (mon -= 2)) { mon += 12; /* Puts Feb last since it has leap day */ year -= 1; } return ((((time64_t) (year/4 - year/100 + year/400 + 367*mon/12 + day) + year*365 - 719499 )*24 + hour /* now have hours - midnight tomorrow handled here */ )*60 + min /* now have minutes */ )*60 + sec; /* finally seconds */ } EXPORT_SYMBOL(mktime64); struct __kernel_old_timeval ns_to_kernel_old_timeval(s64 nsec) { struct timespec64 ts = ns_to_timespec64(nsec); struct __kernel_old_timeval tv; tv.tv_sec = ts.tv_sec; tv.tv_usec = (suseconds_t)ts.tv_nsec / 1000; return tv; } EXPORT_SYMBOL(ns_to_kernel_old_timeval); /** * set_normalized_timespec64 - set timespec sec and nsec parts and normalize * * @ts: pointer to timespec variable to be set * @sec: seconds to set * @nsec: nanoseconds to set * * Set seconds and nanoseconds field of a timespec variable and * normalize to the timespec storage format * * Note: The tv_nsec part is always in the range of 0 <= tv_nsec < NSEC_PER_SEC. * For negative values only the tv_sec field is negative ! */ void set_normalized_timespec64(struct timespec64 *ts, time64_t sec, s64 nsec) { while (nsec >= NSEC_PER_SEC) { /* * The following asm() prevents the compiler from * optimising this loop into a modulo operation. See * also __iter_div_u64_rem() in include/linux/time.h */ asm("" : "+rm"(nsec)); nsec -= NSEC_PER_SEC; ++sec; } while (nsec < 0) { asm("" : "+rm"(nsec)); nsec += NSEC_PER_SEC; --sec; } ts->tv_sec = sec; ts->tv_nsec = nsec; } EXPORT_SYMBOL(set_normalized_timespec64); /** * ns_to_timespec64 - Convert nanoseconds to timespec64 * @nsec: the nanoseconds value to be converted * * Return: the timespec64 representation of the nsec parameter. */ struct timespec64 ns_to_timespec64(s64 nsec) { struct timespec64 ts = { 0, 0 }; s32 rem; if (likely(nsec > 0)) { ts.tv_sec = div_u64_rem(nsec, NSEC_PER_SEC, &rem); ts.tv_nsec = rem; } else if (nsec < 0) { /* * With negative times, tv_sec points to the earlier * second, and tv_nsec counts the nanoseconds since * then, so tv_nsec is always a positive number. */ ts.tv_sec = -div_u64_rem(-nsec - 1, NSEC_PER_SEC, &rem) - 1; ts.tv_nsec = NSEC_PER_SEC - rem - 1; } return ts; } EXPORT_SYMBOL(ns_to_timespec64); /** * __msecs_to_jiffies: - convert milliseconds to jiffies * @m: time in milliseconds * * conversion is done as follows: * * - negative values mean 'infinite timeout' (MAX_JIFFY_OFFSET) * * - 'too large' values [that would result in larger than * MAX_JIFFY_OFFSET values] mean 'infinite timeout' too. * * - all other values are converted to jiffies by either multiplying * the input value by a factor or dividing it with a factor and * handling any 32-bit overflows. * for the details see _msecs_to_jiffies() * * msecs_to_jiffies() checks for the passed in value being a constant * via __builtin_constant_p() allowing gcc to eliminate most of the * code, __msecs_to_jiffies() is called if the value passed does not * allow constant folding and the actual conversion must be done at * runtime. * The _msecs_to_jiffies helpers are the HZ dependent conversion * routines found in include/linux/jiffies.h * * Return: jiffies value */ unsigned long __msecs_to_jiffies(const unsigned int m) { /* * Negative value, means infinite timeout: */ if ((int)m < 0) return MAX_JIFFY_OFFSET; return _msecs_to_jiffies(m); } EXPORT_SYMBOL(__msecs_to_jiffies); /** * __usecs_to_jiffies: - convert microseconds to jiffies * @u: time in milliseconds * * Return: jiffies value */ unsigned long __usecs_to_jiffies(const unsigned int u) { if (u > jiffies_to_usecs(MAX_JIFFY_OFFSET)) return MAX_JIFFY_OFFSET; return _usecs_to_jiffies(u); } EXPORT_SYMBOL(__usecs_to_jiffies); /** * timespec64_to_jiffies - convert a timespec64 value to jiffies * @value: pointer to &struct timespec64 * * The TICK_NSEC - 1 rounds up the value to the next resolution. Note * that a remainder subtract here would not do the right thing as the * resolution values don't fall on second boundaries. I.e. the line: * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding. * Note that due to the small error in the multiplier here, this * rounding is incorrect for sufficiently large values of tv_nsec, but * well formed timespecs should have tv_nsec < NSEC_PER_SEC, so we're * OK. * * Rather, we just shift the bits off the right. * * The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec * value to a scaled second value. * * Return: jiffies value */ unsigned long timespec64_to_jiffies(const struct timespec64 *value) { u64 sec = value->tv_sec; long nsec = value->tv_nsec + TICK_NSEC - 1; if (sec >= MAX_SEC_IN_JIFFIES){ sec = MAX_SEC_IN_JIFFIES; nsec = 0; } return ((sec * SEC_CONVERSION) + (((u64)nsec * NSEC_CONVERSION) >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC; } EXPORT_SYMBOL(timespec64_to_jiffies); /** * jiffies_to_timespec64 - convert jiffies value to &struct timespec64 * @jiffies: jiffies value * @value: pointer to &struct timespec64 */ void jiffies_to_timespec64(const unsigned long jiffies, struct timespec64 *value) { /* * Convert jiffies to nanoseconds and separate with * one divide. */ u32 rem; value->tv_sec = div_u64_rem((u64)jiffies * TICK_NSEC, NSEC_PER_SEC, &rem); value->tv_nsec = rem; } EXPORT_SYMBOL(jiffies_to_timespec64); /* * Convert jiffies/jiffies_64 to clock_t and back. */ /** * jiffies_to_clock_t - Convert jiffies to clock_t * @x: jiffies value * * Return: jiffies converted to clock_t (CLOCKS_PER_SEC) */ clock_t jiffies_to_clock_t(unsigned long x) { #if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0 # if HZ < USER_HZ return x * (USER_HZ / HZ); # else return x / (HZ / USER_HZ); # endif #else return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ); #endif } EXPORT_SYMBOL(jiffies_to_clock_t); /** * clock_t_to_jiffies - Convert clock_t to jiffies * @x: clock_t value * * Return: clock_t value converted to jiffies */ unsigned long clock_t_to_jiffies(unsigned long x) { #if (HZ % USER_HZ)==0 if (x >= ~0UL / (HZ / USER_HZ)) return ~0UL; return x * (HZ / USER_HZ); #else /* Don't worry about loss of precision here .. */ if (x >= ~0UL / HZ * USER_HZ) return ~0UL; /* .. but do try to contain it here */ return div_u64((u64)x * HZ, USER_HZ); #endif } EXPORT_SYMBOL(clock_t_to_jiffies); /** * jiffies_64_to_clock_t - Convert jiffies_64 to clock_t * @x: jiffies_64 value * * Return: jiffies_64 value converted to 64-bit "clock_t" (CLOCKS_PER_SEC) */ u64 jiffies_64_to_clock_t(u64 x) { #if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0 # if HZ < USER_HZ x = div_u64(x * USER_HZ, HZ); # elif HZ > USER_HZ x = div_u64(x, HZ / USER_HZ); # else /* Nothing to do */ # endif #else /* * There are better ways that don't overflow early, * but even this doesn't overflow in hundreds of years * in 64 bits, so.. */ x = div_u64(x * TICK_NSEC, (NSEC_PER_SEC / USER_HZ)); #endif return x; } EXPORT_SYMBOL(jiffies_64_to_clock_t); /** * nsec_to_clock_t - Convert nsec value to clock_t * @x: nsec value * * Return: nsec value converted to 64-bit "clock_t" (CLOCKS_PER_SEC) */ u64 nsec_to_clock_t(u64 x) { #if (NSEC_PER_SEC % USER_HZ) == 0 return div_u64(x, NSEC_PER_SEC / USER_HZ); #elif (USER_HZ % 512) == 0 return div_u64(x * USER_HZ / 512, NSEC_PER_SEC / 512); #else /* * max relative error 5.7e-8 (1.8s per year) for USER_HZ <= 1024, * overflow after 64.99 years. * exact for HZ=60, 72, 90, 120, 144, 180, 300, 600, 900, ... */ return div_u64(x * 9, (9ull * NSEC_PER_SEC + (USER_HZ / 2)) / USER_HZ); #endif } /** * jiffies64_to_nsecs - Convert jiffies64 to nanoseconds * @j: jiffies64 value * * Return: nanoseconds value */ u64 jiffies64_to_nsecs(u64 j) { #if !(NSEC_PER_SEC % HZ) return (NSEC_PER_SEC / HZ) * j; # else return div_u64(j * HZ_TO_NSEC_NUM, HZ_TO_NSEC_DEN); #endif } EXPORT_SYMBOL(jiffies64_to_nsecs); /** * jiffies64_to_msecs - Convert jiffies64 to milliseconds * @j: jiffies64 value * * Return: milliseconds value */ u64 jiffies64_to_msecs(const u64 j) { #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ) return (MSEC_PER_SEC / HZ) * j; #else return div_u64(j * HZ_TO_MSEC_NUM, HZ_TO_MSEC_DEN); #endif } EXPORT_SYMBOL(jiffies64_to_msecs); /** * nsecs_to_jiffies64 - Convert nsecs in u64 to jiffies64 * * @n: nsecs in u64 * * Unlike {m,u}secs_to_jiffies, type of input is not unsigned int but u64. * And this doesn't return MAX_JIFFY_OFFSET since this function is designed * for scheduler, not for use in device drivers to calculate timeout value. * * note: * NSEC_PER_SEC = 10^9 = (5^9 * 2^9) = (1953125 * 512) * ULLONG_MAX ns = 18446744073.709551615 secs = about 584 years * * Return: nsecs converted to jiffies64 value */ u64 nsecs_to_jiffies64(u64 n) { #if (NSEC_PER_SEC % HZ) == 0 /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */ return div_u64(n, NSEC_PER_SEC / HZ); #elif (HZ % 512) == 0 /* overflow after 292 years if HZ = 1024 */ return div_u64(n * HZ / 512, NSEC_PER_SEC / 512); #else /* * Generic case - optimized for cases where HZ is a multiple of 3. * overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc. */ return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ); #endif } EXPORT_SYMBOL(nsecs_to_jiffies64); /** * nsecs_to_jiffies - Convert nsecs in u64 to jiffies * * @n: nsecs in u64 * * Unlike {m,u}secs_to_jiffies, type of input is not unsigned int but u64. * And this doesn't return MAX_JIFFY_OFFSET since this function is designed * for scheduler, not for use in device drivers to calculate timeout value. * * note: * NSEC_PER_SEC = 10^9 = (5^9 * 2^9) = (1953125 * 512) * ULLONG_MAX ns = 18446744073.709551615 secs = about 584 years * * Return: nsecs converted to jiffies value */ unsigned long nsecs_to_jiffies(u64 n) { return (unsigned long)nsecs_to_jiffies64(n); } EXPORT_SYMBOL_GPL(nsecs_to_jiffies); /** * timespec64_add_safe - Add two timespec64 values and do a safety check * for overflow. * @lhs: first (left) timespec64 to add * @rhs: second (right) timespec64 to add * * It's assumed that both values are valid (>= 0). * And, each timespec64 is in normalized form. * * Return: sum of @lhs + @rhs */ struct timespec64 timespec64_add_safe(const struct timespec64 lhs, const struct timespec64 rhs) { struct timespec64 res; set_normalized_timespec64(&res, (timeu64_t) lhs.tv_sec + rhs.tv_sec, lhs.tv_nsec + rhs.tv_nsec); if (unlikely(res.tv_sec < lhs.tv_sec || res.tv_sec < rhs.tv_sec)) { res.tv_sec = TIME64_MAX; res.tv_nsec = 0; } return res; } EXPORT_SYMBOL_GPL(timespec64_add_safe); /** * get_timespec64 - get user's time value into kernel space * @ts: destination &struct timespec64 * @uts: user's time value as &struct __kernel_timespec * * Handles compat or 32-bit modes. * * Return: 0 on success or negative errno on error */ int get_timespec64(struct timespec64 *ts, const struct __kernel_timespec __user *uts) { struct __kernel_timespec kts; int ret; ret = copy_from_user(&kts, uts, sizeof(kts)); if (ret) return -EFAULT; ts->tv_sec = kts.tv_sec; /* Zero out the padding in compat mode */ if (in_compat_syscall()) kts.tv_nsec &= 0xFFFFFFFFUL; /* In 32-bit mode, this drops the padding */ ts->tv_nsec = kts.tv_nsec; return 0; } EXPORT_SYMBOL_GPL(get_timespec64); /** * put_timespec64 - convert timespec64 value to __kernel_timespec format and * copy the latter to userspace * @ts: input &struct timespec64 * @uts: user's &struct __kernel_timespec * * Return: 0 on success or negative errno on error */ int put_timespec64(const struct timespec64 *ts, struct __kernel_timespec __user *uts) { struct __kernel_timespec kts = { .tv_sec = ts->tv_sec, .tv_nsec = ts->tv_nsec }; return copy_to_user(uts, &kts, sizeof(kts)) ? -EFAULT : 0; } EXPORT_SYMBOL_GPL(put_timespec64); static int __get_old_timespec32(struct timespec64 *ts64, const struct old_timespec32 __user *cts) { struct old_timespec32 ts; int ret; ret = copy_from_user(&ts, cts, sizeof(ts)); if (ret) return -EFAULT; ts64->tv_sec = ts.tv_sec; ts64->tv_nsec = ts.tv_nsec; return 0; } static int __put_old_timespec32(const struct timespec64 *ts64, struct old_timespec32 __user *cts) { struct old_timespec32 ts = { .tv_sec = ts64->tv_sec, .tv_nsec = ts64->tv_nsec }; return copy_to_user(cts, &ts, sizeof(ts)) ? -EFAULT : 0; } /** * get_old_timespec32 - get user's old-format time value into kernel space * @ts: destination &struct timespec64 * @uts: user's old-format time value (&struct old_timespec32) * * Handles X86_X32_ABI compatibility conversion. * * Return: 0 on success or negative errno on error */ int get_old_timespec32(struct timespec64 *ts, const void __user *uts) { if (COMPAT_USE_64BIT_TIME) return copy_from_user(ts, uts, sizeof(*ts)) ? -EFAULT : 0; else return __get_old_timespec32(ts, uts); } EXPORT_SYMBOL_GPL(get_old_timespec32); /** * put_old_timespec32 - convert timespec64 value to &struct old_timespec32 and * copy the latter to userspace * @ts: input &struct timespec64 * @uts: user's &struct old_timespec32 * * Handles X86_X32_ABI compatibility conversion. * * Return: 0 on success or negative errno on error */ int put_old_timespec32(const struct timespec64 *ts, void __user *uts) { if (COMPAT_USE_64BIT_TIME) return copy_to_user(uts, ts, sizeof(*ts)) ? -EFAULT : 0; else return __put_old_timespec32(ts, uts); } EXPORT_SYMBOL_GPL(put_old_timespec32); /** * get_itimerspec64 - get user's &struct __kernel_itimerspec into kernel space * @it: destination &struct itimerspec64 * @uit: user's &struct __kernel_itimerspec * * Return: 0 on success or negative errno on error */ int get_itimerspec64(struct itimerspec64 *it, const struct __kernel_itimerspec __user *uit) { int ret; ret = get_timespec64(&it->it_interval, &uit->it_interval); if (ret) return ret; ret = get_timespec64(&it->it_value, &uit->it_value); return ret; } EXPORT_SYMBOL_GPL(get_itimerspec64); /** * put_itimerspec64 - convert &struct itimerspec64 to __kernel_itimerspec format * and copy the latter to userspace * @it: input &struct itimerspec64 * @uit: user's &struct __kernel_itimerspec * * Return: 0 on success or negative errno on error */ int put_itimerspec64(const struct itimerspec64 *it, struct __kernel_itimerspec __user *uit) { int ret; ret = put_timespec64(&it->it_interval, &uit->it_interval); if (ret) return ret; ret = put_timespec64(&it->it_value, &uit->it_value); return ret; } EXPORT_SYMBOL_GPL(put_itimerspec64); /** * get_old_itimerspec32 - get user's &struct old_itimerspec32 into kernel space * @its: destination &struct itimerspec64 * @uits: user's &struct old_itimerspec32 * * Return: 0 on success or negative errno on error */ int get_old_itimerspec32(struct itimerspec64 *its, const struct old_itimerspec32 __user *uits) { if (__get_old_timespec32(&its->it_interval, &uits->it_interval) || __get_old_timespec32(&its->it_value, &uits->it_value)) return -EFAULT; return 0; } EXPORT_SYMBOL_GPL(get_old_itimerspec32); /** * put_old_itimerspec32 - convert &struct itimerspec64 to &struct * old_itimerspec32 and copy the latter to userspace * @its: input &struct itimerspec64 * @uits: user's &struct old_itimerspec32 * * Return: 0 on success or negative errno on error */ int put_old_itimerspec32(const struct itimerspec64 *its, struct old_itimerspec32 __user *uits) { if (__put_old_timespec32(&its->it_interval, &uits->it_interval) || __put_old_timespec32(&its->it_value, &uits->it_value)) return -EFAULT; return 0; } EXPORT_SYMBOL_GPL(put_old_itimerspec32); |
| 5 5 6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _NF_NAT_H #define _NF_NAT_H #include <linux/list.h> #include <linux/netfilter_ipv4.h> #include <linux/netfilter/nf_conntrack_pptp.h> #include <net/netfilter/nf_conntrack.h> #include <net/netfilter/nf_conntrack_extend.h> #include <net/netfilter/nf_conntrack_tuple.h> #include <uapi/linux/netfilter/nf_nat.h> enum nf_nat_manip_type { NF_NAT_MANIP_SRC, NF_NAT_MANIP_DST }; /* SRC manip occurs POST_ROUTING or LOCAL_IN */ #define HOOK2MANIP(hooknum) ((hooknum) != NF_INET_POST_ROUTING && \ (hooknum) != NF_INET_LOCAL_IN) /* per conntrack: nat application helper private data */ union nf_conntrack_nat_help { /* insert nat helper private data here */ #if IS_ENABLED(CONFIG_NF_NAT_PPTP) struct nf_nat_pptp nat_pptp_info; #endif }; /* The structure embedded in the conntrack structure. */ struct nf_conn_nat { union nf_conntrack_nat_help help; #if IS_ENABLED(CONFIG_NF_NAT_MASQUERADE) int masq_index; #endif }; /* Set up the info structure to map into this range. */ unsigned int nf_nat_setup_info(struct nf_conn *ct, const struct nf_nat_range2 *range, enum nf_nat_manip_type maniptype); extern unsigned int nf_nat_alloc_null_binding(struct nf_conn *ct, unsigned int hooknum); struct nf_conn_nat *nf_ct_nat_ext_add(struct nf_conn *ct); static inline struct nf_conn_nat *nfct_nat(const struct nf_conn *ct) { #if IS_ENABLED(CONFIG_NF_NAT) return nf_ct_ext_find(ct, NF_CT_EXT_NAT); #else return NULL; #endif } static inline bool nf_nat_oif_changed(unsigned int hooknum, enum ip_conntrack_info ctinfo, struct nf_conn_nat *nat, const struct net_device *out) { #if IS_ENABLED(CONFIG_NF_NAT_MASQUERADE) return nat && nat->masq_index && hooknum == NF_INET_POST_ROUTING && CTINFO2DIR(ctinfo) == IP_CT_DIR_ORIGINAL && nat->masq_index != out->ifindex; #else return false; #endif } int nf_nat_register_fn(struct net *net, u8 pf, const struct nf_hook_ops *ops, const struct nf_hook_ops *nat_ops, unsigned int ops_count); void nf_nat_unregister_fn(struct net *net, u8 pf, const struct nf_hook_ops *ops, unsigned int ops_count); unsigned int nf_nat_packet(struct nf_conn *ct, enum ip_conntrack_info ctinfo, unsigned int hooknum, struct sk_buff *skb); unsigned int nf_nat_manip_pkt(struct sk_buff *skb, struct nf_conn *ct, enum nf_nat_manip_type mtype, enum ip_conntrack_dir dir); void nf_nat_csum_recalc(struct sk_buff *skb, u8 nfproto, u8 proto, void *data, __sum16 *check, int datalen, int oldlen); int nf_nat_icmp_reply_translation(struct sk_buff *skb, struct nf_conn *ct, enum ip_conntrack_info ctinfo, unsigned int hooknum); int nf_nat_icmpv6_reply_translation(struct sk_buff *skb, struct nf_conn *ct, enum ip_conntrack_info ctinfo, unsigned int hooknum, unsigned int hdrlen); int nf_nat_ipv4_register_fn(struct net *net, const struct nf_hook_ops *ops); void nf_nat_ipv4_unregister_fn(struct net *net, const struct nf_hook_ops *ops); int nf_nat_ipv6_register_fn(struct net *net, const struct nf_hook_ops *ops); void nf_nat_ipv6_unregister_fn(struct net *net, const struct nf_hook_ops *ops); int nf_nat_inet_register_fn(struct net *net, const struct nf_hook_ops *ops); void nf_nat_inet_unregister_fn(struct net *net, const struct nf_hook_ops *ops); unsigned int nf_nat_inet_fn(void *priv, struct sk_buff *skb, const struct nf_hook_state *state); int nf_ct_nat(struct sk_buff *skb, struct nf_conn *ct, enum ip_conntrack_info ctinfo, int *action, const struct nf_nat_range2 *range, bool commit); static inline int nf_nat_initialized(const struct nf_conn *ct, enum nf_nat_manip_type manip) { if (manip == NF_NAT_MANIP_SRC) return ct->status & IPS_SRC_NAT_DONE; else return ct->status & IPS_DST_NAT_DONE; } #endif |
| 4 2 11 10 11 11 11 13 13 13 13 13 11 11 4 11 9 8 2 2 2 2 2 2 2 13 13 2 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/ethtool_netlink.h> #include <linux/bitmap.h> #include "netlink.h" #include "bitset.h" /* Some bitmaps are internally represented as an array of unsigned long, some * as an array of u32 (some even as single u32 for now). To avoid the need of * wrappers on caller side, we provide two set of functions: those with "32" * suffix in their names expect u32 based bitmaps, those without it expect * unsigned long bitmaps. */ static u32 ethnl_lower_bits(unsigned int n) { return ~(u32)0 >> (32 - n % 32); } static u32 ethnl_upper_bits(unsigned int n) { return ~(u32)0 << (n % 32); } /** * ethnl_bitmap32_clear() - Clear u32 based bitmap * @dst: bitmap to clear * @start: beginning of the interval * @end: end of the interval * @mod: set if bitmap was modified * * Clear @nbits bits of a bitmap with indices @start <= i < @end */ static void ethnl_bitmap32_clear(u32 *dst, unsigned int start, unsigned int end, bool *mod) { unsigned int start_word = start / 32; unsigned int end_word = end / 32; unsigned int i; u32 mask; if (end <= start) return; if (start % 32) { mask = ethnl_upper_bits(start); if (end_word == start_word) { mask &= ethnl_lower_bits(end); if (dst[start_word] & mask) { dst[start_word] &= ~mask; *mod = true; } return; } if (dst[start_word] & mask) { dst[start_word] &= ~mask; *mod = true; } start_word++; } for (i = start_word; i < end_word; i++) { if (dst[i]) { dst[i] = 0; *mod = true; } } if (end % 32) { mask = ethnl_lower_bits(end); if (dst[end_word] & mask) { dst[end_word] &= ~mask; *mod = true; } } } /** * ethnl_bitmap32_not_zero() - Check if any bit is set in an interval * @map: bitmap to test * @start: beginning of the interval * @end: end of the interval * * Return: true if there is non-zero bit with index @start <= i < @end, * false if the whole interval is zero */ static bool ethnl_bitmap32_not_zero(const u32 *map, unsigned int start, unsigned int end) { unsigned int start_word = start / 32; unsigned int end_word = end / 32; u32 mask; if (end <= start) return true; if (start % 32) { mask = ethnl_upper_bits(start); if (end_word == start_word) { mask &= ethnl_lower_bits(end); return map[start_word] & mask; } if (map[start_word] & mask) return true; start_word++; } if (!memchr_inv(map + start_word, '\0', (end_word - start_word) * sizeof(u32))) return true; if (end % 32 == 0) return true; return map[end_word] & ethnl_lower_bits(end); } /** * ethnl_bitmap32_update() - Modify u32 based bitmap according to value/mask * pair * @dst: bitmap to update * @nbits: bit size of the bitmap * @value: values to set * @mask: mask of bits to set * @mod: set to true if bitmap is modified, preserve if not * * Set bits in @dst bitmap which are set in @mask to values from @value, leave * the rest untouched. If destination bitmap was modified, set @mod to true, * leave as it is if not. */ static void ethnl_bitmap32_update(u32 *dst, unsigned int nbits, const u32 *value, const u32 *mask, bool *mod) { while (nbits > 0) { u32 real_mask = mask ? *mask : ~(u32)0; u32 new_value; if (nbits < 32) real_mask &= ethnl_lower_bits(nbits); new_value = (*dst & ~real_mask) | (*value & real_mask); if (new_value != *dst) { *dst = new_value; *mod = true; } if (nbits <= 32) break; dst++; nbits -= 32; value++; if (mask) mask++; } } static bool ethnl_bitmap32_test_bit(const u32 *map, unsigned int index) { return map[index / 32] & (1U << (index % 32)); } /** * ethnl_bitset32_size() - Calculate size of bitset nested attribute * @val: value bitmap (u32 based) * @mask: mask bitmap (u32 based, optional) * @nbits: bit length of the bitset * @names: array of bit names (optional) * @compact: assume compact format for output * * Estimate length of netlink attribute composed by a later call to * ethnl_put_bitset32() call with the same arguments. * * Return: negative error code or attribute length estimate */ int ethnl_bitset32_size(const u32 *val, const u32 *mask, unsigned int nbits, ethnl_string_array_t names, bool compact) { unsigned int len = 0; /* list flag */ if (!mask) len += nla_total_size(sizeof(u32)); /* size */ len += nla_total_size(sizeof(u32)); if (compact) { unsigned int nwords = DIV_ROUND_UP(nbits, 32); /* value, mask */ len += (mask ? 2 : 1) * nla_total_size(nwords * sizeof(u32)); } else { unsigned int bits_len = 0; unsigned int bit_len, i; for (i = 0; i < nbits; i++) { const char *name = names ? names[i] : NULL; if (!ethnl_bitmap32_test_bit(mask ?: val, i)) continue; /* index */ bit_len = nla_total_size(sizeof(u32)); /* name */ if (name) bit_len += ethnl_strz_size(name); /* value */ if (mask && ethnl_bitmap32_test_bit(val, i)) bit_len += nla_total_size(0); /* bit nest */ bits_len += nla_total_size(bit_len); } /* bits nest */ len += nla_total_size(bits_len); } /* outermost nest */ return nla_total_size(len); } /** * ethnl_put_bitset32() - Put a bitset nest into a message * @skb: skb with the message * @attrtype: attribute type for the bitset nest * @val: value bitmap (u32 based) * @mask: mask bitmap (u32 based, optional) * @nbits: bit length of the bitset * @names: array of bit names (optional) * @compact: use compact format for the output * * Compose a nested attribute representing a bitset. If @mask is null, simple * bitmap (bit list) is created, if @mask is provided, represent a value/mask * pair. Bit names are only used in verbose mode and when provided by calller. * * Return: 0 on success, negative error value on error */ int ethnl_put_bitset32(struct sk_buff *skb, int attrtype, const u32 *val, const u32 *mask, unsigned int nbits, ethnl_string_array_t names, bool compact) { struct nlattr *nest; struct nlattr *attr; nest = nla_nest_start(skb, attrtype); if (!nest) return -EMSGSIZE; if (!mask && nla_put_flag(skb, ETHTOOL_A_BITSET_NOMASK)) goto nla_put_failure; if (nla_put_u32(skb, ETHTOOL_A_BITSET_SIZE, nbits)) goto nla_put_failure; if (compact) { unsigned int nwords = DIV_ROUND_UP(nbits, 32); unsigned int nbytes = nwords * sizeof(u32); u32 *dst; attr = nla_reserve(skb, ETHTOOL_A_BITSET_VALUE, nbytes); if (!attr) goto nla_put_failure; dst = nla_data(attr); memcpy(dst, val, nbytes); if (nbits % 32) dst[nwords - 1] &= ethnl_lower_bits(nbits); if (mask) { attr = nla_reserve(skb, ETHTOOL_A_BITSET_MASK, nbytes); if (!attr) goto nla_put_failure; dst = nla_data(attr); memcpy(dst, mask, nbytes); if (nbits % 32) dst[nwords - 1] &= ethnl_lower_bits(nbits); } } else { struct nlattr *bits; unsigned int i; bits = nla_nest_start(skb, ETHTOOL_A_BITSET_BITS); if (!bits) goto nla_put_failure; for (i = 0; i < nbits; i++) { const char *name = names ? names[i] : NULL; if (!ethnl_bitmap32_test_bit(mask ?: val, i)) continue; attr = nla_nest_start(skb, ETHTOOL_A_BITSET_BITS_BIT); if (!attr) goto nla_put_failure; if (nla_put_u32(skb, ETHTOOL_A_BITSET_BIT_INDEX, i)) goto nla_put_failure; if (name && ethnl_put_strz(skb, ETHTOOL_A_BITSET_BIT_NAME, name)) goto nla_put_failure; if (mask && ethnl_bitmap32_test_bit(val, i) && nla_put_flag(skb, ETHTOOL_A_BITSET_BIT_VALUE)) goto nla_put_failure; nla_nest_end(skb, attr); } nla_nest_end(skb, bits); } nla_nest_end(skb, nest); return 0; nla_put_failure: nla_nest_cancel(skb, nest); return -EMSGSIZE; } static const struct nla_policy bitset_policy[] = { [ETHTOOL_A_BITSET_NOMASK] = { .type = NLA_FLAG }, [ETHTOOL_A_BITSET_SIZE] = NLA_POLICY_MAX(NLA_U32, ETHNL_MAX_BITSET_SIZE), [ETHTOOL_A_BITSET_BITS] = { .type = NLA_NESTED }, [ETHTOOL_A_BITSET_VALUE] = { .type = NLA_BINARY }, [ETHTOOL_A_BITSET_MASK] = { .type = NLA_BINARY }, }; static const struct nla_policy bit_policy[] = { [ETHTOOL_A_BITSET_BIT_INDEX] = { .type = NLA_U32 }, [ETHTOOL_A_BITSET_BIT_NAME] = { .type = NLA_NUL_STRING }, [ETHTOOL_A_BITSET_BIT_VALUE] = { .type = NLA_FLAG }, }; /** * ethnl_bitset_is_compact() - check if bitset attribute represents a compact * bitset * @bitset: nested attribute representing a bitset * @compact: pointer for return value * * Return: 0 on success, negative error code on failure */ int ethnl_bitset_is_compact(const struct nlattr *bitset, bool *compact) { struct nlattr *tb[ARRAY_SIZE(bitset_policy)]; int ret; ret = nla_parse_nested(tb, ARRAY_SIZE(bitset_policy) - 1, bitset, bitset_policy, NULL); if (ret < 0) return ret; if (tb[ETHTOOL_A_BITSET_BITS]) { if (tb[ETHTOOL_A_BITSET_VALUE] || tb[ETHTOOL_A_BITSET_MASK]) return -EINVAL; *compact = false; return 0; } if (!tb[ETHTOOL_A_BITSET_SIZE] || !tb[ETHTOOL_A_BITSET_VALUE]) return -EINVAL; *compact = true; return 0; } /** * ethnl_name_to_idx() - look up string index for a name * @names: array of ETH_GSTRING_LEN sized strings * @n_names: number of strings in the array * @name: name to look up * * Return: index of the string if found, -ENOENT if not found */ static int ethnl_name_to_idx(ethnl_string_array_t names, unsigned int n_names, const char *name) { unsigned int i; if (!names) return -ENOENT; for (i = 0; i < n_names; i++) { /* names[i] may not be null terminated */ if (!strncmp(names[i], name, ETH_GSTRING_LEN) && strlen(name) <= ETH_GSTRING_LEN) return i; } return -ENOENT; } static int ethnl_parse_bit(unsigned int *index, bool *val, unsigned int nbits, const struct nlattr *bit_attr, bool no_mask, ethnl_string_array_t names, struct netlink_ext_ack *extack) { struct nlattr *tb[ARRAY_SIZE(bit_policy)]; int ret, idx; ret = nla_parse_nested(tb, ARRAY_SIZE(bit_policy) - 1, bit_attr, bit_policy, extack); if (ret < 0) return ret; if (tb[ETHTOOL_A_BITSET_BIT_INDEX]) { const char *name; idx = nla_get_u32(tb[ETHTOOL_A_BITSET_BIT_INDEX]); if (idx >= nbits) { NL_SET_ERR_MSG_ATTR(extack, tb[ETHTOOL_A_BITSET_BIT_INDEX], "bit index too high"); return -EOPNOTSUPP; } name = names ? names[idx] : NULL; if (tb[ETHTOOL_A_BITSET_BIT_NAME] && name && strncmp(nla_data(tb[ETHTOOL_A_BITSET_BIT_NAME]), name, nla_len(tb[ETHTOOL_A_BITSET_BIT_NAME]))) { NL_SET_ERR_MSG_ATTR(extack, bit_attr, "bit index and name mismatch"); return -EINVAL; } } else if (tb[ETHTOOL_A_BITSET_BIT_NAME]) { idx = ethnl_name_to_idx(names, nbits, nla_data(tb[ETHTOOL_A_BITSET_BIT_NAME])); if (idx < 0) { NL_SET_ERR_MSG_ATTR(extack, tb[ETHTOOL_A_BITSET_BIT_NAME], "bit name not found"); return -EOPNOTSUPP; } } else { NL_SET_ERR_MSG_ATTR(extack, bit_attr, "neither bit index nor name specified"); return -EINVAL; } *index = idx; *val = no_mask || tb[ETHTOOL_A_BITSET_BIT_VALUE]; return 0; } /** * ethnl_bitmap32_equal() - Compare two bitmaps * @map1: first bitmap * @map2: second bitmap * @nbits: bit size to compare * * Return: true if first @nbits are equal, false if not */ static bool ethnl_bitmap32_equal(const u32 *map1, const u32 *map2, unsigned int nbits) { if (memcmp(map1, map2, nbits / 32 * sizeof(u32))) return false; if (nbits % 32 == 0) return true; return !((map1[nbits / 32] ^ map2[nbits / 32]) & ethnl_lower_bits(nbits % 32)); } static int ethnl_update_bitset32_verbose(u32 *bitmap, unsigned int nbits, const struct nlattr *attr, struct nlattr **tb, ethnl_string_array_t names, struct netlink_ext_ack *extack, bool *mod) { u32 *saved_bitmap = NULL; struct nlattr *bit_attr; bool no_mask; int rem; int ret; if (tb[ETHTOOL_A_BITSET_VALUE]) { NL_SET_ERR_MSG_ATTR(extack, tb[ETHTOOL_A_BITSET_VALUE], "value only allowed in compact bitset"); return -EINVAL; } if (tb[ETHTOOL_A_BITSET_MASK]) { NL_SET_ERR_MSG_ATTR(extack, tb[ETHTOOL_A_BITSET_MASK], "mask only allowed in compact bitset"); return -EINVAL; } no_mask = tb[ETHTOOL_A_BITSET_NOMASK]; if (no_mask) { unsigned int nwords = DIV_ROUND_UP(nbits, 32); unsigned int nbytes = nwords * sizeof(u32); bool dummy; /* The bitmap size is only the size of the map part without * its mask part. */ saved_bitmap = kcalloc(nwords, sizeof(u32), GFP_KERNEL); if (!saved_bitmap) return -ENOMEM; memcpy(saved_bitmap, bitmap, nbytes); ethnl_bitmap32_clear(bitmap, 0, nbits, &dummy); } nla_for_each_nested(bit_attr, tb[ETHTOOL_A_BITSET_BITS], rem) { bool old_val, new_val; unsigned int idx; if (nla_type(bit_attr) != ETHTOOL_A_BITSET_BITS_BIT) { NL_SET_ERR_MSG_ATTR(extack, bit_attr, "only ETHTOOL_A_BITSET_BITS_BIT allowed in ETHTOOL_A_BITSET_BITS"); kfree(saved_bitmap); return -EINVAL; } ret = ethnl_parse_bit(&idx, &new_val, nbits, bit_attr, no_mask, names, extack); if (ret < 0) { kfree(saved_bitmap); return ret; } old_val = bitmap[idx / 32] & ((u32)1 << (idx % 32)); if (new_val != old_val) { if (new_val) bitmap[idx / 32] |= ((u32)1 << (idx % 32)); else bitmap[idx / 32] &= ~((u32)1 << (idx % 32)); if (!no_mask) *mod = true; } } if (no_mask && !ethnl_bitmap32_equal(saved_bitmap, bitmap, nbits)) *mod = true; kfree(saved_bitmap); return 0; } static int ethnl_compact_sanity_checks(unsigned int nbits, const struct nlattr *nest, struct nlattr **tb, struct netlink_ext_ack *extack) { bool no_mask = tb[ETHTOOL_A_BITSET_NOMASK]; unsigned int attr_nbits, attr_nwords; const struct nlattr *test_attr; if (no_mask && tb[ETHTOOL_A_BITSET_MASK]) { NL_SET_ERR_MSG_ATTR(extack, tb[ETHTOOL_A_BITSET_MASK], "mask not allowed in list bitset"); return -EINVAL; } if (!tb[ETHTOOL_A_BITSET_SIZE]) { NL_SET_ERR_MSG_ATTR(extack, nest, "missing size in compact bitset"); return -EINVAL; } if (!tb[ETHTOOL_A_BITSET_VALUE]) { NL_SET_ERR_MSG_ATTR(extack, nest, "missing value in compact bitset"); return -EINVAL; } if (!no_mask && !tb[ETHTOOL_A_BITSET_MASK]) { NL_SET_ERR_MSG_ATTR(extack, nest, "missing mask in compact nonlist bitset"); return -EINVAL; } attr_nbits = nla_get_u32(tb[ETHTOOL_A_BITSET_SIZE]); attr_nwords = DIV_ROUND_UP(attr_nbits, 32); if (nla_len(tb[ETHTOOL_A_BITSET_VALUE]) != attr_nwords * sizeof(u32)) { NL_SET_ERR_MSG_ATTR(extack, tb[ETHTOOL_A_BITSET_VALUE], "bitset value length does not match size"); return -EINVAL; } if (tb[ETHTOOL_A_BITSET_MASK] && nla_len(tb[ETHTOOL_A_BITSET_MASK]) != attr_nwords * sizeof(u32)) { NL_SET_ERR_MSG_ATTR(extack, tb[ETHTOOL_A_BITSET_MASK], "bitset mask length does not match size"); return -EINVAL; } if (attr_nbits <= nbits) return 0; test_attr = no_mask ? tb[ETHTOOL_A_BITSET_VALUE] : tb[ETHTOOL_A_BITSET_MASK]; if (ethnl_bitmap32_not_zero(nla_data(test_attr), nbits, attr_nbits)) { NL_SET_ERR_MSG_ATTR(extack, test_attr, "cannot modify bits past kernel bitset size"); return -EINVAL; } return 0; } /** * ethnl_update_bitset32() - Apply a bitset nest to a u32 based bitmap * @bitmap: bitmap to update * @nbits: size of the updated bitmap in bits * @attr: nest attribute to parse and apply * @names: array of bit names; may be null for compact format * @extack: extack for error reporting * @mod: set this to true if bitmap is modified, leave as it is if not * * Apply bitset netsted attribute to a bitmap. If the attribute represents * a bit list, @bitmap is set to its contents; otherwise, bits in mask are * set to values from value. Bitmaps in the attribute may be longer than * @nbits but the message must not request modifying any bits past @nbits. * * Return: negative error code on failure, 0 on success */ int ethnl_update_bitset32(u32 *bitmap, unsigned int nbits, const struct nlattr *attr, ethnl_string_array_t names, struct netlink_ext_ack *extack, bool *mod) { struct nlattr *tb[ARRAY_SIZE(bitset_policy)]; unsigned int change_bits; bool no_mask; int ret; if (!attr) return 0; ret = nla_parse_nested(tb, ARRAY_SIZE(bitset_policy) - 1, attr, bitset_policy, extack); if (ret < 0) return ret; if (tb[ETHTOOL_A_BITSET_BITS]) return ethnl_update_bitset32_verbose(bitmap, nbits, attr, tb, names, extack, mod); ret = ethnl_compact_sanity_checks(nbits, attr, tb, extack); if (ret < 0) return ret; no_mask = tb[ETHTOOL_A_BITSET_NOMASK]; change_bits = min_t(unsigned int, nla_get_u32(tb[ETHTOOL_A_BITSET_SIZE]), nbits); ethnl_bitmap32_update(bitmap, change_bits, nla_data(tb[ETHTOOL_A_BITSET_VALUE]), no_mask ? NULL : nla_data(tb[ETHTOOL_A_BITSET_MASK]), mod); if (no_mask && change_bits < nbits) ethnl_bitmap32_clear(bitmap, change_bits, nbits, mod); return 0; } /** * ethnl_parse_bitset() - Compute effective value and mask from bitset nest * @val: unsigned long based bitmap to put value into * @mask: unsigned long based bitmap to put mask into * @nbits: size of @val and @mask bitmaps * @attr: nest attribute to parse and apply * @names: array of bit names; may be null for compact format * @extack: extack for error reporting * * Provide @nbits size long bitmaps for value and mask so that * x = (val & mask) | (x & ~mask) would modify any @nbits sized bitmap x * the same way ethnl_update_bitset() with the same bitset attribute would. * * Return: negative error code on failure, 0 on success */ int ethnl_parse_bitset(unsigned long *val, unsigned long *mask, unsigned int nbits, const struct nlattr *attr, ethnl_string_array_t names, struct netlink_ext_ack *extack) { struct nlattr *tb[ARRAY_SIZE(bitset_policy)]; const struct nlattr *bit_attr; bool no_mask; int rem; int ret; if (!attr) return 0; ret = nla_parse_nested(tb, ARRAY_SIZE(bitset_policy) - 1, attr, bitset_policy, extack); if (ret < 0) return ret; no_mask = tb[ETHTOOL_A_BITSET_NOMASK]; if (!tb[ETHTOOL_A_BITSET_BITS]) { unsigned int change_bits; ret = ethnl_compact_sanity_checks(nbits, attr, tb, extack); if (ret < 0) return ret; change_bits = nla_get_u32(tb[ETHTOOL_A_BITSET_SIZE]); if (change_bits > nbits) change_bits = nbits; bitmap_from_arr32(val, nla_data(tb[ETHTOOL_A_BITSET_VALUE]), change_bits); if (change_bits < nbits) bitmap_clear(val, change_bits, nbits - change_bits); if (no_mask) { bitmap_fill(mask, nbits); } else { bitmap_from_arr32(mask, nla_data(tb[ETHTOOL_A_BITSET_MASK]), change_bits); if (change_bits < nbits) bitmap_clear(mask, change_bits, nbits - change_bits); } return 0; } if (tb[ETHTOOL_A_BITSET_VALUE]) { NL_SET_ERR_MSG_ATTR(extack, tb[ETHTOOL_A_BITSET_VALUE], "value only allowed in compact bitset"); return -EINVAL; } if (tb[ETHTOOL_A_BITSET_MASK]) { NL_SET_ERR_MSG_ATTR(extack, tb[ETHTOOL_A_BITSET_MASK], "mask only allowed in compact bitset"); return -EINVAL; } bitmap_zero(val, nbits); if (no_mask) bitmap_fill(mask, nbits); else bitmap_zero(mask, nbits); nla_for_each_nested(bit_attr, tb[ETHTOOL_A_BITSET_BITS], rem) { unsigned int idx; bool bit_val; ret = ethnl_parse_bit(&idx, &bit_val, nbits, bit_attr, no_mask, names, extack); if (ret < 0) return ret; if (bit_val) __set_bit(idx, val); if (!no_mask) __set_bit(idx, mask); } return 0; } #if BITS_PER_LONG == 64 && defined(__BIG_ENDIAN) /* 64-bit big endian architectures are the only case when u32 based bitmaps * and unsigned long based bitmaps have different memory layout so that we * cannot simply cast the latter to the former and need actual wrappers * converting the latter to the former. * * To reduce the number of slab allocations, the wrappers use fixed size local * variables for bitmaps up to ETHNL_SMALL_BITMAP_BITS bits which is the * majority of bitmaps used by ethtool. */ #define ETHNL_SMALL_BITMAP_BITS 128 #define ETHNL_SMALL_BITMAP_WORDS DIV_ROUND_UP(ETHNL_SMALL_BITMAP_BITS, 32) int ethnl_bitset_size(const unsigned long *val, const unsigned long *mask, unsigned int nbits, ethnl_string_array_t names, bool compact) { u32 small_mask32[ETHNL_SMALL_BITMAP_WORDS]; u32 small_val32[ETHNL_SMALL_BITMAP_WORDS]; u32 *mask32; u32 *val32; int ret; if (nbits > ETHNL_SMALL_BITMAP_BITS) { unsigned int nwords = DIV_ROUND_UP(nbits, 32); val32 = kmalloc_array(2 * nwords, sizeof(u32), GFP_KERNEL); if (!val32) return -ENOMEM; mask32 = val32 + nwords; } else { val32 = small_val32; mask32 = small_mask32; } bitmap_to_arr32(val32, val, nbits); if (mask) bitmap_to_arr32(mask32, mask, nbits); else mask32 = NULL; ret = ethnl_bitset32_size(val32, mask32, nbits, names, compact); if (nbits > ETHNL_SMALL_BITMAP_BITS) kfree(val32); return ret; } int ethnl_put_bitset(struct sk_buff *skb, int attrtype, const unsigned long *val, const unsigned long *mask, unsigned int nbits, ethnl_string_array_t names, bool compact) { u32 small_mask32[ETHNL_SMALL_BITMAP_WORDS]; u32 small_val32[ETHNL_SMALL_BITMAP_WORDS]; u32 *mask32; u32 *val32; int ret; if (nbits > ETHNL_SMALL_BITMAP_BITS) { unsigned int nwords = DIV_ROUND_UP(nbits, 32); val32 = kmalloc_array(2 * nwords, sizeof(u32), GFP_KERNEL); if (!val32) return -ENOMEM; mask32 = val32 + nwords; } else { val32 = small_val32; mask32 = small_mask32; } bitmap_to_arr32(val32, val, nbits); if (mask) bitmap_to_arr32(mask32, mask, nbits); else mask32 = NULL; ret = ethnl_put_bitset32(skb, attrtype, val32, mask32, nbits, names, compact); if (nbits > ETHNL_SMALL_BITMAP_BITS) kfree(val32); return ret; } int ethnl_update_bitset(unsigned long *bitmap, unsigned int nbits, const struct nlattr *attr, ethnl_string_array_t names, struct netlink_ext_ack *extack, bool *mod) { u32 small_bitmap32[ETHNL_SMALL_BITMAP_WORDS]; u32 *bitmap32 = small_bitmap32; bool u32_mod = false; int ret; if (nbits > ETHNL_SMALL_BITMAP_BITS) { unsigned int dst_words = DIV_ROUND_UP(nbits, 32); bitmap32 = kmalloc_array(dst_words, sizeof(u32), GFP_KERNEL); if (!bitmap32) return -ENOMEM; } bitmap_to_arr32(bitmap32, bitmap, nbits); ret = ethnl_update_bitset32(bitmap32, nbits, attr, names, extack, &u32_mod); if (u32_mod) { bitmap_from_arr32(bitmap, bitmap32, nbits); *mod = true; } if (nbits > ETHNL_SMALL_BITMAP_BITS) kfree(bitmap32); return ret; } #else /* On little endian 64-bit and all 32-bit architectures, an unsigned long * based bitmap can be interpreted as u32 based one using a simple cast. */ int ethnl_bitset_size(const unsigned long *val, const unsigned long *mask, unsigned int nbits, ethnl_string_array_t names, bool compact) { return ethnl_bitset32_size((const u32 *)val, (const u32 *)mask, nbits, names, compact); } int ethnl_put_bitset(struct sk_buff *skb, int attrtype, const unsigned long *val, const unsigned long *mask, unsigned int nbits, ethnl_string_array_t names, bool compact) { return ethnl_put_bitset32(skb, attrtype, (const u32 *)val, (const u32 *)mask, nbits, names, compact); } int ethnl_update_bitset(unsigned long *bitmap, unsigned int nbits, const struct nlattr *attr, ethnl_string_array_t names, struct netlink_ext_ack *extack, bool *mod) { return ethnl_update_bitset32((u32 *)bitmap, nbits, attr, names, extack, mod); } #endif /* BITS_PER_LONG == 64 && defined(__BIG_ENDIAN) */ |
| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 2 2 1 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 | // SPDX-License-Identifier: GPL-2.0 #include <linux/syscalls.h> #include <linux/export.h> #include <linux/fs.h> #include <linux/file.h> #include <linux/mount.h> #include <linux/namei.h> #include <linux/statfs.h> #include <linux/security.h> #include <linux/uaccess.h> #include <linux/compat.h> #include "internal.h" static int flags_by_mnt(int mnt_flags) { int flags = 0; if (mnt_flags & MNT_READONLY) flags |= ST_RDONLY; if (mnt_flags & MNT_NOSUID) flags |= ST_NOSUID; if (mnt_flags & MNT_NODEV) flags |= ST_NODEV; if (mnt_flags & MNT_NOEXEC) flags |= ST_NOEXEC; if (mnt_flags & MNT_NOATIME) flags |= ST_NOATIME; if (mnt_flags & MNT_NODIRATIME) flags |= ST_NODIRATIME; if (mnt_flags & MNT_RELATIME) flags |= ST_RELATIME; if (mnt_flags & MNT_NOSYMFOLLOW) flags |= ST_NOSYMFOLLOW; return flags; } static int flags_by_sb(int s_flags) { int flags = 0; if (s_flags & SB_SYNCHRONOUS) flags |= ST_SYNCHRONOUS; if (s_flags & SB_MANDLOCK) flags |= ST_MANDLOCK; if (s_flags & SB_RDONLY) flags |= ST_RDONLY; return flags; } static int calculate_f_flags(struct vfsmount *mnt) { return ST_VALID | flags_by_mnt(mnt->mnt_flags) | flags_by_sb(mnt->mnt_sb->s_flags); } static int statfs_by_dentry(struct dentry *dentry, struct kstatfs *buf) { int retval; if (!dentry->d_sb->s_op->statfs) return -ENOSYS; memset(buf, 0, sizeof(*buf)); retval = security_sb_statfs(dentry); if (retval) return retval; retval = dentry->d_sb->s_op->statfs(dentry, buf); if (retval == 0 && buf->f_frsize == 0) buf->f_frsize = buf->f_bsize; return retval; } int vfs_get_fsid(struct dentry *dentry, __kernel_fsid_t *fsid) { struct kstatfs st; int error; error = statfs_by_dentry(dentry, &st); if (error) return error; *fsid = st.f_fsid; return 0; } EXPORT_SYMBOL(vfs_get_fsid); int vfs_statfs(const struct path *path, struct kstatfs *buf) { int error; error = statfs_by_dentry(path->dentry, buf); if (!error) buf->f_flags = calculate_f_flags(path->mnt); return error; } EXPORT_SYMBOL(vfs_statfs); int user_statfs(const char __user *pathname, struct kstatfs *st) { struct path path; int error; unsigned int lookup_flags = LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT; retry: error = user_path_at(AT_FDCWD, pathname, lookup_flags, &path); if (!error) { error = vfs_statfs(&path, st); path_put(&path); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; goto retry; } } return error; } int fd_statfs(int fd, struct kstatfs *st) { CLASS(fd_raw, f)(fd); if (fd_empty(f)) return -EBADF; return vfs_statfs(&fd_file(f)->f_path, st); } static int do_statfs_native(struct kstatfs *st, struct statfs __user *p) { struct statfs buf; if (sizeof(buf) == sizeof(*st)) memcpy(&buf, st, sizeof(*st)); else { memset(&buf, 0, sizeof(buf)); if (sizeof buf.f_blocks == 4) { if ((st->f_blocks | st->f_bfree | st->f_bavail | st->f_bsize | st->f_frsize) & 0xffffffff00000000ULL) return -EOVERFLOW; /* * f_files and f_ffree may be -1; it's okay to stuff * that into 32 bits */ if (st->f_files != -1 && (st->f_files & 0xffffffff00000000ULL)) return -EOVERFLOW; if (st->f_ffree != -1 && (st->f_ffree & 0xffffffff00000000ULL)) return -EOVERFLOW; } buf.f_type = st->f_type; buf.f_bsize = st->f_bsize; buf.f_blocks = st->f_blocks; buf.f_bfree = st->f_bfree; buf.f_bavail = st->f_bavail; buf.f_files = st->f_files; buf.f_ffree = st->f_ffree; buf.f_fsid = st->f_fsid; buf.f_namelen = st->f_namelen; buf.f_frsize = st->f_frsize; buf.f_flags = st->f_flags; } if (copy_to_user(p, &buf, sizeof(buf))) return -EFAULT; return 0; } static int do_statfs64(struct kstatfs *st, struct statfs64 __user *p) { struct statfs64 buf; if (sizeof(buf) == sizeof(*st)) memcpy(&buf, st, sizeof(*st)); else { memset(&buf, 0, sizeof(buf)); buf.f_type = st->f_type; buf.f_bsize = st->f_bsize; buf.f_blocks = st->f_blocks; buf.f_bfree = st->f_bfree; buf.f_bavail = st->f_bavail; buf.f_files = st->f_files; buf.f_ffree = st->f_ffree; buf.f_fsid = st->f_fsid; buf.f_namelen = st->f_namelen; buf.f_frsize = st->f_frsize; buf.f_flags = st->f_flags; } if (copy_to_user(p, &buf, sizeof(buf))) return -EFAULT; return 0; } SYSCALL_DEFINE2(statfs, const char __user *, pathname, struct statfs __user *, buf) { struct kstatfs st; int error = user_statfs(pathname, &st); if (!error) error = do_statfs_native(&st, buf); return error; } SYSCALL_DEFINE3(statfs64, const char __user *, pathname, size_t, sz, struct statfs64 __user *, buf) { struct kstatfs st; int error; if (sz != sizeof(*buf)) return -EINVAL; error = user_statfs(pathname, &st); if (!error) error = do_statfs64(&st, buf); return error; } SYSCALL_DEFINE2(fstatfs, unsigned int, fd, struct statfs __user *, buf) { struct kstatfs st; int error = fd_statfs(fd, &st); if (!error) error = do_statfs_native(&st, buf); return error; } SYSCALL_DEFINE3(fstatfs64, unsigned int, fd, size_t, sz, struct statfs64 __user *, buf) { struct kstatfs st; int error; if (sz != sizeof(*buf)) return -EINVAL; error = fd_statfs(fd, &st); if (!error) error = do_statfs64(&st, buf); return error; } static int vfs_ustat(dev_t dev, struct kstatfs *sbuf) { struct super_block *s = user_get_super(dev, false); int err; if (!s) return -EINVAL; err = statfs_by_dentry(s->s_root, sbuf); drop_super(s); return err; } SYSCALL_DEFINE2(ustat, unsigned, dev, struct ustat __user *, ubuf) { struct ustat tmp; struct kstatfs sbuf; int err = vfs_ustat(new_decode_dev(dev), &sbuf); if (err) return err; memset(&tmp,0,sizeof(struct ustat)); tmp.f_tfree = sbuf.f_bfree; if (IS_ENABLED(CONFIG_ARCH_32BIT_USTAT_F_TINODE)) tmp.f_tinode = min_t(u64, sbuf.f_ffree, UINT_MAX); else tmp.f_tinode = sbuf.f_ffree; return copy_to_user(ubuf, &tmp, sizeof(struct ustat)) ? -EFAULT : 0; } #ifdef CONFIG_COMPAT static int put_compat_statfs(struct compat_statfs __user *ubuf, struct kstatfs *kbuf) { struct compat_statfs buf; if (sizeof ubuf->f_blocks == 4) { if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail | kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL) return -EOVERFLOW; /* f_files and f_ffree may be -1; it's okay * to stuff that into 32 bits */ if (kbuf->f_files != 0xffffffffffffffffULL && (kbuf->f_files & 0xffffffff00000000ULL)) return -EOVERFLOW; if (kbuf->f_ffree != 0xffffffffffffffffULL && (kbuf->f_ffree & 0xffffffff00000000ULL)) return -EOVERFLOW; } memset(&buf, 0, sizeof(struct compat_statfs)); buf.f_type = kbuf->f_type; buf.f_bsize = kbuf->f_bsize; buf.f_blocks = kbuf->f_blocks; buf.f_bfree = kbuf->f_bfree; buf.f_bavail = kbuf->f_bavail; buf.f_files = kbuf->f_files; buf.f_ffree = kbuf->f_ffree; buf.f_namelen = kbuf->f_namelen; buf.f_fsid.val[0] = kbuf->f_fsid.val[0]; buf.f_fsid.val[1] = kbuf->f_fsid.val[1]; buf.f_frsize = kbuf->f_frsize; buf.f_flags = kbuf->f_flags; if (copy_to_user(ubuf, &buf, sizeof(struct compat_statfs))) return -EFAULT; return 0; } /* * The following statfs calls are copies of code from fs/statfs.c and * should be checked against those from time to time */ COMPAT_SYSCALL_DEFINE2(statfs, const char __user *, pathname, struct compat_statfs __user *, buf) { struct kstatfs tmp; int error = user_statfs(pathname, &tmp); if (!error) error = put_compat_statfs(buf, &tmp); return error; } COMPAT_SYSCALL_DEFINE2(fstatfs, unsigned int, fd, struct compat_statfs __user *, buf) { struct kstatfs tmp; int error = fd_statfs(fd, &tmp); if (!error) error = put_compat_statfs(buf, &tmp); return error; } static int put_compat_statfs64(struct compat_statfs64 __user *ubuf, struct kstatfs *kbuf) { struct compat_statfs64 buf; if ((kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL) return -EOVERFLOW; memset(&buf, 0, sizeof(struct compat_statfs64)); buf.f_type = kbuf->f_type; buf.f_bsize = kbuf->f_bsize; buf.f_blocks = kbuf->f_blocks; buf.f_bfree = kbuf->f_bfree; buf.f_bavail = kbuf->f_bavail; buf.f_files = kbuf->f_files; buf.f_ffree = kbuf->f_ffree; buf.f_namelen = kbuf->f_namelen; buf.f_fsid.val[0] = kbuf->f_fsid.val[0]; buf.f_fsid.val[1] = kbuf->f_fsid.val[1]; buf.f_frsize = kbuf->f_frsize; buf.f_flags = kbuf->f_flags; if (copy_to_user(ubuf, &buf, sizeof(struct compat_statfs64))) return -EFAULT; return 0; } int kcompat_sys_statfs64(const char __user * pathname, compat_size_t sz, struct compat_statfs64 __user * buf) { struct kstatfs tmp; int error; if (sz != sizeof(*buf)) return -EINVAL; error = user_statfs(pathname, &tmp); if (!error) error = put_compat_statfs64(buf, &tmp); return error; } COMPAT_SYSCALL_DEFINE3(statfs64, const char __user *, pathname, compat_size_t, sz, struct compat_statfs64 __user *, buf) { return kcompat_sys_statfs64(pathname, sz, buf); } int kcompat_sys_fstatfs64(unsigned int fd, compat_size_t sz, struct compat_statfs64 __user * buf) { struct kstatfs tmp; int error; if (sz != sizeof(*buf)) return -EINVAL; error = fd_statfs(fd, &tmp); if (!error) error = put_compat_statfs64(buf, &tmp); return error; } COMPAT_SYSCALL_DEFINE3(fstatfs64, unsigned int, fd, compat_size_t, sz, struct compat_statfs64 __user *, buf) { return kcompat_sys_fstatfs64(fd, sz, buf); } /* * This is a copy of sys_ustat, just dealing with a structure layout. * Given how simple this syscall is that apporach is more maintainable * than the various conversion hacks. */ COMPAT_SYSCALL_DEFINE2(ustat, unsigned, dev, struct compat_ustat __user *, u) { struct compat_ustat tmp; struct kstatfs sbuf; int err = vfs_ustat(new_decode_dev(dev), &sbuf); if (err) return err; memset(&tmp, 0, sizeof(struct compat_ustat)); tmp.f_tfree = sbuf.f_bfree; tmp.f_tinode = sbuf.f_ffree; if (copy_to_user(u, &tmp, sizeof(struct compat_ustat))) return -EFAULT; return 0; } #endif |
| 43 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_SCHED_TASK_STACK_H #define _LINUX_SCHED_TASK_STACK_H /* * task->stack (kernel stack) handling interfaces: */ #include <linux/sched.h> #include <linux/magic.h> #include <linux/refcount.h> #include <linux/kasan.h> #ifdef CONFIG_THREAD_INFO_IN_TASK /* * When accessing the stack of a non-current task that might exit, use * try_get_task_stack() instead. task_stack_page will return a pointer * that could get freed out from under you. */ static __always_inline void *task_stack_page(const struct task_struct *task) { return task->stack; } #define setup_thread_stack(new,old) do { } while(0) static __always_inline unsigned long *end_of_stack(const struct task_struct *task) { #ifdef CONFIG_STACK_GROWSUP return (unsigned long *)((unsigned long)task->stack + THREAD_SIZE) - 1; #else return task->stack; #endif } #else #define task_stack_page(task) ((void *)(task)->stack) static inline void setup_thread_stack(struct task_struct *p, struct task_struct *org) { *task_thread_info(p) = *task_thread_info(org); task_thread_info(p)->task = p; } /* * Return the address of the last usable long on the stack. * * When the stack grows down, this is just above the thread * info struct. Going any lower will corrupt the threadinfo. * * When the stack grows up, this is the highest address. * Beyond that position, we corrupt data on the next page. */ static inline unsigned long *end_of_stack(const struct task_struct *p) { #ifdef CONFIG_STACK_GROWSUP return (unsigned long *)((unsigned long)task_thread_info(p) + THREAD_SIZE) - 1; #else return (unsigned long *)(task_thread_info(p) + 1); #endif } #endif #ifdef CONFIG_THREAD_INFO_IN_TASK static inline void *try_get_task_stack(struct task_struct *tsk) { return refcount_inc_not_zero(&tsk->stack_refcount) ? task_stack_page(tsk) : NULL; } extern void put_task_stack(struct task_struct *tsk); #else static inline void *try_get_task_stack(struct task_struct *tsk) { return task_stack_page(tsk); } static inline void put_task_stack(struct task_struct *tsk) {} #endif void exit_task_stack_account(struct task_struct *tsk); #define task_stack_end_corrupted(task) \ (*(end_of_stack(task)) != STACK_END_MAGIC) static inline int object_is_on_stack(const void *obj) { void *stack = task_stack_page(current); obj = kasan_reset_tag(obj); return (obj >= stack) && (obj < (stack + THREAD_SIZE)); } extern void thread_stack_cache_init(void); #ifdef CONFIG_DEBUG_STACK_USAGE unsigned long stack_not_used(struct task_struct *p); #else static inline unsigned long stack_not_used(struct task_struct *p) { return 0; } #endif extern void set_task_stack_end_magic(struct task_struct *tsk); static inline int kstack_end(void *addr) { /* Reliable end of stack detection: * Some APM bios versions misalign the stack */ return !(((unsigned long)addr+sizeof(void*)-1) & (THREAD_SIZE-sizeof(void*))); } #endif /* _LINUX_SCHED_TASK_STACK_H */ |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * AppArmor security module * * This file contains AppArmor auditing function definitions. * * Copyright (C) 1998-2008 Novell/SUSE * Copyright 2009-2010 Canonical Ltd. */ #ifndef __AA_AUDIT_H #define __AA_AUDIT_H #include <linux/audit.h> #include <linux/fs.h> #include <linux/lsm_audit.h> #include <linux/sched.h> #include <linux/slab.h> #include "file.h" #include "label.h" extern const char *const audit_mode_names[]; #define AUDIT_MAX_INDEX 5 enum audit_mode { AUDIT_NORMAL, /* follow normal auditing of accesses */ AUDIT_QUIET_DENIED, /* quiet all denied access messages */ AUDIT_QUIET, /* quiet all messages */ AUDIT_NOQUIET, /* do not quiet audit messages */ AUDIT_ALL /* audit all accesses */ }; enum audit_type { AUDIT_APPARMOR_AUDIT, AUDIT_APPARMOR_ALLOWED, AUDIT_APPARMOR_DENIED, AUDIT_APPARMOR_HINT, AUDIT_APPARMOR_STATUS, AUDIT_APPARMOR_ERROR, AUDIT_APPARMOR_KILL, AUDIT_APPARMOR_AUTO }; #define OP_NULL NULL #define OP_SYSCTL "sysctl" #define OP_CAPABLE "capable" #define OP_UNLINK "unlink" #define OP_MKDIR "mkdir" #define OP_RMDIR "rmdir" #define OP_MKNOD "mknod" #define OP_TRUNC "truncate" #define OP_LINK "link" #define OP_SYMLINK "symlink" #define OP_RENAME_SRC "rename_src" #define OP_RENAME_DEST "rename_dest" #define OP_CHMOD "chmod" #define OP_CHOWN "chown" #define OP_GETATTR "getattr" #define OP_OPEN "open" #define OP_FRECEIVE "file_receive" #define OP_FPERM "file_perm" #define OP_FLOCK "file_lock" #define OP_FMMAP "file_mmap" #define OP_FMPROT "file_mprotect" #define OP_INHERIT "file_inherit" #define OP_PIVOTROOT "pivotroot" #define OP_MOUNT "mount" #define OP_UMOUNT "umount" #define OP_CREATE "create" #define OP_POST_CREATE "post_create" #define OP_BIND "bind" #define OP_CONNECT "connect" #define OP_LISTEN "listen" #define OP_ACCEPT "accept" #define OP_SENDMSG "sendmsg" #define OP_RECVMSG "recvmsg" #define OP_GETSOCKNAME "getsockname" #define OP_GETPEERNAME "getpeername" #define OP_GETSOCKOPT "getsockopt" #define OP_SETSOCKOPT "setsockopt" #define OP_SHUTDOWN "socket_shutdown" #define OP_PTRACE "ptrace" #define OP_SIGNAL "signal" #define OP_EXEC "exec" #define OP_CHANGE_HAT "change_hat" #define OP_CHANGE_PROFILE "change_profile" #define OP_CHANGE_ONEXEC "change_onexec" #define OP_STACK "stack" #define OP_STACK_ONEXEC "stack_onexec" #define OP_SETPROCATTR "setprocattr" #define OP_SETRLIMIT "setrlimit" #define OP_PROF_REPL "profile_replace" #define OP_PROF_LOAD "profile_load" #define OP_PROF_RM "profile_remove" #define OP_USERNS_CREATE "userns_create" #define OP_URING_OVERRIDE "uring_override" #define OP_URING_SQPOLL "uring_sqpoll" struct apparmor_audit_data { int error; int type; u16 class; const char *op; const struct cred *subj_cred; struct aa_label *subj_label; const char *name; const char *info; u32 request; u32 denied; union { /* these entries require a custom callback fn */ struct { struct aa_label *peer; union { struct { const char *target; kuid_t ouid; } fs; struct { int rlim; unsigned long max; } rlim; struct { int signal; int unmappedsig; }; struct { int type, protocol; void *addr; int addrlen; struct { void *addr; int addrlen; } peer; } net; }; }; struct { struct aa_profile *profile; const char *ns; long pos; } iface; struct { const char *src_name; const char *type; const char *trans; const char *data; unsigned long flags; } mnt; struct { struct aa_label *target; } uring; }; struct common_audit_data common; }; /* macros for dealing with apparmor_audit_data structure */ #define aad(SA) (container_of(SA, struct apparmor_audit_data, common)) #define aad_of_va(VA) aad((struct common_audit_data *)(VA)) #define DEFINE_AUDIT_DATA(NAME, T, C, X) \ /* TODO: cleanup audit init so we don't need _aad = {0,} */ \ struct apparmor_audit_data NAME = { \ .class = (C), \ .op = (X), \ .common.type = (T), \ .common.u.tsk = NULL, \ .common.apparmor_audit_data = &NAME, \ }; void aa_audit_msg(int type, struct apparmor_audit_data *ad, void (*cb) (struct audit_buffer *, void *)); int aa_audit(int type, struct aa_profile *profile, struct apparmor_audit_data *ad, void (*cb) (struct audit_buffer *, void *)); #define aa_audit_error(ERROR, AD, CB) \ ({ \ (AD)->error = (ERROR); \ aa_audit_msg(AUDIT_APPARMOR_ERROR, (AD), (CB)); \ (AD)->error; \ }) static inline int complain_error(int error) { if (error == -EPERM || error == -EACCES) return 0; return error; } void aa_audit_rule_free(void *vrule); int aa_audit_rule_init(u32 field, u32 op, char *rulestr, void **vrule, gfp_t gfp); int aa_audit_rule_known(struct audit_krule *rule); int aa_audit_rule_match(struct lsm_prop *prop, u32 field, u32 op, void *vrule); #endif /* __AA_AUDIT_H */ |
| 56 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Because linux/module.h has tracepoints in the header, and ftrace.h * used to include this file, define_trace.h includes linux/module.h * But we do not want the module.h to override the TRACE_SYSTEM macro * variable that define_trace.h is processing, so we only set it * when module events are being processed, which would happen when * CREATE_TRACE_POINTS is defined. */ #ifdef CREATE_TRACE_POINTS #undef TRACE_SYSTEM #define TRACE_SYSTEM module #endif #if !defined(_TRACE_MODULE_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_MODULE_H #include <linux/tracepoint.h> #ifdef CONFIG_MODULES struct module; #define show_module_flags(flags) __print_flags(flags, "", \ { (1UL << TAINT_PROPRIETARY_MODULE), "P" }, \ { (1UL << TAINT_OOT_MODULE), "O" }, \ { (1UL << TAINT_FORCED_MODULE), "F" }, \ { (1UL << TAINT_CRAP), "C" }, \ { (1UL << TAINT_UNSIGNED_MODULE), "E" }) TRACE_EVENT(module_load, TP_PROTO(struct module *mod), TP_ARGS(mod), TP_STRUCT__entry( __field( unsigned int, taints ) __string( name, mod->name ) ), TP_fast_assign( __entry->taints = mod->taints; __assign_str(name); ), TP_printk("%s %s", __get_str(name), show_module_flags(__entry->taints)) ); TRACE_EVENT(module_free, TP_PROTO(struct module *mod), TP_ARGS(mod), TP_STRUCT__entry( __string( name, mod->name ) ), TP_fast_assign( __assign_str(name); ), TP_printk("%s", __get_str(name)) ); #ifdef CONFIG_MODULE_UNLOAD /* trace_module_get/put are only used if CONFIG_MODULE_UNLOAD is defined */ DECLARE_EVENT_CLASS(module_refcnt, TP_PROTO(struct module *mod, unsigned long ip), TP_ARGS(mod, ip), TP_STRUCT__entry( __field( unsigned long, ip ) __field( int, refcnt ) __string( name, mod->name ) ), TP_fast_assign( __entry->ip = ip; __entry->refcnt = atomic_read(&mod->refcnt); __assign_str(name); ), TP_printk("%s call_site=%ps refcnt=%d", __get_str(name), (void *)__entry->ip, __entry->refcnt) ); DEFINE_EVENT(module_refcnt, module_get, TP_PROTO(struct module *mod, unsigned long ip), TP_ARGS(mod, ip) ); DEFINE_EVENT(module_refcnt, module_put, TP_PROTO(struct module *mod, unsigned long ip), TP_ARGS(mod, ip) ); #endif /* CONFIG_MODULE_UNLOAD */ TRACE_EVENT(module_request, TP_PROTO(char *name, bool wait, unsigned long ip), TP_ARGS(name, wait, ip), TP_STRUCT__entry( __field( unsigned long, ip ) __field( bool, wait ) __string( name, name ) ), TP_fast_assign( __entry->ip = ip; __entry->wait = wait; __assign_str(name); ), TP_printk("%s wait=%d call_site=%ps", __get_str(name), (int)__entry->wait, (void *)__entry->ip) ); #endif /* CONFIG_MODULES */ #endif /* _TRACE_MODULE_H */ /* This part must be outside protection */ #include <trace/define_trace.h> |
| 1 1 1 1 1 1 1 2 2 2 3 2 1 1 1 1 2 2 2 2 1 1 1 1 1 4 4 4 44 3 3 1 3 3 3 1 3 3 10 10 10 1 1 1 3 2 1 3 11 11 11 11 10 10 4 2 11 1 1 1 1 1 1 1 7 7 7 6 6 5 5 4 1 5 4 1 3 3 1 2 2 1 2 2 4 1 1 2 2 2 2 2 2 2 43 43 43 42 42 42 42 1 1 1 1 1 1 1 42 43 43 43 4 2 2 2 1 2 2 2 2 2 2 2 4 6 6 5 5 2 5 3 2 3 3 3 5 3 4 4 2 2 4 5 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 | // SPDX-License-Identifier: GPL-2.0 #include <linux/kernel.h> #include <linux/errno.h> #include <linux/fs.h> #include <linux/file.h> #include <linux/mm.h> #include <linux/slab.h> #include <linux/nospec.h> #include <linux/hugetlb.h> #include <linux/compat.h> #include <linux/io_uring.h> #include <linux/io_uring/cmd.h> #include <uapi/linux/io_uring.h> #include "filetable.h" #include "io_uring.h" #include "openclose.h" #include "rsrc.h" #include "memmap.h" #include "register.h" struct io_rsrc_update { struct file *file; u64 arg; u32 nr_args; u32 offset; }; static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov, struct page **last_hpage); /* only define max */ #define IORING_MAX_FIXED_FILES (1U << 20) #define IORING_MAX_REG_BUFFERS (1U << 14) #define IO_CACHED_BVECS_SEGS 32 int __io_account_mem(struct user_struct *user, unsigned long nr_pages) { unsigned long page_limit, cur_pages, new_pages; if (!nr_pages) return 0; /* Don't allow more pages than we can safely lock */ page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; cur_pages = atomic_long_read(&user->locked_vm); do { new_pages = cur_pages + nr_pages; if (new_pages > page_limit) return -ENOMEM; } while (!atomic_long_try_cmpxchg(&user->locked_vm, &cur_pages, new_pages)); return 0; } void io_unaccount_mem(struct user_struct *user, struct mm_struct *mm_account, unsigned long nr_pages) { if (user) __io_unaccount_mem(user, nr_pages); if (mm_account) atomic64_sub(nr_pages, &mm_account->pinned_vm); } int io_account_mem(struct user_struct *user, struct mm_struct *mm_account, unsigned long nr_pages) { int ret; if (user) { ret = __io_account_mem(user, nr_pages); if (ret) return ret; } if (mm_account) atomic64_add(nr_pages, &mm_account->pinned_vm); return 0; } int io_validate_user_buf_range(u64 uaddr, u64 ulen) { unsigned long tmp, base = (unsigned long)uaddr; unsigned long acct_len = (unsigned long)PAGE_ALIGN(ulen); /* arbitrary limit, but we need something */ if (ulen > SZ_1G || !ulen) return -EFAULT; if (check_add_overflow(base, acct_len, &tmp)) return -EOVERFLOW; return 0; } static int io_buffer_validate(struct iovec *iov) { /* * Don't impose further limits on the size and buffer * constraints here, we'll -EINVAL later when IO is * submitted if they are wrong. */ if (!iov->iov_base) return iov->iov_len ? -EFAULT : 0; return io_validate_user_buf_range((unsigned long)iov->iov_base, iov->iov_len); } static void io_release_ubuf(void *priv) { struct io_mapped_ubuf *imu = priv; unsigned int i; for (i = 0; i < imu->nr_bvecs; i++) { struct folio *folio = page_folio(imu->bvec[i].bv_page); unpin_user_folio(folio, 1); } } static struct io_mapped_ubuf *io_alloc_imu(struct io_ring_ctx *ctx, int nr_bvecs) { if (nr_bvecs <= IO_CACHED_BVECS_SEGS) return io_cache_alloc(&ctx->imu_cache, GFP_KERNEL); return kvmalloc(struct_size_t(struct io_mapped_ubuf, bvec, nr_bvecs), GFP_KERNEL); } static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu) { if (imu->nr_bvecs <= IO_CACHED_BVECS_SEGS) io_cache_free(&ctx->imu_cache, imu); else kvfree(imu); } static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu) { if (unlikely(refcount_read(&imu->refs) > 1)) { if (!refcount_dec_and_test(&imu->refs)) return; } if (imu->acct_pages) io_unaccount_mem(ctx->user, ctx->mm_account, imu->acct_pages); imu->release(imu->priv); io_free_imu(ctx, imu); } struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type) { struct io_rsrc_node *node; node = io_cache_alloc(&ctx->node_cache, GFP_KERNEL); if (node) { node->type = type; node->refs = 1; node->tag = 0; node->file_ptr = 0; } return node; } bool io_rsrc_cache_init(struct io_ring_ctx *ctx) { const int imu_cache_size = struct_size_t(struct io_mapped_ubuf, bvec, IO_CACHED_BVECS_SEGS); const int node_size = sizeof(struct io_rsrc_node); bool ret; ret = io_alloc_cache_init(&ctx->node_cache, IO_ALLOC_CACHE_MAX, node_size, 0); ret |= io_alloc_cache_init(&ctx->imu_cache, IO_ALLOC_CACHE_MAX, imu_cache_size, 0); return ret; } void io_rsrc_cache_free(struct io_ring_ctx *ctx) { io_alloc_cache_free(&ctx->node_cache, kfree); io_alloc_cache_free(&ctx->imu_cache, kfree); } static void io_clear_table_tags(struct io_rsrc_data *data) { int i; for (i = 0; i < data->nr; i++) { struct io_rsrc_node *node = data->nodes[i]; if (node) node->tag = 0; } } __cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data) { if (!data->nr) return; while (data->nr--) { if (data->nodes[data->nr]) io_put_rsrc_node(ctx, data->nodes[data->nr]); } kvfree(data->nodes); data->nodes = NULL; data->nr = 0; } __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr) { data->nodes = kvmalloc_array(nr, sizeof(struct io_rsrc_node *), GFP_KERNEL_ACCOUNT | __GFP_ZERO); if (data->nodes) { data->nr = nr; return 0; } return -ENOMEM; } static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_rsrc_update2 *up, unsigned nr_args) { u64 __user *tags = u64_to_user_ptr(up->tags); __s32 __user *fds = u64_to_user_ptr(up->data); int fd, i, err = 0; unsigned int done; if (!ctx->file_table.data.nr) return -ENXIO; if (up->offset + nr_args > ctx->file_table.data.nr) return -EINVAL; for (done = 0; done < nr_args; done++) { u64 tag = 0; if ((tags && copy_from_user(&tag, &tags[done], sizeof(tag))) || copy_from_user(&fd, &fds[done], sizeof(fd))) { err = -EFAULT; break; } if ((fd == IORING_REGISTER_FILES_SKIP || fd == -1) && tag) { err = -EINVAL; break; } if (fd == IORING_REGISTER_FILES_SKIP) continue; i = up->offset + done; if (io_reset_rsrc_node(ctx, &ctx->file_table.data, i)) io_file_bitmap_clear(&ctx->file_table, i); if (fd != -1) { struct file *file = fget(fd); struct io_rsrc_node *node; if (!file) { err = -EBADF; break; } /* * Don't allow io_uring instances to be registered. */ if (io_is_uring_fops(file)) { fput(file); err = -EBADF; break; } node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE); if (!node) { err = -ENOMEM; fput(file); break; } ctx->file_table.data.nodes[i] = node; if (tag) node->tag = tag; io_fixed_file_set(node, file); io_file_bitmap_set(&ctx->file_table, i); } } return done ? done : err; } static int __io_sqe_buffers_update(struct io_ring_ctx *ctx, struct io_uring_rsrc_update2 *up, unsigned int nr_args) { u64 __user *tags = u64_to_user_ptr(up->tags); struct iovec fast_iov, *iov; struct page *last_hpage = NULL; struct iovec __user *uvec; u64 user_data = up->data; __u32 done; int i, err; if (!ctx->buf_table.nr) return -ENXIO; if (up->offset + nr_args > ctx->buf_table.nr) return -EINVAL; for (done = 0; done < nr_args; done++) { struct io_rsrc_node *node; u64 tag = 0; uvec = u64_to_user_ptr(user_data); iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat); if (IS_ERR(iov)) { err = PTR_ERR(iov); break; } if (tags && copy_from_user(&tag, &tags[done], sizeof(tag))) { err = -EFAULT; break; } err = io_buffer_validate(iov); if (err) break; node = io_sqe_buffer_register(ctx, iov, &last_hpage); if (IS_ERR(node)) { err = PTR_ERR(node); break; } if (tag) { if (!node) { err = -EINVAL; break; } node->tag = tag; } i = array_index_nospec(up->offset + done, ctx->buf_table.nr); io_reset_rsrc_node(ctx, &ctx->buf_table, i); ctx->buf_table.nodes[i] = node; if (ctx->compat) user_data += sizeof(struct compat_iovec); else user_data += sizeof(struct iovec); } return done ? done : err; } static int __io_register_rsrc_update(struct io_ring_ctx *ctx, unsigned type, struct io_uring_rsrc_update2 *up, unsigned nr_args) { __u32 tmp; lockdep_assert_held(&ctx->uring_lock); if (check_add_overflow(up->offset, nr_args, &tmp)) return -EOVERFLOW; switch (type) { case IORING_RSRC_FILE: return __io_sqe_files_update(ctx, up, nr_args); case IORING_RSRC_BUFFER: return __io_sqe_buffers_update(ctx, up, nr_args); } return -EINVAL; } int io_register_files_update(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args) { struct io_uring_rsrc_update2 up; if (!nr_args) return -EINVAL; memset(&up, 0, sizeof(up)); if (copy_from_user(&up, arg, sizeof(struct io_uring_rsrc_update))) return -EFAULT; if (up.resv || up.resv2) return -EINVAL; return __io_register_rsrc_update(ctx, IORING_RSRC_FILE, &up, nr_args); } int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg, unsigned size, unsigned type) { struct io_uring_rsrc_update2 up; if (size != sizeof(up)) return -EINVAL; if (copy_from_user(&up, arg, sizeof(up))) return -EFAULT; if (!up.nr || up.resv || up.resv2) return -EINVAL; return __io_register_rsrc_update(ctx, type, &up, up.nr); } __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, unsigned int size, unsigned int type) { struct io_uring_rsrc_register rr; /* keep it extendible */ if (size != sizeof(rr)) return -EINVAL; memset(&rr, 0, sizeof(rr)); if (copy_from_user(&rr, arg, size)) return -EFAULT; if (!rr.nr || rr.resv2) return -EINVAL; if (rr.flags & ~IORING_RSRC_REGISTER_SPARSE) return -EINVAL; switch (type) { case IORING_RSRC_FILE: if (rr.flags & IORING_RSRC_REGISTER_SPARSE && rr.data) break; return io_sqe_files_register(ctx, u64_to_user_ptr(rr.data), rr.nr, u64_to_user_ptr(rr.tags)); case IORING_RSRC_BUFFER: if (rr.flags & IORING_RSRC_REGISTER_SPARSE && rr.data) break; return io_sqe_buffers_register(ctx, u64_to_user_ptr(rr.data), rr.nr, u64_to_user_ptr(rr.tags)); } return -EINVAL; } int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_rsrc_update *up = io_kiocb_to_cmd(req, struct io_rsrc_update); if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT))) return -EINVAL; if (sqe->rw_flags || sqe->splice_fd_in) return -EINVAL; up->offset = READ_ONCE(sqe->off); up->nr_args = READ_ONCE(sqe->len); if (!up->nr_args) return -EINVAL; up->arg = READ_ONCE(sqe->addr); return 0; } static int io_files_update_with_index_alloc(struct io_kiocb *req, unsigned int issue_flags) { struct io_rsrc_update *up = io_kiocb_to_cmd(req, struct io_rsrc_update); __s32 __user *fds = u64_to_user_ptr(up->arg); unsigned int done; struct file *file; int ret, fd; if (!req->ctx->file_table.data.nr) return -ENXIO; for (done = 0; done < up->nr_args; done++) { if (get_user(fd, &fds[done])) { ret = -EFAULT; break; } file = fget(fd); if (!file) { ret = -EBADF; break; } ret = io_fixed_fd_install(req, issue_flags, file, IORING_FILE_INDEX_ALLOC); if (ret < 0) break; if (put_user(ret, &fds[done])) { __io_close_fixed(req->ctx, issue_flags, ret); ret = -EFAULT; break; } } if (done) return done; return ret; } int io_files_update(struct io_kiocb *req, unsigned int issue_flags) { struct io_rsrc_update *up = io_kiocb_to_cmd(req, struct io_rsrc_update); struct io_ring_ctx *ctx = req->ctx; struct io_uring_rsrc_update2 up2; int ret; up2.offset = up->offset; up2.data = up->arg; up2.nr = 0; up2.tags = 0; up2.resv = 0; up2.resv2 = 0; if (up->offset == IORING_FILE_INDEX_ALLOC) { ret = io_files_update_with_index_alloc(req, issue_flags); } else { io_ring_submit_lock(ctx, issue_flags); ret = __io_register_rsrc_update(ctx, IORING_RSRC_FILE, &up2, up->nr_args); io_ring_submit_unlock(ctx, issue_flags); } if (ret < 0) req_set_fail(req); io_req_set_res(req, ret, 0); return IOU_COMPLETE; } void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node) { if (node->tag) io_post_aux_cqe(ctx, node->tag, 0, 0); switch (node->type) { case IORING_RSRC_FILE: fput(io_slot_file(node)); break; case IORING_RSRC_BUFFER: io_buffer_unmap(ctx, node->buf); break; default: WARN_ON_ONCE(1); break; } io_cache_free(&ctx->node_cache, node); } int io_sqe_files_unregister(struct io_ring_ctx *ctx) { if (!ctx->file_table.data.nr) return -ENXIO; io_free_file_tables(ctx, &ctx->file_table); io_file_table_set_alloc_range(ctx, 0, 0); return 0; } int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args, u64 __user *tags) { __s32 __user *fds = (__s32 __user *) arg; struct file *file; int fd, ret; unsigned i; if (ctx->file_table.data.nr) return -EBUSY; if (!nr_args) return -EINVAL; if (nr_args > IORING_MAX_FIXED_FILES) return -EMFILE; if (nr_args > rlimit(RLIMIT_NOFILE)) return -EMFILE; if (!io_alloc_file_tables(ctx, &ctx->file_table, nr_args)) return -ENOMEM; for (i = 0; i < nr_args; i++) { struct io_rsrc_node *node; u64 tag = 0; ret = -EFAULT; if (tags && copy_from_user(&tag, &tags[i], sizeof(tag))) goto fail; if (fds && copy_from_user(&fd, &fds[i], sizeof(fd))) goto fail; /* allow sparse sets */ if (!fds || fd == -1) { ret = -EINVAL; if (tag) goto fail; continue; } file = fget(fd); ret = -EBADF; if (unlikely(!file)) goto fail; /* * Don't allow io_uring instances to be registered. */ if (io_is_uring_fops(file)) { fput(file); goto fail; } ret = -ENOMEM; node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE); if (!node) { fput(file); goto fail; } if (tag) node->tag = tag; ctx->file_table.data.nodes[i] = node; io_fixed_file_set(node, file); io_file_bitmap_set(&ctx->file_table, i); } /* default it to the whole table */ io_file_table_set_alloc_range(ctx, 0, ctx->file_table.data.nr); return 0; fail: io_clear_table_tags(&ctx->file_table.data); io_sqe_files_unregister(ctx); return ret; } int io_sqe_buffers_unregister(struct io_ring_ctx *ctx) { if (!ctx->buf_table.nr) return -ENXIO; io_rsrc_data_free(ctx, &ctx->buf_table); return 0; } /* * Not super efficient, but this is just a registration time. And we do cache * the last compound head, so generally we'll only do a full search if we don't * match that one. * * We check if the given compound head page has already been accounted, to * avoid double accounting it. This allows us to account the full size of the * page, not just the constituent pages of a huge page. */ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages, int nr_pages, struct page *hpage) { int i, j; /* check current page array */ for (i = 0; i < nr_pages; i++) { if (!PageCompound(pages[i])) continue; if (compound_head(pages[i]) == hpage) return true; } /* check previously registered pages */ for (i = 0; i < ctx->buf_table.nr; i++) { struct io_rsrc_node *node = ctx->buf_table.nodes[i]; struct io_mapped_ubuf *imu; if (!node) continue; imu = node->buf; for (j = 0; j < imu->nr_bvecs; j++) { if (!PageCompound(imu->bvec[j].bv_page)) continue; if (compound_head(imu->bvec[j].bv_page) == hpage) return true; } } return false; } static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages, int nr_pages, struct io_mapped_ubuf *imu, struct page **last_hpage) { int i, ret; imu->acct_pages = 0; for (i = 0; i < nr_pages; i++) { if (!PageCompound(pages[i])) { imu->acct_pages++; } else { struct page *hpage; hpage = compound_head(pages[i]); if (hpage == *last_hpage) continue; *last_hpage = hpage; if (headpage_already_acct(ctx, pages, i, hpage)) continue; imu->acct_pages += page_size(hpage) >> PAGE_SHIFT; } } if (!imu->acct_pages) return 0; ret = io_account_mem(ctx->user, ctx->mm_account, imu->acct_pages); if (ret) imu->acct_pages = 0; return ret; } static bool io_coalesce_buffer(struct page ***pages, int *nr_pages, struct io_imu_folio_data *data) { struct page **page_array = *pages, **new_array = NULL; unsigned nr_pages_left = *nr_pages; unsigned nr_folios = data->nr_folios; unsigned i, j; /* Store head pages only*/ new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL); if (!new_array) return false; for (i = 0, j = 0; i < nr_folios; i++) { struct page *p = compound_head(page_array[j]); struct folio *folio = page_folio(p); unsigned int nr; WARN_ON_ONCE(i > 0 && p != page_array[j]); nr = i ? data->nr_pages_mid : data->nr_pages_head; nr = min(nr, nr_pages_left); /* Drop all but one ref, the entire folio will remain pinned. */ if (nr > 1) unpin_user_folio(folio, nr - 1); j += nr; nr_pages_left -= nr; new_array[i] = p; } WARN_ON_ONCE(j != *nr_pages); kvfree(page_array); *pages = new_array; *nr_pages = nr_folios; return true; } bool io_check_coalesce_buffer(struct page **page_array, int nr_pages, struct io_imu_folio_data *data) { struct folio *folio = page_folio(page_array[0]); unsigned int count = 1, nr_folios = 1; int i; data->nr_pages_mid = folio_nr_pages(folio); data->folio_shift = folio_shift(folio); data->first_folio_page_idx = folio_page_idx(folio, page_array[0]); /* * Check if pages are contiguous inside a folio, and all folios have * the same page count except for the head and tail. */ for (i = 1; i < nr_pages; i++) { if (page_folio(page_array[i]) == folio && page_array[i] == page_array[i-1] + 1) { count++; continue; } if (nr_folios == 1) { if (folio_page_idx(folio, page_array[i-1]) != data->nr_pages_mid - 1) return false; data->nr_pages_head = count; } else if (count != data->nr_pages_mid) { return false; } folio = page_folio(page_array[i]); if (folio_size(folio) != (1UL << data->folio_shift) || folio_page_idx(folio, page_array[i]) != 0) return false; count = 1; nr_folios++; } if (nr_folios == 1) data->nr_pages_head = count; data->nr_folios = nr_folios; return true; } static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov, struct page **last_hpage) { struct io_mapped_ubuf *imu = NULL; struct page **pages = NULL; struct io_rsrc_node *node; unsigned long off; size_t size; int ret, nr_pages, i; struct io_imu_folio_data data; bool coalesced = false; if (!iov->iov_base) return NULL; node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER); if (!node) return ERR_PTR(-ENOMEM); ret = -ENOMEM; pages = io_pin_pages((unsigned long) iov->iov_base, iov->iov_len, &nr_pages); if (IS_ERR(pages)) { ret = PTR_ERR(pages); pages = NULL; goto done; } /* If it's huge page(s), try to coalesce them into fewer bvec entries */ if (nr_pages > 1 && io_check_coalesce_buffer(pages, nr_pages, &data)) { if (data.nr_pages_mid != 1) coalesced = io_coalesce_buffer(&pages, &nr_pages, &data); } imu = io_alloc_imu(ctx, nr_pages); if (!imu) goto done; imu->nr_bvecs = nr_pages; ret = io_buffer_account_pin(ctx, pages, nr_pages, imu, last_hpage); if (ret) goto done; size = iov->iov_len; /* store original address for later verification */ imu->ubuf = (unsigned long) iov->iov_base; imu->len = iov->iov_len; imu->folio_shift = PAGE_SHIFT; imu->release = io_release_ubuf; imu->priv = imu; imu->is_kbuf = false; imu->dir = IO_IMU_DEST | IO_IMU_SOURCE; if (coalesced) imu->folio_shift = data.folio_shift; refcount_set(&imu->refs, 1); off = (unsigned long)iov->iov_base & ~PAGE_MASK; if (coalesced) off += data.first_folio_page_idx << PAGE_SHIFT; node->buf = imu; ret = 0; for (i = 0; i < nr_pages; i++) { size_t vec_len; vec_len = min_t(size_t, size, (1UL << imu->folio_shift) - off); bvec_set_page(&imu->bvec[i], pages[i], vec_len, off); off = 0; size -= vec_len; } done: if (ret) { if (imu) io_free_imu(ctx, imu); if (pages) { for (i = 0; i < nr_pages; i++) unpin_user_folio(page_folio(pages[i]), 1); } io_cache_free(&ctx->node_cache, node); node = ERR_PTR(ret); } kvfree(pages); return node; } int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, unsigned int nr_args, u64 __user *tags) { struct page *last_hpage = NULL; struct io_rsrc_data data; struct iovec fast_iov, *iov = &fast_iov; const struct iovec __user *uvec; int i, ret; BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16)); if (ctx->buf_table.nr) return -EBUSY; if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS) return -EINVAL; ret = io_rsrc_data_alloc(&data, nr_args); if (ret) return ret; if (!arg) memset(iov, 0, sizeof(*iov)); for (i = 0; i < nr_args; i++) { struct io_rsrc_node *node; u64 tag = 0; if (arg) { uvec = (struct iovec __user *) arg; iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat); if (IS_ERR(iov)) { ret = PTR_ERR(iov); break; } ret = io_buffer_validate(iov); if (ret) break; if (ctx->compat) arg += sizeof(struct compat_iovec); else arg += sizeof(struct iovec); } if (tags) { if (copy_from_user(&tag, &tags[i], sizeof(tag))) { ret = -EFAULT; break; } } node = io_sqe_buffer_register(ctx, iov, &last_hpage); if (IS_ERR(node)) { ret = PTR_ERR(node); break; } if (tag) { if (!node) { ret = -EINVAL; break; } node->tag = tag; } data.nodes[i] = node; } ctx->buf_table = data; if (ret) { io_clear_table_tags(&ctx->buf_table); io_sqe_buffers_unregister(ctx); } return ret; } int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq, void (*release)(void *), unsigned int index, unsigned int issue_flags) { struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx; struct io_rsrc_data *data = &ctx->buf_table; struct req_iterator rq_iter; struct io_mapped_ubuf *imu; struct io_rsrc_node *node; struct bio_vec bv; unsigned int nr_bvecs = 0; int ret = 0; io_ring_submit_lock(ctx, issue_flags); if (index >= data->nr) { ret = -EINVAL; goto unlock; } index = array_index_nospec(index, data->nr); if (data->nodes[index]) { ret = -EBUSY; goto unlock; } node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER); if (!node) { ret = -ENOMEM; goto unlock; } /* * blk_rq_nr_phys_segments() may overestimate the number of bvecs * but avoids needing to iterate over the bvecs */ imu = io_alloc_imu(ctx, blk_rq_nr_phys_segments(rq)); if (!imu) { kfree(node); ret = -ENOMEM; goto unlock; } imu->ubuf = 0; imu->len = blk_rq_bytes(rq); imu->acct_pages = 0; imu->folio_shift = PAGE_SHIFT; refcount_set(&imu->refs, 1); imu->release = release; imu->priv = rq; imu->is_kbuf = true; imu->dir = 1 << rq_data_dir(rq); rq_for_each_bvec(bv, rq, rq_iter) imu->bvec[nr_bvecs++] = bv; imu->nr_bvecs = nr_bvecs; node->buf = imu; data->nodes[index] = node; unlock: io_ring_submit_unlock(ctx, issue_flags); return ret; } EXPORT_SYMBOL_GPL(io_buffer_register_bvec); int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index, unsigned int issue_flags) { struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx; struct io_rsrc_data *data = &ctx->buf_table; struct io_rsrc_node *node; int ret = 0; io_ring_submit_lock(ctx, issue_flags); if (index >= data->nr) { ret = -EINVAL; goto unlock; } index = array_index_nospec(index, data->nr); node = data->nodes[index]; if (!node) { ret = -EINVAL; goto unlock; } if (!node->buf->is_kbuf) { ret = -EBUSY; goto unlock; } io_put_rsrc_node(ctx, node); data->nodes[index] = NULL; unlock: io_ring_submit_unlock(ctx, issue_flags); return ret; } EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec); static int validate_fixed_range(u64 buf_addr, size_t len, const struct io_mapped_ubuf *imu) { u64 buf_end; if (unlikely(check_add_overflow(buf_addr, (u64)len, &buf_end))) return -EFAULT; /* not inside the mapped region */ if (unlikely(buf_addr < imu->ubuf || buf_end > (imu->ubuf + imu->len))) return -EFAULT; if (unlikely(len > MAX_RW_COUNT)) return -EFAULT; return 0; } static int io_import_kbuf(int ddir, struct iov_iter *iter, struct io_mapped_ubuf *imu, size_t len, size_t offset) { size_t count = len + offset; iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, count); iov_iter_advance(iter, offset); if (count < imu->len) { const struct bio_vec *bvec = iter->bvec; len += iter->iov_offset; while (len > bvec->bv_len) { len -= bvec->bv_len; bvec++; } iter->nr_segs = 1 + bvec - iter->bvec; } return 0; } static int io_import_fixed(int ddir, struct iov_iter *iter, struct io_mapped_ubuf *imu, u64 buf_addr, size_t len) { const struct bio_vec *bvec; size_t folio_mask; unsigned nr_segs; size_t offset; int ret; ret = validate_fixed_range(buf_addr, len, imu); if (unlikely(ret)) return ret; if (!(imu->dir & (1 << ddir))) return -EFAULT; offset = buf_addr - imu->ubuf; if (imu->is_kbuf) return io_import_kbuf(ddir, iter, imu, len, offset); /* * Don't use iov_iter_advance() here, as it's really slow for * using the latter parts of a big fixed buffer - it iterates * over each segment manually. We can cheat a bit here for user * registered nodes, because we know that: * * 1) it's a BVEC iter, we set it up * 2) all bvecs are the same in size, except potentially the * first and last bvec */ folio_mask = (1UL << imu->folio_shift) - 1; bvec = imu->bvec; if (offset >= bvec->bv_len) { unsigned long seg_skip; /* skip first vec */ offset -= bvec->bv_len; seg_skip = 1 + (offset >> imu->folio_shift); bvec += seg_skip; offset &= folio_mask; } nr_segs = (offset + len + bvec->bv_offset + folio_mask) >> imu->folio_shift; iov_iter_bvec(iter, ddir, bvec, nr_segs, len); iter->iov_offset = offset; return 0; } inline struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req, unsigned issue_flags) { struct io_ring_ctx *ctx = req->ctx; struct io_rsrc_node *node; if (req->flags & REQ_F_BUF_NODE) return req->buf_node; req->flags |= REQ_F_BUF_NODE; io_ring_submit_lock(ctx, issue_flags); node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index); if (node) { node->refs++; req->buf_node = node; io_ring_submit_unlock(ctx, issue_flags); return node; } req->flags &= ~REQ_F_BUF_NODE; io_ring_submit_unlock(ctx, issue_flags); return NULL; } int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter, u64 buf_addr, size_t len, int ddir, unsigned issue_flags) { struct io_rsrc_node *node; node = io_find_buf_node(req, issue_flags); if (!node) return -EFAULT; return io_import_fixed(ddir, iter, node->buf, buf_addr, len); } /* Lock two rings at once. The rings must be different! */ static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2) { if (ctx1 > ctx2) swap(ctx1, ctx2); mutex_lock(&ctx1->uring_lock); mutex_lock_nested(&ctx2->uring_lock, SINGLE_DEPTH_NESTING); } /* Both rings are locked by the caller. */ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx, struct io_uring_clone_buffers *arg) { struct io_rsrc_data data; int i, ret, off, nr; unsigned int nbufs; lockdep_assert_held(&ctx->uring_lock); lockdep_assert_held(&src_ctx->uring_lock); /* * Accounting state is shared between the two rings; that only works if * both rings are accounted towards the same counters. */ if (ctx->user != src_ctx->user || ctx->mm_account != src_ctx->mm_account) return -EINVAL; /* if offsets are given, must have nr specified too */ if (!arg->nr && (arg->dst_off || arg->src_off)) return -EINVAL; /* not allowed unless REPLACE is set */ if (ctx->buf_table.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE)) return -EBUSY; nbufs = src_ctx->buf_table.nr; if (!nbufs) return -ENXIO; if (!arg->nr) arg->nr = nbufs; else if (arg->nr > nbufs) return -EINVAL; else if (arg->nr > IORING_MAX_REG_BUFFERS) return -EINVAL; if (check_add_overflow(arg->nr, arg->src_off, &off) || off > nbufs) return -EOVERFLOW; if (check_add_overflow(arg->nr, arg->dst_off, &nbufs)) return -EOVERFLOW; if (nbufs > IORING_MAX_REG_BUFFERS) return -EINVAL; ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.nr)); if (ret) return ret; /* Copy original dst nodes from before the cloned range */ for (i = 0; i < min(arg->dst_off, ctx->buf_table.nr); i++) { struct io_rsrc_node *node = ctx->buf_table.nodes[i]; if (node) { data.nodes[i] = node; node->refs++; } } off = arg->dst_off; i = arg->src_off; nr = arg->nr; while (nr--) { struct io_rsrc_node *dst_node, *src_node; src_node = io_rsrc_node_lookup(&src_ctx->buf_table, i); if (!src_node) { dst_node = NULL; } else { dst_node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER); if (!dst_node) { io_rsrc_data_free(ctx, &data); return -ENOMEM; } refcount_inc(&src_node->buf->refs); dst_node->buf = src_node->buf; } data.nodes[off++] = dst_node; i++; } /* Copy original dst nodes from after the cloned range */ for (i = nbufs; i < ctx->buf_table.nr; i++) { struct io_rsrc_node *node = ctx->buf_table.nodes[i]; if (node) { data.nodes[i] = node; node->refs++; } } /* * If asked for replace, put the old table. data->nodes[] holds both * old and new nodes at this point. */ if (arg->flags & IORING_REGISTER_DST_REPLACE) io_rsrc_data_free(ctx, &ctx->buf_table); /* * ctx->buf_table must be empty now - either the contents are being * replaced and we just freed the table, or the contents are being * copied to a ring that does not have buffers yet (checked at function * entry). */ WARN_ON_ONCE(ctx->buf_table.nr); ctx->buf_table = data; return 0; } /* * Copy the registered buffers from the source ring whose file descriptor * is given in the src_fd to the current ring. This is identical to registering * the buffers with ctx, except faster as mappings already exist. * * Since the memory is already accounted once, don't account it again. */ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg) { struct io_uring_clone_buffers buf; struct io_ring_ctx *src_ctx; bool registered_src; struct file *file; int ret; if (copy_from_user(&buf, arg, sizeof(buf))) return -EFAULT; if (buf.flags & ~(IORING_REGISTER_SRC_REGISTERED|IORING_REGISTER_DST_REPLACE)) return -EINVAL; if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.nr) return -EBUSY; if (memchr_inv(buf.pad, 0, sizeof(buf.pad))) return -EINVAL; registered_src = (buf.flags & IORING_REGISTER_SRC_REGISTERED) != 0; file = io_uring_register_get_file(buf.src_fd, registered_src); if (IS_ERR(file)) return PTR_ERR(file); src_ctx = file->private_data; if (src_ctx != ctx) { mutex_unlock(&ctx->uring_lock); lock_two_rings(ctx, src_ctx); if (src_ctx->submitter_task && src_ctx->submitter_task != current) { ret = -EEXIST; goto out; } } ret = io_clone_buffers(ctx, src_ctx, &buf); out: if (src_ctx != ctx) mutex_unlock(&src_ctx->uring_lock); fput(file); return ret; } void io_vec_free(struct iou_vec *iv) { if (!iv->iovec) return; kfree(iv->iovec); iv->iovec = NULL; iv->nr = 0; } int io_vec_realloc(struct iou_vec *iv, unsigned nr_entries) { gfp_t gfp = GFP_KERNEL | __GFP_NOWARN; struct iovec *iov; iov = kmalloc_array(nr_entries, sizeof(iov[0]), gfp); if (!iov) return -ENOMEM; io_vec_free(iv); iv->iovec = iov; iv->nr = nr_entries; return 0; } static int io_vec_fill_bvec(int ddir, struct iov_iter *iter, struct io_mapped_ubuf *imu, struct iovec *iovec, unsigned nr_iovs, struct iou_vec *vec) { unsigned long folio_size = 1 << imu->folio_shift; unsigned long folio_mask = folio_size - 1; struct bio_vec *res_bvec = vec->bvec; size_t total_len = 0; unsigned bvec_idx = 0; unsigned iov_idx; for (iov_idx = 0; iov_idx < nr_iovs; iov_idx++) { size_t iov_len = iovec[iov_idx].iov_len; u64 buf_addr = (u64)(uintptr_t)iovec[iov_idx].iov_base; struct bio_vec *src_bvec; size_t offset; int ret; ret = validate_fixed_range(buf_addr, iov_len, imu); if (unlikely(ret)) return ret; if (unlikely(!iov_len)) return -EFAULT; if (unlikely(check_add_overflow(total_len, iov_len, &total_len))) return -EOVERFLOW; offset = buf_addr - imu->ubuf; /* * Only the first bvec can have non zero bv_offset, account it * here and work with full folios below. */ offset += imu->bvec[0].bv_offset; src_bvec = imu->bvec + (offset >> imu->folio_shift); offset &= folio_mask; for (; iov_len; offset = 0, bvec_idx++, src_bvec++) { size_t seg_size = min_t(size_t, iov_len, folio_size - offset); bvec_set_page(&res_bvec[bvec_idx], src_bvec->bv_page, seg_size, offset); iov_len -= seg_size; } } if (total_len > MAX_RW_COUNT) return -EINVAL; iov_iter_bvec(iter, ddir, res_bvec, bvec_idx, total_len); return 0; } static int io_estimate_bvec_size(struct iovec *iov, unsigned nr_iovs, struct io_mapped_ubuf *imu) { unsigned shift = imu->folio_shift; size_t max_segs = 0; unsigned i; for (i = 0; i < nr_iovs; i++) { max_segs += (iov[i].iov_len >> shift) + 2; if (max_segs > INT_MAX) return -EOVERFLOW; } return max_segs; } static int io_vec_fill_kern_bvec(int ddir, struct iov_iter *iter, struct io_mapped_ubuf *imu, struct iovec *iovec, unsigned nr_iovs, struct iou_vec *vec) { const struct bio_vec *src_bvec = imu->bvec; struct bio_vec *res_bvec = vec->bvec; unsigned res_idx = 0; size_t total_len = 0; unsigned iov_idx; for (iov_idx = 0; iov_idx < nr_iovs; iov_idx++) { size_t offset = (size_t)(uintptr_t)iovec[iov_idx].iov_base; size_t iov_len = iovec[iov_idx].iov_len; struct bvec_iter bi = { .bi_size = offset + iov_len, }; struct bio_vec bv; bvec_iter_advance(src_bvec, &bi, offset); for_each_mp_bvec(bv, src_bvec, bi, bi) res_bvec[res_idx++] = bv; total_len += iov_len; } iov_iter_bvec(iter, ddir, res_bvec, res_idx, total_len); return 0; } static int iov_kern_bvec_size(const struct iovec *iov, const struct io_mapped_ubuf *imu, unsigned int *nr_seg) { size_t offset = (size_t)(uintptr_t)iov->iov_base; const struct bio_vec *bvec = imu->bvec; int start = 0, i = 0; size_t off = 0; int ret; ret = validate_fixed_range(offset, iov->iov_len, imu); if (unlikely(ret)) return ret; for (i = 0; off < offset + iov->iov_len && i < imu->nr_bvecs; off += bvec[i].bv_len, i++) { if (offset >= off && offset < off + bvec[i].bv_len) start = i; } *nr_seg = i - start; return 0; } static int io_kern_bvec_size(struct iovec *iov, unsigned nr_iovs, struct io_mapped_ubuf *imu, unsigned *nr_segs) { unsigned max_segs = 0; size_t total_len = 0; unsigned i; int ret; *nr_segs = 0; for (i = 0; i < nr_iovs; i++) { if (unlikely(!iov[i].iov_len)) return -EFAULT; if (unlikely(check_add_overflow(total_len, iov[i].iov_len, &total_len))) return -EOVERFLOW; ret = iov_kern_bvec_size(&iov[i], imu, &max_segs); if (unlikely(ret)) return ret; *nr_segs += max_segs; } if (total_len > MAX_RW_COUNT) return -EINVAL; return 0; } int io_import_reg_vec(int ddir, struct iov_iter *iter, struct io_kiocb *req, struct iou_vec *vec, unsigned nr_iovs, unsigned issue_flags) { struct io_rsrc_node *node; struct io_mapped_ubuf *imu; unsigned iovec_off; struct iovec *iov; unsigned nr_segs; node = io_find_buf_node(req, issue_flags); if (!node) return -EFAULT; imu = node->buf; if (!(imu->dir & (1 << ddir))) return -EFAULT; iovec_off = vec->nr - nr_iovs; iov = vec->iovec + iovec_off; if (imu->is_kbuf) { int ret = io_kern_bvec_size(iov, nr_iovs, imu, &nr_segs); if (unlikely(ret)) return ret; } else { int ret = io_estimate_bvec_size(iov, nr_iovs, imu); if (ret < 0) return ret; nr_segs = ret; } if (sizeof(struct bio_vec) > sizeof(struct iovec)) { size_t bvec_bytes; bvec_bytes = nr_segs * sizeof(struct bio_vec); nr_segs = (bvec_bytes + sizeof(*iov) - 1) / sizeof(*iov); nr_segs += nr_iovs; } if (nr_segs > vec->nr) { struct iou_vec tmp_vec = {}; int ret; ret = io_vec_realloc(&tmp_vec, nr_segs); if (ret) return ret; iovec_off = tmp_vec.nr - nr_iovs; memcpy(tmp_vec.iovec + iovec_off, iov, sizeof(*iov) * nr_iovs); io_vec_free(vec); *vec = tmp_vec; iov = vec->iovec + iovec_off; req->flags |= REQ_F_NEED_CLEANUP; } if (imu->is_kbuf) return io_vec_fill_kern_bvec(ddir, iter, imu, iov, nr_iovs, vec); return io_vec_fill_bvec(ddir, iter, imu, iov, nr_iovs, vec); } int io_prep_reg_iovec(struct io_kiocb *req, struct iou_vec *iv, const struct iovec __user *uvec, size_t uvec_segs) { struct iovec *iov; int iovec_off, ret; void *res; if (uvec_segs > iv->nr) { ret = io_vec_realloc(iv, uvec_segs); if (ret) return ret; req->flags |= REQ_F_NEED_CLEANUP; } /* pad iovec to the right */ iovec_off = iv->nr - uvec_segs; iov = iv->iovec + iovec_off; res = iovec_from_user(uvec, uvec_segs, uvec_segs, iov, io_is_compat(req->ctx)); if (IS_ERR(res)) return PTR_ERR(res); req->flags |= REQ_F_IMPORT_BUFFER; return 0; } |
| 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (c) 2016 Mellanox Technologies. All rights reserved. * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com> */ #include "devl_internal.h" struct devlink_linecard { struct list_head list; struct devlink *devlink; unsigned int index; const struct devlink_linecard_ops *ops; void *priv; enum devlink_linecard_state state; struct mutex state_lock; /* Protects state */ const char *type; struct devlink_linecard_type *types; unsigned int types_count; u32 rel_index; }; unsigned int devlink_linecard_index(struct devlink_linecard *linecard) { return linecard->index; } static struct devlink_linecard * devlink_linecard_get_by_index(struct devlink *devlink, unsigned int linecard_index) { struct devlink_linecard *devlink_linecard; list_for_each_entry(devlink_linecard, &devlink->linecard_list, list) { if (devlink_linecard->index == linecard_index) return devlink_linecard; } return NULL; } static bool devlink_linecard_index_exists(struct devlink *devlink, unsigned int linecard_index) { return devlink_linecard_get_by_index(devlink, linecard_index); } static struct devlink_linecard * devlink_linecard_get_from_attrs(struct devlink *devlink, struct nlattr **attrs) { if (attrs[DEVLINK_ATTR_LINECARD_INDEX]) { u32 linecard_index = nla_get_u32(attrs[DEVLINK_ATTR_LINECARD_INDEX]); struct devlink_linecard *linecard; linecard = devlink_linecard_get_by_index(devlink, linecard_index); if (!linecard) return ERR_PTR(-ENODEV); return linecard; } return ERR_PTR(-EINVAL); } static struct devlink_linecard * devlink_linecard_get_from_info(struct devlink *devlink, struct genl_info *info) { return devlink_linecard_get_from_attrs(devlink, info->attrs); } struct devlink_linecard_type { const char *type; const void *priv; }; static int devlink_nl_linecard_fill(struct sk_buff *msg, struct devlink *devlink, struct devlink_linecard *linecard, enum devlink_command cmd, u32 portid, u32 seq, int flags, struct netlink_ext_ack *extack) { struct devlink_linecard_type *linecard_type; struct nlattr *attr; void *hdr; int i; hdr = genlmsg_put(msg, portid, seq, &devlink_nl_family, flags, cmd); if (!hdr) return -EMSGSIZE; if (devlink_nl_put_handle(msg, devlink)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_LINECARD_INDEX, linecard->index)) goto nla_put_failure; if (nla_put_u8(msg, DEVLINK_ATTR_LINECARD_STATE, linecard->state)) goto nla_put_failure; if (linecard->type && nla_put_string(msg, DEVLINK_ATTR_LINECARD_TYPE, linecard->type)) goto nla_put_failure; if (linecard->types_count) { attr = nla_nest_start(msg, DEVLINK_ATTR_LINECARD_SUPPORTED_TYPES); if (!attr) goto nla_put_failure; for (i = 0; i < linecard->types_count; i++) { linecard_type = &linecard->types[i]; if (nla_put_string(msg, DEVLINK_ATTR_LINECARD_TYPE, linecard_type->type)) { nla_nest_cancel(msg, attr); goto nla_put_failure; } } nla_nest_end(msg, attr); } if (devlink_rel_devlink_handle_put(msg, devlink, linecard->rel_index, DEVLINK_ATTR_NESTED_DEVLINK, NULL)) goto nla_put_failure; genlmsg_end(msg, hdr); return 0; nla_put_failure: genlmsg_cancel(msg, hdr); return -EMSGSIZE; } static void devlink_linecard_notify(struct devlink_linecard *linecard, enum devlink_command cmd) { struct devlink *devlink = linecard->devlink; struct sk_buff *msg; int err; WARN_ON(cmd != DEVLINK_CMD_LINECARD_NEW && cmd != DEVLINK_CMD_LINECARD_DEL); if (!__devl_is_registered(devlink) || !devlink_nl_notify_need(devlink)) return; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return; err = devlink_nl_linecard_fill(msg, devlink, linecard, cmd, 0, 0, 0, NULL); if (err) { nlmsg_free(msg); return; } devlink_nl_notify_send(devlink, msg); } void devlink_linecards_notify_register(struct devlink *devlink) { struct devlink_linecard *linecard; list_for_each_entry(linecard, &devlink->linecard_list, list) devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); } void devlink_linecards_notify_unregister(struct devlink *devlink) { struct devlink_linecard *linecard; list_for_each_entry_reverse(linecard, &devlink->linecard_list, list) devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_DEL); } int devlink_nl_linecard_get_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; struct devlink_linecard *linecard; struct sk_buff *msg; int err; linecard = devlink_linecard_get_from_info(devlink, info); if (IS_ERR(linecard)) return PTR_ERR(linecard); msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; mutex_lock(&linecard->state_lock); err = devlink_nl_linecard_fill(msg, devlink, linecard, DEVLINK_CMD_LINECARD_NEW, info->snd_portid, info->snd_seq, 0, info->extack); mutex_unlock(&linecard->state_lock); if (err) { nlmsg_free(msg); return err; } return genlmsg_reply(msg, info); } static int devlink_nl_linecard_get_dump_one(struct sk_buff *msg, struct devlink *devlink, struct netlink_callback *cb, int flags) { struct devlink_nl_dump_state *state = devlink_dump_state(cb); struct devlink_linecard *linecard; int idx = 0; int err = 0; list_for_each_entry(linecard, &devlink->linecard_list, list) { if (idx < state->idx) { idx++; continue; } mutex_lock(&linecard->state_lock); err = devlink_nl_linecard_fill(msg, devlink, linecard, DEVLINK_CMD_LINECARD_NEW, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, flags, cb->extack); mutex_unlock(&linecard->state_lock); if (err) { state->idx = idx; break; } idx++; } return err; } int devlink_nl_linecard_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { return devlink_nl_dumpit(skb, cb, devlink_nl_linecard_get_dump_one); } static struct devlink_linecard_type * devlink_linecard_type_lookup(struct devlink_linecard *linecard, const char *type) { struct devlink_linecard_type *linecard_type; int i; for (i = 0; i < linecard->types_count; i++) { linecard_type = &linecard->types[i]; if (!strcmp(type, linecard_type->type)) return linecard_type; } return NULL; } static int devlink_linecard_type_set(struct devlink_linecard *linecard, const char *type, struct netlink_ext_ack *extack) { const struct devlink_linecard_ops *ops = linecard->ops; struct devlink_linecard_type *linecard_type; int err; mutex_lock(&linecard->state_lock); if (linecard->state == DEVLINK_LINECARD_STATE_PROVISIONING) { NL_SET_ERR_MSG(extack, "Line card is currently being provisioned"); err = -EBUSY; goto out; } if (linecard->state == DEVLINK_LINECARD_STATE_UNPROVISIONING) { NL_SET_ERR_MSG(extack, "Line card is currently being unprovisioned"); err = -EBUSY; goto out; } linecard_type = devlink_linecard_type_lookup(linecard, type); if (!linecard_type) { NL_SET_ERR_MSG(extack, "Unsupported line card type provided"); err = -EINVAL; goto out; } if (linecard->state != DEVLINK_LINECARD_STATE_UNPROVISIONED && linecard->state != DEVLINK_LINECARD_STATE_PROVISIONING_FAILED) { NL_SET_ERR_MSG(extack, "Line card already provisioned"); err = -EBUSY; /* Check if the line card is provisioned in the same * way the user asks. In case it is, make the operation * to return success. */ if (ops->same_provision && ops->same_provision(linecard, linecard->priv, linecard_type->type, linecard_type->priv)) err = 0; goto out; } linecard->state = DEVLINK_LINECARD_STATE_PROVISIONING; linecard->type = linecard_type->type; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); mutex_unlock(&linecard->state_lock); err = ops->provision(linecard, linecard->priv, linecard_type->type, linecard_type->priv, extack); if (err) { /* Provisioning failed. Assume the linecard is unprovisioned * for future operations. */ mutex_lock(&linecard->state_lock); linecard->state = DEVLINK_LINECARD_STATE_UNPROVISIONED; linecard->type = NULL; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); mutex_unlock(&linecard->state_lock); } return err; out: mutex_unlock(&linecard->state_lock); return err; } static int devlink_linecard_type_unset(struct devlink_linecard *linecard, struct netlink_ext_ack *extack) { int err; mutex_lock(&linecard->state_lock); if (linecard->state == DEVLINK_LINECARD_STATE_PROVISIONING) { NL_SET_ERR_MSG(extack, "Line card is currently being provisioned"); err = -EBUSY; goto out; } if (linecard->state == DEVLINK_LINECARD_STATE_UNPROVISIONING) { NL_SET_ERR_MSG(extack, "Line card is currently being unprovisioned"); err = -EBUSY; goto out; } if (linecard->state == DEVLINK_LINECARD_STATE_PROVISIONING_FAILED) { linecard->state = DEVLINK_LINECARD_STATE_UNPROVISIONED; linecard->type = NULL; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); err = 0; goto out; } if (linecard->state == DEVLINK_LINECARD_STATE_UNPROVISIONED) { NL_SET_ERR_MSG(extack, "Line card is not provisioned"); err = 0; goto out; } linecard->state = DEVLINK_LINECARD_STATE_UNPROVISIONING; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); mutex_unlock(&linecard->state_lock); err = linecard->ops->unprovision(linecard, linecard->priv, extack); if (err) { /* Unprovisioning failed. Assume the linecard is unprovisioned * for future operations. */ mutex_lock(&linecard->state_lock); linecard->state = DEVLINK_LINECARD_STATE_UNPROVISIONED; linecard->type = NULL; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); mutex_unlock(&linecard->state_lock); } return err; out: mutex_unlock(&linecard->state_lock); return err; } int devlink_nl_linecard_set_doit(struct sk_buff *skb, struct genl_info *info) { struct netlink_ext_ack *extack = info->extack; struct devlink *devlink = info->user_ptr[0]; struct devlink_linecard *linecard; int err; linecard = devlink_linecard_get_from_info(devlink, info); if (IS_ERR(linecard)) return PTR_ERR(linecard); if (info->attrs[DEVLINK_ATTR_LINECARD_TYPE]) { const char *type; type = nla_data(info->attrs[DEVLINK_ATTR_LINECARD_TYPE]); if (strcmp(type, "")) { err = devlink_linecard_type_set(linecard, type, extack); if (err) return err; } else { err = devlink_linecard_type_unset(linecard, extack); if (err) return err; } } return 0; } static int devlink_linecard_types_init(struct devlink_linecard *linecard) { struct devlink_linecard_type *linecard_type; unsigned int count; int i; count = linecard->ops->types_count(linecard, linecard->priv); linecard->types = kmalloc_array(count, sizeof(*linecard_type), GFP_KERNEL); if (!linecard->types) return -ENOMEM; linecard->types_count = count; for (i = 0; i < count; i++) { linecard_type = &linecard->types[i]; linecard->ops->types_get(linecard, linecard->priv, i, &linecard_type->type, &linecard_type->priv); } return 0; } static void devlink_linecard_types_fini(struct devlink_linecard *linecard) { kfree(linecard->types); } /** * devl_linecard_create - Create devlink linecard * * @devlink: devlink * @linecard_index: driver-specific numerical identifier of the linecard * @ops: linecards ops * @priv: user priv pointer * * Create devlink linecard instance with provided linecard index. * Caller can use any indexing, even hw-related one. * * Return: Line card structure or an ERR_PTR() encoded error code. */ struct devlink_linecard * devl_linecard_create(struct devlink *devlink, unsigned int linecard_index, const struct devlink_linecard_ops *ops, void *priv) { struct devlink_linecard *linecard; int err; if (WARN_ON(!ops || !ops->provision || !ops->unprovision || !ops->types_count || !ops->types_get)) return ERR_PTR(-EINVAL); if (devlink_linecard_index_exists(devlink, linecard_index)) return ERR_PTR(-EEXIST); linecard = kzalloc(sizeof(*linecard), GFP_KERNEL); if (!linecard) return ERR_PTR(-ENOMEM); linecard->devlink = devlink; linecard->index = linecard_index; linecard->ops = ops; linecard->priv = priv; linecard->state = DEVLINK_LINECARD_STATE_UNPROVISIONED; mutex_init(&linecard->state_lock); err = devlink_linecard_types_init(linecard); if (err) { mutex_destroy(&linecard->state_lock); kfree(linecard); return ERR_PTR(err); } list_add_tail(&linecard->list, &devlink->linecard_list); devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); return linecard; } EXPORT_SYMBOL_GPL(devl_linecard_create); /** * devl_linecard_destroy - Destroy devlink linecard * * @linecard: devlink linecard */ void devl_linecard_destroy(struct devlink_linecard *linecard) { devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_DEL); list_del(&linecard->list); devlink_linecard_types_fini(linecard); mutex_destroy(&linecard->state_lock); kfree(linecard); } EXPORT_SYMBOL_GPL(devl_linecard_destroy); /** * devlink_linecard_provision_set - Set provisioning on linecard * * @linecard: devlink linecard * @type: linecard type * * This is either called directly from the provision() op call or * as a result of the provision() op call asynchronously. */ void devlink_linecard_provision_set(struct devlink_linecard *linecard, const char *type) { mutex_lock(&linecard->state_lock); WARN_ON(linecard->type && strcmp(linecard->type, type)); linecard->state = DEVLINK_LINECARD_STATE_PROVISIONED; linecard->type = type; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); mutex_unlock(&linecard->state_lock); } EXPORT_SYMBOL_GPL(devlink_linecard_provision_set); /** * devlink_linecard_provision_clear - Clear provisioning on linecard * * @linecard: devlink linecard * * This is either called directly from the unprovision() op call or * as a result of the unprovision() op call asynchronously. */ void devlink_linecard_provision_clear(struct devlink_linecard *linecard) { mutex_lock(&linecard->state_lock); linecard->state = DEVLINK_LINECARD_STATE_UNPROVISIONED; linecard->type = NULL; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); mutex_unlock(&linecard->state_lock); } EXPORT_SYMBOL_GPL(devlink_linecard_provision_clear); /** * devlink_linecard_provision_fail - Fail provisioning on linecard * * @linecard: devlink linecard * * This is either called directly from the provision() op call or * as a result of the provision() op call asynchronously. */ void devlink_linecard_provision_fail(struct devlink_linecard *linecard) { mutex_lock(&linecard->state_lock); linecard->state = DEVLINK_LINECARD_STATE_PROVISIONING_FAILED; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); mutex_unlock(&linecard->state_lock); } EXPORT_SYMBOL_GPL(devlink_linecard_provision_fail); /** * devlink_linecard_activate - Set linecard active * * @linecard: devlink linecard */ void devlink_linecard_activate(struct devlink_linecard *linecard) { mutex_lock(&linecard->state_lock); WARN_ON(linecard->state != DEVLINK_LINECARD_STATE_PROVISIONED); linecard->state = DEVLINK_LINECARD_STATE_ACTIVE; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); mutex_unlock(&linecard->state_lock); } EXPORT_SYMBOL_GPL(devlink_linecard_activate); /** * devlink_linecard_deactivate - Set linecard inactive * * @linecard: devlink linecard */ void devlink_linecard_deactivate(struct devlink_linecard *linecard) { mutex_lock(&linecard->state_lock); switch (linecard->state) { case DEVLINK_LINECARD_STATE_ACTIVE: linecard->state = DEVLINK_LINECARD_STATE_PROVISIONED; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); break; case DEVLINK_LINECARD_STATE_UNPROVISIONING: /* Line card is being deactivated as part * of unprovisioning flow. */ break; default: WARN_ON(1); break; } mutex_unlock(&linecard->state_lock); } EXPORT_SYMBOL_GPL(devlink_linecard_deactivate); static void devlink_linecard_rel_notify_cb(struct devlink *devlink, u32 linecard_index) { struct devlink_linecard *linecard; linecard = devlink_linecard_get_by_index(devlink, linecard_index); if (!linecard) return; devlink_linecard_notify(linecard, DEVLINK_CMD_LINECARD_NEW); } static void devlink_linecard_rel_cleanup_cb(struct devlink *devlink, u32 linecard_index, u32 rel_index) { struct devlink_linecard *linecard; linecard = devlink_linecard_get_by_index(devlink, linecard_index); if (linecard && linecard->rel_index == rel_index) linecard->rel_index = 0; } /** * devlink_linecard_nested_dl_set - Attach/detach nested devlink * instance to linecard. * * @linecard: devlink linecard * @nested_devlink: devlink instance to attach or NULL to detach */ int devlink_linecard_nested_dl_set(struct devlink_linecard *linecard, struct devlink *nested_devlink) { return devlink_rel_nested_in_add(&linecard->rel_index, linecard->devlink->index, linecard->index, devlink_linecard_rel_notify_cb, devlink_linecard_rel_cleanup_cb, nested_devlink); } EXPORT_SYMBOL_GPL(devlink_linecard_nested_dl_set); |
| 522 13 4 5 3 1 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_HIGHMEM_INTERNAL_H #define _LINUX_HIGHMEM_INTERNAL_H /* * Outside of CONFIG_HIGHMEM to support X86 32bit iomap_atomic() cruft. */ #ifdef CONFIG_KMAP_LOCAL void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot); void *__kmap_local_page_prot(const struct page *page, pgprot_t prot); void kunmap_local_indexed(const void *vaddr); void kmap_local_fork(struct task_struct *tsk); void __kmap_local_sched_out(void); void __kmap_local_sched_in(void); static inline void kmap_assert_nomap(void) { DEBUG_LOCKS_WARN_ON(current->kmap_ctrl.idx); } #else static inline void kmap_local_fork(struct task_struct *tsk) { } static inline void kmap_assert_nomap(void) { } #endif #ifdef CONFIG_HIGHMEM #include <asm/highmem.h> #ifndef ARCH_HAS_KMAP_FLUSH_TLB static inline void kmap_flush_tlb(unsigned long addr) { } #endif #ifndef kmap_prot #define kmap_prot PAGE_KERNEL #endif void *kmap_high(struct page *page); void kunmap_high(const struct page *page); void __kmap_flush_unused(void); struct page *__kmap_to_page(void *addr); static inline void *kmap(struct page *page) { void *addr; might_sleep(); if (!PageHighMem(page)) addr = page_address(page); else addr = kmap_high(page); kmap_flush_tlb((unsigned long)addr); return addr; } static inline void kunmap(const struct page *page) { might_sleep(); if (!PageHighMem(page)) return; kunmap_high(page); } static inline struct page *kmap_to_page(void *addr) { return __kmap_to_page(addr); } static inline void kmap_flush_unused(void) { __kmap_flush_unused(); } static inline void *kmap_local_page(const struct page *page) { return __kmap_local_page_prot(page, kmap_prot); } static inline void *kmap_local_page_try_from_panic(const struct page *page) { if (!PageHighMem(page)) return page_address(page); /* If the page is in HighMem, it's not safe to kmap it.*/ return NULL; } static inline void *kmap_local_folio(const struct folio *folio, size_t offset) { const struct page *page = folio_page(folio, offset / PAGE_SIZE); return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE; } static inline void *kmap_local_page_prot(const struct page *page, pgprot_t prot) { return __kmap_local_page_prot(page, prot); } static inline void *kmap_local_pfn(unsigned long pfn) { return __kmap_local_pfn_prot(pfn, kmap_prot); } static inline void __kunmap_local(const void *vaddr) { kunmap_local_indexed(vaddr); } static inline void *kmap_atomic_prot(const struct page *page, pgprot_t prot) { if (IS_ENABLED(CONFIG_PREEMPT_RT)) migrate_disable(); else preempt_disable(); pagefault_disable(); return __kmap_local_page_prot(page, prot); } static inline void *kmap_atomic(const struct page *page) { return kmap_atomic_prot(page, kmap_prot); } static inline void *kmap_atomic_pfn(unsigned long pfn) { if (IS_ENABLED(CONFIG_PREEMPT_RT)) migrate_disable(); else preempt_disable(); pagefault_disable(); return __kmap_local_pfn_prot(pfn, kmap_prot); } static inline void __kunmap_atomic(const void *addr) { kunmap_local_indexed(addr); pagefault_enable(); if (IS_ENABLED(CONFIG_PREEMPT_RT)) migrate_enable(); else preempt_enable(); } unsigned long __nr_free_highpages(void); unsigned long __totalhigh_pages(void); static inline unsigned long nr_free_highpages(void) { return __nr_free_highpages(); } static inline unsigned long totalhigh_pages(void) { return __totalhigh_pages(); } static inline bool is_kmap_addr(const void *x) { unsigned long addr = (unsigned long)x; return (addr >= PKMAP_ADDR(0) && addr < PKMAP_ADDR(LAST_PKMAP)) || (addr >= __fix_to_virt(FIX_KMAP_END) && addr < __fix_to_virt(FIX_KMAP_BEGIN)); } #else /* CONFIG_HIGHMEM */ static inline struct page *kmap_to_page(void *addr) { return virt_to_page(addr); } static inline void *kmap(struct page *page) { might_sleep(); return page_address(page); } static inline void kunmap_high(const struct page *page) { } static inline void kmap_flush_unused(void) { } static inline void kunmap(const struct page *page) { #ifdef ARCH_HAS_FLUSH_ON_KUNMAP kunmap_flush_on_unmap(page_address(page)); #endif } static inline void *kmap_local_page(const struct page *page) { return page_address(page); } static inline void *kmap_local_page_try_from_panic(const struct page *page) { return page_address(page); } static inline void *kmap_local_folio(const struct folio *folio, size_t offset) { return folio_address(folio) + offset; } static inline void *kmap_local_page_prot(const struct page *page, pgprot_t prot) { return kmap_local_page(page); } static inline void *kmap_local_pfn(unsigned long pfn) { return kmap_local_page(pfn_to_page(pfn)); } static inline void __kunmap_local(const void *addr) { #ifdef ARCH_HAS_FLUSH_ON_KUNMAP kunmap_flush_on_unmap(PTR_ALIGN_DOWN(addr, PAGE_SIZE)); #endif } static inline void *kmap_atomic(const struct page *page) { if (IS_ENABLED(CONFIG_PREEMPT_RT)) migrate_disable(); else preempt_disable(); pagefault_disable(); return page_address(page); } static inline void *kmap_atomic_prot(const struct page *page, pgprot_t prot) { return kmap_atomic(page); } static inline void *kmap_atomic_pfn(unsigned long pfn) { return kmap_atomic(pfn_to_page(pfn)); } static inline void __kunmap_atomic(const void *addr) { #ifdef ARCH_HAS_FLUSH_ON_KUNMAP kunmap_flush_on_unmap(PTR_ALIGN_DOWN(addr, PAGE_SIZE)); #endif pagefault_enable(); if (IS_ENABLED(CONFIG_PREEMPT_RT)) migrate_enable(); else preempt_enable(); } static inline unsigned long nr_free_highpages(void) { return 0; } static inline unsigned long totalhigh_pages(void) { return 0; } static inline bool is_kmap_addr(const void *x) { return false; } #endif /* CONFIG_HIGHMEM */ /** * kunmap_atomic - Unmap the virtual address mapped by kmap_atomic() - deprecated! * @__addr: Virtual address to be unmapped * * Unmaps an address previously mapped by kmap_atomic() and re-enables * pagefaults. Depending on PREEMP_RT configuration, re-enables also * migration and preemption. Users should not count on these side effects. * * Mappings should be unmapped in the reverse order that they were mapped. * See kmap_local_page() for details on nesting. * * @__addr can be any address within the mapped page, so there is no need * to subtract any offset that has been added. In contrast to kunmap(), * this function takes the address returned from kmap_atomic(), not the * page passed to it. The compiler will warn you if you pass the page. */ #define kunmap_atomic(__addr) \ do { \ BUILD_BUG_ON(__same_type((__addr), struct page *)); \ __kunmap_atomic(__addr); \ } while (0) /** * kunmap_local - Unmap a page mapped via kmap_local_page(). * @__addr: An address within the page mapped * * @__addr can be any address within the mapped page. Commonly it is the * address return from kmap_local_page(), but it can also include offsets. * * Unmapping should be done in the reverse order of the mapping. See * kmap_local_page() for details. */ #define kunmap_local(__addr) \ do { \ BUILD_BUG_ON(__same_type((__addr), struct page *)); \ __kunmap_local(__addr); \ } while (0) #endif |
| 1 1 1 4 4 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | #ifndef IOU_ALLOC_CACHE_H #define IOU_ALLOC_CACHE_H #include <linux/io_uring_types.h> /* * Don't allow the cache to grow beyond this size. */ #define IO_ALLOC_CACHE_MAX 128 void io_alloc_cache_free(struct io_alloc_cache *cache, void (*free)(const void *)); bool io_alloc_cache_init(struct io_alloc_cache *cache, unsigned max_nr, unsigned int size, unsigned int init_bytes); void *io_cache_alloc_new(struct io_alloc_cache *cache, gfp_t gfp); static inline bool io_alloc_cache_put(struct io_alloc_cache *cache, void *entry) { if (cache->nr_cached < cache->max_cached) { if (!kasan_mempool_poison_object(entry)) return false; cache->entries[cache->nr_cached++] = entry; return true; } return false; } static inline void *io_alloc_cache_get(struct io_alloc_cache *cache) { if (cache->nr_cached) { void *entry = cache->entries[--cache->nr_cached]; /* * If KASAN is enabled, always clear the initial bytes that * must be zeroed post alloc, in case any of them overlap * with KASAN storage. */ #if defined(CONFIG_KASAN) kasan_mempool_unpoison_object(entry, cache->elem_size); if (cache->init_clear) memset(entry, 0, cache->init_clear); #endif return entry; } return NULL; } static inline void *io_cache_alloc(struct io_alloc_cache *cache, gfp_t gfp) { void *obj; obj = io_alloc_cache_get(cache); if (obj) return obj; return io_cache_alloc_new(cache, gfp); } static inline void io_cache_free(struct io_alloc_cache *cache, void *obj) { if (!io_alloc_cache_put(cache, obj)) kfree(obj); } #endif |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 | /* SPDX-License-Identifier: GPL-2.0 */ /* * fs-verity: read-only file-based authenticity protection * * This header declares the interface between the fs/verity/ support layer and * filesystems that support fs-verity. * * Copyright 2019 Google LLC */ #ifndef _LINUX_FSVERITY_H #define _LINUX_FSVERITY_H #include <linux/fs.h> #include <linux/mm.h> #include <crypto/hash_info.h> #include <crypto/sha2.h> #include <uapi/linux/fsverity.h> /* * Largest digest size among all hash algorithms supported by fs-verity. * Currently assumed to be <= size of fsverity_descriptor::root_hash. */ #define FS_VERITY_MAX_DIGEST_SIZE SHA512_DIGEST_SIZE /* Arbitrary limit to bound the kmalloc() size. Can be changed. */ #define FS_VERITY_MAX_DESCRIPTOR_SIZE 16384 struct fsverity_info; /* Verity operations for filesystems */ struct fsverity_operations { /** * The offset of the pointer to struct fsverity_info in the * filesystem-specific part of the inode, relative to the beginning of * the common part of the inode (the 'struct inode'). */ ptrdiff_t inode_info_offs; /** * Begin enabling verity on the given file. * * @filp: a readonly file descriptor for the file * * The filesystem must do any needed filesystem-specific preparations * for enabling verity, e.g. evicting inline data. It also must return * -EBUSY if verity is already being enabled on the given file. * * i_rwsem is held for write. * * Return: 0 on success, -errno on failure */ int (*begin_enable_verity)(struct file *filp); /** * End enabling verity on the given file. * * @filp: a readonly file descriptor for the file * @desc: the verity descriptor to write, or NULL on failure * @desc_size: size of verity descriptor, or 0 on failure * @merkle_tree_size: total bytes the Merkle tree took up * * If desc == NULL, then enabling verity failed and the filesystem only * must do any necessary cleanups. Else, it must also store the given * verity descriptor to a fs-specific location associated with the inode * and do any fs-specific actions needed to mark the inode as a verity * inode, e.g. setting a bit in the on-disk inode. The filesystem is * also responsible for setting the S_VERITY flag in the VFS inode. * * i_rwsem is held for write, but it may have been dropped between * ->begin_enable_verity() and ->end_enable_verity(). * * Return: 0 on success, -errno on failure */ int (*end_enable_verity)(struct file *filp, const void *desc, size_t desc_size, u64 merkle_tree_size); /** * Get the verity descriptor of the given inode. * * @inode: an inode with the S_VERITY flag set * @buf: buffer in which to place the verity descriptor * @bufsize: size of @buf, or 0 to retrieve the size only * * If bufsize == 0, then the size of the verity descriptor is returned. * Otherwise the verity descriptor is written to 'buf' and its actual * size is returned; -ERANGE is returned if it's too large. This may be * called by multiple processes concurrently on the same inode. * * Return: the size on success, -errno on failure */ int (*get_verity_descriptor)(struct inode *inode, void *buf, size_t bufsize); /** * Read a Merkle tree page of the given inode. * * @inode: the inode * @index: 0-based index of the page within the Merkle tree * @num_ra_pages: The number of Merkle tree pages that should be * prefetched starting at @index if the page at @index * isn't already cached. Implementations may ignore this * argument; it's only a performance optimization. * * This can be called at any time on an open verity file. It may be * called by multiple processes concurrently, even with the same page. * * Note that this must retrieve a *page*, not necessarily a *block*. * * Return: the page on success, ERR_PTR() on failure */ struct page *(*read_merkle_tree_page)(struct inode *inode, pgoff_t index, unsigned long num_ra_pages); /** * Write a Merkle tree block to the given inode. * * @inode: the inode for which the Merkle tree is being built * @buf: the Merkle tree block to write * @pos: the position of the block in the Merkle tree (in bytes) * @size: the Merkle tree block size (in bytes) * * This is only called between ->begin_enable_verity() and * ->end_enable_verity(). * * Return: 0 on success, -errno on failure */ int (*write_merkle_tree_block)(struct inode *inode, const void *buf, u64 pos, unsigned int size); }; #ifdef CONFIG_FS_VERITY /* * Returns the address of the verity info pointer within the filesystem-specific * part of the inode. (To save memory on filesystems that don't support * fsverity, a field in 'struct inode' itself is no longer used.) */ static inline struct fsverity_info ** fsverity_info_addr(const struct inode *inode) { VFS_WARN_ON_ONCE(inode->i_sb->s_vop->inode_info_offs == 0); return (void *)inode + inode->i_sb->s_vop->inode_info_offs; } static inline struct fsverity_info *fsverity_get_info(const struct inode *inode) { /* * Since this function can be called on inodes belonging to filesystems * that don't support fsverity at all, and fsverity_info_addr() doesn't * work on such filesystems, we have to start with an IS_VERITY() check. * Checking IS_VERITY() here is also useful to minimize the overhead of * fsverity_active() on non-verity files. */ if (!IS_VERITY(inode)) return NULL; /* * Pairs with the cmpxchg_release() in fsverity_set_info(). I.e., * another task may publish the inode's verity info concurrently, * executing a RELEASE barrier. Use smp_load_acquire() here to safely * ACQUIRE the memory the other task published. */ return smp_load_acquire(fsverity_info_addr(inode)); } /* enable.c */ int fsverity_ioctl_enable(struct file *filp, const void __user *arg); /* measure.c */ int fsverity_ioctl_measure(struct file *filp, void __user *arg); int fsverity_get_digest(struct inode *inode, u8 raw_digest[FS_VERITY_MAX_DIGEST_SIZE], u8 *alg, enum hash_algo *halg); /* open.c */ int __fsverity_file_open(struct inode *inode, struct file *filp); int __fsverity_prepare_setattr(struct dentry *dentry, struct iattr *attr); void __fsverity_cleanup_inode(struct inode *inode); /** * fsverity_cleanup_inode() - free the inode's verity info, if present * @inode: an inode being evicted * * Filesystems must call this on inode eviction to free the inode's verity info. */ static inline void fsverity_cleanup_inode(struct inode *inode) { /* * Only IS_VERITY() inodes can have verity info, so start by checking * for IS_VERITY() (which is faster than retrieving the pointer to the * verity info). This minimizes overhead for non-verity inodes. */ if (IS_VERITY(inode)) __fsverity_cleanup_inode(inode); else VFS_WARN_ON_ONCE(*fsverity_info_addr(inode) != NULL); } /* read_metadata.c */ int fsverity_ioctl_read_metadata(struct file *filp, const void __user *uarg); /* verify.c */ bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset); void fsverity_verify_bio(struct bio *bio); void fsverity_enqueue_verify_work(struct work_struct *work); #else /* !CONFIG_FS_VERITY */ static inline struct fsverity_info *fsverity_get_info(const struct inode *inode) { return NULL; } /* enable.c */ static inline int fsverity_ioctl_enable(struct file *filp, const void __user *arg) { return -EOPNOTSUPP; } /* measure.c */ static inline int fsverity_ioctl_measure(struct file *filp, void __user *arg) { return -EOPNOTSUPP; } static inline int fsverity_get_digest(struct inode *inode, u8 raw_digest[FS_VERITY_MAX_DIGEST_SIZE], u8 *alg, enum hash_algo *halg) { /* * fsverity is not enabled in the kernel configuration, so always report * that the file doesn't have fsverity enabled (digest size 0). */ return 0; } /* open.c */ static inline int __fsverity_file_open(struct inode *inode, struct file *filp) { return -EOPNOTSUPP; } static inline int __fsverity_prepare_setattr(struct dentry *dentry, struct iattr *attr) { return -EOPNOTSUPP; } static inline void fsverity_cleanup_inode(struct inode *inode) { } /* read_metadata.c */ static inline int fsverity_ioctl_read_metadata(struct file *filp, const void __user *uarg) { return -EOPNOTSUPP; } /* verify.c */ static inline bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset) { WARN_ON_ONCE(1); return false; } static inline void fsverity_verify_bio(struct bio *bio) { WARN_ON_ONCE(1); } static inline void fsverity_enqueue_verify_work(struct work_struct *work) { WARN_ON_ONCE(1); } #endif /* !CONFIG_FS_VERITY */ static inline bool fsverity_verify_folio(struct folio *folio) { return fsverity_verify_blocks(folio, folio_size(folio), 0); } static inline bool fsverity_verify_page(struct page *page) { return fsverity_verify_blocks(page_folio(page), PAGE_SIZE, 0); } /** * fsverity_active() - do reads from the inode need to go through fs-verity? * @inode: inode to check * * This checks whether the inode's verity info has been set. * * Filesystems call this from ->readahead() to check whether the pages need to * be verified or not. Don't use IS_VERITY() for this purpose; it's subject to * a race condition where the file is being read concurrently with * FS_IOC_ENABLE_VERITY completing. (S_VERITY is set before the verity info.) * * Return: true if reads need to go through fs-verity, otherwise false */ static inline bool fsverity_active(const struct inode *inode) { return fsverity_get_info(inode) != NULL; } /** * fsverity_file_open() - prepare to open a verity file * @inode: the inode being opened * @filp: the struct file being set up * * When opening a verity file, deny the open if it is for writing. Otherwise, * set up the inode's verity info if not already done. * * When combined with fscrypt, this must be called after fscrypt_file_open(). * Otherwise, we won't have the key set up to decrypt the verity metadata. * * Return: 0 on success, -errno on failure */ static inline int fsverity_file_open(struct inode *inode, struct file *filp) { if (IS_VERITY(inode)) return __fsverity_file_open(inode, filp); return 0; } /** * fsverity_prepare_setattr() - prepare to change a verity inode's attributes * @dentry: dentry through which the inode is being changed * @attr: attributes to change * * Verity files are immutable, so deny truncates. This isn't covered by the * open-time check because sys_truncate() takes a path, not a file descriptor. * * Return: 0 on success, -errno on failure */ static inline int fsverity_prepare_setattr(struct dentry *dentry, struct iattr *attr) { if (IS_VERITY(d_inode(dentry))) return __fsverity_prepare_setattr(dentry, attr); return 0; } #endif /* _LINUX_FSVERITY_H */ |
| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 5 5 5 2 5 5 5 4 5 1 4 5 5 5 4 1 3 4 1 3 3 3 3 3 3 1 3 2 5 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 3 2 3 3 3 2 3 5 5 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 5 3 2 3 3 3 3 2 3 3 3 6 3 6 6 6 6 6 4 5 5 5 4 1 3 5 2 3 4 4 5 1 2 1 2 5 5 1 1 1 1 1 1 1 1 6 6 6 6 6 6 6 3 1 1 1 1 1 2 5 5 5 5 5 5 2 3 3 3 3 6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 | /* * usbmidi.c - ALSA USB MIDI driver * * Copyright (c) 2002-2009 Clemens Ladisch * All rights reserved. * * Based on the OSS usb-midi driver by NAGANO Daisuke, * NetBSD's umidi driver by Takuya SHIOZAKI, * the "USB Device Class Definition for MIDI Devices" by Roland * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions, and the following disclaimer, * without modification. * 2. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission. * * Alternatively, this software may be distributed and/or modified under the * terms of the GNU General Public License as published by the Free Software * Foundation; either version 2 of the License, or (at your option) any later * version. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include <linux/kernel.h> #include <linux/types.h> #include <linux/bitops.h> #include <linux/interrupt.h> #include <linux/spinlock.h> #include <linux/string.h> #include <linux/init.h> #include <linux/slab.h> #include <linux/timer.h> #include <linux/usb.h> #include <linux/wait.h> #include <linux/usb/audio.h> #include <linux/usb/midi.h> #include <linux/module.h> #include <sound/core.h> #include <sound/control.h> #include <sound/rawmidi.h> #include <sound/asequencer.h> #include "usbaudio.h" #include "midi.h" #include "power.h" #include "helper.h" /* * define this to log all USB packets */ /* #define DUMP_PACKETS */ /* * how long to wait after some USB errors, so that hub_wq can disconnect() us * without too many spurious errors */ #define ERROR_DELAY_JIFFIES (HZ / 10) #define OUTPUT_URBS 7 #define INPUT_URBS 7 MODULE_AUTHOR("Clemens Ladisch <clemens@ladisch.de>"); MODULE_DESCRIPTION("USB Audio/MIDI helper module"); MODULE_LICENSE("Dual BSD/GPL"); struct snd_usb_midi_in_endpoint; struct snd_usb_midi_out_endpoint; struct snd_usb_midi_endpoint; struct usb_protocol_ops { void (*input)(struct snd_usb_midi_in_endpoint*, uint8_t*, int); void (*output)(struct snd_usb_midi_out_endpoint *ep, struct urb *urb); void (*output_packet)(struct urb*, uint8_t, uint8_t, uint8_t, uint8_t); void (*init_out_endpoint)(struct snd_usb_midi_out_endpoint *); void (*finish_out_endpoint)(struct snd_usb_midi_out_endpoint *); }; struct snd_usb_midi { struct usb_device *dev; struct snd_card *card; struct usb_interface *iface; const struct snd_usb_audio_quirk *quirk; struct snd_rawmidi *rmidi; const struct usb_protocol_ops *usb_protocol_ops; struct list_head list; struct timer_list error_timer; spinlock_t disc_lock; struct rw_semaphore disc_rwsem; struct mutex mutex; u32 usb_id; int next_midi_device; struct snd_usb_midi_endpoint { struct snd_usb_midi_out_endpoint *out; struct snd_usb_midi_in_endpoint *in; } endpoints[MIDI_MAX_ENDPOINTS]; unsigned long input_triggered; unsigned int opened[2]; unsigned char disconnected; unsigned char input_running; struct snd_kcontrol *roland_load_ctl; }; struct snd_usb_midi_out_endpoint { struct snd_usb_midi *umidi; struct out_urb_context { struct urb *urb; struct snd_usb_midi_out_endpoint *ep; } urbs[OUTPUT_URBS]; unsigned int active_urbs; unsigned int drain_urbs; int max_transfer; /* size of urb buffer */ struct work_struct work; unsigned int next_urb; spinlock_t buffer_lock; struct usbmidi_out_port { struct snd_usb_midi_out_endpoint *ep; struct snd_rawmidi_substream *substream; int active; uint8_t cable; /* cable number << 4 */ uint8_t state; #define STATE_UNKNOWN 0 #define STATE_1PARAM 1 #define STATE_2PARAM_1 2 #define STATE_2PARAM_2 3 #define STATE_SYSEX_0 4 #define STATE_SYSEX_1 5 #define STATE_SYSEX_2 6 uint8_t data[2]; } ports[0x10]; int current_port; wait_queue_head_t drain_wait; }; struct snd_usb_midi_in_endpoint { struct snd_usb_midi *umidi; struct urb *urbs[INPUT_URBS]; struct usbmidi_in_port { struct snd_rawmidi_substream *substream; u8 running_status_length; } ports[0x10]; u8 seen_f5; bool in_sysex; u8 last_cin; u8 error_resubmit; int current_port; }; static void snd_usbmidi_do_output(struct snd_usb_midi_out_endpoint *ep); static const uint8_t snd_usbmidi_cin_length[] = { 0, 0, 2, 3, 3, 1, 2, 3, 3, 3, 3, 3, 2, 2, 3, 1 }; /* * Submits the URB, with error handling. */ static int snd_usbmidi_submit_urb(struct urb *urb, gfp_t flags) { int err = usb_submit_urb(urb, flags); if (err < 0 && err != -ENODEV) dev_err(&urb->dev->dev, "usb_submit_urb: %d\n", err); return err; } /* * Error handling for URB completion functions. */ static int snd_usbmidi_urb_error(const struct urb *urb) { switch (urb->status) { /* manually unlinked, or device gone */ case -ENOENT: case -ECONNRESET: case -ESHUTDOWN: case -ENODEV: return -ENODEV; /* errors that might occur during unplugging */ case -EPROTO: case -ETIME: case -EILSEQ: return -EIO; default: dev_err(&urb->dev->dev, "urb status %d\n", urb->status); return 0; /* continue */ } } /* * Receives a chunk of MIDI data. */ static void snd_usbmidi_input_data(struct snd_usb_midi_in_endpoint *ep, int portidx, uint8_t *data, int length) { struct usbmidi_in_port *port = &ep->ports[portidx]; if (!port->substream) { dev_dbg(&ep->umidi->dev->dev, "unexpected port %d!\n", portidx); return; } if (!test_bit(port->substream->number, &ep->umidi->input_triggered)) return; snd_rawmidi_receive(port->substream, data, length); } #ifdef DUMP_PACKETS static void dump_urb(const char *type, const u8 *data, int length) { pr_debug("%s packet: [", type); for (; length > 0; ++data, --length) pr_cont(" %02x", *data); pr_cont(" ]\n"); } #else #define dump_urb(type, data, length) /* nothing */ #endif /* * Processes the data read from the device. */ static void snd_usbmidi_in_urb_complete(struct urb *urb) { struct snd_usb_midi_in_endpoint *ep = urb->context; if (urb->status == 0) { dump_urb("received", urb->transfer_buffer, urb->actual_length); ep->umidi->usb_protocol_ops->input(ep, urb->transfer_buffer, urb->actual_length); } else { int err = snd_usbmidi_urb_error(urb); if (err < 0) { if (err != -ENODEV) { ep->error_resubmit = 1; mod_timer(&ep->umidi->error_timer, jiffies + ERROR_DELAY_JIFFIES); } return; } } urb->dev = ep->umidi->dev; snd_usbmidi_submit_urb(urb, GFP_ATOMIC); } static void snd_usbmidi_out_urb_complete(struct urb *urb) { struct out_urb_context *context = urb->context; struct snd_usb_midi_out_endpoint *ep = context->ep; unsigned int urb_index; scoped_guard(spinlock_irqsave, &ep->buffer_lock) { urb_index = context - ep->urbs; ep->active_urbs &= ~(1 << urb_index); if (unlikely(ep->drain_urbs)) { ep->drain_urbs &= ~(1 << urb_index); wake_up(&ep->drain_wait); } } if (urb->status < 0) { int err = snd_usbmidi_urb_error(urb); if (err < 0) { if (err != -ENODEV) mod_timer(&ep->umidi->error_timer, jiffies + ERROR_DELAY_JIFFIES); return; } } snd_usbmidi_do_output(ep); } /* * This is called when some data should be transferred to the device * (from one or more substreams). */ static void snd_usbmidi_do_output(struct snd_usb_midi_out_endpoint *ep) { unsigned int urb_index; struct urb *urb; guard(spinlock_irqsave)(&ep->buffer_lock); if (ep->umidi->disconnected) return; urb_index = ep->next_urb; for (;;) { if (!(ep->active_urbs & (1 << urb_index))) { urb = ep->urbs[urb_index].urb; urb->transfer_buffer_length = 0; ep->umidi->usb_protocol_ops->output(ep, urb); if (urb->transfer_buffer_length == 0) break; dump_urb("sending", urb->transfer_buffer, urb->transfer_buffer_length); urb->dev = ep->umidi->dev; if (snd_usbmidi_submit_urb(urb, GFP_ATOMIC) < 0) break; ep->active_urbs |= 1 << urb_index; } if (++urb_index >= OUTPUT_URBS) urb_index = 0; if (urb_index == ep->next_urb) break; } ep->next_urb = urb_index; } static void snd_usbmidi_out_work(struct work_struct *work) { struct snd_usb_midi_out_endpoint *ep = container_of(work, struct snd_usb_midi_out_endpoint, work); snd_usbmidi_do_output(ep); } /* called after transfers had been interrupted due to some USB error */ static void snd_usbmidi_error_timer(struct timer_list *t) { struct snd_usb_midi *umidi = timer_container_of(umidi, t, error_timer); unsigned int i, j; guard(spinlock)(&umidi->disc_lock); if (umidi->disconnected) { return; } for (i = 0; i < MIDI_MAX_ENDPOINTS; ++i) { struct snd_usb_midi_in_endpoint *in = umidi->endpoints[i].in; if (in && in->error_resubmit) { in->error_resubmit = 0; for (j = 0; j < INPUT_URBS; ++j) { if (atomic_read(&in->urbs[j]->use_count)) continue; in->urbs[j]->dev = umidi->dev; snd_usbmidi_submit_urb(in->urbs[j], GFP_ATOMIC); } } if (umidi->endpoints[i].out) snd_usbmidi_do_output(umidi->endpoints[i].out); } } /* helper function to send static data that may not DMA-able */ static int send_bulk_static_data(struct snd_usb_midi_out_endpoint *ep, const void *data, int len) { int err = 0; void *buf = kmemdup(data, len, GFP_KERNEL); if (!buf) return -ENOMEM; dump_urb("sending", buf, len); if (ep->urbs[0].urb) err = usb_bulk_msg(ep->umidi->dev, ep->urbs[0].urb->pipe, buf, len, NULL, 250); kfree(buf); return err; } /* * Standard USB MIDI protocol: see the spec. * Midiman protocol: like the standard protocol, but the control byte is the * fourth byte in each packet, and uses length instead of CIN. */ static void snd_usbmidi_standard_input(struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { int i; for (i = 0; i + 3 < buffer_length; i += 4) if (buffer[i] != 0) { int cable = buffer[i] >> 4; int length = snd_usbmidi_cin_length[buffer[i] & 0x0f]; snd_usbmidi_input_data(ep, cable, &buffer[i + 1], length); } } static void snd_usbmidi_midiman_input(struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { int i; for (i = 0; i + 3 < buffer_length; i += 4) if (buffer[i + 3] != 0) { int port = buffer[i + 3] >> 4; int length = buffer[i + 3] & 3; snd_usbmidi_input_data(ep, port, &buffer[i], length); } } /* * Buggy M-Audio device: running status on input results in a packet that has * the data bytes but not the status byte and that is marked with CIN 4. */ static void snd_usbmidi_maudio_broken_running_status_input( struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { int i; for (i = 0; i + 3 < buffer_length; i += 4) if (buffer[i] != 0) { int cable = buffer[i] >> 4; u8 cin = buffer[i] & 0x0f; struct usbmidi_in_port *port = &ep->ports[cable]; int length; length = snd_usbmidi_cin_length[cin]; if (cin == 0xf && buffer[i + 1] >= 0xf8) ; /* realtime msg: no running status change */ else if (cin >= 0x8 && cin <= 0xe) /* channel msg */ port->running_status_length = length - 1; else if (cin == 0x4 && port->running_status_length != 0 && buffer[i + 1] < 0x80) /* CIN 4 that is not a SysEx */ length = port->running_status_length; else /* * All other msgs cannot begin running status. * (A channel msg sent as two or three CIN 0xF * packets could in theory, but this device * doesn't use this format.) */ port->running_status_length = 0; snd_usbmidi_input_data(ep, cable, &buffer[i + 1], length); } } /* * QinHeng CH345 is buggy: every second packet inside a SysEx has not CIN 4 * but the previously seen CIN, but still with three data bytes. */ static void ch345_broken_sysex_input(struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { unsigned int i, cin, length; for (i = 0; i + 3 < buffer_length; i += 4) { if (buffer[i] == 0 && i > 0) break; cin = buffer[i] & 0x0f; if (ep->in_sysex && cin == ep->last_cin && (buffer[i + 1 + (cin == 0x6)] & 0x80) == 0) cin = 0x4; #if 0 if (buffer[i + 1] == 0x90) { /* * Either a corrupted running status or a real note-on * message; impossible to detect reliably. */ } #endif length = snd_usbmidi_cin_length[cin]; snd_usbmidi_input_data(ep, 0, &buffer[i + 1], length); ep->in_sysex = cin == 0x4; if (!ep->in_sysex) ep->last_cin = cin; } } /* * CME protocol: like the standard protocol, but SysEx commands are sent as a * single USB packet preceded by a 0x0F byte, as are system realtime * messages and MIDI Active Sensing. * Also, multiple messages can be sent in the same packet. */ static void snd_usbmidi_cme_input(struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { int remaining = buffer_length; /* * CME send sysex, song position pointer, system realtime * and active sensing using CIN 0x0f, which in the standard * is only intended for single byte unparsed data. * So we need to interpret these here before sending them on. * By default, we assume single byte data, which is true * for system realtime (midi clock, start, stop and continue) * and active sensing, and handle the other (known) cases * separately. * In contrast to the standard, CME does not split sysex * into multiple 4-byte packets, but lumps everything together * into one. In addition, CME can string multiple messages * together in the same packet; pressing the Record button * on an UF6 sends a sysex message directly followed * by a song position pointer in the same packet. * For it to have any reasonable meaning, a sysex message * needs to be at least 3 bytes in length (0xf0, id, 0xf7), * corresponding to a packet size of 4 bytes, and the ones sent * by CME devices are 6 or 7 bytes, making the packet fragments * 7 or 8 bytes long (six or seven bytes plus preceding CN+CIN byte). * For the other types, the packet size is always 4 bytes, * as per the standard, with the data size being 3 for SPP * and 1 for the others. * Thus all packet fragments are at least 4 bytes long, so we can * skip anything that is shorter; this also conveniantly skips * packets with size 0, which CME devices continuously send when * they have nothing better to do. * Another quirk is that sometimes multiple messages are sent * in the same packet. This has been observed for midi clock * and active sensing i.e. 0x0f 0xf8 0x00 0x00 0x0f 0xfe 0x00 0x00, * but also multiple note ons/offs, and control change together * with MIDI clock. Similarly, some sysex messages are followed by * the song position pointer in the same packet, and occasionally * additionally by a midi clock or active sensing. * We handle this by looping over all data and parsing it along the way. */ while (remaining >= 4) { int source_length = 4; /* default */ if ((buffer[0] & 0x0f) == 0x0f) { int data_length = 1; /* default */ if (buffer[1] == 0xf0) { /* Sysex: Find EOX and send on whole message. */ /* To kick off the search, skip the first * two bytes (CN+CIN and SYSEX (0xf0). */ uint8_t *tmp_buf = buffer + 2; int tmp_length = remaining - 2; while (tmp_length > 1 && *tmp_buf != 0xf7) { tmp_buf++; tmp_length--; } data_length = tmp_buf - buffer; source_length = data_length + 1; } else if (buffer[1] == 0xf2) { /* Three byte song position pointer */ data_length = 3; } snd_usbmidi_input_data(ep, buffer[0] >> 4, &buffer[1], data_length); } else { /* normal channel events */ snd_usbmidi_standard_input(ep, buffer, source_length); } buffer += source_length; remaining -= source_length; } } /* * Adds one USB MIDI packet to the output buffer. */ static void snd_usbmidi_output_standard_packet(struct urb *urb, uint8_t p0, uint8_t p1, uint8_t p2, uint8_t p3) { uint8_t *buf = (uint8_t *)urb->transfer_buffer + urb->transfer_buffer_length; buf[0] = p0; buf[1] = p1; buf[2] = p2; buf[3] = p3; urb->transfer_buffer_length += 4; } /* * Adds one Midiman packet to the output buffer. */ static void snd_usbmidi_output_midiman_packet(struct urb *urb, uint8_t p0, uint8_t p1, uint8_t p2, uint8_t p3) { uint8_t *buf = (uint8_t *)urb->transfer_buffer + urb->transfer_buffer_length; buf[0] = p1; buf[1] = p2; buf[2] = p3; buf[3] = (p0 & 0xf0) | snd_usbmidi_cin_length[p0 & 0x0f]; urb->transfer_buffer_length += 4; } /* * Converts MIDI commands to USB MIDI packets. */ static void snd_usbmidi_transmit_byte(struct usbmidi_out_port *port, uint8_t b, struct urb *urb) { uint8_t p0 = port->cable; void (*output_packet)(struct urb*, uint8_t, uint8_t, uint8_t, uint8_t) = port->ep->umidi->usb_protocol_ops->output_packet; if (b >= 0xf8) { output_packet(urb, p0 | 0x0f, b, 0, 0); } else if (b >= 0xf0) { switch (b) { case 0xf0: port->data[0] = b; port->state = STATE_SYSEX_1; break; case 0xf1: case 0xf3: port->data[0] = b; port->state = STATE_1PARAM; break; case 0xf2: port->data[0] = b; port->state = STATE_2PARAM_1; break; case 0xf4: case 0xf5: port->state = STATE_UNKNOWN; break; case 0xf6: output_packet(urb, p0 | 0x05, 0xf6, 0, 0); port->state = STATE_UNKNOWN; break; case 0xf7: switch (port->state) { case STATE_SYSEX_0: output_packet(urb, p0 | 0x05, 0xf7, 0, 0); break; case STATE_SYSEX_1: output_packet(urb, p0 | 0x06, port->data[0], 0xf7, 0); break; case STATE_SYSEX_2: output_packet(urb, p0 | 0x07, port->data[0], port->data[1], 0xf7); break; } port->state = STATE_UNKNOWN; break; } } else if (b >= 0x80) { port->data[0] = b; if (b >= 0xc0 && b <= 0xdf) port->state = STATE_1PARAM; else port->state = STATE_2PARAM_1; } else { /* b < 0x80 */ switch (port->state) { case STATE_1PARAM: if (port->data[0] < 0xf0) { p0 |= port->data[0] >> 4; } else { p0 |= 0x02; port->state = STATE_UNKNOWN; } output_packet(urb, p0, port->data[0], b, 0); break; case STATE_2PARAM_1: port->data[1] = b; port->state = STATE_2PARAM_2; break; case STATE_2PARAM_2: if (port->data[0] < 0xf0) { p0 |= port->data[0] >> 4; port->state = STATE_2PARAM_1; } else { p0 |= 0x03; port->state = STATE_UNKNOWN; } output_packet(urb, p0, port->data[0], port->data[1], b); break; case STATE_SYSEX_0: port->data[0] = b; port->state = STATE_SYSEX_1; break; case STATE_SYSEX_1: port->data[1] = b; port->state = STATE_SYSEX_2; break; case STATE_SYSEX_2: output_packet(urb, p0 | 0x04, port->data[0], port->data[1], b); port->state = STATE_SYSEX_0; break; } } } static void snd_usbmidi_standard_output(struct snd_usb_midi_out_endpoint *ep, struct urb *urb) { int p; /* FIXME: lower-numbered ports can starve higher-numbered ports */ for (p = 0; p < 0x10; ++p) { struct usbmidi_out_port *port = &ep->ports[p]; if (!port->active) continue; while (urb->transfer_buffer_length + 3 < ep->max_transfer) { uint8_t b; if (snd_rawmidi_transmit(port->substream, &b, 1) != 1) { port->active = 0; break; } snd_usbmidi_transmit_byte(port, b, urb); } } } static const struct usb_protocol_ops snd_usbmidi_standard_ops = { .input = snd_usbmidi_standard_input, .output = snd_usbmidi_standard_output, .output_packet = snd_usbmidi_output_standard_packet, }; static const struct usb_protocol_ops snd_usbmidi_midiman_ops = { .input = snd_usbmidi_midiman_input, .output = snd_usbmidi_standard_output, .output_packet = snd_usbmidi_output_midiman_packet, }; static const struct usb_protocol_ops snd_usbmidi_maudio_broken_running_status_ops = { .input = snd_usbmidi_maudio_broken_running_status_input, .output = snd_usbmidi_standard_output, .output_packet = snd_usbmidi_output_standard_packet, }; static const struct usb_protocol_ops snd_usbmidi_cme_ops = { .input = snd_usbmidi_cme_input, .output = snd_usbmidi_standard_output, .output_packet = snd_usbmidi_output_standard_packet, }; static const struct usb_protocol_ops snd_usbmidi_ch345_broken_sysex_ops = { .input = ch345_broken_sysex_input, .output = snd_usbmidi_standard_output, .output_packet = snd_usbmidi_output_standard_packet, }; /* * AKAI MPD16 protocol: * * For control port (endpoint 1): * ============================== * One or more chunks consisting of first byte of (0x10 | msg_len) and then a * SysEx message (msg_len=9 bytes long). * * For data port (endpoint 2): * =========================== * One or more chunks consisting of first byte of (0x20 | msg_len) and then a * MIDI message (msg_len bytes long) * * Messages sent: Active Sense, Note On, Poly Pressure, Control Change. */ static void snd_usbmidi_akai_input(struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { unsigned int pos = 0; unsigned int len = (unsigned int)buffer_length; while (pos < len) { unsigned int port = (buffer[pos] >> 4) - 1; unsigned int msg_len = buffer[pos] & 0x0f; pos++; if (pos + msg_len <= len && port < 2) snd_usbmidi_input_data(ep, 0, &buffer[pos], msg_len); pos += msg_len; } } #define MAX_AKAI_SYSEX_LEN 9 static void snd_usbmidi_akai_output(struct snd_usb_midi_out_endpoint *ep, struct urb *urb) { uint8_t *msg; int pos, end, count, buf_end; uint8_t tmp[MAX_AKAI_SYSEX_LEN]; struct snd_rawmidi_substream *substream = ep->ports[0].substream; if (!ep->ports[0].active) return; msg = urb->transfer_buffer + urb->transfer_buffer_length; buf_end = ep->max_transfer - MAX_AKAI_SYSEX_LEN - 1; /* only try adding more data when there's space for at least 1 SysEx */ while (urb->transfer_buffer_length < buf_end) { count = snd_rawmidi_transmit_peek(substream, tmp, MAX_AKAI_SYSEX_LEN); if (!count) { ep->ports[0].active = 0; return; } /* try to skip non-SysEx data */ for (pos = 0; pos < count && tmp[pos] != 0xF0; pos++) ; if (pos > 0) { snd_rawmidi_transmit_ack(substream, pos); continue; } /* look for the start or end marker */ for (end = 1; end < count && tmp[end] < 0xF0; end++) ; /* next SysEx started before the end of current one */ if (end < count && tmp[end] == 0xF0) { /* it's incomplete - drop it */ snd_rawmidi_transmit_ack(substream, end); continue; } /* SysEx complete */ if (end < count && tmp[end] == 0xF7) { /* queue it, ack it, and get the next one */ count = end + 1; msg[0] = 0x10 | count; memcpy(&msg[1], tmp, count); snd_rawmidi_transmit_ack(substream, count); urb->transfer_buffer_length += count + 1; msg += count + 1; continue; } /* less than 9 bytes and no end byte - wait for more */ if (count < MAX_AKAI_SYSEX_LEN) { ep->ports[0].active = 0; return; } /* 9 bytes and no end marker in sight - malformed, skip it */ snd_rawmidi_transmit_ack(substream, count); } } static const struct usb_protocol_ops snd_usbmidi_akai_ops = { .input = snd_usbmidi_akai_input, .output = snd_usbmidi_akai_output, }; /* * Novation USB MIDI protocol: number of data bytes is in the first byte * (when receiving) (+1!) or in the second byte (when sending); data begins * at the third byte. */ static void snd_usbmidi_novation_input(struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { if (buffer_length < 2 || !buffer[0] || buffer_length < buffer[0] + 1) return; snd_usbmidi_input_data(ep, 0, &buffer[2], buffer[0] - 1); } static void snd_usbmidi_novation_output(struct snd_usb_midi_out_endpoint *ep, struct urb *urb) { uint8_t *transfer_buffer; int count; if (!ep->ports[0].active) return; transfer_buffer = urb->transfer_buffer; count = snd_rawmidi_transmit(ep->ports[0].substream, &transfer_buffer[2], ep->max_transfer - 2); if (count < 1) { ep->ports[0].active = 0; return; } transfer_buffer[0] = 0; transfer_buffer[1] = count; urb->transfer_buffer_length = 2 + count; } static const struct usb_protocol_ops snd_usbmidi_novation_ops = { .input = snd_usbmidi_novation_input, .output = snd_usbmidi_novation_output, }; /* * "raw" protocol: just move raw MIDI bytes from/to the endpoint */ static void snd_usbmidi_raw_input(struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { snd_usbmidi_input_data(ep, 0, buffer, buffer_length); } static void snd_usbmidi_raw_output(struct snd_usb_midi_out_endpoint *ep, struct urb *urb) { int count; if (!ep->ports[0].active) return; count = snd_rawmidi_transmit(ep->ports[0].substream, urb->transfer_buffer, ep->max_transfer); if (count < 1) { ep->ports[0].active = 0; return; } urb->transfer_buffer_length = count; } static const struct usb_protocol_ops snd_usbmidi_raw_ops = { .input = snd_usbmidi_raw_input, .output = snd_usbmidi_raw_output, }; /* * FTDI protocol: raw MIDI bytes, but input packets have two modem status bytes. */ static void snd_usbmidi_ftdi_input(struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { if (buffer_length > 2) snd_usbmidi_input_data(ep, 0, buffer + 2, buffer_length - 2); } static const struct usb_protocol_ops snd_usbmidi_ftdi_ops = { .input = snd_usbmidi_ftdi_input, .output = snd_usbmidi_raw_output, }; static void snd_usbmidi_us122l_input(struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { if (buffer_length != 9) return; buffer_length = 8; while (buffer_length && buffer[buffer_length - 1] == 0xFD) buffer_length--; if (buffer_length) snd_usbmidi_input_data(ep, 0, buffer, buffer_length); } static void snd_usbmidi_us122l_output(struct snd_usb_midi_out_endpoint *ep, struct urb *urb) { int count; if (!ep->ports[0].active) return; switch (snd_usb_get_speed(ep->umidi->dev)) { case USB_SPEED_HIGH: case USB_SPEED_SUPER: case USB_SPEED_SUPER_PLUS: count = 1; break; default: count = 2; } count = snd_rawmidi_transmit(ep->ports[0].substream, urb->transfer_buffer, count); if (count < 1) { ep->ports[0].active = 0; return; } memset(urb->transfer_buffer + count, 0xFD, ep->max_transfer - count); urb->transfer_buffer_length = ep->max_transfer; } static const struct usb_protocol_ops snd_usbmidi_122l_ops = { .input = snd_usbmidi_us122l_input, .output = snd_usbmidi_us122l_output, }; /* * Emagic USB MIDI protocol: raw MIDI with "F5 xx" port switching. */ static void snd_usbmidi_emagic_init_out(struct snd_usb_midi_out_endpoint *ep) { static const u8 init_data[] = { /* initialization magic: "get version" */ 0xf0, 0x00, 0x20, 0x31, /* Emagic */ 0x64, /* Unitor8 */ 0x0b, /* version number request */ 0x00, /* command version */ 0x00, /* EEPROM, box 0 */ 0xf7 }; send_bulk_static_data(ep, init_data, sizeof(init_data)); /* while we're at it, pour on more magic */ send_bulk_static_data(ep, init_data, sizeof(init_data)); } static void snd_usbmidi_emagic_finish_out(struct snd_usb_midi_out_endpoint *ep) { static const u8 finish_data[] = { /* switch to patch mode with last preset */ 0xf0, 0x00, 0x20, 0x31, /* Emagic */ 0x64, /* Unitor8 */ 0x10, /* patch switch command */ 0x00, /* command version */ 0x7f, /* to all boxes */ 0x40, /* last preset in EEPROM */ 0xf7 }; send_bulk_static_data(ep, finish_data, sizeof(finish_data)); } static void snd_usbmidi_emagic_input(struct snd_usb_midi_in_endpoint *ep, uint8_t *buffer, int buffer_length) { int i; /* FF indicates end of valid data */ for (i = 0; i < buffer_length; ++i) if (buffer[i] == 0xff) { buffer_length = i; break; } /* handle F5 at end of last buffer */ if (ep->seen_f5) goto switch_port; while (buffer_length > 0) { /* determine size of data until next F5 */ for (i = 0; i < buffer_length; ++i) if (buffer[i] == 0xf5) break; snd_usbmidi_input_data(ep, ep->current_port, buffer, i); buffer += i; buffer_length -= i; if (buffer_length <= 0) break; /* assert(buffer[0] == 0xf5); */ ep->seen_f5 = 1; ++buffer; --buffer_length; switch_port: if (buffer_length <= 0) break; if (buffer[0] < 0x80) { ep->current_port = (buffer[0] - 1) & 15; ++buffer; --buffer_length; } ep->seen_f5 = 0; } } static void snd_usbmidi_emagic_output(struct snd_usb_midi_out_endpoint *ep, struct urb *urb) { int port0 = ep->current_port; uint8_t *buf = urb->transfer_buffer; int buf_free = ep->max_transfer; int length, i; for (i = 0; i < 0x10; ++i) { /* round-robin, starting at the last current port */ int portnum = (port0 + i) & 15; struct usbmidi_out_port *port = &ep->ports[portnum]; if (!port->active) continue; if (snd_rawmidi_transmit_peek(port->substream, buf, 1) != 1) { port->active = 0; continue; } if (portnum != ep->current_port) { if (buf_free < 2) break; ep->current_port = portnum; buf[0] = 0xf5; buf[1] = (portnum + 1) & 15; buf += 2; buf_free -= 2; } if (buf_free < 1) break; length = snd_rawmidi_transmit(port->substream, buf, buf_free); if (length > 0) { buf += length; buf_free -= length; if (buf_free < 1) break; } } if (buf_free < ep->max_transfer && buf_free > 0) { *buf = 0xff; --buf_free; } urb->transfer_buffer_length = ep->max_transfer - buf_free; } static const struct usb_protocol_ops snd_usbmidi_emagic_ops = { .input = snd_usbmidi_emagic_input, .output = snd_usbmidi_emagic_output, .init_out_endpoint = snd_usbmidi_emagic_init_out, .finish_out_endpoint = snd_usbmidi_emagic_finish_out, }; static void update_roland_altsetting(struct snd_usb_midi *umidi) { struct usb_interface *intf; struct usb_host_interface *hostif; struct usb_interface_descriptor *intfd; int is_light_load; intf = umidi->iface; is_light_load = intf->cur_altsetting != intf->altsetting; if (umidi->roland_load_ctl->private_value == is_light_load) return; hostif = &intf->altsetting[umidi->roland_load_ctl->private_value]; intfd = get_iface_desc(hostif); snd_usbmidi_input_stop(&umidi->list); usb_set_interface(umidi->dev, intfd->bInterfaceNumber, intfd->bAlternateSetting); snd_usbmidi_input_start(&umidi->list); } static int substream_open(struct snd_rawmidi_substream *substream, int dir, int open) { struct snd_usb_midi *umidi = substream->rmidi->private_data; struct snd_kcontrol *ctl; guard(rwsem_read)(&umidi->disc_rwsem); if (umidi->disconnected) return open ? -ENODEV : 0; guard(mutex)(&umidi->mutex); if (open) { if (!umidi->opened[0] && !umidi->opened[1]) { if (umidi->roland_load_ctl) { ctl = umidi->roland_load_ctl; ctl->vd[0].access |= SNDRV_CTL_ELEM_ACCESS_INACTIVE; snd_ctl_notify(umidi->card, SNDRV_CTL_EVENT_MASK_INFO, &ctl->id); update_roland_altsetting(umidi); } } umidi->opened[dir]++; if (umidi->opened[1]) snd_usbmidi_input_start(&umidi->list); } else { umidi->opened[dir]--; if (!umidi->opened[1]) snd_usbmidi_input_stop(&umidi->list); if (!umidi->opened[0] && !umidi->opened[1]) { if (umidi->roland_load_ctl) { ctl = umidi->roland_load_ctl; ctl->vd[0].access &= ~SNDRV_CTL_ELEM_ACCESS_INACTIVE; snd_ctl_notify(umidi->card, SNDRV_CTL_EVENT_MASK_INFO, &ctl->id); } } } return 0; } static int snd_usbmidi_output_open(struct snd_rawmidi_substream *substream) { struct snd_usb_midi *umidi = substream->rmidi->private_data; struct usbmidi_out_port *port = NULL; int i, j; for (i = 0; i < MIDI_MAX_ENDPOINTS; ++i) if (umidi->endpoints[i].out) for (j = 0; j < 0x10; ++j) if (umidi->endpoints[i].out->ports[j].substream == substream) { port = &umidi->endpoints[i].out->ports[j]; break; } if (!port) return -ENXIO; substream->runtime->private_data = port; port->state = STATE_UNKNOWN; return substream_open(substream, 0, 1); } static int snd_usbmidi_output_close(struct snd_rawmidi_substream *substream) { struct usbmidi_out_port *port = substream->runtime->private_data; flush_work(&port->ep->work); return substream_open(substream, 0, 0); } static void snd_usbmidi_output_trigger(struct snd_rawmidi_substream *substream, int up) { struct usbmidi_out_port *port = (struct usbmidi_out_port *)substream->runtime->private_data; port->active = up; if (up) { if (port->ep->umidi->disconnected) { /* gobble up remaining bytes to prevent wait in * snd_rawmidi_drain_output */ snd_rawmidi_proceed(substream); return; } queue_work(system_highpri_wq, &port->ep->work); } } static void snd_usbmidi_output_drain(struct snd_rawmidi_substream *substream) { struct usbmidi_out_port *port = substream->runtime->private_data; struct snd_usb_midi_out_endpoint *ep = port->ep; unsigned int drain_urbs; DEFINE_WAIT(wait); long timeout = msecs_to_jiffies(50); if (ep->umidi->disconnected) return; /* * The substream buffer is empty, but some data might still be in the * currently active URBs, so we have to wait for those to complete. */ spin_lock_irq(&ep->buffer_lock); drain_urbs = ep->active_urbs; if (drain_urbs) { ep->drain_urbs |= drain_urbs; do { prepare_to_wait(&ep->drain_wait, &wait, TASK_UNINTERRUPTIBLE); spin_unlock_irq(&ep->buffer_lock); timeout = schedule_timeout(timeout); spin_lock_irq(&ep->buffer_lock); drain_urbs &= ep->drain_urbs; } while (drain_urbs && timeout); finish_wait(&ep->drain_wait, &wait); } port->active = 0; spin_unlock_irq(&ep->buffer_lock); } static int snd_usbmidi_input_open(struct snd_rawmidi_substream *substream) { return substream_open(substream, 1, 1); } static int snd_usbmidi_input_close(struct snd_rawmidi_substream *substream) { return substream_open(substream, 1, 0); } static void snd_usbmidi_input_trigger(struct snd_rawmidi_substream *substream, int up) { struct snd_usb_midi *umidi = substream->rmidi->private_data; if (up) set_bit(substream->number, &umidi->input_triggered); else clear_bit(substream->number, &umidi->input_triggered); } static const struct snd_rawmidi_ops snd_usbmidi_output_ops = { .open = snd_usbmidi_output_open, .close = snd_usbmidi_output_close, .trigger = snd_usbmidi_output_trigger, .drain = snd_usbmidi_output_drain, }; static const struct snd_rawmidi_ops snd_usbmidi_input_ops = { .open = snd_usbmidi_input_open, .close = snd_usbmidi_input_close, .trigger = snd_usbmidi_input_trigger }; static void free_urb_and_buffer(struct snd_usb_midi *umidi, struct urb *urb, unsigned int buffer_length) { usb_free_coherent(umidi->dev, buffer_length, urb->transfer_buffer, urb->transfer_dma); usb_free_urb(urb); } /* * Frees an input endpoint. * May be called when ep hasn't been initialized completely. */ static void snd_usbmidi_in_endpoint_delete(struct snd_usb_midi_in_endpoint *ep) { unsigned int i; for (i = 0; i < INPUT_URBS; ++i) if (ep->urbs[i]) free_urb_and_buffer(ep->umidi, ep->urbs[i], ep->urbs[i]->transfer_buffer_length); kfree(ep); } /* * Creates an input endpoint. */ static int snd_usbmidi_in_endpoint_create(struct snd_usb_midi *umidi, struct snd_usb_midi_endpoint_info *ep_info, struct snd_usb_midi_endpoint *rep) { struct snd_usb_midi_in_endpoint *ep; void *buffer; unsigned int pipe; int length; unsigned int i; int err; rep->in = NULL; ep = kzalloc(sizeof(*ep), GFP_KERNEL); if (!ep) return -ENOMEM; ep->umidi = umidi; for (i = 0; i < INPUT_URBS; ++i) { ep->urbs[i] = usb_alloc_urb(0, GFP_KERNEL); if (!ep->urbs[i]) { err = -ENOMEM; goto error; } } if (ep_info->in_interval) pipe = usb_rcvintpipe(umidi->dev, ep_info->in_ep); else pipe = usb_rcvbulkpipe(umidi->dev, ep_info->in_ep); length = usb_maxpacket(umidi->dev, pipe); for (i = 0; i < INPUT_URBS; ++i) { buffer = usb_alloc_coherent(umidi->dev, length, GFP_KERNEL, &ep->urbs[i]->transfer_dma); if (!buffer) { err = -ENOMEM; goto error; } if (ep_info->in_interval) usb_fill_int_urb(ep->urbs[i], umidi->dev, pipe, buffer, length, snd_usbmidi_in_urb_complete, ep, ep_info->in_interval); else usb_fill_bulk_urb(ep->urbs[i], umidi->dev, pipe, buffer, length, snd_usbmidi_in_urb_complete, ep); ep->urbs[i]->transfer_flags = URB_NO_TRANSFER_DMA_MAP; err = usb_urb_ep_type_check(ep->urbs[i]); if (err < 0) { dev_err(&umidi->dev->dev, "invalid MIDI in EP %x\n", ep_info->in_ep); goto error; } } rep->in = ep; return 0; error: snd_usbmidi_in_endpoint_delete(ep); return err; } /* * Frees an output endpoint. * May be called when ep hasn't been initialized completely. */ static void snd_usbmidi_out_endpoint_clear(struct snd_usb_midi_out_endpoint *ep) { unsigned int i; for (i = 0; i < OUTPUT_URBS; ++i) if (ep->urbs[i].urb) { free_urb_and_buffer(ep->umidi, ep->urbs[i].urb, ep->max_transfer); ep->urbs[i].urb = NULL; } } static void snd_usbmidi_out_endpoint_delete(struct snd_usb_midi_out_endpoint *ep) { snd_usbmidi_out_endpoint_clear(ep); kfree(ep); } /* * Creates an output endpoint, and initializes output ports. */ static int snd_usbmidi_out_endpoint_create(struct snd_usb_midi *umidi, struct snd_usb_midi_endpoint_info *ep_info, struct snd_usb_midi_endpoint *rep) { struct snd_usb_midi_out_endpoint *ep; unsigned int i; unsigned int pipe; void *buffer; int err; rep->out = NULL; ep = kzalloc(sizeof(*ep), GFP_KERNEL); if (!ep) return -ENOMEM; ep->umidi = umidi; for (i = 0; i < OUTPUT_URBS; ++i) { ep->urbs[i].urb = usb_alloc_urb(0, GFP_KERNEL); if (!ep->urbs[i].urb) { err = -ENOMEM; goto error; } ep->urbs[i].ep = ep; } if (ep_info->out_interval) pipe = usb_sndintpipe(umidi->dev, ep_info->out_ep); else pipe = usb_sndbulkpipe(umidi->dev, ep_info->out_ep); switch (umidi->usb_id) { default: ep->max_transfer = usb_maxpacket(umidi->dev, pipe); break; /* * Various chips declare a packet size larger than 4 bytes, but * do not actually work with larger packets: */ case USB_ID(0x0a67, 0x5011): /* Medeli DD305 */ case USB_ID(0x0a92, 0x1020): /* ESI M4U */ case USB_ID(0x1430, 0x474b): /* RedOctane GH MIDI INTERFACE */ case USB_ID(0x15ca, 0x0101): /* Textech USB Midi Cable */ case USB_ID(0x15ca, 0x1806): /* Textech USB Midi Cable */ case USB_ID(0x1a86, 0x752d): /* QinHeng CH345 "USB2.0-MIDI" */ case USB_ID(0xfc08, 0x0101): /* Unknown vendor Cable */ ep->max_transfer = 4; break; /* * Some devices only work with 9 bytes packet size: */ case USB_ID(0x0644, 0x800e): /* Tascam US-122L */ case USB_ID(0x0644, 0x800f): /* Tascam US-144 */ ep->max_transfer = 9; break; } for (i = 0; i < OUTPUT_URBS; ++i) { buffer = usb_alloc_coherent(umidi->dev, ep->max_transfer, GFP_KERNEL, &ep->urbs[i].urb->transfer_dma); if (!buffer) { err = -ENOMEM; goto error; } if (ep_info->out_interval) usb_fill_int_urb(ep->urbs[i].urb, umidi->dev, pipe, buffer, ep->max_transfer, snd_usbmidi_out_urb_complete, &ep->urbs[i], ep_info->out_interval); else usb_fill_bulk_urb(ep->urbs[i].urb, umidi->dev, pipe, buffer, ep->max_transfer, snd_usbmidi_out_urb_complete, &ep->urbs[i]); err = usb_urb_ep_type_check(ep->urbs[i].urb); if (err < 0) { dev_err(&umidi->dev->dev, "invalid MIDI out EP %x\n", ep_info->out_ep); goto error; } ep->urbs[i].urb->transfer_flags = URB_NO_TRANSFER_DMA_MAP; } spin_lock_init(&ep->buffer_lock); INIT_WORK(&ep->work, snd_usbmidi_out_work); init_waitqueue_head(&ep->drain_wait); for (i = 0; i < 0x10; ++i) if (ep_info->out_cables & (1 << i)) { ep->ports[i].ep = ep; ep->ports[i].cable = i << 4; } if (umidi->usb_protocol_ops->init_out_endpoint) umidi->usb_protocol_ops->init_out_endpoint(ep); rep->out = ep; return 0; error: snd_usbmidi_out_endpoint_delete(ep); return err; } /* * Frees everything. */ static void snd_usbmidi_free(struct snd_usb_midi *umidi) { int i; if (!umidi->disconnected) snd_usbmidi_disconnect(&umidi->list); for (i = 0; i < MIDI_MAX_ENDPOINTS; ++i) { struct snd_usb_midi_endpoint *ep = &umidi->endpoints[i]; kfree(ep->out); } mutex_destroy(&umidi->mutex); kfree(umidi); } /* * Unlinks all URBs (must be done before the usb_device is deleted). */ void snd_usbmidi_disconnect(struct list_head *p) { struct snd_usb_midi *umidi; unsigned int i, j; umidi = list_entry(p, struct snd_usb_midi, list); /* * an URB's completion handler may start the timer and * a timer may submit an URB. To reliably break the cycle * a flag under lock must be used */ scoped_guard(rwsem_write, &umidi->disc_rwsem) { guard(spinlock_irq)(&umidi->disc_lock); umidi->disconnected = 1; } timer_shutdown_sync(&umidi->error_timer); for (i = 0; i < MIDI_MAX_ENDPOINTS; ++i) { struct snd_usb_midi_endpoint *ep = &umidi->endpoints[i]; if (ep->out) cancel_work_sync(&ep->out->work); if (ep->out) { for (j = 0; j < OUTPUT_URBS; ++j) usb_kill_urb(ep->out->urbs[j].urb); if (umidi->usb_protocol_ops->finish_out_endpoint) umidi->usb_protocol_ops->finish_out_endpoint(ep->out); ep->out->active_urbs = 0; if (ep->out->drain_urbs) { ep->out->drain_urbs = 0; wake_up(&ep->out->drain_wait); } } if (ep->in) for (j = 0; j < INPUT_URBS; ++j) usb_kill_urb(ep->in->urbs[j]); /* free endpoints here; later call can result in Oops */ if (ep->out) snd_usbmidi_out_endpoint_clear(ep->out); if (ep->in) { snd_usbmidi_in_endpoint_delete(ep->in); ep->in = NULL; } } } EXPORT_SYMBOL(snd_usbmidi_disconnect); static void snd_usbmidi_rawmidi_free(struct snd_rawmidi *rmidi) { struct snd_usb_midi *umidi = rmidi->private_data; snd_usbmidi_free(umidi); } static struct snd_rawmidi_substream *snd_usbmidi_find_substream(struct snd_usb_midi *umidi, int stream, int number) { struct snd_rawmidi_substream *substream; list_for_each_entry(substream, &umidi->rmidi->streams[stream].substreams, list) { if (substream->number == number) return substream; } return NULL; } /* * This list specifies names for ports that do not fit into the standard * "(product) MIDI (n)" schema because they aren't external MIDI ports, * such as internal control or synthesizer ports. */ static struct port_info { u32 id; short int port; short int voices; const char *name; unsigned int seq_flags; } snd_usbmidi_port_info[] = { #define PORT_INFO(vendor, product, num, name_, voices_, flags) \ { .id = USB_ID(vendor, product), \ .port = num, .voices = voices_, \ .name = name_, .seq_flags = flags } #define EXTERNAL_PORT(vendor, product, num, name) \ PORT_INFO(vendor, product, num, name, 0, \ SNDRV_SEQ_PORT_TYPE_MIDI_GENERIC | \ SNDRV_SEQ_PORT_TYPE_HARDWARE | \ SNDRV_SEQ_PORT_TYPE_PORT) #define CONTROL_PORT(vendor, product, num, name) \ PORT_INFO(vendor, product, num, name, 0, \ SNDRV_SEQ_PORT_TYPE_MIDI_GENERIC | \ SNDRV_SEQ_PORT_TYPE_HARDWARE) #define GM_SYNTH_PORT(vendor, product, num, name, voices) \ PORT_INFO(vendor, product, num, name, voices, \ SNDRV_SEQ_PORT_TYPE_MIDI_GENERIC | \ SNDRV_SEQ_PORT_TYPE_MIDI_GM | \ SNDRV_SEQ_PORT_TYPE_HARDWARE | \ SNDRV_SEQ_PORT_TYPE_SYNTHESIZER) #define ROLAND_SYNTH_PORT(vendor, product, num, name, voices) \ PORT_INFO(vendor, product, num, name, voices, \ SNDRV_SEQ_PORT_TYPE_MIDI_GENERIC | \ SNDRV_SEQ_PORT_TYPE_MIDI_GM | \ SNDRV_SEQ_PORT_TYPE_MIDI_GM2 | \ SNDRV_SEQ_PORT_TYPE_MIDI_GS | \ SNDRV_SEQ_PORT_TYPE_MIDI_XG | \ SNDRV_SEQ_PORT_TYPE_HARDWARE | \ SNDRV_SEQ_PORT_TYPE_SYNTHESIZER) #define SOUNDCANVAS_PORT(vendor, product, num, name, voices) \ PORT_INFO(vendor, product, num, name, voices, \ SNDRV_SEQ_PORT_TYPE_MIDI_GENERIC | \ SNDRV_SEQ_PORT_TYPE_MIDI_GM | \ SNDRV_SEQ_PORT_TYPE_MIDI_GM2 | \ SNDRV_SEQ_PORT_TYPE_MIDI_GS | \ SNDRV_SEQ_PORT_TYPE_MIDI_XG | \ SNDRV_SEQ_PORT_TYPE_MIDI_MT32 | \ SNDRV_SEQ_PORT_TYPE_HARDWARE | \ SNDRV_SEQ_PORT_TYPE_SYNTHESIZER) /* Yamaha MOTIF XF */ GM_SYNTH_PORT(0x0499, 0x105c, 0, "%s Tone Generator", 128), CONTROL_PORT(0x0499, 0x105c, 1, "%s Remote Control"), EXTERNAL_PORT(0x0499, 0x105c, 2, "%s Thru"), CONTROL_PORT(0x0499, 0x105c, 3, "%s Editor"), /* Roland UA-100 */ CONTROL_PORT(0x0582, 0x0000, 2, "%s Control"), /* Roland SC-8850 */ SOUNDCANVAS_PORT(0x0582, 0x0003, 0, "%s Part A", 128), SOUNDCANVAS_PORT(0x0582, 0x0003, 1, "%s Part B", 128), SOUNDCANVAS_PORT(0x0582, 0x0003, 2, "%s Part C", 128), SOUNDCANVAS_PORT(0x0582, 0x0003, 3, "%s Part D", 128), EXTERNAL_PORT(0x0582, 0x0003, 4, "%s MIDI 1"), EXTERNAL_PORT(0x0582, 0x0003, 5, "%s MIDI 2"), /* Roland U-8 */ EXTERNAL_PORT(0x0582, 0x0004, 0, "%s MIDI"), CONTROL_PORT(0x0582, 0x0004, 1, "%s Control"), /* Roland SC-8820 */ SOUNDCANVAS_PORT(0x0582, 0x0007, 0, "%s Part A", 64), SOUNDCANVAS_PORT(0x0582, 0x0007, 1, "%s Part B", 64), EXTERNAL_PORT(0x0582, 0x0007, 2, "%s MIDI"), /* Roland SK-500 */ SOUNDCANVAS_PORT(0x0582, 0x000b, 0, "%s Part A", 64), SOUNDCANVAS_PORT(0x0582, 0x000b, 1, "%s Part B", 64), EXTERNAL_PORT(0x0582, 0x000b, 2, "%s MIDI"), /* Roland SC-D70 */ SOUNDCANVAS_PORT(0x0582, 0x000c, 0, "%s Part A", 64), SOUNDCANVAS_PORT(0x0582, 0x000c, 1, "%s Part B", 64), EXTERNAL_PORT(0x0582, 0x000c, 2, "%s MIDI"), /* Edirol UM-880 */ CONTROL_PORT(0x0582, 0x0014, 8, "%s Control"), /* Edirol SD-90 */ ROLAND_SYNTH_PORT(0x0582, 0x0016, 0, "%s Part A", 128), ROLAND_SYNTH_PORT(0x0582, 0x0016, 1, "%s Part B", 128), EXTERNAL_PORT(0x0582, 0x0016, 2, "%s MIDI 1"), EXTERNAL_PORT(0x0582, 0x0016, 3, "%s MIDI 2"), /* Edirol UM-550 */ CONTROL_PORT(0x0582, 0x0023, 5, "%s Control"), /* Edirol SD-20 */ ROLAND_SYNTH_PORT(0x0582, 0x0027, 0, "%s Part A", 64), ROLAND_SYNTH_PORT(0x0582, 0x0027, 1, "%s Part B", 64), EXTERNAL_PORT(0x0582, 0x0027, 2, "%s MIDI"), /* Edirol SD-80 */ ROLAND_SYNTH_PORT(0x0582, 0x0029, 0, "%s Part A", 128), ROLAND_SYNTH_PORT(0x0582, 0x0029, 1, "%s Part B", 128), EXTERNAL_PORT(0x0582, 0x0029, 2, "%s MIDI 1"), EXTERNAL_PORT(0x0582, 0x0029, 3, "%s MIDI 2"), /* Edirol UA-700 */ EXTERNAL_PORT(0x0582, 0x002b, 0, "%s MIDI"), CONTROL_PORT(0x0582, 0x002b, 1, "%s Control"), /* Roland VariOS */ EXTERNAL_PORT(0x0582, 0x002f, 0, "%s MIDI"), EXTERNAL_PORT(0x0582, 0x002f, 1, "%s External MIDI"), EXTERNAL_PORT(0x0582, 0x002f, 2, "%s Sync"), /* Edirol PCR */ EXTERNAL_PORT(0x0582, 0x0033, 0, "%s MIDI"), EXTERNAL_PORT(0x0582, 0x0033, 1, "%s 1"), EXTERNAL_PORT(0x0582, 0x0033, 2, "%s 2"), /* BOSS GS-10 */ EXTERNAL_PORT(0x0582, 0x003b, 0, "%s MIDI"), CONTROL_PORT(0x0582, 0x003b, 1, "%s Control"), /* Edirol UA-1000 */ EXTERNAL_PORT(0x0582, 0x0044, 0, "%s MIDI"), CONTROL_PORT(0x0582, 0x0044, 1, "%s Control"), /* Edirol UR-80 */ EXTERNAL_PORT(0x0582, 0x0048, 0, "%s MIDI"), EXTERNAL_PORT(0x0582, 0x0048, 1, "%s 1"), EXTERNAL_PORT(0x0582, 0x0048, 2, "%s 2"), /* Edirol PCR-A */ EXTERNAL_PORT(0x0582, 0x004d, 0, "%s MIDI"), EXTERNAL_PORT(0x0582, 0x004d, 1, "%s 1"), EXTERNAL_PORT(0x0582, 0x004d, 2, "%s 2"), /* BOSS GT-PRO */ CONTROL_PORT(0x0582, 0x0089, 0, "%s Control"), /* Edirol UM-3EX */ CONTROL_PORT(0x0582, 0x009a, 3, "%s Control"), /* Roland VG-99 */ CONTROL_PORT(0x0582, 0x00b2, 0, "%s Control"), EXTERNAL_PORT(0x0582, 0x00b2, 1, "%s MIDI"), /* Cakewalk Sonar V-Studio 100 */ EXTERNAL_PORT(0x0582, 0x00eb, 0, "%s MIDI"), CONTROL_PORT(0x0582, 0x00eb, 1, "%s Control"), /* Roland VB-99 */ CONTROL_PORT(0x0582, 0x0102, 0, "%s Control"), EXTERNAL_PORT(0x0582, 0x0102, 1, "%s MIDI"), /* Roland A-PRO */ EXTERNAL_PORT(0x0582, 0x010f, 0, "%s MIDI"), CONTROL_PORT(0x0582, 0x010f, 1, "%s 1"), CONTROL_PORT(0x0582, 0x010f, 2, "%s 2"), /* Roland SD-50 */ ROLAND_SYNTH_PORT(0x0582, 0x0114, 0, "%s Synth", 128), EXTERNAL_PORT(0x0582, 0x0114, 1, "%s MIDI"), CONTROL_PORT(0x0582, 0x0114, 2, "%s Control"), /* Roland OCTA-CAPTURE */ EXTERNAL_PORT(0x0582, 0x0120, 0, "%s MIDI"), CONTROL_PORT(0x0582, 0x0120, 1, "%s Control"), EXTERNAL_PORT(0x0582, 0x0121, 0, "%s MIDI"), CONTROL_PORT(0x0582, 0x0121, 1, "%s Control"), /* Roland SPD-SX */ CONTROL_PORT(0x0582, 0x0145, 0, "%s Control"), EXTERNAL_PORT(0x0582, 0x0145, 1, "%s MIDI"), /* Roland A-Series */ CONTROL_PORT(0x0582, 0x0156, 0, "%s Keyboard"), EXTERNAL_PORT(0x0582, 0x0156, 1, "%s MIDI"), /* Roland INTEGRA-7 */ ROLAND_SYNTH_PORT(0x0582, 0x015b, 0, "%s Synth", 128), CONTROL_PORT(0x0582, 0x015b, 1, "%s Control"), /* M-Audio MidiSport 8x8 */ CONTROL_PORT(0x0763, 0x1031, 8, "%s Control"), CONTROL_PORT(0x0763, 0x1033, 8, "%s Control"), /* MOTU Fastlane */ EXTERNAL_PORT(0x07fd, 0x0001, 0, "%s MIDI A"), EXTERNAL_PORT(0x07fd, 0x0001, 1, "%s MIDI B"), /* Emagic Unitor8/AMT8/MT4 */ EXTERNAL_PORT(0x086a, 0x0001, 8, "%s Broadcast"), EXTERNAL_PORT(0x086a, 0x0002, 8, "%s Broadcast"), EXTERNAL_PORT(0x086a, 0x0003, 4, "%s Broadcast"), /* Akai MPD16 */ CONTROL_PORT(0x09e8, 0x0062, 0, "%s Control"), PORT_INFO(0x09e8, 0x0062, 1, "%s MIDI", 0, SNDRV_SEQ_PORT_TYPE_MIDI_GENERIC | SNDRV_SEQ_PORT_TYPE_HARDWARE), /* Access Music Virus TI */ EXTERNAL_PORT(0x133e, 0x0815, 0, "%s MIDI"), PORT_INFO(0x133e, 0x0815, 1, "%s Synth", 0, SNDRV_SEQ_PORT_TYPE_MIDI_GENERIC | SNDRV_SEQ_PORT_TYPE_HARDWARE | SNDRV_SEQ_PORT_TYPE_SYNTHESIZER), }; static struct port_info *find_port_info(struct snd_usb_midi *umidi, int number) { int i; for (i = 0; i < ARRAY_SIZE(snd_usbmidi_port_info); ++i) { if (snd_usbmidi_port_info[i].id == umidi->usb_id && snd_usbmidi_port_info[i].port == number) return &snd_usbmidi_port_info[i]; } return NULL; } static void snd_usbmidi_get_port_info(struct snd_rawmidi *rmidi, int number, struct snd_seq_port_info *seq_port_info) { struct snd_usb_midi *umidi = rmidi->private_data; struct port_info *port_info; /* TODO: read port flags from descriptors */ port_info = find_port_info(umidi, number); if (port_info) { seq_port_info->type = port_info->seq_flags; seq_port_info->midi_voices = port_info->voices; } } /* return iJack for the corresponding jackID */ static int find_usb_ijack(struct usb_host_interface *hostif, uint8_t jack_id) { unsigned char *extra = hostif->extra; int extralen = hostif->extralen; struct usb_descriptor_header *h; struct usb_midi_out_jack_descriptor *outjd; struct usb_midi_in_jack_descriptor *injd; size_t sz; while (extralen > 4) { h = (struct usb_descriptor_header *)extra; if (h->bDescriptorType != USB_DT_CS_INTERFACE) goto next; outjd = (struct usb_midi_out_jack_descriptor *)h; if (h->bLength >= sizeof(*outjd) && outjd->bDescriptorSubtype == UAC_MIDI_OUT_JACK && outjd->bJackID == jack_id) { sz = USB_DT_MIDI_OUT_SIZE(outjd->bNrInputPins); if (outjd->bLength < sz) goto next; return *(extra + sz - 1); } injd = (struct usb_midi_in_jack_descriptor *)h; if (injd->bLength >= sizeof(*injd) && injd->bDescriptorSubtype == UAC_MIDI_IN_JACK && injd->bJackID == jack_id) return injd->iJack; next: if (!extra[0]) break; extralen -= extra[0]; extra += extra[0]; } return 0; } static void snd_usbmidi_init_substream(struct snd_usb_midi *umidi, int stream, int number, int jack_id, struct snd_rawmidi_substream **rsubstream) { struct port_info *port_info; const char *name_format; struct usb_interface *intf; struct usb_host_interface *hostif; uint8_t jack_name_buf[32]; uint8_t *default_jack_name = "MIDI"; uint8_t *jack_name = default_jack_name; uint8_t iJack; int res; struct snd_rawmidi_substream *substream = snd_usbmidi_find_substream(umidi, stream, number); if (!substream) { dev_err(&umidi->dev->dev, "substream %d:%d not found\n", stream, number); return; } intf = umidi->iface; if (intf && jack_id >= 0) { hostif = intf->cur_altsetting; iJack = find_usb_ijack(hostif, jack_id); if (iJack != 0) { res = usb_string(umidi->dev, iJack, jack_name_buf, ARRAY_SIZE(jack_name_buf)); if (res) jack_name = jack_name_buf; } } port_info = find_port_info(umidi, number); if (port_info || jack_name == default_jack_name || strncmp(umidi->card->shortname, jack_name, strlen(umidi->card->shortname)) != 0) { name_format = port_info ? port_info->name : (jack_name != default_jack_name ? "%s %s" : "%s %s %d"); snprintf(substream->name, sizeof(substream->name), name_format, umidi->card->shortname, jack_name, number + 1); } else { /* The manufacturer included the iProduct name in the jack * name, do not use both */ strscpy(substream->name, jack_name); } *rsubstream = substream; } /* * Creates the endpoints and their ports. */ static int snd_usbmidi_create_endpoints(struct snd_usb_midi *umidi, struct snd_usb_midi_endpoint_info *endpoints) { int i, j, err; int out_ports = 0, in_ports = 0; for (i = 0; i < MIDI_MAX_ENDPOINTS; ++i) { if (endpoints[i].out_cables) { err = snd_usbmidi_out_endpoint_create(umidi, &endpoints[i], &umidi->endpoints[i]); if (err < 0) return err; } if (endpoints[i].in_cables) { err = snd_usbmidi_in_endpoint_create(umidi, &endpoints[i], &umidi->endpoints[i]); if (err < 0) return err; } for (j = 0; j < 0x10; ++j) { if (endpoints[i].out_cables & (1 << j)) { snd_usbmidi_init_substream(umidi, SNDRV_RAWMIDI_STREAM_OUTPUT, out_ports, endpoints[i].assoc_out_jacks[j], &umidi->endpoints[i].out->ports[j].substream); ++out_ports; } if (endpoints[i].in_cables & (1 << j)) { snd_usbmidi_init_substream(umidi, SNDRV_RAWMIDI_STREAM_INPUT, in_ports, endpoints[i].assoc_in_jacks[j], &umidi->endpoints[i].in->ports[j].substream); ++in_ports; } } } dev_dbg(&umidi->dev->dev, "created %d output and %d input ports\n", out_ports, in_ports); return 0; } static struct usb_ms_endpoint_descriptor *find_usb_ms_endpoint_descriptor( struct usb_host_endpoint *hostep) { unsigned char *extra = hostep->extra; int extralen = hostep->extralen; while (extralen > 3) { struct usb_ms_endpoint_descriptor *ms_ep = (struct usb_ms_endpoint_descriptor *)extra; if (ms_ep->bLength > 3 && ms_ep->bDescriptorType == USB_DT_CS_ENDPOINT && ms_ep->bDescriptorSubtype == UAC_MS_GENERAL) return ms_ep; if (!extra[0]) break; extralen -= extra[0]; extra += extra[0]; } return NULL; } /* * Returns MIDIStreaming device capabilities. */ static int snd_usbmidi_get_ms_info(struct snd_usb_midi *umidi, struct snd_usb_midi_endpoint_info *endpoints) { struct usb_interface *intf; struct usb_host_interface *hostif; struct usb_interface_descriptor *intfd; struct usb_ms_header_descriptor *ms_header; struct usb_host_endpoint *hostep; struct usb_endpoint_descriptor *ep; struct usb_ms_endpoint_descriptor *ms_ep; int i, j, epidx; intf = umidi->iface; if (!intf) return -ENXIO; hostif = &intf->altsetting[0]; intfd = get_iface_desc(hostif); ms_header = (struct usb_ms_header_descriptor *)hostif->extra; if (hostif->extralen >= 7 && ms_header->bLength >= 7 && ms_header->bDescriptorType == USB_DT_CS_INTERFACE && ms_header->bDescriptorSubtype == UAC_HEADER) dev_dbg(&umidi->dev->dev, "MIDIStreaming version %02x.%02x\n", ((uint8_t *)&ms_header->bcdMSC)[1], ((uint8_t *)&ms_header->bcdMSC)[0]); else dev_warn(&umidi->dev->dev, "MIDIStreaming interface descriptor not found\n"); epidx = 0; for (i = 0; i < intfd->bNumEndpoints; ++i) { hostep = &hostif->endpoint[i]; ep = get_ep_desc(hostep); if (!usb_endpoint_xfer_bulk(ep) && !usb_endpoint_xfer_int(ep)) continue; ms_ep = find_usb_ms_endpoint_descriptor(hostep); if (!ms_ep) continue; if (ms_ep->bLength <= sizeof(*ms_ep)) continue; if (ms_ep->bNumEmbMIDIJack > 0x10) continue; if (ms_ep->bLength < sizeof(*ms_ep) + ms_ep->bNumEmbMIDIJack) continue; if (usb_endpoint_dir_out(ep)) { if (endpoints[epidx].out_ep) { if (++epidx >= MIDI_MAX_ENDPOINTS) { dev_warn(&umidi->dev->dev, "too many endpoints\n"); break; } } endpoints[epidx].out_ep = usb_endpoint_num(ep); if (usb_endpoint_xfer_int(ep)) endpoints[epidx].out_interval = ep->bInterval; else if (snd_usb_get_speed(umidi->dev) == USB_SPEED_LOW) /* * Low speed bulk transfers don't exist, so * force interrupt transfers for devices like * ESI MIDI Mate that try to use them anyway. */ endpoints[epidx].out_interval = 1; endpoints[epidx].out_cables = (1 << ms_ep->bNumEmbMIDIJack) - 1; for (j = 0; j < ms_ep->bNumEmbMIDIJack; ++j) endpoints[epidx].assoc_out_jacks[j] = ms_ep->baAssocJackID[j]; for (; j < ARRAY_SIZE(endpoints[epidx].assoc_out_jacks); ++j) endpoints[epidx].assoc_out_jacks[j] = -1; dev_dbg(&umidi->dev->dev, "EP %02X: %d jack(s)\n", ep->bEndpointAddress, ms_ep->bNumEmbMIDIJack); } else { if (endpoints[epidx].in_ep) { if (++epidx >= MIDI_MAX_ENDPOINTS) { dev_warn(&umidi->dev->dev, "too many endpoints\n"); break; } } endpoints[epidx].in_ep = usb_endpoint_num(ep); if (usb_endpoint_xfer_int(ep)) endpoints[epidx].in_interval = ep->bInterval; else if (snd_usb_get_speed(umidi->dev) == USB_SPEED_LOW) endpoints[epidx].in_interval = 1; endpoints[epidx].in_cables = (1 << ms_ep->bNumEmbMIDIJack) - 1; for (j = 0; j < ms_ep->bNumEmbMIDIJack; ++j) endpoints[epidx].assoc_in_jacks[j] = ms_ep->baAssocJackID[j]; for (; j < ARRAY_SIZE(endpoints[epidx].assoc_in_jacks); ++j) endpoints[epidx].assoc_in_jacks[j] = -1; dev_dbg(&umidi->dev->dev, "EP %02X: %d jack(s)\n", ep->bEndpointAddress, ms_ep->bNumEmbMIDIJack); } } return 0; } static int roland_load_info(struct snd_kcontrol *kcontrol, struct snd_ctl_elem_info *info) { static const char *const names[] = { "High Load", "Light Load" }; return snd_ctl_enum_info(info, 1, 2, names); } static int roland_load_get(struct snd_kcontrol *kcontrol, struct snd_ctl_elem_value *value) { value->value.enumerated.item[0] = kcontrol->private_value; return 0; } static int roland_load_put(struct snd_kcontrol *kcontrol, struct snd_ctl_elem_value *value) { struct snd_usb_midi *umidi = snd_kcontrol_chip(kcontrol); int changed; if (value->value.enumerated.item[0] > 1) return -EINVAL; guard(mutex)(&umidi->mutex); changed = value->value.enumerated.item[0] != kcontrol->private_value; if (changed) kcontrol->private_value = value->value.enumerated.item[0]; return changed; } static const struct snd_kcontrol_new roland_load_ctl = { .iface = SNDRV_CTL_ELEM_IFACE_MIXER, .name = "MIDI Input Mode", .info = roland_load_info, .get = roland_load_get, .put = roland_load_put, .private_value = 1, }; /* * On Roland devices, use the second alternate setting to be able to use * the interrupt input endpoint. */ static void snd_usbmidi_switch_roland_altsetting(struct snd_usb_midi *umidi) { struct usb_interface *intf; struct usb_host_interface *hostif; struct usb_interface_descriptor *intfd; intf = umidi->iface; if (!intf || intf->num_altsetting != 2) return; hostif = &intf->altsetting[1]; intfd = get_iface_desc(hostif); /* If either or both of the endpoints support interrupt transfer, * then use the alternate setting */ if (intfd->bNumEndpoints != 2 || !((get_endpoint(hostif, 0)->bmAttributes & USB_ENDPOINT_XFERTYPE_MASK) == USB_ENDPOINT_XFER_INT || (get_endpoint(hostif, 1)->bmAttributes & USB_ENDPOINT_XFERTYPE_MASK) == USB_ENDPOINT_XFER_INT)) return; dev_dbg(&umidi->dev->dev, "switching to altsetting %d with int ep\n", intfd->bAlternateSetting); usb_set_interface(umidi->dev, intfd->bInterfaceNumber, intfd->bAlternateSetting); umidi->roland_load_ctl = snd_ctl_new1(&roland_load_ctl, umidi); if (snd_ctl_add(umidi->card, umidi->roland_load_ctl) < 0) umidi->roland_load_ctl = NULL; } /* * Try to find any usable endpoints in the interface. */ static int snd_usbmidi_detect_endpoints(struct snd_usb_midi *umidi, struct snd_usb_midi_endpoint_info *endpoint, int max_endpoints) { struct usb_interface *intf; struct usb_host_interface *hostif; struct usb_interface_descriptor *intfd; struct usb_endpoint_descriptor *epd; int i, out_eps = 0, in_eps = 0; if (USB_ID_VENDOR(umidi->usb_id) == 0x0582) snd_usbmidi_switch_roland_altsetting(umidi); if (endpoint[0].out_ep || endpoint[0].in_ep) return 0; intf = umidi->iface; if (!intf || intf->num_altsetting < 1) return -ENOENT; hostif = intf->cur_altsetting; intfd = get_iface_desc(hostif); for (i = 0; i < intfd->bNumEndpoints; ++i) { epd = get_endpoint(hostif, i); if (!usb_endpoint_xfer_bulk(epd) && !usb_endpoint_xfer_int(epd)) continue; if (out_eps < max_endpoints && usb_endpoint_dir_out(epd)) { endpoint[out_eps].out_ep = usb_endpoint_num(epd); if (usb_endpoint_xfer_int(epd)) endpoint[out_eps].out_interval = epd->bInterval; ++out_eps; } if (in_eps < max_endpoints && usb_endpoint_dir_in(epd)) { endpoint[in_eps].in_ep = usb_endpoint_num(epd); if (usb_endpoint_xfer_int(epd)) endpoint[in_eps].in_interval = epd->bInterval; ++in_eps; } } return (out_eps || in_eps) ? 0 : -ENOENT; } /* * Detects the endpoints for one-port-per-endpoint protocols. */ static int snd_usbmidi_detect_per_port_endpoints(struct snd_usb_midi *umidi, struct snd_usb_midi_endpoint_info *endpoints) { int err, i; err = snd_usbmidi_detect_endpoints(umidi, endpoints, MIDI_MAX_ENDPOINTS); for (i = 0; i < MIDI_MAX_ENDPOINTS; ++i) { if (endpoints[i].out_ep) endpoints[i].out_cables = 0x0001; if (endpoints[i].in_ep) endpoints[i].in_cables = 0x0001; } return err; } /* * Detects the endpoints and ports of Yamaha devices. */ static int snd_usbmidi_detect_yamaha(struct snd_usb_midi *umidi, struct snd_usb_midi_endpoint_info *endpoint) { struct usb_interface *intf; struct usb_host_interface *hostif; struct usb_interface_descriptor *intfd; uint8_t *cs_desc; intf = umidi->iface; if (!intf) return -ENOENT; hostif = intf->altsetting; intfd = get_iface_desc(hostif); if (intfd->bNumEndpoints < 1) return -ENOENT; /* * For each port there is one MIDI_IN/OUT_JACK descriptor, not * necessarily with any useful contents. So simply count 'em. */ for (cs_desc = hostif->extra; cs_desc < hostif->extra + hostif->extralen && cs_desc[0] >= 2; cs_desc += cs_desc[0]) { if (cs_desc[1] == USB_DT_CS_INTERFACE) { if (cs_desc[2] == UAC_MIDI_IN_JACK) endpoint->in_cables = (endpoint->in_cables << 1) | 1; else if (cs_desc[2] == UAC_MIDI_OUT_JACK) endpoint->out_cables = (endpoint->out_cables << 1) | 1; } } if (!endpoint->in_cables && !endpoint->out_cables) return -ENOENT; return snd_usbmidi_detect_endpoints(umidi, endpoint, 1); } /* * Detects the endpoints and ports of Roland devices. */ static int snd_usbmidi_detect_roland(struct snd_usb_midi *umidi, struct snd_usb_midi_endpoint_info *endpoint) { struct usb_interface *intf; struct usb_host_interface *hostif; u8 *cs_desc; intf = umidi->iface; if (!intf) return -ENOENT; hostif = intf->altsetting; /* * Some devices have a descriptor <06 24 F1 02 <inputs> <outputs>>, * some have standard class descriptors, or both kinds, or neither. */ for (cs_desc = hostif->extra; cs_desc < hostif->extra + hostif->extralen && cs_desc[0] >= 2; cs_desc += cs_desc[0]) { if (cs_desc[0] >= 6 && cs_desc[1] == USB_DT_CS_INTERFACE && cs_desc[2] == 0xf1 && cs_desc[3] == 0x02) { if (cs_desc[4] > 0x10 || cs_desc[5] > 0x10) continue; endpoint->in_cables = (1 << cs_desc[4]) - 1; endpoint->out_cables = (1 << cs_desc[5]) - 1; return snd_usbmidi_detect_endpoints(umidi, endpoint, 1); } else if (cs_desc[0] >= 7 && cs_desc[1] == USB_DT_CS_INTERFACE && cs_desc[2] == UAC_HEADER) { return snd_usbmidi_get_ms_info(umidi, endpoint); } } return -ENODEV; } /* * Creates the endpoints and their ports for Midiman devices. */ static int snd_usbmidi_create_endpoints_midiman(struct snd_usb_midi *umidi, struct snd_usb_midi_endpoint_info *endpoint) { struct snd_usb_midi_endpoint_info ep_info; struct usb_interface *intf; struct usb_host_interface *hostif; struct usb_interface_descriptor *intfd; struct usb_endpoint_descriptor *epd; int cable, err; intf = umidi->iface; if (!intf) return -ENOENT; hostif = intf->altsetting; intfd = get_iface_desc(hostif); /* * The various MidiSport devices have more or less random endpoint * numbers, so we have to identify the endpoints by their index in * the descriptor array, like the driver for that other OS does. * * There is one interrupt input endpoint for all input ports, one * bulk output endpoint for even-numbered ports, and one for odd- * numbered ports. Both bulk output endpoints have corresponding * input bulk endpoints (at indices 1 and 3) which aren't used. */ if (intfd->bNumEndpoints < (endpoint->out_cables > 0x0001 ? 5 : 3)) { dev_dbg(&umidi->dev->dev, "not enough endpoints\n"); return -ENOENT; } epd = get_endpoint(hostif, 0); if (!usb_endpoint_dir_in(epd) || !usb_endpoint_xfer_int(epd)) { dev_dbg(&umidi->dev->dev, "endpoint[0] isn't interrupt\n"); return -ENXIO; } epd = get_endpoint(hostif, 2); if (!usb_endpoint_dir_out(epd) || !usb_endpoint_xfer_bulk(epd)) { dev_dbg(&umidi->dev->dev, "endpoint[2] isn't bulk output\n"); return -ENXIO; } if (endpoint->out_cables > 0x0001) { epd = get_endpoint(hostif, 4); if (!usb_endpoint_dir_out(epd) || !usb_endpoint_xfer_bulk(epd)) { dev_dbg(&umidi->dev->dev, "endpoint[4] isn't bulk output\n"); return -ENXIO; } } ep_info.out_ep = get_endpoint(hostif, 2)->bEndpointAddress & USB_ENDPOINT_NUMBER_MASK; ep_info.out_interval = 0; ep_info.out_cables = endpoint->out_cables & 0x5555; err = snd_usbmidi_out_endpoint_create(umidi, &ep_info, &umidi->endpoints[0]); if (err < 0) return err; ep_info.in_ep = get_endpoint(hostif, 0)->bEndpointAddress & USB_ENDPOINT_NUMBER_MASK; ep_info.in_interval = get_endpoint(hostif, 0)->bInterval; ep_info.in_cables = endpoint->in_cables; err = snd_usbmidi_in_endpoint_create(umidi, &ep_info, &umidi->endpoints[0]); if (err < 0) return err; if (endpoint->out_cables > 0x0001) { ep_info.out_ep = get_endpoint(hostif, 4)->bEndpointAddress & USB_ENDPOINT_NUMBER_MASK; ep_info.out_cables = endpoint->out_cables & 0xaaaa; err = snd_usbmidi_out_endpoint_create(umidi, &ep_info, &umidi->endpoints[1]); if (err < 0) return err; } for (cable = 0; cable < 0x10; ++cable) { if (endpoint->out_cables & (1 << cable)) snd_usbmidi_init_substream(umidi, SNDRV_RAWMIDI_STREAM_OUTPUT, cable, -1 /* prevent trying to find jack */, &umidi->endpoints[cable & 1].out->ports[cable].substream); if (endpoint->in_cables & (1 << cable)) snd_usbmidi_init_substream(umidi, SNDRV_RAWMIDI_STREAM_INPUT, cable, -1 /* prevent trying to find jack */, &umidi->endpoints[0].in->ports[cable].substream); } return 0; } static const struct snd_rawmidi_global_ops snd_usbmidi_ops = { .get_port_info = snd_usbmidi_get_port_info, }; static int snd_usbmidi_create_rawmidi(struct snd_usb_midi *umidi, int out_ports, int in_ports) { struct snd_rawmidi *rmidi; int err; err = snd_rawmidi_new(umidi->card, "USB MIDI", umidi->next_midi_device++, out_ports, in_ports, &rmidi); if (err < 0) return err; strscpy(rmidi->name, umidi->card->shortname); rmidi->info_flags = SNDRV_RAWMIDI_INFO_OUTPUT | SNDRV_RAWMIDI_INFO_INPUT | SNDRV_RAWMIDI_INFO_DUPLEX; rmidi->ops = &snd_usbmidi_ops; rmidi->private_data = umidi; rmidi->private_free = snd_usbmidi_rawmidi_free; snd_rawmidi_set_ops(rmidi, SNDRV_RAWMIDI_STREAM_OUTPUT, &snd_usbmidi_output_ops); snd_rawmidi_set_ops(rmidi, SNDRV_RAWMIDI_STREAM_INPUT, &snd_usbmidi_input_ops); umidi->rmidi = rmidi; return 0; } /* * Temporarily stop input. */ void snd_usbmidi_input_stop(struct list_head *p) { struct snd_usb_midi *umidi; unsigned int i, j; umidi = list_entry(p, struct snd_usb_midi, list); if (!umidi->input_running) return; for (i = 0; i < MIDI_MAX_ENDPOINTS; ++i) { struct snd_usb_midi_endpoint *ep = &umidi->endpoints[i]; if (ep->in) for (j = 0; j < INPUT_URBS; ++j) usb_kill_urb(ep->in->urbs[j]); } umidi->input_running = 0; } EXPORT_SYMBOL(snd_usbmidi_input_stop); static void snd_usbmidi_input_start_ep(struct snd_usb_midi *umidi, struct snd_usb_midi_in_endpoint *ep) { unsigned int i; if (!ep) return; for (i = 0; i < INPUT_URBS; ++i) { struct urb *urb = ep->urbs[i]; scoped_guard(spinlock_irqsave, &umidi->disc_lock) { if (!atomic_read(&urb->use_count)) { urb->dev = ep->umidi->dev; snd_usbmidi_submit_urb(urb, GFP_ATOMIC); } } } } /* * Resume input after a call to snd_usbmidi_input_stop(). */ void snd_usbmidi_input_start(struct list_head *p) { struct snd_usb_midi *umidi; int i; umidi = list_entry(p, struct snd_usb_midi, list); if (umidi->input_running || !umidi->opened[1]) return; for (i = 0; i < MIDI_MAX_ENDPOINTS; ++i) snd_usbmidi_input_start_ep(umidi, umidi->endpoints[i].in); umidi->input_running = 1; } EXPORT_SYMBOL(snd_usbmidi_input_start); /* * Prepare for suspend. Typically called from the USB suspend callback. */ void snd_usbmidi_suspend(struct list_head *p) { struct snd_usb_midi *umidi; umidi = list_entry(p, struct snd_usb_midi, list); guard(mutex)(&umidi->mutex); snd_usbmidi_input_stop(p); } EXPORT_SYMBOL(snd_usbmidi_suspend); /* * Resume. Typically called from the USB resume callback. */ void snd_usbmidi_resume(struct list_head *p) { struct snd_usb_midi *umidi; umidi = list_entry(p, struct snd_usb_midi, list); guard(mutex)(&umidi->mutex); snd_usbmidi_input_start(p); } EXPORT_SYMBOL(snd_usbmidi_resume); /* * Creates and registers everything needed for a MIDI streaming interface. */ int __snd_usbmidi_create(struct snd_card *card, struct usb_interface *iface, struct list_head *midi_list, const struct snd_usb_audio_quirk *quirk, unsigned int usb_id, unsigned int *num_rawmidis) { struct snd_usb_midi *umidi; struct snd_usb_midi_endpoint_info endpoints[MIDI_MAX_ENDPOINTS]; int out_ports, in_ports; int i, err; umidi = kzalloc(sizeof(*umidi), GFP_KERNEL); if (!umidi) return -ENOMEM; umidi->dev = interface_to_usbdev(iface); umidi->card = card; umidi->iface = iface; umidi->quirk = quirk; umidi->usb_protocol_ops = &snd_usbmidi_standard_ops; if (num_rawmidis) umidi->next_midi_device = *num_rawmidis; spin_lock_init(&umidi->disc_lock); init_rwsem(&umidi->disc_rwsem); mutex_init(&umidi->mutex); if (!usb_id) usb_id = USB_ID(le16_to_cpu(umidi->dev->descriptor.idVendor), le16_to_cpu(umidi->dev->descriptor.idProduct)); umidi->usb_id = usb_id; timer_setup(&umidi->error_timer, snd_usbmidi_error_timer, 0); /* detect the endpoint(s) to use */ memset(endpoints, 0, sizeof(endpoints)); switch (quirk ? quirk->type : QUIRK_MIDI_STANDARD_INTERFACE) { case QUIRK_MIDI_STANDARD_INTERFACE: err = snd_usbmidi_get_ms_info(umidi, endpoints); if (umidi->usb_id == USB_ID(0x0763, 0x0150)) /* M-Audio Uno */ umidi->usb_protocol_ops = &snd_usbmidi_maudio_broken_running_status_ops; break; case QUIRK_MIDI_US122L: umidi->usb_protocol_ops = &snd_usbmidi_122l_ops; fallthrough; case QUIRK_MIDI_FIXED_ENDPOINT: memcpy(&endpoints[0], quirk->data, sizeof(struct snd_usb_midi_endpoint_info)); err = snd_usbmidi_detect_endpoints(umidi, &endpoints[0], 1); break; case QUIRK_MIDI_YAMAHA: err = snd_usbmidi_detect_yamaha(umidi, &endpoints[0]); break; case QUIRK_MIDI_ROLAND: err = snd_usbmidi_detect_roland(umidi, &endpoints[0]); break; case QUIRK_MIDI_MIDIMAN: umidi->usb_protocol_ops = &snd_usbmidi_midiman_ops; memcpy(&endpoints[0], quirk->data, sizeof(struct snd_usb_midi_endpoint_info)); err = 0; break; case QUIRK_MIDI_NOVATION: umidi->usb_protocol_ops = &snd_usbmidi_novation_ops; err = snd_usbmidi_detect_per_port_endpoints(umidi, endpoints); break; case QUIRK_MIDI_RAW_BYTES: umidi->usb_protocol_ops = &snd_usbmidi_raw_ops; /* * Interface 1 contains isochronous endpoints, but with the same * numbers as in interface 0. Since it is interface 1 that the * USB core has most recently seen, these descriptors are now * associated with the endpoint numbers. This will foul up our * attempts to submit bulk/interrupt URBs to the endpoints in * interface 0, so we have to make sure that the USB core looks * again at interface 0 by calling usb_set_interface() on it. */ if (umidi->usb_id == USB_ID(0x07fd, 0x0001)) /* MOTU Fastlane */ usb_set_interface(umidi->dev, 0, 0); err = snd_usbmidi_detect_per_port_endpoints(umidi, endpoints); break; case QUIRK_MIDI_EMAGIC: umidi->usb_protocol_ops = &snd_usbmidi_emagic_ops; memcpy(&endpoints[0], quirk->data, sizeof(struct snd_usb_midi_endpoint_info)); err = snd_usbmidi_detect_endpoints(umidi, &endpoints[0], 1); break; case QUIRK_MIDI_CME: umidi->usb_protocol_ops = &snd_usbmidi_cme_ops; err = snd_usbmidi_detect_per_port_endpoints(umidi, endpoints); break; case QUIRK_MIDI_AKAI: umidi->usb_protocol_ops = &snd_usbmidi_akai_ops; err = snd_usbmidi_detect_per_port_endpoints(umidi, endpoints); /* endpoint 1 is input-only */ endpoints[1].out_cables = 0; break; case QUIRK_MIDI_FTDI: umidi->usb_protocol_ops = &snd_usbmidi_ftdi_ops; /* set baud rate to 31250 (48 MHz / 16 / 96) */ err = usb_control_msg(umidi->dev, usb_sndctrlpipe(umidi->dev, 0), 3, 0x40, 0x60, 0, NULL, 0, 1000); if (err < 0) break; err = snd_usbmidi_detect_per_port_endpoints(umidi, endpoints); break; case QUIRK_MIDI_CH345: umidi->usb_protocol_ops = &snd_usbmidi_ch345_broken_sysex_ops; err = snd_usbmidi_detect_per_port_endpoints(umidi, endpoints); break; default: dev_err(&umidi->dev->dev, "invalid quirk type %d\n", quirk->type); err = -ENXIO; break; } if (err < 0) goto free_midi; /* create rawmidi device */ out_ports = 0; in_ports = 0; for (i = 0; i < MIDI_MAX_ENDPOINTS; ++i) { out_ports += hweight16(endpoints[i].out_cables); in_ports += hweight16(endpoints[i].in_cables); } err = snd_usbmidi_create_rawmidi(umidi, out_ports, in_ports); if (err < 0) goto free_midi; /* create endpoint/port structures */ if (quirk && quirk->type == QUIRK_MIDI_MIDIMAN) err = snd_usbmidi_create_endpoints_midiman(umidi, &endpoints[0]); else err = snd_usbmidi_create_endpoints(umidi, endpoints); if (err < 0) goto exit; usb_autopm_get_interface_no_resume(umidi->iface); list_add_tail(&umidi->list, midi_list); if (num_rawmidis) *num_rawmidis = umidi->next_midi_device; return 0; free_midi: kfree(umidi); exit: return err; } EXPORT_SYMBOL(__snd_usbmidi_create); |
| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 137 3 135 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 | // SPDX-License-Identifier: GPL-2.0-or-later /* * net/core/netprio_cgroup.c Priority Control Group * * Authors: Neil Horman <nhorman@tuxdriver.com> */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/module.h> #include <linux/slab.h> #include <linux/types.h> #include <linux/string.h> #include <linux/errno.h> #include <linux/skbuff.h> #include <linux/cgroup.h> #include <linux/rcupdate.h> #include <linux/atomic.h> #include <linux/sched/task.h> #include <net/rtnetlink.h> #include <net/pkt_cls.h> #include <net/sock.h> #include <net/netprio_cgroup.h> #include <linux/fdtable.h> /* * netprio allocates per-net_device priomap array which is indexed by * css->id. Limiting css ID to 16bits doesn't lose anything. */ #define NETPRIO_ID_MAX USHRT_MAX #define PRIOMAP_MIN_SZ 128 /* * Extend @dev->priomap so that it's large enough to accommodate * @target_idx. @dev->priomap.priomap_len > @target_idx after successful * return. Must be called under rtnl lock. */ static int extend_netdev_table(struct net_device *dev, u32 target_idx) { struct netprio_map *old, *new; size_t new_sz, new_len; /* is the existing priomap large enough? */ old = rtnl_dereference(dev->priomap); if (old && old->priomap_len > target_idx) return 0; /* * Determine the new size. Let's keep it power-of-two. We start * from PRIOMAP_MIN_SZ and double it until it's large enough to * accommodate @target_idx. */ new_sz = PRIOMAP_MIN_SZ; while (true) { new_len = (new_sz - offsetof(struct netprio_map, priomap)) / sizeof(new->priomap[0]); if (new_len > target_idx) break; new_sz *= 2; /* overflowed? */ if (WARN_ON(new_sz < PRIOMAP_MIN_SZ)) return -ENOSPC; } /* allocate & copy */ new = kzalloc(new_sz, GFP_KERNEL); if (!new) return -ENOMEM; if (old) memcpy(new->priomap, old->priomap, old->priomap_len * sizeof(old->priomap[0])); new->priomap_len = new_len; /* install the new priomap */ rcu_assign_pointer(dev->priomap, new); if (old) kfree_rcu(old, rcu); return 0; } /** * netprio_prio - return the effective netprio of a cgroup-net_device pair * @css: css part of the target pair * @dev: net_device part of the target pair * * Should be called under RCU read or rtnl lock. */ static u32 netprio_prio(struct cgroup_subsys_state *css, struct net_device *dev) { struct netprio_map *map = rcu_dereference_rtnl(dev->priomap); int id = css->id; if (map && id < map->priomap_len) return map->priomap[id]; return 0; } /** * netprio_set_prio - set netprio on a cgroup-net_device pair * @css: css part of the target pair * @dev: net_device part of the target pair * @prio: prio to set * * Set netprio to @prio on @css-@dev pair. Should be called under rtnl * lock and may fail under memory pressure for non-zero @prio. */ static int netprio_set_prio(struct cgroup_subsys_state *css, struct net_device *dev, u32 prio) { struct netprio_map *map; int id = css->id; int ret; /* avoid extending priomap for zero writes */ map = rtnl_dereference(dev->priomap); if (!prio && (!map || map->priomap_len <= id)) return 0; ret = extend_netdev_table(dev, id); if (ret) return ret; map = rtnl_dereference(dev->priomap); map->priomap[id] = prio; return 0; } static struct cgroup_subsys_state * cgrp_css_alloc(struct cgroup_subsys_state *parent_css) { struct cgroup_subsys_state *css; css = kzalloc(sizeof(*css), GFP_KERNEL); if (!css) return ERR_PTR(-ENOMEM); return css; } static int cgrp_css_online(struct cgroup_subsys_state *css) { struct cgroup_subsys_state *parent_css = css->parent; struct net_device *dev; int ret = 0; if (css->id > NETPRIO_ID_MAX) return -ENOSPC; if (!parent_css) return 0; rtnl_lock(); /* * Inherit prios from the parent. As all prios are set during * onlining, there is no need to clear them on offline. */ for_each_netdev(&init_net, dev) { u32 prio = netprio_prio(parent_css, dev); ret = netprio_set_prio(css, dev, prio); if (ret) break; } rtnl_unlock(); return ret; } static void cgrp_css_free(struct cgroup_subsys_state *css) { kfree(css); } static u64 read_prioidx(struct cgroup_subsys_state *css, struct cftype *cft) { return css->id; } static int read_priomap(struct seq_file *sf, void *v) { struct net_device *dev; rcu_read_lock(); for_each_netdev_rcu(&init_net, dev) seq_printf(sf, "%s %u\n", dev->name, netprio_prio(seq_css(sf), dev)); rcu_read_unlock(); return 0; } static ssize_t write_priomap(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { char devname[IFNAMSIZ + 1]; struct net_device *dev; u32 prio; int ret; if (sscanf(buf, "%"__stringify(IFNAMSIZ)"s %u", devname, &prio) != 2) return -EINVAL; dev = dev_get_by_name(&init_net, devname); if (!dev) return -ENODEV; rtnl_lock(); ret = netprio_set_prio(of_css(of), dev, prio); rtnl_unlock(); dev_put(dev); return ret ?: nbytes; } static int update_netprio(const void *v, struct file *file, unsigned n) { struct socket *sock = sock_from_file(file); if (sock) sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data, (unsigned long)v); return 0; } static void net_prio_attach(struct cgroup_taskset *tset) { struct task_struct *p; struct cgroup_subsys_state *css; cgroup_taskset_for_each(p, css, tset) { void *v = (void *)(unsigned long)css->id; task_lock(p); iterate_fd(p->files, 0, update_netprio, v); task_unlock(p); } } static struct cftype ss_files[] = { { .name = "prioidx", .read_u64 = read_prioidx, }, { .name = "ifpriomap", .seq_show = read_priomap, .write = write_priomap, }, { } /* terminate */ }; struct cgroup_subsys net_prio_cgrp_subsys = { .css_alloc = cgrp_css_alloc, .css_online = cgrp_css_online, .css_free = cgrp_css_free, .attach = net_prio_attach, .legacy_cftypes = ss_files, }; static int netprio_device_event(struct notifier_block *unused, unsigned long event, void *ptr) { struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct netprio_map *old; /* * Note this is called with rtnl_lock held so we have update side * protection on our rcu assignments */ switch (event) { case NETDEV_UNREGISTER: old = rtnl_dereference(dev->priomap); RCU_INIT_POINTER(dev->priomap, NULL); if (old) kfree_rcu(old, rcu); break; } return NOTIFY_DONE; } static struct notifier_block netprio_device_notifier = { .notifier_call = netprio_device_event }; static int __init init_cgroup_netprio(void) { register_netdevice_notifier(&netprio_device_notifier); return 0; } subsys_initcall(init_cgroup_netprio); |
| 118 8 118 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 | /* * Copyright (C) 2016 Red Hat * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the "Software"), * to deal in the Software without restriction, including without limitation * the rights to use, copy, modify, merge, publish, distribute, sublicense, * and/or sell copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice shall be included in * all copies or substantial portions of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR * OTHER DEALINGS IN THE SOFTWARE. * * Authors: * Rob Clark <robdclark@gmail.com> */ #ifndef DRM_PRINT_H_ #define DRM_PRINT_H_ #include <linux/compiler.h> #include <linux/printk.h> #include <linux/device.h> #include <linux/dynamic_debug.h> #include <drm/drm.h> #include <drm/drm_device.h> struct debugfs_regset32; struct drm_device; struct seq_file; /* Do *not* use outside of drm_print.[ch]! */ extern unsigned long __drm_debug; /** * DOC: print * * A simple wrapper for dev_printk(), seq_printf(), etc. Allows same * debug code to be used for both debugfs and printk logging. * * For example:: * * void log_some_info(struct drm_printer *p) * { * drm_printf(p, "foo=%d\n", foo); * drm_printf(p, "bar=%d\n", bar); * } * * #ifdef CONFIG_DEBUG_FS * void debugfs_show(struct seq_file *f) * { * struct drm_printer p = drm_seq_file_printer(f); * log_some_info(&p); * } * #endif * * void some_other_function(...) * { * struct drm_printer p = drm_info_printer(drm->dev); * log_some_info(&p); * } */ /** * enum drm_debug_category - The DRM debug categories * * Each of the DRM debug logging macros use a specific category, and the logging * is filtered by the drm.debug module parameter. This enum specifies the values * for the interface. * * Each DRM_DEBUG_<CATEGORY> macro logs to DRM_UT_<CATEGORY> category, except * DRM_DEBUG() logs to DRM_UT_CORE. * * Enabling verbose debug messages is done through the drm.debug parameter, each * category being enabled by a bit: * * - drm.debug=0x1 will enable CORE messages * - drm.debug=0x2 will enable DRIVER messages * - drm.debug=0x3 will enable CORE and DRIVER messages * - ... * - drm.debug=0x1ff will enable all messages * * An interesting feature is that it's possible to enable verbose logging at * run-time by echoing the debug value in its sysfs node:: * * # echo 0xf > /sys/module/drm/parameters/debug * */ enum drm_debug_category { /* These names must match those in DYNAMIC_DEBUG_CLASSBITS */ /** * @DRM_UT_CORE: Used in the generic drm code: drm_ioctl.c, drm_mm.c, * drm_memory.c, ... */ DRM_UT_CORE, /** * @DRM_UT_DRIVER: Used in the vendor specific part of the driver: i915, * radeon, ... macro. */ DRM_UT_DRIVER, /** * @DRM_UT_KMS: Used in the modesetting code. */ DRM_UT_KMS, /** * @DRM_UT_PRIME: Used in the prime code. */ DRM_UT_PRIME, /** * @DRM_UT_ATOMIC: Used in the atomic code. */ DRM_UT_ATOMIC, /** * @DRM_UT_VBL: Used for verbose debug message in the vblank code. */ DRM_UT_VBL, /** * @DRM_UT_STATE: Used for verbose atomic state debugging. */ DRM_UT_STATE, /** * @DRM_UT_LEASE: Used in the lease code. */ DRM_UT_LEASE, /** * @DRM_UT_DP: Used in the DP code. */ DRM_UT_DP, /** * @DRM_UT_DRMRES: Used in the drm managed resources code. */ DRM_UT_DRMRES }; static inline bool drm_debug_enabled_raw(enum drm_debug_category category) { return unlikely(__drm_debug & BIT(category)); } #define drm_debug_enabled_instrumented(category) \ ({ \ pr_debug("todo: is this frequent enough to optimize ?\n"); \ drm_debug_enabled_raw(category); \ }) #if defined(CONFIG_DRM_USE_DYNAMIC_DEBUG) /* * the drm.debug API uses dyndbg, so each drm_*dbg macro/callsite gets * a descriptor, and only enabled callsites are reachable. They use * the private macro to avoid re-testing the enable-bit. */ #define __drm_debug_enabled(category) true #define drm_debug_enabled(category) drm_debug_enabled_instrumented(category) #else #define __drm_debug_enabled(category) drm_debug_enabled_raw(category) #define drm_debug_enabled(category) drm_debug_enabled_raw(category) #endif /** * struct drm_printer - drm output "stream" * * Do not use struct members directly. Use drm_printer_seq_file(), * drm_printer_info(), etc to initialize. And drm_printf() for output. */ struct drm_printer { /* private: */ void (*printfn)(struct drm_printer *p, struct va_format *vaf); void (*puts)(struct drm_printer *p, const char *str); void *arg; const void *origin; const char *prefix; struct { unsigned int series; unsigned int counter; } line; enum drm_debug_category category; }; void __drm_printfn_coredump(struct drm_printer *p, struct va_format *vaf); void __drm_puts_coredump(struct drm_printer *p, const char *str); void __drm_printfn_seq_file(struct drm_printer *p, struct va_format *vaf); void __drm_puts_seq_file(struct drm_printer *p, const char *str); void __drm_printfn_info(struct drm_printer *p, struct va_format *vaf); void __drm_printfn_dbg(struct drm_printer *p, struct va_format *vaf); void __drm_printfn_err(struct drm_printer *p, struct va_format *vaf); void __drm_printfn_line(struct drm_printer *p, struct va_format *vaf); __printf(2, 3) void drm_printf(struct drm_printer *p, const char *f, ...); void drm_puts(struct drm_printer *p, const char *str); void drm_print_regset32(struct drm_printer *p, struct debugfs_regset32 *regset); void drm_print_bits(struct drm_printer *p, unsigned long value, const char * const bits[], unsigned int nbits); void drm_print_hex_dump(struct drm_printer *p, const char *prefix, const u8 *buf, size_t len); __printf(2, 0) /** * drm_vprintf - print to a &drm_printer stream * @p: the &drm_printer * @fmt: format string * @va: the va_list */ static inline void drm_vprintf(struct drm_printer *p, const char *fmt, va_list *va) { struct va_format vaf = { .fmt = fmt, .va = va }; p->printfn(p, &vaf); } /** * drm_printf_indent - Print to a &drm_printer stream with indentation * @printer: DRM printer * @indent: Tab indentation level (max 5) * @fmt: Format string */ #define drm_printf_indent(printer, indent, fmt, ...) \ drm_printf((printer), "%.*s" fmt, (indent), "\t\t\t\t\tX", ##__VA_ARGS__) /** * struct drm_print_iterator - local struct used with drm_printer_coredump * @data: Pointer to the devcoredump output buffer, can be NULL if using * drm_printer_coredump to determine size of devcoredump * @start: The offset within the buffer to start writing * @remain: The number of bytes to write for this iteration */ struct drm_print_iterator { void *data; ssize_t start; ssize_t remain; /* private: */ ssize_t offset; }; /** * drm_coredump_printer - construct a &drm_printer that can output to a buffer * from the read function for devcoredump * @iter: A pointer to a struct drm_print_iterator for the read instance * * This wrapper extends drm_printf() to work with a dev_coredumpm() callback * function. The passed in drm_print_iterator struct contains the buffer * pointer, size and offset as passed in from devcoredump. * * For example:: * * void coredump_read(char *buffer, loff_t offset, size_t count, * void *data, size_t datalen) * { * struct drm_print_iterator iter; * struct drm_printer p; * * iter.data = buffer; * iter.start = offset; * iter.remain = count; * * p = drm_coredump_printer(&iter); * * drm_printf(p, "foo=%d\n", foo); * } * * void makecoredump(...) * { * ... * dev_coredumpm(dev, THIS_MODULE, data, 0, GFP_KERNEL, * coredump_read, ...) * } * * The above example has a time complexity of O(N^2), where N is the size of the * devcoredump. This is acceptable for small devcoredumps but scales poorly for * larger ones. * * Another use case for drm_coredump_printer is to capture the devcoredump into * a saved buffer before the dev_coredump() callback. This involves two passes: * one to determine the size of the devcoredump and another to print it to a * buffer. Then, in dev_coredump(), copy from the saved buffer into the * devcoredump read buffer. * * For example:: * * char *devcoredump_saved_buffer; * * ssize_t __coredump_print(char *buffer, ssize_t count, ...) * { * struct drm_print_iterator iter; * struct drm_printer p; * * iter.data = buffer; * iter.start = 0; * iter.remain = count; * * p = drm_coredump_printer(&iter); * * drm_printf(p, "foo=%d\n", foo); * ... * return count - iter.remain; * } * * void coredump_print(...) * { * ssize_t count; * * count = __coredump_print(NULL, INT_MAX, ...); * devcoredump_saved_buffer = kvmalloc(count, GFP_KERNEL); * __coredump_print(devcoredump_saved_buffer, count, ...); * } * * void coredump_read(char *buffer, loff_t offset, size_t count, * void *data, size_t datalen) * { * ... * memcpy(buffer, devcoredump_saved_buffer + offset, count); * ... * } * * The above example has a time complexity of O(N*2), where N is the size of the * devcoredump. This scales better than the previous example for larger * devcoredumps. * * RETURNS: * The &drm_printer object */ static inline struct drm_printer drm_coredump_printer(struct drm_print_iterator *iter) { struct drm_printer p = { .printfn = __drm_printfn_coredump, .puts = __drm_puts_coredump, .arg = iter, }; /* Set the internal offset of the iterator to zero */ iter->offset = 0; return p; } /** * drm_coredump_printer_is_full() - DRM coredump printer output is full * @p: DRM coredump printer * * DRM printer output is full, useful to short circuit coredump printing once * printer is full. * * RETURNS: * True if DRM coredump printer output buffer is full, False otherwise */ static inline bool drm_coredump_printer_is_full(struct drm_printer *p) { struct drm_print_iterator *iterator = p->arg; if (p->printfn != __drm_printfn_coredump) return true; return !iterator->remain; } /** * drm_seq_file_printer - construct a &drm_printer that outputs to &seq_file * @f: the &struct seq_file to output to * * RETURNS: * The &drm_printer object */ static inline struct drm_printer drm_seq_file_printer(struct seq_file *f) { struct drm_printer p = { .printfn = __drm_printfn_seq_file, .puts = __drm_puts_seq_file, .arg = f, }; return p; } /** * drm_info_printer - construct a &drm_printer that outputs to dev_printk() * @dev: the &struct device pointer * * RETURNS: * The &drm_printer object */ static inline struct drm_printer drm_info_printer(struct device *dev) { struct drm_printer p = { .printfn = __drm_printfn_info, .arg = dev, }; return p; } /** * drm_dbg_printer - construct a &drm_printer for drm device specific output * @drm: the &struct drm_device pointer, or NULL * @category: the debug category to use * @prefix: debug output prefix, or NULL for no prefix * * RETURNS: * The &drm_printer object */ static inline struct drm_printer drm_dbg_printer(struct drm_device *drm, enum drm_debug_category category, const char *prefix) { struct drm_printer p = { .printfn = __drm_printfn_dbg, .arg = drm, .origin = (const void *)_THIS_IP_, /* it's fine as we will be inlined */ .prefix = prefix, .category = category, }; return p; } /** * drm_err_printer - construct a &drm_printer that outputs to drm_err() * @drm: the &struct drm_device pointer * @prefix: debug output prefix, or NULL for no prefix * * RETURNS: * The &drm_printer object */ static inline struct drm_printer drm_err_printer(struct drm_device *drm, const char *prefix) { struct drm_printer p = { .printfn = __drm_printfn_err, .arg = drm, .prefix = prefix }; return p; } /** * drm_line_printer - construct a &drm_printer that prefixes outputs with line numbers * @p: the &struct drm_printer which actually generates the output * @prefix: optional output prefix, or NULL for no prefix * @series: optional unique series identifier, or 0 to omit identifier in the output * * This printer can be used to increase the robustness of the captured output * to make sure we didn't lost any intermediate lines of the output. Helpful * while capturing some crash data. * * Example 1:: * * void crash_dump(struct drm_device *drm) * { * static unsigned int id; * struct drm_printer p = drm_err_printer(drm, "crash"); * struct drm_printer lp = drm_line_printer(&p, "dump", ++id); * * drm_printf(&lp, "foo"); * drm_printf(&lp, "bar"); * } * * Above code will print into the dmesg something like:: * * [ ] 0000:00:00.0: [drm] *ERROR* crash dump 1.1: foo * [ ] 0000:00:00.0: [drm] *ERROR* crash dump 1.2: bar * * Example 2:: * * void line_dump(struct device *dev) * { * struct drm_printer p = drm_info_printer(dev); * struct drm_printer lp = drm_line_printer(&p, NULL, 0); * * drm_printf(&lp, "foo"); * drm_printf(&lp, "bar"); * } * * Above code will print:: * * [ ] 0000:00:00.0: [drm] 1: foo * [ ] 0000:00:00.0: [drm] 2: bar * * RETURNS: * The &drm_printer object */ static inline struct drm_printer drm_line_printer(struct drm_printer *p, const char *prefix, unsigned int series) { struct drm_printer lp = { .printfn = __drm_printfn_line, .arg = p, .prefix = prefix, .line = { .series = series, }, }; return lp; } /* * struct device based logging * * Prefer drm_device based logging over device or printk based logging. */ __printf(3, 4) void drm_dev_printk(const struct device *dev, const char *level, const char *format, ...); struct _ddebug; __printf(4, 5) void __drm_dev_dbg(struct _ddebug *desc, const struct device *dev, enum drm_debug_category category, const char *format, ...); /** * DRM_DEV_ERROR() - Error output. * * NOTE: this is deprecated in favor of drm_err() or dev_err(). * * @dev: device pointer * @fmt: printf() like format string. */ #define DRM_DEV_ERROR(dev, fmt, ...) \ drm_dev_printk(dev, KERN_ERR, "*ERROR* " fmt, ##__VA_ARGS__) /** * DRM_DEV_ERROR_RATELIMITED() - Rate limited error output. * * NOTE: this is deprecated in favor of drm_err_ratelimited() or * dev_err_ratelimited(). * * @dev: device pointer * @fmt: printf() like format string. * * Like DRM_ERROR() but won't flood the log. */ #define DRM_DEV_ERROR_RATELIMITED(dev, fmt, ...) \ ({ \ static DEFINE_RATELIMIT_STATE(_rs, \ DEFAULT_RATELIMIT_INTERVAL, \ DEFAULT_RATELIMIT_BURST); \ \ if (__ratelimit(&_rs)) \ DRM_DEV_ERROR(dev, fmt, ##__VA_ARGS__); \ }) /* NOTE: this is deprecated in favor of drm_info() or dev_info(). */ #define DRM_DEV_INFO(dev, fmt, ...) \ drm_dev_printk(dev, KERN_INFO, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of drm_info_once() or dev_info_once(). */ #define DRM_DEV_INFO_ONCE(dev, fmt, ...) \ ({ \ static bool __print_once __read_mostly; \ if (!__print_once) { \ __print_once = true; \ DRM_DEV_INFO(dev, fmt, ##__VA_ARGS__); \ } \ }) #if !defined(CONFIG_DRM_USE_DYNAMIC_DEBUG) #define drm_dev_dbg(dev, cat, fmt, ...) \ __drm_dev_dbg(NULL, dev, cat, fmt, ##__VA_ARGS__) #else #define drm_dev_dbg(dev, cat, fmt, ...) \ _dynamic_func_call_cls(cat, fmt, __drm_dev_dbg, \ dev, cat, fmt, ##__VA_ARGS__) #endif /** * DRM_DEV_DEBUG() - Debug output for generic drm code * * NOTE: this is deprecated in favor of drm_dbg_core(). * * @dev: device pointer * @fmt: printf() like format string. */ #define DRM_DEV_DEBUG(dev, fmt, ...) \ drm_dev_dbg(dev, DRM_UT_CORE, fmt, ##__VA_ARGS__) /** * DRM_DEV_DEBUG_DRIVER() - Debug output for vendor specific part of the driver * * NOTE: this is deprecated in favor of drm_dbg() or dev_dbg(). * * @dev: device pointer * @fmt: printf() like format string. */ #define DRM_DEV_DEBUG_DRIVER(dev, fmt, ...) \ drm_dev_dbg(dev, DRM_UT_DRIVER, fmt, ##__VA_ARGS__) /** * DRM_DEV_DEBUG_KMS() - Debug output for modesetting code * * NOTE: this is deprecated in favor of drm_dbg_kms(). * * @dev: device pointer * @fmt: printf() like format string. */ #define DRM_DEV_DEBUG_KMS(dev, fmt, ...) \ drm_dev_dbg(dev, DRM_UT_KMS, fmt, ##__VA_ARGS__) /* * struct drm_device based logging * * Prefer drm_device based logging over device or prink based logging. */ /* Helper to enforce struct drm_device type */ static inline struct device *__drm_to_dev(const struct drm_device *drm) { return drm ? drm->dev : NULL; } /* Helper for struct drm_device based logging. */ #define __drm_printk(drm, level, type, fmt, ...) \ dev_##level##type(__drm_to_dev(drm), "[drm] " fmt, ##__VA_ARGS__) #define drm_info(drm, fmt, ...) \ __drm_printk((drm), info,, fmt, ##__VA_ARGS__) #define drm_notice(drm, fmt, ...) \ __drm_printk((drm), notice,, fmt, ##__VA_ARGS__) #define drm_warn(drm, fmt, ...) \ __drm_printk((drm), warn,, fmt, ##__VA_ARGS__) #define drm_err(drm, fmt, ...) \ __drm_printk((drm), err,, "*ERROR* " fmt, ##__VA_ARGS__) #define drm_info_once(drm, fmt, ...) \ __drm_printk((drm), info, _once, fmt, ##__VA_ARGS__) #define drm_notice_once(drm, fmt, ...) \ __drm_printk((drm), notice, _once, fmt, ##__VA_ARGS__) #define drm_warn_once(drm, fmt, ...) \ __drm_printk((drm), warn, _once, fmt, ##__VA_ARGS__) #define drm_err_once(drm, fmt, ...) \ __drm_printk((drm), err, _once, "*ERROR* " fmt, ##__VA_ARGS__) #define drm_err_ratelimited(drm, fmt, ...) \ __drm_printk((drm), err, _ratelimited, "*ERROR* " fmt, ##__VA_ARGS__) #define drm_dbg_core(drm, fmt, ...) \ drm_dev_dbg(__drm_to_dev(drm), DRM_UT_CORE, fmt, ##__VA_ARGS__) #define drm_dbg_driver(drm, fmt, ...) \ drm_dev_dbg(__drm_to_dev(drm), DRM_UT_DRIVER, fmt, ##__VA_ARGS__) #define drm_dbg_kms(drm, fmt, ...) \ drm_dev_dbg(__drm_to_dev(drm), DRM_UT_KMS, fmt, ##__VA_ARGS__) #define drm_dbg_prime(drm, fmt, ...) \ drm_dev_dbg(__drm_to_dev(drm), DRM_UT_PRIME, fmt, ##__VA_ARGS__) #define drm_dbg_atomic(drm, fmt, ...) \ drm_dev_dbg(__drm_to_dev(drm), DRM_UT_ATOMIC, fmt, ##__VA_ARGS__) #define drm_dbg_vbl(drm, fmt, ...) \ drm_dev_dbg(__drm_to_dev(drm), DRM_UT_VBL, fmt, ##__VA_ARGS__) #define drm_dbg_state(drm, fmt, ...) \ drm_dev_dbg(__drm_to_dev(drm), DRM_UT_STATE, fmt, ##__VA_ARGS__) #define drm_dbg_lease(drm, fmt, ...) \ drm_dev_dbg(__drm_to_dev(drm), DRM_UT_LEASE, fmt, ##__VA_ARGS__) #define drm_dbg_dp(drm, fmt, ...) \ drm_dev_dbg(__drm_to_dev(drm), DRM_UT_DP, fmt, ##__VA_ARGS__) #define drm_dbg_drmres(drm, fmt, ...) \ drm_dev_dbg(__drm_to_dev(drm), DRM_UT_DRMRES, fmt, ##__VA_ARGS__) #define drm_dbg(drm, fmt, ...) drm_dbg_driver(drm, fmt, ##__VA_ARGS__) /* * printk based logging * * Prefer drm_device based logging over device or prink based logging. */ __printf(1, 2) void __drm_err(const char *format, ...); #if !defined(CONFIG_DRM_USE_DYNAMIC_DEBUG) #define __drm_dbg(cat, fmt, ...) __drm_dev_dbg(NULL, NULL, cat, fmt, ##__VA_ARGS__) #else #define __drm_dbg(cat, fmt, ...) \ _dynamic_func_call_cls(cat, fmt, __drm_dev_dbg, \ NULL, cat, fmt, ##__VA_ARGS__) #endif /* Macros to make printk easier */ #define _DRM_PRINTK(once, level, fmt, ...) \ printk##once(KERN_##level "[" DRM_NAME "] " fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of pr_info(). */ #define DRM_INFO(fmt, ...) \ _DRM_PRINTK(, INFO, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of pr_notice(). */ #define DRM_NOTE(fmt, ...) \ _DRM_PRINTK(, NOTICE, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of pr_warn(). */ #define DRM_WARN(fmt, ...) \ _DRM_PRINTK(, WARNING, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of pr_info_once(). */ #define DRM_INFO_ONCE(fmt, ...) \ _DRM_PRINTK(_once, INFO, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of pr_notice_once(). */ #define DRM_NOTE_ONCE(fmt, ...) \ _DRM_PRINTK(_once, NOTICE, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of pr_warn_once(). */ #define DRM_WARN_ONCE(fmt, ...) \ _DRM_PRINTK(_once, WARNING, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of pr_err(). */ #define DRM_ERROR(fmt, ...) \ __drm_err(fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of pr_err_ratelimited(). */ #define DRM_ERROR_RATELIMITED(fmt, ...) \ DRM_DEV_ERROR_RATELIMITED(NULL, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of drm_dbg_core(NULL, ...). */ #define DRM_DEBUG(fmt, ...) \ __drm_dbg(DRM_UT_CORE, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of drm_dbg(NULL, ...). */ #define DRM_DEBUG_DRIVER(fmt, ...) \ __drm_dbg(DRM_UT_DRIVER, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of drm_dbg_kms(NULL, ...). */ #define DRM_DEBUG_KMS(fmt, ...) \ __drm_dbg(DRM_UT_KMS, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of drm_dbg_prime(NULL, ...). */ #define DRM_DEBUG_PRIME(fmt, ...) \ __drm_dbg(DRM_UT_PRIME, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of drm_dbg_atomic(NULL, ...). */ #define DRM_DEBUG_ATOMIC(fmt, ...) \ __drm_dbg(DRM_UT_ATOMIC, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of drm_dbg_vbl(NULL, ...). */ #define DRM_DEBUG_VBL(fmt, ...) \ __drm_dbg(DRM_UT_VBL, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of drm_dbg_lease(NULL, ...). */ #define DRM_DEBUG_LEASE(fmt, ...) \ __drm_dbg(DRM_UT_LEASE, fmt, ##__VA_ARGS__) /* NOTE: this is deprecated in favor of drm_dbg_dp(NULL, ...). */ #define DRM_DEBUG_DP(fmt, ...) \ __drm_dbg(DRM_UT_DP, fmt, ## __VA_ARGS__) #define __DRM_DEFINE_DBG_RATELIMITED(category, drm, fmt, ...) \ ({ \ static DEFINE_RATELIMIT_STATE(rs_, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);\ \ if (drm_debug_enabled(DRM_UT_ ## category) && __ratelimit(&rs_)) \ drm_dev_printk(__drm_to_dev(drm), KERN_DEBUG, fmt, ## __VA_ARGS__); \ }) #define drm_dbg_ratelimited(drm, fmt, ...) \ __DRM_DEFINE_DBG_RATELIMITED(DRIVER, drm, fmt, ## __VA_ARGS__) #define drm_dbg_kms_ratelimited(drm, fmt, ...) \ __DRM_DEFINE_DBG_RATELIMITED(KMS, drm, fmt, ## __VA_ARGS__) /* * struct drm_device based WARNs * * drm_WARN*() acts like WARN*(), but with the key difference of * using device specific information so that we know from which device * warning is originating from. * * Prefer drm_device based drm_WARN* over regular WARN* */ /* Helper for struct drm_device based WARNs */ #define drm_WARN(drm, condition, format, arg...) \ WARN(condition, "%s %s: [drm] " format, \ dev_driver_string(__drm_to_dev(drm)), \ dev_name(__drm_to_dev(drm)), ## arg) #define drm_WARN_ONCE(drm, condition, format, arg...) \ WARN_ONCE(condition, "%s %s: [drm] " format, \ dev_driver_string(__drm_to_dev(drm)), \ dev_name(__drm_to_dev(drm)), ## arg) #define drm_WARN_ON(drm, x) \ drm_WARN((drm), (x), "%s", \ "drm_WARN_ON(" __stringify(x) ")") #define drm_WARN_ON_ONCE(drm, x) \ drm_WARN_ONCE((drm), (x), "%s", \ "drm_WARN_ON_ONCE(" __stringify(x) ")") #endif /* DRM_PRINT_H_ */ |
| 1 8 1 4 4 80 1 1 81 4 4 4 4 4 4 3 1 1 2 13 2 13 2 2 11 5 5 5 5 5 5 5 5 8 8 8 8 8 2 2 1 4 3 2 1 2 2 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 | // SPDX-License-Identifier: GPL-2.0-only /* * Memory merging support. * * This code enables dynamic sharing of identical pages found in different * memory areas, even if they are not shared by fork() * * Copyright (C) 2008-2009 Red Hat, Inc. * Authors: * Izik Eidus * Andrea Arcangeli * Chris Wright * Hugh Dickins */ #include <linux/errno.h> #include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/fs.h> #include <linux/mman.h> #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/sched/cputime.h> #include <linux/rwsem.h> #include <linux/pagemap.h> #include <linux/rmap.h> #include <linux/spinlock.h> #include <linux/xxhash.h> #include <linux/delay.h> #include <linux/kthread.h> #include <linux/wait.h> #include <linux/slab.h> #include <linux/rbtree.h> #include <linux/memory.h> #include <linux/mmu_notifier.h> #include <linux/swap.h> #include <linux/ksm.h> #include <linux/hashtable.h> #include <linux/freezer.h> #include <linux/oom.h> #include <linux/numa.h> #include <linux/pagewalk.h> #include <asm/tlbflush.h> #include "internal.h" #include "mm_slot.h" #define CREATE_TRACE_POINTS #include <trace/events/ksm.h> #ifdef CONFIG_NUMA #define NUMA(x) (x) #define DO_NUMA(x) do { (x); } while (0) #else #define NUMA(x) (0) #define DO_NUMA(x) do { } while (0) #endif typedef u8 rmap_age_t; /** * DOC: Overview * * A few notes about the KSM scanning process, * to make it easier to understand the data structures below: * * In order to reduce excessive scanning, KSM sorts the memory pages by their * contents into a data structure that holds pointers to the pages' locations. * * Since the contents of the pages may change at any moment, KSM cannot just * insert the pages into a normal sorted tree and expect it to find anything. * Therefore KSM uses two data structures - the stable and the unstable tree. * * The stable tree holds pointers to all the merged pages (ksm pages), sorted * by their contents. Because each such page is write-protected, searching on * this tree is fully assured to be working (except when pages are unmapped), * and therefore this tree is called the stable tree. * * The stable tree node includes information required for reverse * mapping from a KSM page to virtual addresses that map this page. * * In order to avoid large latencies of the rmap walks on KSM pages, * KSM maintains two types of nodes in the stable tree: * * * the regular nodes that keep the reverse mapping structures in a * linked list * * the "chains" that link nodes ("dups") that represent the same * write protected memory content, but each "dup" corresponds to a * different KSM page copy of that content * * Internally, the regular nodes, "dups" and "chains" are represented * using the same struct ksm_stable_node structure. * * In addition to the stable tree, KSM uses a second data structure called the * unstable tree: this tree holds pointers to pages which have been found to * be "unchanged for a period of time". The unstable tree sorts these pages * by their contents, but since they are not write-protected, KSM cannot rely * upon the unstable tree to work correctly - the unstable tree is liable to * be corrupted as its contents are modified, and so it is called unstable. * * KSM solves this problem by several techniques: * * 1) The unstable tree is flushed every time KSM completes scanning all * memory areas, and then the tree is rebuilt again from the beginning. * 2) KSM will only insert into the unstable tree, pages whose hash value * has not changed since the previous scan of all memory areas. * 3) The unstable tree is a RedBlack Tree - so its balancing is based on the * colors of the nodes and not on their contents, assuring that even when * the tree gets "corrupted" it won't get out of balance, so scanning time * remains the same (also, searching and inserting nodes in an rbtree uses * the same algorithm, so we have no overhead when we flush and rebuild). * 4) KSM never flushes the stable tree, which means that even if it were to * take 10 attempts to find a page in the unstable tree, once it is found, * it is secured in the stable tree. (When we scan a new page, we first * compare it against the stable tree, and then against the unstable tree.) * * If the merge_across_nodes tunable is unset, then KSM maintains multiple * stable trees and multiple unstable trees: one of each for each NUMA node. */ /** * struct ksm_mm_slot - ksm information per mm that is being scanned * @slot: hash lookup from mm to mm_slot * @rmap_list: head for this mm_slot's singly-linked list of rmap_items */ struct ksm_mm_slot { struct mm_slot slot; struct ksm_rmap_item *rmap_list; }; /** * struct ksm_scan - cursor for scanning * @mm_slot: the current mm_slot we are scanning * @address: the next address inside that to be scanned * @rmap_list: link to the next rmap to be scanned in the rmap_list * @seqnr: count of completed full scans (needed when removing unstable node) * * There is only the one ksm_scan instance of this cursor structure. */ struct ksm_scan { struct ksm_mm_slot *mm_slot; unsigned long address; struct ksm_rmap_item **rmap_list; unsigned long seqnr; }; /** * struct ksm_stable_node - node of the stable rbtree * @node: rb node of this ksm page in the stable tree * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list * @hlist_dup: linked into the stable_node->hlist with a stable_node chain * @list: linked into migrate_nodes, pending placement in the proper node tree * @hlist: hlist head of rmap_items using this ksm page * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid) * @chain_prune_time: time of the last full garbage collection * @rmap_hlist_len: number of rmap_item entries in hlist or STABLE_NODE_CHAIN * @nid: NUMA node id of stable tree in which linked (may not match kpfn) */ struct ksm_stable_node { union { struct rb_node node; /* when node of stable tree */ struct { /* when listed for migration */ struct list_head *head; struct { struct hlist_node hlist_dup; struct list_head list; }; }; }; struct hlist_head hlist; union { unsigned long kpfn; unsigned long chain_prune_time; }; /* * STABLE_NODE_CHAIN can be any negative number in * rmap_hlist_len negative range, but better not -1 to be able * to reliably detect underflows. */ #define STABLE_NODE_CHAIN -1024 int rmap_hlist_len; #ifdef CONFIG_NUMA int nid; #endif }; /** * struct ksm_rmap_item - reverse mapping item for virtual addresses * @rmap_list: next rmap_item in mm_slot's singly-linked rmap_list * @anon_vma: pointer to anon_vma for this mm,address, when in stable tree * @nid: NUMA node id of unstable tree in which linked (may not match page) * @mm: the memory structure this rmap_item is pointing into * @address: the virtual address this rmap_item tracks (+ flags in low bits) * @oldchecksum: previous checksum of the page at that virtual address * @node: rb node of this rmap_item in the unstable tree * @head: pointer to stable_node heading this list in the stable tree * @hlist: link into hlist of rmap_items hanging off that stable_node * @age: number of scan iterations since creation * @remaining_skips: how many scans to skip */ struct ksm_rmap_item { struct ksm_rmap_item *rmap_list; union { struct anon_vma *anon_vma; /* when stable */ #ifdef CONFIG_NUMA int nid; /* when node of unstable tree */ #endif }; struct mm_struct *mm; unsigned long address; /* + low bits used for flags below */ unsigned int oldchecksum; /* when unstable */ rmap_age_t age; rmap_age_t remaining_skips; union { struct rb_node node; /* when node of unstable tree */ struct { /* when listed from stable tree */ struct ksm_stable_node *head; struct hlist_node hlist; }; }; }; #define SEQNR_MASK 0x0ff /* low bits of unstable tree seqnr */ #define UNSTABLE_FLAG 0x100 /* is a node of the unstable tree */ #define STABLE_FLAG 0x200 /* is listed from the stable tree */ /* The stable and unstable tree heads */ static struct rb_root one_stable_tree[1] = { RB_ROOT }; static struct rb_root one_unstable_tree[1] = { RB_ROOT }; static struct rb_root *root_stable_tree = one_stable_tree; static struct rb_root *root_unstable_tree = one_unstable_tree; /* Recently migrated nodes of stable tree, pending proper placement */ static LIST_HEAD(migrate_nodes); #define STABLE_NODE_DUP_HEAD ((struct list_head *)&migrate_nodes.prev) #define MM_SLOTS_HASH_BITS 10 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); static struct ksm_mm_slot ksm_mm_head = { .slot.mm_node = LIST_HEAD_INIT(ksm_mm_head.slot.mm_node), }; static struct ksm_scan ksm_scan = { .mm_slot = &ksm_mm_head, }; static struct kmem_cache *rmap_item_cache; static struct kmem_cache *stable_node_cache; static struct kmem_cache *mm_slot_cache; /* Default number of pages to scan per batch */ #define DEFAULT_PAGES_TO_SCAN 100 /* The number of pages scanned */ static unsigned long ksm_pages_scanned; /* The number of nodes in the stable tree */ static unsigned long ksm_pages_shared; /* The number of page slots additionally sharing those nodes */ static unsigned long ksm_pages_sharing; /* The number of nodes in the unstable tree */ static unsigned long ksm_pages_unshared; /* The number of rmap_items in use: to calculate pages_volatile */ static unsigned long ksm_rmap_items; /* The number of stable_node chains */ static unsigned long ksm_stable_node_chains; /* The number of stable_node dups linked to the stable_node chains */ static unsigned long ksm_stable_node_dups; /* Delay in pruning stale stable_node_dups in the stable_node_chains */ static unsigned int ksm_stable_node_chains_prune_millisecs = 2000; /* Maximum number of page slots sharing a stable node */ static int ksm_max_page_sharing = 256; /* Number of pages ksmd should scan in one batch */ static unsigned int ksm_thread_pages_to_scan = DEFAULT_PAGES_TO_SCAN; /* Milliseconds ksmd should sleep between batches */ static unsigned int ksm_thread_sleep_millisecs = 20; /* Checksum of an empty (zeroed) page */ static unsigned int zero_checksum __read_mostly; /* Whether to merge empty (zeroed) pages with actual zero pages */ static bool ksm_use_zero_pages __read_mostly; /* Skip pages that couldn't be de-duplicated previously */ /* Default to true at least temporarily, for testing */ static bool ksm_smart_scan = true; /* The number of zero pages which is placed by KSM */ atomic_long_t ksm_zero_pages = ATOMIC_LONG_INIT(0); /* The number of pages that have been skipped due to "smart scanning" */ static unsigned long ksm_pages_skipped; /* Don't scan more than max pages per batch. */ static unsigned long ksm_advisor_max_pages_to_scan = 30000; /* Min CPU for scanning pages per scan */ #define KSM_ADVISOR_MIN_CPU 10 /* Max CPU for scanning pages per scan */ static unsigned int ksm_advisor_max_cpu = 70; /* Target scan time in seconds to analyze all KSM candidate pages. */ static unsigned long ksm_advisor_target_scan_time = 200; /* Exponentially weighted moving average. */ #define EWMA_WEIGHT 30 /** * struct advisor_ctx - metadata for KSM advisor * @start_scan: start time of the current scan * @scan_time: scan time of previous scan * @change: change in percent to pages_to_scan parameter * @cpu_time: cpu time consumed by the ksmd thread in the previous scan */ struct advisor_ctx { ktime_t start_scan; unsigned long scan_time; unsigned long change; unsigned long long cpu_time; }; static struct advisor_ctx advisor_ctx; /* Define different advisor's */ enum ksm_advisor_type { KSM_ADVISOR_NONE, KSM_ADVISOR_SCAN_TIME, }; static enum ksm_advisor_type ksm_advisor; #ifdef CONFIG_SYSFS /* * Only called through the sysfs control interface: */ /* At least scan this many pages per batch. */ static unsigned long ksm_advisor_min_pages_to_scan = 500; static void set_advisor_defaults(void) { if (ksm_advisor == KSM_ADVISOR_NONE) { ksm_thread_pages_to_scan = DEFAULT_PAGES_TO_SCAN; } else if (ksm_advisor == KSM_ADVISOR_SCAN_TIME) { advisor_ctx = (const struct advisor_ctx){ 0 }; ksm_thread_pages_to_scan = ksm_advisor_min_pages_to_scan; } } #endif /* CONFIG_SYSFS */ static inline void advisor_start_scan(void) { if (ksm_advisor == KSM_ADVISOR_SCAN_TIME) advisor_ctx.start_scan = ktime_get(); } /* * Use previous scan time if available, otherwise use current scan time as an * approximation for the previous scan time. */ static inline unsigned long prev_scan_time(struct advisor_ctx *ctx, unsigned long scan_time) { return ctx->scan_time ? ctx->scan_time : scan_time; } /* Calculate exponential weighted moving average */ static unsigned long ewma(unsigned long prev, unsigned long curr) { return ((100 - EWMA_WEIGHT) * prev + EWMA_WEIGHT * curr) / 100; } /* * The scan time advisor is based on the current scan rate and the target * scan rate. * * new_pages_to_scan = pages_to_scan * (scan_time / target_scan_time) * * To avoid perturbations it calculates a change factor of previous changes. * A new change factor is calculated for each iteration and it uses an * exponentially weighted moving average. The new pages_to_scan value is * multiplied with that change factor: * * new_pages_to_scan *= change factor * * The new_pages_to_scan value is limited by the cpu min and max values. It * calculates the cpu percent for the last scan and calculates the new * estimated cpu percent cost for the next scan. That value is capped by the * cpu min and max setting. * * In addition the new pages_to_scan value is capped by the max and min * limits. */ static void scan_time_advisor(void) { unsigned int cpu_percent; unsigned long cpu_time; unsigned long cpu_time_diff; unsigned long cpu_time_diff_ms; unsigned long pages; unsigned long per_page_cost; unsigned long factor; unsigned long change; unsigned long last_scan_time; unsigned long scan_time; /* Convert scan time to seconds */ scan_time = div_s64(ktime_ms_delta(ktime_get(), advisor_ctx.start_scan), MSEC_PER_SEC); scan_time = scan_time ? scan_time : 1; /* Calculate CPU consumption of ksmd background thread */ cpu_time = task_sched_runtime(current); cpu_time_diff = cpu_time - advisor_ctx.cpu_time; cpu_time_diff_ms = cpu_time_diff / 1000 / 1000; cpu_percent = (cpu_time_diff_ms * 100) / (scan_time * 1000); cpu_percent = cpu_percent ? cpu_percent : 1; last_scan_time = prev_scan_time(&advisor_ctx, scan_time); /* Calculate scan time as percentage of target scan time */ factor = ksm_advisor_target_scan_time * 100 / scan_time; factor = factor ? factor : 1; /* * Calculate scan time as percentage of last scan time and use * exponentially weighted average to smooth it */ change = scan_time * 100 / last_scan_time; change = change ? change : 1; change = ewma(advisor_ctx.change, change); /* Calculate new scan rate based on target scan rate. */ pages = ksm_thread_pages_to_scan * 100 / factor; /* Update pages_to_scan by weighted change percentage. */ pages = pages * change / 100; /* Cap new pages_to_scan value */ per_page_cost = ksm_thread_pages_to_scan / cpu_percent; per_page_cost = per_page_cost ? per_page_cost : 1; pages = min(pages, per_page_cost * ksm_advisor_max_cpu); pages = max(pages, per_page_cost * KSM_ADVISOR_MIN_CPU); pages = min(pages, ksm_advisor_max_pages_to_scan); /* Update advisor context */ advisor_ctx.change = change; advisor_ctx.scan_time = scan_time; advisor_ctx.cpu_time = cpu_time; ksm_thread_pages_to_scan = pages; trace_ksm_advisor(scan_time, pages, cpu_percent); } static void advisor_stop_scan(void) { if (ksm_advisor == KSM_ADVISOR_SCAN_TIME) scan_time_advisor(); } #ifdef CONFIG_NUMA /* Zeroed when merging across nodes is not allowed */ static unsigned int ksm_merge_across_nodes = 1; static int ksm_nr_node_ids = 1; #else #define ksm_merge_across_nodes 1U #define ksm_nr_node_ids 1 #endif #define KSM_RUN_STOP 0 #define KSM_RUN_MERGE 1 #define KSM_RUN_UNMERGE 2 #define KSM_RUN_OFFLINE 4 static unsigned long ksm_run = KSM_RUN_STOP; static void wait_while_offlining(void); static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait); static DECLARE_WAIT_QUEUE_HEAD(ksm_iter_wait); static DEFINE_MUTEX(ksm_thread_mutex); static DEFINE_SPINLOCK(ksm_mmlist_lock); static int __init ksm_slab_init(void) { rmap_item_cache = KMEM_CACHE(ksm_rmap_item, 0); if (!rmap_item_cache) goto out; stable_node_cache = KMEM_CACHE(ksm_stable_node, 0); if (!stable_node_cache) goto out_free1; mm_slot_cache = KMEM_CACHE(ksm_mm_slot, 0); if (!mm_slot_cache) goto out_free2; return 0; out_free2: kmem_cache_destroy(stable_node_cache); out_free1: kmem_cache_destroy(rmap_item_cache); out: return -ENOMEM; } static void __init ksm_slab_free(void) { kmem_cache_destroy(mm_slot_cache); kmem_cache_destroy(stable_node_cache); kmem_cache_destroy(rmap_item_cache); mm_slot_cache = NULL; } static __always_inline bool is_stable_node_chain(struct ksm_stable_node *chain) { return chain->rmap_hlist_len == STABLE_NODE_CHAIN; } static __always_inline bool is_stable_node_dup(struct ksm_stable_node *dup) { return dup->head == STABLE_NODE_DUP_HEAD; } static inline void stable_node_chain_add_dup(struct ksm_stable_node *dup, struct ksm_stable_node *chain) { VM_BUG_ON(is_stable_node_dup(dup)); dup->head = STABLE_NODE_DUP_HEAD; VM_BUG_ON(!is_stable_node_chain(chain)); hlist_add_head(&dup->hlist_dup, &chain->hlist); ksm_stable_node_dups++; } static inline void __stable_node_dup_del(struct ksm_stable_node *dup) { VM_BUG_ON(!is_stable_node_dup(dup)); hlist_del(&dup->hlist_dup); ksm_stable_node_dups--; } static inline void stable_node_dup_del(struct ksm_stable_node *dup) { VM_BUG_ON(is_stable_node_chain(dup)); if (is_stable_node_dup(dup)) __stable_node_dup_del(dup); else rb_erase(&dup->node, root_stable_tree + NUMA(dup->nid)); #ifdef CONFIG_DEBUG_VM dup->head = NULL; #endif } static inline struct ksm_rmap_item *alloc_rmap_item(void) { struct ksm_rmap_item *rmap_item; rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN); if (rmap_item) ksm_rmap_items++; return rmap_item; } static inline void free_rmap_item(struct ksm_rmap_item *rmap_item) { ksm_rmap_items--; rmap_item->mm->ksm_rmap_items--; rmap_item->mm = NULL; /* debug safety */ kmem_cache_free(rmap_item_cache, rmap_item); } static inline struct ksm_stable_node *alloc_stable_node(void) { /* * The allocation can take too long with GFP_KERNEL when memory is under * pressure, which may lead to hung task warnings. Adding __GFP_HIGH * grants access to memory reserves, helping to avoid this problem. */ return kmem_cache_alloc(stable_node_cache, GFP_KERNEL | __GFP_HIGH); } static inline void free_stable_node(struct ksm_stable_node *stable_node) { VM_BUG_ON(stable_node->rmap_hlist_len && !is_stable_node_chain(stable_node)); kmem_cache_free(stable_node_cache, stable_node); } /* * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's * page tables after it has passed through ksm_exit() - which, if necessary, * takes mmap_lock briefly to serialize against them. ksm_exit() does not set * a special flag: they can just back out as soon as mm_users goes to zero. * ksm_test_exit() is used throughout to make this test for exit: in some * places for correctness, in some places just to avoid unnecessary work. */ static inline bool ksm_test_exit(struct mm_struct *mm) { return atomic_read(&mm->mm_users) == 0; } static int break_ksm_pmd_entry(pmd_t *pmdp, unsigned long addr, unsigned long end, struct mm_walk *walk) { unsigned long *found_addr = (unsigned long *) walk->private; struct mm_struct *mm = walk->mm; pte_t *start_ptep, *ptep; spinlock_t *ptl; int found = 0; if (ksm_test_exit(walk->mm)) return 0; if (signal_pending(current)) return -ERESTARTSYS; start_ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); if (!start_ptep) return 0; for (ptep = start_ptep; addr < end; ptep++, addr += PAGE_SIZE) { pte_t pte = ptep_get(ptep); struct folio *folio = NULL; if (pte_present(pte)) { folio = vm_normal_folio(walk->vma, addr, pte); } else if (!pte_none(pte)) { const softleaf_t entry = softleaf_from_pte(pte); /* * As KSM pages remain KSM pages until freed, no need to wait * here for migration to end. */ if (softleaf_is_migration(entry)) folio = softleaf_to_folio(entry); } /* return 1 if the page is an normal ksm page or KSM-placed zero page */ found = (folio && folio_test_ksm(folio)) || (pte_present(pte) && is_ksm_zero_pte(pte)); if (found) { *found_addr = addr; goto out_unlock; } } out_unlock: pte_unmap_unlock(start_ptep, ptl); return found; } static const struct mm_walk_ops break_ksm_ops = { .pmd_entry = break_ksm_pmd_entry, .walk_lock = PGWALK_RDLOCK, }; static const struct mm_walk_ops break_ksm_lock_vma_ops = { .pmd_entry = break_ksm_pmd_entry, .walk_lock = PGWALK_WRLOCK, }; /* * Though it's very tempting to unmerge rmap_items from stable tree rather * than check every pte of a given vma, the locking doesn't quite work for * that - an rmap_item is assigned to the stable tree after inserting ksm * page and upping mmap_lock. Nor does it fit with the way we skip dup'ing * rmap_items from parent to child at fork time (so as not to waste time * if exit comes before the next scan reaches it). * * Similarly, although we'd like to remove rmap_items (so updating counts * and freeing memory) when unmerging an area, it's easier to leave that * to the next pass of ksmd - consider, for example, how ksmd might be * in cmp_and_merge_page on one of the rmap_items we would be removing. * * We use break_ksm to break COW on a ksm page by triggering unsharing, * such that the ksm page will get replaced by an exclusive anonymous page. * * We take great care only to touch a ksm page, in a VM_MERGEABLE vma, * in case the application has unmapped and remapped mm,addr meanwhile. * Could a ksm page appear anywhere else? Actually yes, in a VM_PFNMAP * mmap of /dev/mem, where we would not want to touch it. * * FAULT_FLAG_REMOTE/FOLL_REMOTE are because we do this outside the context * of the process that owns 'vma'. We also do not want to enforce * protection keys here anyway. */ static int break_ksm(struct vm_area_struct *vma, unsigned long addr, unsigned long end, bool lock_vma) { vm_fault_t ret = 0; const struct mm_walk_ops *ops = lock_vma ? &break_ksm_lock_vma_ops : &break_ksm_ops; do { int ksm_page; cond_resched(); ksm_page = walk_page_range_vma(vma, addr, end, ops, &addr); if (ksm_page <= 0) return ksm_page; ret = handle_mm_fault(vma, addr, FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE, NULL); } while (!(ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | VM_FAULT_OOM))); /* * We must loop until we no longer find a KSM page because * handle_mm_fault() may back out if there's any difficulty e.g. if * pte accessed bit gets updated concurrently. * * VM_FAULT_SIGBUS could occur if we race with truncation of the * backing file, which also invalidates anonymous pages: that's * okay, that truncation will have unmapped the KSM page for us. * * VM_FAULT_OOM: at the time of writing (late July 2009), setting * aside mem_cgroup limits, VM_FAULT_OOM would only be set if the * current task has TIF_MEMDIE set, and will be OOM killed on return * to user; and ksmd, having no mm, would never be chosen for that. * * But if the mm is in a limited mem_cgroup, then the fault may fail * with VM_FAULT_OOM even if the current task is not TIF_MEMDIE; and * even ksmd can fail in this way - though it's usually breaking ksm * just to undo a merge it made a moment before, so unlikely to oom. * * That's a pity: we might therefore have more kernel pages allocated * than we're counting as nodes in the stable tree; but ksm_do_scan * will retry to break_cow on each pass, so should recover the page * in due course. The important thing is to not let VM_MERGEABLE * be cleared while any such pages might remain in the area. */ return (ret & VM_FAULT_OOM) ? -ENOMEM : 0; } static bool ksm_compatible(const struct file *file, vm_flags_t vm_flags) { if (vm_flags & (VM_SHARED | VM_MAYSHARE | VM_SPECIAL | VM_HUGETLB | VM_DROPPABLE)) return false; /* just ignore the advice */ if (file_is_dax(file)) return false; #ifdef VM_SAO if (vm_flags & VM_SAO) return false; #endif #ifdef VM_SPARC_ADI if (vm_flags & VM_SPARC_ADI) return false; #endif return true; } static bool vma_ksm_compatible(struct vm_area_struct *vma) { return ksm_compatible(vma->vm_file, vma->vm_flags); } static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma; if (ksm_test_exit(mm)) return NULL; vma = vma_lookup(mm, addr); if (!vma || !(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) return NULL; return vma; } static void break_cow(struct ksm_rmap_item *rmap_item) { struct mm_struct *mm = rmap_item->mm; unsigned long addr = rmap_item->address; struct vm_area_struct *vma; /* * It is not an accident that whenever we want to break COW * to undo, we also need to drop a reference to the anon_vma. */ put_anon_vma(rmap_item->anon_vma); mmap_read_lock(mm); vma = find_mergeable_vma(mm, addr); if (vma) break_ksm(vma, addr, addr + PAGE_SIZE, false); mmap_read_unlock(mm); } static struct page *get_mergeable_page(struct ksm_rmap_item *rmap_item) { struct mm_struct *mm = rmap_item->mm; unsigned long addr = rmap_item->address; struct vm_area_struct *vma; struct page *page = NULL; struct folio_walk fw; struct folio *folio; mmap_read_lock(mm); vma = find_mergeable_vma(mm, addr); if (!vma) goto out; folio = folio_walk_start(&fw, vma, addr, 0); if (folio) { if (!folio_is_zone_device(folio) && folio_test_anon(folio)) { folio_get(folio); page = fw.page; } folio_walk_end(&fw, vma); } out: if (page) { flush_anon_page(vma, page, addr); flush_dcache_page(page); } mmap_read_unlock(mm); return page; } /* * This helper is used for getting right index into array of tree roots. * When merge_across_nodes knob is set to 1, there are only two rb-trees for * stable and unstable pages from all nodes with roots in index 0. Otherwise, * every node has its own stable and unstable tree. */ static inline int get_kpfn_nid(unsigned long kpfn) { return ksm_merge_across_nodes ? 0 : NUMA(pfn_to_nid(kpfn)); } static struct ksm_stable_node *alloc_stable_node_chain(struct ksm_stable_node *dup, struct rb_root *root) { struct ksm_stable_node *chain = alloc_stable_node(); VM_BUG_ON(is_stable_node_chain(dup)); if (likely(chain)) { INIT_HLIST_HEAD(&chain->hlist); chain->chain_prune_time = jiffies; chain->rmap_hlist_len = STABLE_NODE_CHAIN; #if defined (CONFIG_DEBUG_VM) && defined(CONFIG_NUMA) chain->nid = NUMA_NO_NODE; /* debug */ #endif ksm_stable_node_chains++; /* * Put the stable node chain in the first dimension of * the stable tree and at the same time remove the old * stable node. */ rb_replace_node(&dup->node, &chain->node, root); /* * Move the old stable node to the second dimension * queued in the hlist_dup. The invariant is that all * dup stable_nodes in the chain->hlist point to pages * that are write protected and have the exact same * content. */ stable_node_chain_add_dup(dup, chain); } return chain; } static inline void free_stable_node_chain(struct ksm_stable_node *chain, struct rb_root *root) { rb_erase(&chain->node, root); free_stable_node(chain); ksm_stable_node_chains--; } static void remove_node_from_stable_tree(struct ksm_stable_node *stable_node) { struct ksm_rmap_item *rmap_item; /* check it's not STABLE_NODE_CHAIN or negative */ BUG_ON(stable_node->rmap_hlist_len < 0); hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { if (rmap_item->hlist.next) { ksm_pages_sharing--; trace_ksm_remove_rmap_item(stable_node->kpfn, rmap_item, rmap_item->mm); } else { ksm_pages_shared--; } rmap_item->mm->ksm_merging_pages--; VM_BUG_ON(stable_node->rmap_hlist_len <= 0); stable_node->rmap_hlist_len--; put_anon_vma(rmap_item->anon_vma); rmap_item->address &= PAGE_MASK; cond_resched(); } /* * We need the second aligned pointer of the migrate_nodes * list_head to stay clear from the rb_parent_color union * (aligned and different than any node) and also different * from &migrate_nodes. This will verify that future list.h changes * don't break STABLE_NODE_DUP_HEAD. Only recent gcc can handle it. */ BUILD_BUG_ON(STABLE_NODE_DUP_HEAD <= &migrate_nodes); BUILD_BUG_ON(STABLE_NODE_DUP_HEAD >= &migrate_nodes + 1); trace_ksm_remove_ksm_page(stable_node->kpfn); if (stable_node->head == &migrate_nodes) list_del(&stable_node->list); else stable_node_dup_del(stable_node); free_stable_node(stable_node); } enum ksm_get_folio_flags { KSM_GET_FOLIO_NOLOCK, KSM_GET_FOLIO_LOCK, KSM_GET_FOLIO_TRYLOCK }; /* * ksm_get_folio: checks if the page indicated by the stable node * is still its ksm page, despite having held no reference to it. * In which case we can trust the content of the page, and it * returns the gotten page; but if the page has now been zapped, * remove the stale node from the stable tree and return NULL. * But beware, the stable node's page might be being migrated. * * You would expect the stable_node to hold a reference to the ksm page. * But if it increments the page's count, swapping out has to wait for * ksmd to come around again before it can free the page, which may take * seconds or even minutes: much too unresponsive. So instead we use a * "keyhole reference": access to the ksm page from the stable node peeps * out through its keyhole to see if that page still holds the right key, * pointing back to this stable node. This relies on freeing a PageAnon * page to reset its page->mapping to NULL, and relies on no other use of * a page to put something that might look like our key in page->mapping. * is on its way to being freed; but it is an anomaly to bear in mind. */ static struct folio *ksm_get_folio(struct ksm_stable_node *stable_node, enum ksm_get_folio_flags flags) { struct folio *folio; void *expected_mapping; unsigned long kpfn; expected_mapping = (void *)((unsigned long)stable_node | FOLIO_MAPPING_KSM); again: kpfn = READ_ONCE(stable_node->kpfn); /* Address dependency. */ folio = pfn_folio(kpfn); if (READ_ONCE(folio->mapping) != expected_mapping) goto stale; /* * We cannot do anything with the page while its refcount is 0. * Usually 0 means free, or tail of a higher-order page: in which * case this node is no longer referenced, and should be freed; * however, it might mean that the page is under page_ref_freeze(). * The __remove_mapping() case is easy, again the node is now stale; * the same is in reuse_ksm_page() case; but if page is swapcache * in folio_migrate_mapping(), it might still be our page, * in which case it's essential to keep the node. */ while (!folio_try_get(folio)) { /* * Another check for folio->mapping != expected_mapping * would work here too. We have chosen to test the * swapcache flag to optimize the common case, when the * folio is or is about to be freed: the swapcache flag * is cleared (under spin_lock_irq) in the ref_freeze * section of __remove_mapping(); but anon folio->mapping * is reset to NULL later, in free_pages_prepare(). */ if (!folio_test_swapcache(folio)) goto stale; cpu_relax(); } if (READ_ONCE(folio->mapping) != expected_mapping) { folio_put(folio); goto stale; } if (flags == KSM_GET_FOLIO_TRYLOCK) { if (!folio_trylock(folio)) { folio_put(folio); return ERR_PTR(-EBUSY); } } else if (flags == KSM_GET_FOLIO_LOCK) folio_lock(folio); if (flags != KSM_GET_FOLIO_NOLOCK) { if (READ_ONCE(folio->mapping) != expected_mapping) { folio_unlock(folio); folio_put(folio); goto stale; } } return folio; stale: /* * We come here from above when folio->mapping or the swapcache flag * suggests that the node is stale; but it might be under migration. * We need smp_rmb(), matching the smp_wmb() in folio_migrate_ksm(), * before checking whether node->kpfn has been changed. */ smp_rmb(); if (READ_ONCE(stable_node->kpfn) != kpfn) goto again; remove_node_from_stable_tree(stable_node); return NULL; } /* * Removing rmap_item from stable or unstable tree. * This function will clean the information from the stable/unstable tree. */ static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item) { if (rmap_item->address & STABLE_FLAG) { struct ksm_stable_node *stable_node; struct folio *folio; stable_node = rmap_item->head; folio = ksm_get_folio(stable_node, KSM_GET_FOLIO_LOCK); if (!folio) goto out; hlist_del(&rmap_item->hlist); folio_unlock(folio); folio_put(folio); if (!hlist_empty(&stable_node->hlist)) ksm_pages_sharing--; else ksm_pages_shared--; rmap_item->mm->ksm_merging_pages--; VM_BUG_ON(stable_node->rmap_hlist_len <= 0); stable_node->rmap_hlist_len--; put_anon_vma(rmap_item->anon_vma); rmap_item->head = NULL; rmap_item->address &= PAGE_MASK; } else if (rmap_item->address & UNSTABLE_FLAG) { unsigned char age; /* * Usually ksmd can and must skip the rb_erase, because * root_unstable_tree was already reset to RB_ROOT. * But be careful when an mm is exiting: do the rb_erase * if this rmap_item was inserted by this scan, rather * than left over from before. */ age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); BUG_ON(age > 1); if (!age) rb_erase(&rmap_item->node, root_unstable_tree + NUMA(rmap_item->nid)); ksm_pages_unshared--; rmap_item->address &= PAGE_MASK; } out: cond_resched(); /* we're called from many long loops */ } static void remove_trailing_rmap_items(struct ksm_rmap_item **rmap_list) { while (*rmap_list) { struct ksm_rmap_item *rmap_item = *rmap_list; *rmap_list = rmap_item->rmap_list; remove_rmap_item_from_tree(rmap_item); free_rmap_item(rmap_item); } } static inline struct ksm_stable_node *folio_stable_node(const struct folio *folio) { return folio_test_ksm(folio) ? folio_raw_mapping(folio) : NULL; } static inline void folio_set_stable_node(struct folio *folio, struct ksm_stable_node *stable_node) { VM_WARN_ON_FOLIO(folio_test_anon(folio) && PageAnonExclusive(&folio->page), folio); folio->mapping = (void *)((unsigned long)stable_node | FOLIO_MAPPING_KSM); } #ifdef CONFIG_SYSFS /* * Only called through the sysfs control interface: */ static int remove_stable_node(struct ksm_stable_node *stable_node) { struct folio *folio; int err; folio = ksm_get_folio(stable_node, KSM_GET_FOLIO_LOCK); if (!folio) { /* * ksm_get_folio did remove_node_from_stable_tree itself. */ return 0; } /* * Page could be still mapped if this races with __mmput() running in * between ksm_exit() and exit_mmap(). Just refuse to let * merge_across_nodes/max_page_sharing be switched. */ err = -EBUSY; if (!folio_mapped(folio)) { /* * The stable node did not yet appear stale to ksm_get_folio(), * since that allows for an unmapped ksm folio to be recognized * right up until it is freed; but the node is safe to remove. * This folio might be in an LRU cache waiting to be freed, * or it might be in the swapcache (perhaps under writeback), * or it might have been removed from swapcache a moment ago. */ folio_set_stable_node(folio, NULL); remove_node_from_stable_tree(stable_node); err = 0; } folio_unlock(folio); folio_put(folio); return err; } static int remove_stable_node_chain(struct ksm_stable_node *stable_node, struct rb_root *root) { struct ksm_stable_node *dup; struct hlist_node *hlist_safe; if (!is_stable_node_chain(stable_node)) { VM_BUG_ON(is_stable_node_dup(stable_node)); if (remove_stable_node(stable_node)) return true; else return false; } hlist_for_each_entry_safe(dup, hlist_safe, &stable_node->hlist, hlist_dup) { VM_BUG_ON(!is_stable_node_dup(dup)); if (remove_stable_node(dup)) return true; } BUG_ON(!hlist_empty(&stable_node->hlist)); free_stable_node_chain(stable_node, root); return false; } static int remove_all_stable_nodes(void) { struct ksm_stable_node *stable_node, *next; int nid; int err = 0; for (nid = 0; nid < ksm_nr_node_ids; nid++) { while (root_stable_tree[nid].rb_node) { stable_node = rb_entry(root_stable_tree[nid].rb_node, struct ksm_stable_node, node); if (remove_stable_node_chain(stable_node, root_stable_tree + nid)) { err = -EBUSY; break; /* proceed to next nid */ } cond_resched(); } } list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) { if (remove_stable_node(stable_node)) err = -EBUSY; cond_resched(); } return err; } static int unmerge_and_remove_all_rmap_items(void) { struct ksm_mm_slot *mm_slot; struct mm_slot *slot; struct mm_struct *mm; struct vm_area_struct *vma; int err = 0; spin_lock(&ksm_mmlist_lock); slot = list_entry(ksm_mm_head.slot.mm_node.next, struct mm_slot, mm_node); ksm_scan.mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); spin_unlock(&ksm_mmlist_lock); for (mm_slot = ksm_scan.mm_slot; mm_slot != &ksm_mm_head; mm_slot = ksm_scan.mm_slot) { VMA_ITERATOR(vmi, mm_slot->slot.mm, 0); mm = mm_slot->slot.mm; mmap_read_lock(mm); /* * Exit right away if mm is exiting to avoid lockdep issue in * the maple tree */ if (ksm_test_exit(mm)) goto mm_exiting; for_each_vma(vmi, vma) { if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) continue; err = break_ksm(vma, vma->vm_start, vma->vm_end, false); if (err) goto error; } mm_exiting: remove_trailing_rmap_items(&mm_slot->rmap_list); mmap_read_unlock(mm); spin_lock(&ksm_mmlist_lock); slot = list_entry(mm_slot->slot.mm_node.next, struct mm_slot, mm_node); ksm_scan.mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); if (ksm_test_exit(mm)) { hash_del(&mm_slot->slot.hash); list_del(&mm_slot->slot.mm_node); spin_unlock(&ksm_mmlist_lock); mm_slot_free(mm_slot_cache, mm_slot); mm_flags_clear(MMF_VM_MERGEABLE, mm); mm_flags_clear(MMF_VM_MERGE_ANY, mm); mmdrop(mm); } else spin_unlock(&ksm_mmlist_lock); } /* Clean up stable nodes, but don't worry if some are still busy */ remove_all_stable_nodes(); ksm_scan.seqnr = 0; return 0; error: mmap_read_unlock(mm); spin_lock(&ksm_mmlist_lock); ksm_scan.mm_slot = &ksm_mm_head; spin_unlock(&ksm_mmlist_lock); return err; } #endif /* CONFIG_SYSFS */ static u32 calc_checksum(struct page *page) { u32 checksum; void *addr = kmap_local_page(page); checksum = xxhash(addr, PAGE_SIZE, 0); kunmap_local(addr); return checksum; } static int write_protect_page(struct vm_area_struct *vma, struct folio *folio, pte_t *orig_pte) { struct mm_struct *mm = vma->vm_mm; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, 0, 0); int swapped; int err = -EFAULT; struct mmu_notifier_range range; bool anon_exclusive; pte_t entry; if (WARN_ON_ONCE(folio_test_large(folio))) return err; pvmw.address = page_address_in_vma(folio, folio_page(folio, 0), vma); if (pvmw.address == -EFAULT) goto out; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, pvmw.address, pvmw.address + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); if (!page_vma_mapped_walk(&pvmw)) goto out_mn; if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?")) goto out_unlock; entry = ptep_get(pvmw.pte); /* * Handle PFN swap PTEs, such as device-exclusive ones, that actually * map pages: give up just like the next folio_walk would. */ if (unlikely(!pte_present(entry))) goto out_unlock; anon_exclusive = PageAnonExclusive(&folio->page); if (pte_write(entry) || pte_dirty(entry) || anon_exclusive || mm_tlb_flush_pending(mm)) { swapped = folio_test_swapcache(folio); flush_cache_page(vma, pvmw.address, folio_pfn(folio)); /* * Ok this is tricky, when get_user_pages_fast() run it doesn't * take any lock, therefore the check that we are going to make * with the pagecount against the mapcount is racy and * O_DIRECT can happen right after the check. * So we clear the pte and flush the tlb before the check * this assure us that no O_DIRECT can happen after the check * or in the middle of the check. * * No need to notify as we are downgrading page table to read * only not changing it to point to a new page. * * See Documentation/mm/mmu_notifier.rst */ entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); /* * Check that no O_DIRECT or similar I/O is in progress on the * page */ if (folio_mapcount(folio) + 1 + swapped != folio_ref_count(folio)) { set_pte_at(mm, pvmw.address, pvmw.pte, entry); goto out_unlock; } /* See folio_try_share_anon_rmap_pte(): clear PTE first. */ if (anon_exclusive && folio_try_share_anon_rmap_pte(folio, &folio->page)) { set_pte_at(mm, pvmw.address, pvmw.pte, entry); goto out_unlock; } if (pte_dirty(entry)) folio_mark_dirty(folio); entry = pte_mkclean(entry); if (pte_write(entry)) entry = pte_wrprotect(entry); set_pte_at(mm, pvmw.address, pvmw.pte, entry); } *orig_pte = entry; err = 0; out_unlock: page_vma_mapped_walk_done(&pvmw); out_mn: mmu_notifier_invalidate_range_end(&range); out: return err; } /** * replace_page - replace page in vma by new ksm page * @vma: vma that holds the pte pointing to page * @page: the page we are replacing by kpage * @kpage: the ksm page we replace page by * @orig_pte: the original value of the pte * * Returns 0 on success, -EFAULT on failure. */ static int replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage, pte_t orig_pte) { struct folio *kfolio = page_folio(kpage); struct mm_struct *mm = vma->vm_mm; struct folio *folio = page_folio(page); pmd_t *pmd; pmd_t pmde; pte_t *ptep; pte_t newpte; spinlock_t *ptl; unsigned long addr; int err = -EFAULT; struct mmu_notifier_range range; addr = page_address_in_vma(folio, page, vma); if (addr == -EFAULT) goto out; pmd = mm_find_pmd(mm, addr); if (!pmd) goto out; /* * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at() * without holding anon_vma lock for write. So when looking for a * genuine pmde (in which to find pte), test present and !THP together. */ pmde = pmdp_get_lockless(pmd); if (!pmd_present(pmde) || pmd_trans_huge(pmde)) goto out; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr, addr + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); if (!ptep) goto out_mn; if (!pte_same(ptep_get(ptep), orig_pte)) { pte_unmap_unlock(ptep, ptl); goto out_mn; } VM_BUG_ON_PAGE(PageAnonExclusive(page), page); VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage), kfolio); /* * No need to check ksm_use_zero_pages here: we can only have a * zero_page here if ksm_use_zero_pages was enabled already. */ if (!is_zero_pfn(page_to_pfn(kpage))) { folio_get(kfolio); folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE); newpte = mk_pte(kpage, vma->vm_page_prot); } else { /* * Use pte_mkdirty to mark the zero page mapped by KSM, and then * we can easily track all KSM-placed zero pages by checking if * the dirty bit in zero page's PTE is set. */ newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot))); ksm_map_zero_page(mm); /* * We're replacing an anonymous page with a zero page, which is * not anonymous. We need to do proper accounting otherwise we * will get wrong values in /proc, and a BUG message in dmesg * when tearing down the mm. */ dec_mm_counter(mm, MM_ANONPAGES); } flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep))); /* * No need to notify as we are replacing a read only page with another * read only page with the same content. * * See Documentation/mm/mmu_notifier.rst */ ptep_clear_flush(vma, addr, ptep); set_pte_at(mm, addr, ptep, newpte); folio_remove_rmap_pte(folio, page, vma); if (!folio_mapped(folio)) folio_free_swap(folio); folio_put(folio); pte_unmap_unlock(ptep, ptl); err = 0; out_mn: mmu_notifier_invalidate_range_end(&range); out: return err; } /* * try_to_merge_one_page - take two pages and merge them into one * @vma: the vma that holds the pte pointing to page * @page: the PageAnon page that we want to replace with kpage * @kpage: the KSM page that we want to map instead of page, * or NULL the first time when we want to use page as kpage. * * This function returns 0 if the pages were merged, -EFAULT otherwise. */ static int try_to_merge_one_page(struct vm_area_struct *vma, struct page *page, struct page *kpage) { struct folio *folio = page_folio(page); pte_t orig_pte = __pte(0); int err = -EFAULT; if (page == kpage) /* ksm page forked */ return 0; if (!folio_test_anon(folio)) goto out; /* * We need the folio lock to read a stable swapcache flag in * write_protect_page(). We trylock because we don't want to wait * here - we prefer to continue scanning and merging different * pages, then come back to this page when it is unlocked. */ if (!folio_trylock(folio)) goto out; if (folio_test_large(folio)) { if (split_huge_page(page)) goto out_unlock; folio = page_folio(page); } /* * If this anonymous page is mapped only here, its pte may need * to be write-protected. If it's mapped elsewhere, all of its * ptes are necessarily already write-protected. But in either * case, we need to lock and check page_count is not raised. */ if (write_protect_page(vma, folio, &orig_pte) == 0) { if (!kpage) { /* * While we hold folio lock, upgrade folio from * anon to a NULL stable_node with the KSM flag set: * stable_tree_insert() will update stable_node. */ folio_set_stable_node(folio, NULL); folio_mark_accessed(folio); /* * Page reclaim just frees a clean folio with no dirty * ptes: make sure that the ksm page would be swapped. */ if (!folio_test_dirty(folio)) folio_mark_dirty(folio); err = 0; } else if (pages_identical(page, kpage)) err = replace_page(vma, page, kpage, orig_pte); } out_unlock: folio_unlock(folio); out: return err; } /* * This function returns 0 if the pages were merged or if they are * no longer merging candidates (e.g., VMA stale), -EFAULT otherwise. */ static int try_to_merge_with_zero_page(struct ksm_rmap_item *rmap_item, struct page *page) { struct mm_struct *mm = rmap_item->mm; int err = -EFAULT; /* * Same checksum as an empty page. We attempt to merge it with the * appropriate zero page if the user enabled this via sysfs. */ if (ksm_use_zero_pages && (rmap_item->oldchecksum == zero_checksum)) { struct vm_area_struct *vma; mmap_read_lock(mm); vma = find_mergeable_vma(mm, rmap_item->address); if (vma) { err = try_to_merge_one_page(vma, page, ZERO_PAGE(rmap_item->address)); trace_ksm_merge_one_page( page_to_pfn(ZERO_PAGE(rmap_item->address)), rmap_item, mm, err); } else { /* * If the vma is out of date, we do not need to * continue. */ err = 0; } mmap_read_unlock(mm); } return err; } /* * try_to_merge_with_ksm_page - like try_to_merge_two_pages, * but no new kernel page is allocated: kpage must already be a ksm page. * * This function returns 0 if the pages were merged, -EFAULT otherwise. */ static int try_to_merge_with_ksm_page(struct ksm_rmap_item *rmap_item, struct page *page, struct page *kpage) { struct mm_struct *mm = rmap_item->mm; struct vm_area_struct *vma; int err = -EFAULT; mmap_read_lock(mm); vma = find_mergeable_vma(mm, rmap_item->address); if (!vma) goto out; err = try_to_merge_one_page(vma, page, kpage); if (err) goto out; /* Unstable nid is in union with stable anon_vma: remove first */ remove_rmap_item_from_tree(rmap_item); /* Must get reference to anon_vma while still holding mmap_lock */ rmap_item->anon_vma = vma->anon_vma; get_anon_vma(vma->anon_vma); out: mmap_read_unlock(mm); trace_ksm_merge_with_ksm_page(kpage, page_to_pfn(kpage ? kpage : page), rmap_item, mm, err); return err; } /* * try_to_merge_two_pages - take two identical pages and prepare them * to be merged into one page. * * This function returns the kpage if we successfully merged two identical * pages into one ksm page, NULL otherwise. * * Note that this function upgrades page to ksm page: if one of the pages * is already a ksm page, try_to_merge_with_ksm_page should be used. */ static struct folio *try_to_merge_two_pages(struct ksm_rmap_item *rmap_item, struct page *page, struct ksm_rmap_item *tree_rmap_item, struct page *tree_page) { int err; err = try_to_merge_with_ksm_page(rmap_item, page, NULL); if (!err) { err = try_to_merge_with_ksm_page(tree_rmap_item, tree_page, page); /* * If that fails, we have a ksm page with only one pte * pointing to it: so break it. */ if (err) break_cow(rmap_item); } return err ? NULL : page_folio(page); } static __always_inline bool __is_page_sharing_candidate(struct ksm_stable_node *stable_node, int offset) { VM_BUG_ON(stable_node->rmap_hlist_len < 0); /* * Check that at least one mapping still exists, otherwise * there's no much point to merge and share with this * stable_node, as the underlying tree_page of the other * sharer is going to be freed soon. */ return stable_node->rmap_hlist_len && stable_node->rmap_hlist_len + offset < ksm_max_page_sharing; } static __always_inline bool is_page_sharing_candidate(struct ksm_stable_node *stable_node) { return __is_page_sharing_candidate(stable_node, 0); } static struct folio *stable_node_dup(struct ksm_stable_node **_stable_node_dup, struct ksm_stable_node **_stable_node, struct rb_root *root, bool prune_stale_stable_nodes) { struct ksm_stable_node *dup, *found = NULL, *stable_node = *_stable_node; struct hlist_node *hlist_safe; struct folio *folio, *tree_folio = NULL; int found_rmap_hlist_len; if (!prune_stale_stable_nodes || time_before(jiffies, stable_node->chain_prune_time + msecs_to_jiffies( ksm_stable_node_chains_prune_millisecs))) prune_stale_stable_nodes = false; else stable_node->chain_prune_time = jiffies; hlist_for_each_entry_safe(dup, hlist_safe, &stable_node->hlist, hlist_dup) { cond_resched(); /* * We must walk all stable_node_dup to prune the stale * stable nodes during lookup. * * ksm_get_folio can drop the nodes from the * stable_node->hlist if they point to freed pages * (that's why we do a _safe walk). The "dup" * stable_node parameter itself will be freed from * under us if it returns NULL. */ folio = ksm_get_folio(dup, KSM_GET_FOLIO_NOLOCK); if (!folio) continue; /* Pick the best candidate if possible. */ if (!found || (is_page_sharing_candidate(dup) && (!is_page_sharing_candidate(found) || dup->rmap_hlist_len > found_rmap_hlist_len))) { if (found) folio_put(tree_folio); found = dup; found_rmap_hlist_len = found->rmap_hlist_len; tree_folio = folio; /* skip put_page for found candidate */ if (!prune_stale_stable_nodes && is_page_sharing_candidate(found)) break; continue; } folio_put(folio); } if (found) { if (hlist_is_singular_node(&found->hlist_dup, &stable_node->hlist)) { /* * If there's not just one entry it would * corrupt memory, better BUG_ON. In KSM * context with no lock held it's not even * fatal. */ BUG_ON(stable_node->hlist.first->next); /* * There's just one entry and it is below the * deduplication limit so drop the chain. */ rb_replace_node(&stable_node->node, &found->node, root); free_stable_node(stable_node); ksm_stable_node_chains--; ksm_stable_node_dups--; /* * NOTE: the caller depends on the stable_node * to be equal to stable_node_dup if the chain * was collapsed. */ *_stable_node = found; /* * Just for robustness, as stable_node is * otherwise left as a stable pointer, the * compiler shall optimize it away at build * time. */ stable_node = NULL; } else if (stable_node->hlist.first != &found->hlist_dup && __is_page_sharing_candidate(found, 1)) { /* * If the found stable_node dup can accept one * more future merge (in addition to the one * that is underway) and is not at the head of * the chain, put it there so next search will * be quicker in the !prune_stale_stable_nodes * case. * * NOTE: it would be inaccurate to use nr > 1 * instead of checking the hlist.first pointer * directly, because in the * prune_stale_stable_nodes case "nr" isn't * the position of the found dup in the chain, * but the total number of dups in the chain. */ hlist_del(&found->hlist_dup); hlist_add_head(&found->hlist_dup, &stable_node->hlist); } } else { /* Its hlist must be empty if no one found. */ free_stable_node_chain(stable_node, root); } *_stable_node_dup = found; return tree_folio; } /* * Like for ksm_get_folio, this function can free the *_stable_node and * *_stable_node_dup if the returned tree_page is NULL. * * It can also free and overwrite *_stable_node with the found * stable_node_dup if the chain is collapsed (in which case * *_stable_node will be equal to *_stable_node_dup like if the chain * never existed). It's up to the caller to verify tree_page is not * NULL before dereferencing *_stable_node or *_stable_node_dup. * * *_stable_node_dup is really a second output parameter of this * function and will be overwritten in all cases, the caller doesn't * need to initialize it. */ static struct folio *__stable_node_chain(struct ksm_stable_node **_stable_node_dup, struct ksm_stable_node **_stable_node, struct rb_root *root, bool prune_stale_stable_nodes) { struct ksm_stable_node *stable_node = *_stable_node; if (!is_stable_node_chain(stable_node)) { *_stable_node_dup = stable_node; return ksm_get_folio(stable_node, KSM_GET_FOLIO_NOLOCK); } return stable_node_dup(_stable_node_dup, _stable_node, root, prune_stale_stable_nodes); } static __always_inline struct folio *chain_prune(struct ksm_stable_node **s_n_d, struct ksm_stable_node **s_n, struct rb_root *root) { return __stable_node_chain(s_n_d, s_n, root, true); } static __always_inline struct folio *chain(struct ksm_stable_node **s_n_d, struct ksm_stable_node **s_n, struct rb_root *root) { return __stable_node_chain(s_n_d, s_n, root, false); } /* * stable_tree_search - search for page inside the stable tree * * This function checks if there is a page inside the stable tree * with identical content to the page that we are scanning right now. * * This function returns the stable tree node of identical content if found, * -EBUSY if the stable node's page is being migrated, NULL otherwise. */ static struct folio *stable_tree_search(struct page *page) { int nid; struct rb_root *root; struct rb_node **new; struct rb_node *parent; struct ksm_stable_node *stable_node, *stable_node_dup; struct ksm_stable_node *page_node; struct folio *folio; folio = page_folio(page); page_node = folio_stable_node(folio); if (page_node && page_node->head != &migrate_nodes) { /* ksm page forked */ folio_get(folio); return folio; } nid = get_kpfn_nid(folio_pfn(folio)); root = root_stable_tree + nid; again: new = &root->rb_node; parent = NULL; while (*new) { struct folio *tree_folio; int ret; cond_resched(); stable_node = rb_entry(*new, struct ksm_stable_node, node); tree_folio = chain_prune(&stable_node_dup, &stable_node, root); if (!tree_folio) { /* * If we walked over a stale stable_node, * ksm_get_folio() will call rb_erase() and it * may rebalance the tree from under us. So * restart the search from scratch. Returning * NULL would be safe too, but we'd generate * false negative insertions just because some * stable_node was stale. */ goto again; } ret = memcmp_pages(page, &tree_folio->page); folio_put(tree_folio); parent = *new; if (ret < 0) new = &parent->rb_left; else if (ret > 0) new = &parent->rb_right; else { if (page_node) { VM_BUG_ON(page_node->head != &migrate_nodes); /* * If the mapcount of our migrated KSM folio is * at most 1, we can merge it with another * KSM folio where we know that we have space * for one more mapping without exceeding the * ksm_max_page_sharing limit: see * chain_prune(). This way, we can avoid adding * this stable node to the chain. */ if (folio_mapcount(folio) > 1) goto chain_append; } if (!is_page_sharing_candidate(stable_node_dup)) { /* * If the stable_node is a chain and * we got a payload match in memcmp * but we cannot merge the scanned * page in any of the existing * stable_node dups because they're * all full, we need to wait the * scanned page to find itself a match * in the unstable tree to create a * brand new KSM page to add later to * the dups of this stable_node. */ return NULL; } /* * Lock and unlock the stable_node's page (which * might already have been migrated) so that page * migration is sure to notice its raised count. * It would be more elegant to return stable_node * than kpage, but that involves more changes. */ tree_folio = ksm_get_folio(stable_node_dup, KSM_GET_FOLIO_TRYLOCK); if (PTR_ERR(tree_folio) == -EBUSY) return ERR_PTR(-EBUSY); if (unlikely(!tree_folio)) /* * The tree may have been rebalanced, * so re-evaluate parent and new. */ goto again; folio_unlock(tree_folio); if (get_kpfn_nid(stable_node_dup->kpfn) != NUMA(stable_node_dup->nid)) { folio_put(tree_folio); goto replace; } return tree_folio; } } if (!page_node) return NULL; list_del(&page_node->list); DO_NUMA(page_node->nid = nid); rb_link_node(&page_node->node, parent, new); rb_insert_color(&page_node->node, root); out: if (is_page_sharing_candidate(page_node)) { folio_get(folio); return folio; } else return NULL; replace: /* * If stable_node was a chain and chain_prune collapsed it, * stable_node has been updated to be the new regular * stable_node. A collapse of the chain is indistinguishable * from the case there was no chain in the stable * rbtree. Otherwise stable_node is the chain and * stable_node_dup is the dup to replace. */ if (stable_node_dup == stable_node) { VM_BUG_ON(is_stable_node_chain(stable_node_dup)); VM_BUG_ON(is_stable_node_dup(stable_node_dup)); /* there is no chain */ if (page_node) { VM_BUG_ON(page_node->head != &migrate_nodes); list_del(&page_node->list); DO_NUMA(page_node->nid = nid); rb_replace_node(&stable_node_dup->node, &page_node->node, root); if (is_page_sharing_candidate(page_node)) folio_get(folio); else folio = NULL; } else { rb_erase(&stable_node_dup->node, root); folio = NULL; } } else { VM_BUG_ON(!is_stable_node_chain(stable_node)); __stable_node_dup_del(stable_node_dup); if (page_node) { VM_BUG_ON(page_node->head != &migrate_nodes); list_del(&page_node->list); DO_NUMA(page_node->nid = nid); stable_node_chain_add_dup(page_node, stable_node); if (is_page_sharing_candidate(page_node)) folio_get(folio); else folio = NULL; } else { folio = NULL; } } stable_node_dup->head = &migrate_nodes; list_add(&stable_node_dup->list, stable_node_dup->head); return folio; chain_append: /* * If stable_node was a chain and chain_prune collapsed it, * stable_node has been updated to be the new regular * stable_node. A collapse of the chain is indistinguishable * from the case there was no chain in the stable * rbtree. Otherwise stable_node is the chain and * stable_node_dup is the dup to replace. */ if (stable_node_dup == stable_node) { VM_BUG_ON(is_stable_node_dup(stable_node_dup)); /* chain is missing so create it */ stable_node = alloc_stable_node_chain(stable_node_dup, root); if (!stable_node) return NULL; } /* * Add this stable_node dup that was * migrated to the stable_node chain * of the current nid for this page * content. */ VM_BUG_ON(!is_stable_node_dup(stable_node_dup)); VM_BUG_ON(page_node->head != &migrate_nodes); list_del(&page_node->list); DO_NUMA(page_node->nid = nid); stable_node_chain_add_dup(page_node, stable_node); goto out; } /* * stable_tree_insert - insert stable tree node pointing to new ksm page * into the stable tree. * * This function returns the stable tree node just allocated on success, * NULL otherwise. */ static struct ksm_stable_node *stable_tree_insert(struct folio *kfolio) { int nid; unsigned long kpfn; struct rb_root *root; struct rb_node **new; struct rb_node *parent; struct ksm_stable_node *stable_node, *stable_node_dup; bool need_chain = false; kpfn = folio_pfn(kfolio); nid = get_kpfn_nid(kpfn); root = root_stable_tree + nid; again: parent = NULL; new = &root->rb_node; while (*new) { struct folio *tree_folio; int ret; cond_resched(); stable_node = rb_entry(*new, struct ksm_stable_node, node); tree_folio = chain(&stable_node_dup, &stable_node, root); if (!tree_folio) { /* * If we walked over a stale stable_node, * ksm_get_folio() will call rb_erase() and it * may rebalance the tree from under us. So * restart the search from scratch. Returning * NULL would be safe too, but we'd generate * false negative insertions just because some * stable_node was stale. */ goto again; } ret = memcmp_pages(&kfolio->page, &tree_folio->page); folio_put(tree_folio); parent = *new; if (ret < 0) new = &parent->rb_left; else if (ret > 0) new = &parent->rb_right; else { need_chain = true; break; } } stable_node_dup = alloc_stable_node(); if (!stable_node_dup) return NULL; INIT_HLIST_HEAD(&stable_node_dup->hlist); stable_node_dup->kpfn = kpfn; stable_node_dup->rmap_hlist_len = 0; DO_NUMA(stable_node_dup->nid = nid); if (!need_chain) { rb_link_node(&stable_node_dup->node, parent, new); rb_insert_color(&stable_node_dup->node, root); } else { if (!is_stable_node_chain(stable_node)) { struct ksm_stable_node *orig = stable_node; /* chain is missing so create it */ stable_node = alloc_stable_node_chain(orig, root); if (!stable_node) { free_stable_node(stable_node_dup); return NULL; } } stable_node_chain_add_dup(stable_node_dup, stable_node); } folio_set_stable_node(kfolio, stable_node_dup); return stable_node_dup; } /* * unstable_tree_search_insert - search for identical page, * else insert rmap_item into the unstable tree. * * This function searches for a page in the unstable tree identical to the * page currently being scanned; and if no identical page is found in the * tree, we insert rmap_item as a new object into the unstable tree. * * This function returns pointer to rmap_item found to be identical * to the currently scanned page, NULL otherwise. * * This function does both searching and inserting, because they share * the same walking algorithm in an rbtree. */ static struct ksm_rmap_item *unstable_tree_search_insert(struct ksm_rmap_item *rmap_item, struct page *page, struct page **tree_pagep) { struct rb_node **new; struct rb_root *root; struct rb_node *parent = NULL; int nid; nid = get_kpfn_nid(page_to_pfn(page)); root = root_unstable_tree + nid; new = &root->rb_node; while (*new) { struct ksm_rmap_item *tree_rmap_item; struct page *tree_page; int ret; cond_resched(); tree_rmap_item = rb_entry(*new, struct ksm_rmap_item, node); tree_page = get_mergeable_page(tree_rmap_item); if (!tree_page) return NULL; /* * Don't substitute a ksm page for a forked page. */ if (page == tree_page) { put_page(tree_page); return NULL; } ret = memcmp_pages(page, tree_page); parent = *new; if (ret < 0) { put_page(tree_page); new = &parent->rb_left; } else if (ret > 0) { put_page(tree_page); new = &parent->rb_right; } else if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { /* * If tree_page has been migrated to another NUMA node, * it will be flushed out and put in the right unstable * tree next time: only merge with it when across_nodes. */ put_page(tree_page); return NULL; } else { *tree_pagep = tree_page; return tree_rmap_item; } } rmap_item->address |= UNSTABLE_FLAG; rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); DO_NUMA(rmap_item->nid = nid); rb_link_node(&rmap_item->node, parent, new); rb_insert_color(&rmap_item->node, root); ksm_pages_unshared++; return NULL; } /* * stable_tree_append - add another rmap_item to the linked list of * rmap_items hanging off a given node of the stable tree, all sharing * the same ksm page. */ static void stable_tree_append(struct ksm_rmap_item *rmap_item, struct ksm_stable_node *stable_node, bool max_page_sharing_bypass) { /* * rmap won't find this mapping if we don't insert the * rmap_item in the right stable_node * duplicate. page_migration could break later if rmap breaks, * so we can as well crash here. We really need to check for * rmap_hlist_len == STABLE_NODE_CHAIN, but we can as well check * for other negative values as an underflow if detected here * for the first time (and not when decreasing rmap_hlist_len) * would be sign of memory corruption in the stable_node. */ BUG_ON(stable_node->rmap_hlist_len < 0); stable_node->rmap_hlist_len++; if (!max_page_sharing_bypass) /* possibly non fatal but unexpected overflow, only warn */ WARN_ON_ONCE(stable_node->rmap_hlist_len > ksm_max_page_sharing); rmap_item->head = stable_node; rmap_item->address |= STABLE_FLAG; hlist_add_head(&rmap_item->hlist, &stable_node->hlist); if (rmap_item->hlist.next) ksm_pages_sharing++; else ksm_pages_shared++; rmap_item->mm->ksm_merging_pages++; } /* * cmp_and_merge_page - first see if page can be merged into the stable tree; * if not, compare checksum to previous and if it's the same, see if page can * be inserted into the unstable tree, or merged with a page already there and * both transferred to the stable tree. * * @page: the page that we are searching identical page to. * @rmap_item: the reverse mapping into the virtual address of this page */ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_item) { struct folio *folio = page_folio(page); struct ksm_rmap_item *tree_rmap_item; struct page *tree_page = NULL; struct ksm_stable_node *stable_node; struct folio *kfolio; unsigned int checksum; int err; bool max_page_sharing_bypass = false; stable_node = folio_stable_node(folio); if (stable_node) { if (stable_node->head != &migrate_nodes && get_kpfn_nid(READ_ONCE(stable_node->kpfn)) != NUMA(stable_node->nid)) { stable_node_dup_del(stable_node); stable_node->head = &migrate_nodes; list_add(&stable_node->list, stable_node->head); } if (stable_node->head != &migrate_nodes && rmap_item->head == stable_node) return; /* * If it's a KSM fork, allow it to go over the sharing limit * without warnings. */ if (!is_page_sharing_candidate(stable_node)) max_page_sharing_bypass = true; } else { remove_rmap_item_from_tree(rmap_item); /* * If the hash value of the page has changed from the last time * we calculated it, this page is changing frequently: therefore we * don't want to insert it in the unstable tree, and we don't want * to waste our time searching for something identical to it there. */ checksum = calc_checksum(page); if (rmap_item->oldchecksum != checksum) { rmap_item->oldchecksum = checksum; return; } if (!try_to_merge_with_zero_page(rmap_item, page)) return; } /* Start by searching for the folio in the stable tree */ kfolio = stable_tree_search(page); if (kfolio == folio && rmap_item->head == stable_node) { folio_put(kfolio); return; } remove_rmap_item_from_tree(rmap_item); if (kfolio) { if (kfolio == ERR_PTR(-EBUSY)) return; err = try_to_merge_with_ksm_page(rmap_item, page, &kfolio->page); if (!err) { /* * The page was successfully merged: * add its rmap_item to the stable tree. */ folio_lock(kfolio); stable_tree_append(rmap_item, folio_stable_node(kfolio), max_page_sharing_bypass); folio_unlock(kfolio); } folio_put(kfolio); return; } tree_rmap_item = unstable_tree_search_insert(rmap_item, page, &tree_page); if (tree_rmap_item) { bool split; kfolio = try_to_merge_two_pages(rmap_item, page, tree_rmap_item, tree_page); /* * If both pages we tried to merge belong to the same compound * page, then we actually ended up increasing the reference * count of the same compound page twice, and split_huge_page * failed. * Here we set a flag if that happened, and we use it later to * try split_huge_page again. Since we call put_page right * afterwards, the reference count will be correct and * split_huge_page should succeed. */ split = PageTransCompound(page) && compound_head(page) == compound_head(tree_page); put_page(tree_page); if (kfolio) { /* * The pages were successfully merged: insert new * node in the stable tree and add both rmap_items. */ folio_lock(kfolio); stable_node = stable_tree_insert(kfolio); if (stable_node) { stable_tree_append(tree_rmap_item, stable_node, false); stable_tree_append(rmap_item, stable_node, false); } folio_unlock(kfolio); /* * If we fail to insert the page into the stable tree, * we will have 2 virtual addresses that are pointing * to a ksm page left outside the stable tree, * in which case we need to break_cow on both. */ if (!stable_node) { break_cow(tree_rmap_item); break_cow(rmap_item); } } else if (split) { /* * We are here if we tried to merge two pages and * failed because they both belonged to the same * compound page. We will split the page now, but no * merging will take place. * We do not want to add the cost of a full lock; if * the page is locked, it is better to skip it and * perhaps try again later. */ if (!folio_trylock(folio)) return; split_huge_page(page); folio = page_folio(page); folio_unlock(folio); } } } static struct ksm_rmap_item *get_next_rmap_item(struct ksm_mm_slot *mm_slot, struct ksm_rmap_item **rmap_list, unsigned long addr) { struct ksm_rmap_item *rmap_item; while (*rmap_list) { rmap_item = *rmap_list; if ((rmap_item->address & PAGE_MASK) == addr) return rmap_item; if (rmap_item->address > addr) break; *rmap_list = rmap_item->rmap_list; remove_rmap_item_from_tree(rmap_item); free_rmap_item(rmap_item); } rmap_item = alloc_rmap_item(); if (rmap_item) { /* It has already been zeroed */ rmap_item->mm = mm_slot->slot.mm; rmap_item->mm->ksm_rmap_items++; rmap_item->address = addr; rmap_item->rmap_list = *rmap_list; *rmap_list = rmap_item; } return rmap_item; } /* * Calculate skip age for the ksm page age. The age determines how often * de-duplicating has already been tried unsuccessfully. If the age is * smaller, the scanning of this page is skipped for less scans. * * @age: rmap_item age of page */ static unsigned int skip_age(rmap_age_t age) { if (age <= 3) return 1; if (age <= 5) return 2; if (age <= 8) return 4; return 8; } /* * Determines if a page should be skipped for the current scan. * * @folio: folio containing the page to check * @rmap_item: associated rmap_item of page */ static bool should_skip_rmap_item(struct folio *folio, struct ksm_rmap_item *rmap_item) { rmap_age_t age; if (!ksm_smart_scan) return false; /* * Never skip pages that are already KSM; pages cmp_and_merge_page() * will essentially ignore them, but we still have to process them * properly. */ if (folio_test_ksm(folio)) return false; age = rmap_item->age; if (age != U8_MAX) rmap_item->age++; /* * Smaller ages are not skipped, they need to get a chance to go * through the different phases of the KSM merging. */ if (age < 3) return false; /* * Are we still allowed to skip? If not, then don't skip it * and determine how much more often we are allowed to skip next. */ if (!rmap_item->remaining_skips) { rmap_item->remaining_skips = skip_age(age); return false; } /* Skip this page */ ksm_pages_skipped++; rmap_item->remaining_skips--; remove_rmap_item_from_tree(rmap_item); return true; } struct ksm_next_page_arg { struct folio *folio; struct page *page; unsigned long addr; }; static int ksm_next_page_pmd_entry(pmd_t *pmdp, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct ksm_next_page_arg *private = walk->private; struct vm_area_struct *vma = walk->vma; pte_t *start_ptep = NULL, *ptep, pte; struct mm_struct *mm = walk->mm; struct folio *folio; struct page *page; spinlock_t *ptl; pmd_t pmd; if (ksm_test_exit(mm)) return 0; cond_resched(); pmd = pmdp_get_lockless(pmdp); if (!pmd_present(pmd)) return 0; if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && pmd_leaf(pmd)) { ptl = pmd_lock(mm, pmdp); pmd = pmdp_get(pmdp); if (!pmd_present(pmd)) { goto not_found_unlock; } else if (pmd_leaf(pmd)) { page = vm_normal_page_pmd(vma, addr, pmd); if (!page) goto not_found_unlock; folio = page_folio(page); if (folio_is_zone_device(folio) || !folio_test_anon(folio)) goto not_found_unlock; page += ((addr & (PMD_SIZE - 1)) >> PAGE_SHIFT); goto found_unlock; } spin_unlock(ptl); } start_ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); if (!start_ptep) return 0; for (ptep = start_ptep; addr < end; ptep++, addr += PAGE_SIZE) { pte = ptep_get(ptep); if (!pte_present(pte)) continue; page = vm_normal_page(vma, addr, pte); if (!page) continue; folio = page_folio(page); if (folio_is_zone_device(folio) || !folio_test_anon(folio)) continue; goto found_unlock; } not_found_unlock: spin_unlock(ptl); if (start_ptep) pte_unmap(start_ptep); return 0; found_unlock: folio_get(folio); spin_unlock(ptl); if (start_ptep) pte_unmap(start_ptep); private->page = page; private->folio = folio; private->addr = addr; return 1; } static struct mm_walk_ops ksm_next_page_ops = { .pmd_entry = ksm_next_page_pmd_entry, .walk_lock = PGWALK_RDLOCK, }; static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page) { struct mm_struct *mm; struct ksm_mm_slot *mm_slot; struct mm_slot *slot; struct vm_area_struct *vma; struct ksm_rmap_item *rmap_item; struct vma_iterator vmi; int nid; if (list_empty(&ksm_mm_head.slot.mm_node)) return NULL; mm_slot = ksm_scan.mm_slot; if (mm_slot == &ksm_mm_head) { advisor_start_scan(); trace_ksm_start_scan(ksm_scan.seqnr, ksm_rmap_items); /* * A number of pages can hang around indefinitely in per-cpu * LRU cache, raised page count preventing write_protect_page * from merging them. Though it doesn't really matter much, * it is puzzling to see some stuck in pages_volatile until * other activity jostles them out, and they also prevented * LTP's KSM test from succeeding deterministically; so drain * them here (here rather than on entry to ksm_do_scan(), * so we don't IPI too often when pages_to_scan is set low). */ lru_add_drain_all(); /* * Whereas stale stable_nodes on the stable_tree itself * get pruned in the regular course of stable_tree_search(), * those moved out to the migrate_nodes list can accumulate: * so prune them once before each full scan. */ if (!ksm_merge_across_nodes) { struct ksm_stable_node *stable_node, *next; struct folio *folio; list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) { folio = ksm_get_folio(stable_node, KSM_GET_FOLIO_NOLOCK); if (folio) folio_put(folio); cond_resched(); } } for (nid = 0; nid < ksm_nr_node_ids; nid++) root_unstable_tree[nid] = RB_ROOT; spin_lock(&ksm_mmlist_lock); slot = list_entry(mm_slot->slot.mm_node.next, struct mm_slot, mm_node); mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); ksm_scan.mm_slot = mm_slot; spin_unlock(&ksm_mmlist_lock); /* * Although we tested list_empty() above, a racing __ksm_exit * of the last mm on the list may have removed it since then. */ if (mm_slot == &ksm_mm_head) return NULL; next_mm: ksm_scan.address = 0; ksm_scan.rmap_list = &mm_slot->rmap_list; } slot = &mm_slot->slot; mm = slot->mm; vma_iter_init(&vmi, mm, ksm_scan.address); mmap_read_lock(mm); if (ksm_test_exit(mm)) goto no_vmas; for_each_vma(vmi, vma) { if (!(vma->vm_flags & VM_MERGEABLE)) continue; if (ksm_scan.address < vma->vm_start) ksm_scan.address = vma->vm_start; if (!vma->anon_vma) ksm_scan.address = vma->vm_end; while (ksm_scan.address < vma->vm_end) { struct ksm_next_page_arg ksm_next_page_arg; struct page *tmp_page = NULL; struct folio *folio; if (ksm_test_exit(mm)) break; int found; found = walk_page_range_vma(vma, ksm_scan.address, vma->vm_end, &ksm_next_page_ops, &ksm_next_page_arg); if (found > 0) { folio = ksm_next_page_arg.folio; tmp_page = ksm_next_page_arg.page; ksm_scan.address = ksm_next_page_arg.addr; } else { VM_WARN_ON_ONCE(found < 0); ksm_scan.address = vma->vm_end - PAGE_SIZE; } if (tmp_page) { flush_anon_page(vma, tmp_page, ksm_scan.address); flush_dcache_page(tmp_page); rmap_item = get_next_rmap_item(mm_slot, ksm_scan.rmap_list, ksm_scan.address); if (rmap_item) { ksm_scan.rmap_list = &rmap_item->rmap_list; if (should_skip_rmap_item(folio, rmap_item)) { folio_put(folio); goto next_page; } ksm_scan.address += PAGE_SIZE; *page = tmp_page; } else { folio_put(folio); } mmap_read_unlock(mm); return rmap_item; } next_page: ksm_scan.address += PAGE_SIZE; cond_resched(); } } if (ksm_test_exit(mm)) { no_vmas: ksm_scan.address = 0; ksm_scan.rmap_list = &mm_slot->rmap_list; } /* * Nuke all the rmap_items that are above this current rmap: * because there were no VM_MERGEABLE vmas with such addresses. */ remove_trailing_rmap_items(ksm_scan.rmap_list); spin_lock(&ksm_mmlist_lock); slot = list_entry(mm_slot->slot.mm_node.next, struct mm_slot, mm_node); ksm_scan.mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); if (ksm_scan.address == 0) { /* * We've completed a full scan of all vmas, holding mmap_lock * throughout, and found no VM_MERGEABLE: so do the same as * __ksm_exit does to remove this mm from all our lists now. * This applies either when cleaning up after __ksm_exit * (but beware: we can reach here even before __ksm_exit), * or when all VM_MERGEABLE areas have been unmapped (and * mmap_lock then protects against race with MADV_MERGEABLE). */ hash_del(&mm_slot->slot.hash); list_del(&mm_slot->slot.mm_node); spin_unlock(&ksm_mmlist_lock); mm_slot_free(mm_slot_cache, mm_slot); /* * Only clear MMF_VM_MERGEABLE. We must not clear * MMF_VM_MERGE_ANY, because for those MMF_VM_MERGE_ANY process, * perhaps their mm_struct has just been added to ksm_mm_slot * list, and its process has not yet officially started running * or has not yet performed mmap/brk to allocate anonymous VMAS. */ mm_flags_clear(MMF_VM_MERGEABLE, mm); mmap_read_unlock(mm); mmdrop(mm); } else { mmap_read_unlock(mm); /* * mmap_read_unlock(mm) first because after * spin_unlock(&ksm_mmlist_lock) run, the "mm" may * already have been freed under us by __ksm_exit() * because the "mm_slot" is still hashed and * ksm_scan.mm_slot doesn't point to it anymore. */ spin_unlock(&ksm_mmlist_lock); } /* Repeat until we've completed scanning the whole list */ mm_slot = ksm_scan.mm_slot; if (mm_slot != &ksm_mm_head) goto next_mm; advisor_stop_scan(); trace_ksm_stop_scan(ksm_scan.seqnr, ksm_rmap_items); ksm_scan.seqnr++; return NULL; } /** * ksm_do_scan - the ksm scanner main worker function. * @scan_npages: number of pages we want to scan before we return. */ static void ksm_do_scan(unsigned int scan_npages) { struct ksm_rmap_item *rmap_item; struct page *page; while (scan_npages-- && likely(!freezing(current))) { cond_resched(); rmap_item = scan_get_next_rmap_item(&page); if (!rmap_item) return; cmp_and_merge_page(page, rmap_item); put_page(page); ksm_pages_scanned++; } } static int ksmd_should_run(void) { return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.slot.mm_node); } static int ksm_scan_thread(void *nothing) { unsigned int sleep_ms; set_freezable(); set_user_nice(current, 5); while (!kthread_should_stop()) { mutex_lock(&ksm_thread_mutex); wait_while_offlining(); if (ksmd_should_run()) ksm_do_scan(ksm_thread_pages_to_scan); mutex_unlock(&ksm_thread_mutex); if (ksmd_should_run()) { sleep_ms = READ_ONCE(ksm_thread_sleep_millisecs); wait_event_freezable_timeout(ksm_iter_wait, sleep_ms != READ_ONCE(ksm_thread_sleep_millisecs), msecs_to_jiffies(sleep_ms)); } else { wait_event_freezable(ksm_thread_wait, ksmd_should_run() || kthread_should_stop()); } } return 0; } static bool __ksm_should_add_vma(const struct file *file, vm_flags_t vm_flags) { if (vm_flags & VM_MERGEABLE) return false; return ksm_compatible(file, vm_flags); } static void __ksm_add_vma(struct vm_area_struct *vma) { if (__ksm_should_add_vma(vma->vm_file, vma->vm_flags)) vm_flags_set(vma, VM_MERGEABLE); } static int __ksm_del_vma(struct vm_area_struct *vma) { int err; if (!(vma->vm_flags & VM_MERGEABLE)) return 0; if (vma->anon_vma) { err = break_ksm(vma, vma->vm_start, vma->vm_end, true); if (err) return err; } vm_flags_clear(vma, VM_MERGEABLE); return 0; } /** * ksm_vma_flags - Update VMA flags to mark as mergeable if compatible * * @mm: Proposed VMA's mm_struct * @file: Proposed VMA's file-backed mapping, if any. * @vm_flags: Proposed VMA"s flags. * * Returns: @vm_flags possibly updated to mark mergeable. */ vm_flags_t ksm_vma_flags(struct mm_struct *mm, const struct file *file, vm_flags_t vm_flags) { if (mm_flags_test(MMF_VM_MERGE_ANY, mm) && __ksm_should_add_vma(file, vm_flags)) { vm_flags |= VM_MERGEABLE; /* * Generally, the flags here always include MMF_VM_MERGEABLE. * However, in rare cases, this flag may be cleared by ksmd who * scans a cycle without finding any mergeable vma. */ if (unlikely(!mm_flags_test(MMF_VM_MERGEABLE, mm))) __ksm_enter(mm); } return vm_flags; } static void ksm_add_vmas(struct mm_struct *mm) { struct vm_area_struct *vma; VMA_ITERATOR(vmi, mm, 0); for_each_vma(vmi, vma) __ksm_add_vma(vma); } static int ksm_del_vmas(struct mm_struct *mm) { struct vm_area_struct *vma; int err; VMA_ITERATOR(vmi, mm, 0); for_each_vma(vmi, vma) { err = __ksm_del_vma(vma); if (err) return err; } return 0; } /** * ksm_enable_merge_any - Add mm to mm ksm list and enable merging on all * compatible VMA's * * @mm: Pointer to mm * * Returns 0 on success, otherwise error code */ int ksm_enable_merge_any(struct mm_struct *mm) { int err; if (mm_flags_test(MMF_VM_MERGE_ANY, mm)) return 0; if (!mm_flags_test(MMF_VM_MERGEABLE, mm)) { err = __ksm_enter(mm); if (err) return err; } mm_flags_set(MMF_VM_MERGE_ANY, mm); ksm_add_vmas(mm); return 0; } /** * ksm_disable_merge_any - Disable merging on all compatible VMA's of the mm, * previously enabled via ksm_enable_merge_any(). * * Disabling merging implies unmerging any merged pages, like setting * MADV_UNMERGEABLE would. If unmerging fails, the whole operation fails and * merging on all compatible VMA's remains enabled. * * @mm: Pointer to mm * * Returns 0 on success, otherwise error code */ int ksm_disable_merge_any(struct mm_struct *mm) { int err; if (!mm_flags_test(MMF_VM_MERGE_ANY, mm)) return 0; err = ksm_del_vmas(mm); if (err) { ksm_add_vmas(mm); return err; } mm_flags_clear(MMF_VM_MERGE_ANY, mm); return 0; } int ksm_disable(struct mm_struct *mm) { mmap_assert_write_locked(mm); if (!mm_flags_test(MMF_VM_MERGEABLE, mm)) return 0; if (mm_flags_test(MMF_VM_MERGE_ANY, mm)) return ksm_disable_merge_any(mm); return ksm_del_vmas(mm); } int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, vm_flags_t *vm_flags) { struct mm_struct *mm = vma->vm_mm; int err; switch (advice) { case MADV_MERGEABLE: if (vma->vm_flags & VM_MERGEABLE) return 0; if (!vma_ksm_compatible(vma)) return 0; if (!mm_flags_test(MMF_VM_MERGEABLE, mm)) { err = __ksm_enter(mm); if (err) return err; } *vm_flags |= VM_MERGEABLE; break; case MADV_UNMERGEABLE: if (!(*vm_flags & VM_MERGEABLE)) return 0; /* just ignore the advice */ if (vma->anon_vma) { err = break_ksm(vma, start, end, true); if (err) return err; } *vm_flags &= ~VM_MERGEABLE; break; } return 0; } EXPORT_SYMBOL_GPL(ksm_madvise); int __ksm_enter(struct mm_struct *mm) { struct ksm_mm_slot *mm_slot; struct mm_slot *slot; int needs_wakeup; mm_slot = mm_slot_alloc(mm_slot_cache); if (!mm_slot) return -ENOMEM; slot = &mm_slot->slot; /* Check ksm_run too? Would need tighter locking */ needs_wakeup = list_empty(&ksm_mm_head.slot.mm_node); spin_lock(&ksm_mmlist_lock); mm_slot_insert(mm_slots_hash, mm, slot); /* * When KSM_RUN_MERGE (or KSM_RUN_STOP), * insert just behind the scanning cursor, to let the area settle * down a little; when fork is followed by immediate exec, we don't * want ksmd to waste time setting up and tearing down an rmap_list. * * But when KSM_RUN_UNMERGE, it's important to insert ahead of its * scanning cursor, otherwise KSM pages in newly forked mms will be * missed: then we might as well insert at the end of the list. */ if (ksm_run & KSM_RUN_UNMERGE) list_add_tail(&slot->mm_node, &ksm_mm_head.slot.mm_node); else list_add_tail(&slot->mm_node, &ksm_scan.mm_slot->slot.mm_node); spin_unlock(&ksm_mmlist_lock); mm_flags_set(MMF_VM_MERGEABLE, mm); mmgrab(mm); if (needs_wakeup) wake_up_interruptible(&ksm_thread_wait); trace_ksm_enter(mm); return 0; } void __ksm_exit(struct mm_struct *mm) { struct ksm_mm_slot *mm_slot = NULL; struct mm_slot *slot; int easy_to_free = 0; /* * This process is exiting: if it's straightforward (as is the * case when ksmd was never running), free mm_slot immediately. * But if it's at the cursor or has rmap_items linked to it, use * mmap_lock to synchronize with any break_cows before pagetables * are freed, and leave the mm_slot on the list for ksmd to free. * Beware: ksm may already have noticed it exiting and freed the slot. */ spin_lock(&ksm_mmlist_lock); slot = mm_slot_lookup(mm_slots_hash, mm); if (!slot) goto unlock; mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); if (ksm_scan.mm_slot == mm_slot) goto unlock; if (!mm_slot->rmap_list) { hash_del(&slot->hash); list_del(&slot->mm_node); easy_to_free = 1; } else { list_move(&slot->mm_node, &ksm_scan.mm_slot->slot.mm_node); } unlock: spin_unlock(&ksm_mmlist_lock); if (easy_to_free) { mm_slot_free(mm_slot_cache, mm_slot); mm_flags_clear(MMF_VM_MERGE_ANY, mm); mm_flags_clear(MMF_VM_MERGEABLE, mm); mmdrop(mm); } else if (mm_slot) { mmap_write_lock(mm); mmap_write_unlock(mm); } trace_ksm_exit(mm); } struct folio *ksm_might_need_to_copy(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { struct page *page = folio_page(folio, 0); struct anon_vma *anon_vma = folio_anon_vma(folio); struct folio *new_folio; if (folio_test_large(folio)) return folio; if (folio_test_ksm(folio)) { if (folio_stable_node(folio) && !(ksm_run & KSM_RUN_UNMERGE)) return folio; /* no need to copy it */ } else if (!anon_vma) { return folio; /* no need to copy it */ } else if (folio->index == linear_page_index(vma, addr) && anon_vma->root == vma->anon_vma->root) { return folio; /* still no need to copy it */ } if (PageHWPoison(page)) return ERR_PTR(-EHWPOISON); if (!folio_test_uptodate(folio)) return folio; /* let do_swap_page report the error */ new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); if (new_folio && mem_cgroup_charge(new_folio, vma->vm_mm, GFP_KERNEL)) { folio_put(new_folio); new_folio = NULL; } if (new_folio) { if (copy_mc_user_highpage(folio_page(new_folio, 0), page, addr, vma)) { folio_put(new_folio); return ERR_PTR(-EHWPOISON); } folio_set_dirty(new_folio); __folio_mark_uptodate(new_folio); __folio_set_locked(new_folio); #ifdef CONFIG_SWAP count_vm_event(KSM_SWPIN_COPY); #endif } return new_folio; } void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc) { struct ksm_stable_node *stable_node; struct ksm_rmap_item *rmap_item; int search_new_forks = 0; VM_BUG_ON_FOLIO(!folio_test_ksm(folio), folio); /* * Rely on the page lock to protect against concurrent modifications * to that page's node of the stable tree. */ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); stable_node = folio_stable_node(folio); if (!stable_node) return; again: hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { struct anon_vma *anon_vma = rmap_item->anon_vma; struct anon_vma_chain *vmac; struct vm_area_struct *vma; cond_resched(); if (!anon_vma_trylock_read(anon_vma)) { if (rwc->try_lock) { rwc->contended = true; return; } anon_vma_lock_read(anon_vma); } anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, 0, ULONG_MAX) { unsigned long addr; cond_resched(); vma = vmac->vma; /* Ignore the stable/unstable/sqnr flags */ addr = rmap_item->address & PAGE_MASK; if (addr < vma->vm_start || addr >= vma->vm_end) continue; /* * Initially we examine only the vma which covers this * rmap_item; but later, if there is still work to do, * we examine covering vmas in other mms: in case they * were forked from the original since ksmd passed. */ if ((rmap_item->mm == vma->vm_mm) == search_new_forks) continue; if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) continue; if (!rwc->rmap_one(folio, vma, addr, rwc->arg)) { anon_vma_unlock_read(anon_vma); return; } if (rwc->done && rwc->done(folio)) { anon_vma_unlock_read(anon_vma); return; } } anon_vma_unlock_read(anon_vma); } if (!search_new_forks++) goto again; } #ifdef CONFIG_MEMORY_FAILURE /* * Collect processes when the error hit an ksm page. */ void collect_procs_ksm(const struct folio *folio, const struct page *page, struct list_head *to_kill, int force_early) { struct ksm_stable_node *stable_node; struct ksm_rmap_item *rmap_item; struct vm_area_struct *vma; struct task_struct *tsk; stable_node = folio_stable_node(folio); if (!stable_node) return; hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { struct anon_vma *av = rmap_item->anon_vma; anon_vma_lock_read(av); rcu_read_lock(); for_each_process(tsk) { struct anon_vma_chain *vmac; unsigned long addr; struct task_struct *t = task_early_kill(tsk, force_early); if (!t) continue; anon_vma_interval_tree_foreach(vmac, &av->rb_root, 0, ULONG_MAX) { vma = vmac->vma; if (vma->vm_mm == t->mm) { addr = rmap_item->address & PAGE_MASK; add_to_kill_ksm(t, page, vma, to_kill, addr); } } } rcu_read_unlock(); anon_vma_unlock_read(av); } } #endif #ifdef CONFIG_MIGRATION void folio_migrate_ksm(struct folio *newfolio, struct folio *folio) { struct ksm_stable_node *stable_node; VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_locked(newfolio), newfolio); VM_BUG_ON_FOLIO(newfolio->mapping != folio->mapping, newfolio); stable_node = folio_stable_node(folio); if (stable_node) { VM_BUG_ON_FOLIO(stable_node->kpfn != folio_pfn(folio), folio); stable_node->kpfn = folio_pfn(newfolio); /* * newfolio->mapping was set in advance; now we need smp_wmb() * to make sure that the new stable_node->kpfn is visible * to ksm_get_folio() before it can see that folio->mapping * has gone stale (or that the swapcache flag has been cleared). */ smp_wmb(); folio_set_stable_node(folio, NULL); } } #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_MEMORY_HOTREMOVE static void wait_while_offlining(void) { while (ksm_run & KSM_RUN_OFFLINE) { mutex_unlock(&ksm_thread_mutex); wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE), TASK_UNINTERRUPTIBLE); mutex_lock(&ksm_thread_mutex); } } static bool stable_node_dup_remove_range(struct ksm_stable_node *stable_node, unsigned long start_pfn, unsigned long end_pfn) { if (stable_node->kpfn >= start_pfn && stable_node->kpfn < end_pfn) { /* * Don't ksm_get_folio, page has already gone: * which is why we keep kpfn instead of page* */ remove_node_from_stable_tree(stable_node); return true; } return false; } static bool stable_node_chain_remove_range(struct ksm_stable_node *stable_node, unsigned long start_pfn, unsigned long end_pfn, struct rb_root *root) { struct ksm_stable_node *dup; struct hlist_node *hlist_safe; if (!is_stable_node_chain(stable_node)) { VM_BUG_ON(is_stable_node_dup(stable_node)); return stable_node_dup_remove_range(stable_node, start_pfn, end_pfn); } hlist_for_each_entry_safe(dup, hlist_safe, &stable_node->hlist, hlist_dup) { VM_BUG_ON(!is_stable_node_dup(dup)); stable_node_dup_remove_range(dup, start_pfn, end_pfn); } if (hlist_empty(&stable_node->hlist)) { free_stable_node_chain(stable_node, root); return true; /* notify caller that tree was rebalanced */ } else return false; } static void ksm_check_stable_tree(unsigned long start_pfn, unsigned long end_pfn) { struct ksm_stable_node *stable_node, *next; struct rb_node *node; int nid; for (nid = 0; nid < ksm_nr_node_ids; nid++) { node = rb_first(root_stable_tree + nid); while (node) { stable_node = rb_entry(node, struct ksm_stable_node, node); if (stable_node_chain_remove_range(stable_node, start_pfn, end_pfn, root_stable_tree + nid)) node = rb_first(root_stable_tree + nid); else node = rb_next(node); cond_resched(); } } list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) { if (stable_node->kpfn >= start_pfn && stable_node->kpfn < end_pfn) remove_node_from_stable_tree(stable_node); cond_resched(); } } static int ksm_memory_callback(struct notifier_block *self, unsigned long action, void *arg) { struct memory_notify *mn = arg; switch (action) { case MEM_GOING_OFFLINE: /* * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items() * and remove_all_stable_nodes() while memory is going offline: * it is unsafe for them to touch the stable tree at this time. * But break_ksm(), rmap lookups and other entry points * which do not need the ksm_thread_mutex are all safe. */ mutex_lock(&ksm_thread_mutex); ksm_run |= KSM_RUN_OFFLINE; mutex_unlock(&ksm_thread_mutex); break; case MEM_OFFLINE: /* * Most of the work is done by page migration; but there might * be a few stable_nodes left over, still pointing to struct * pages which have been offlined: prune those from the tree, * otherwise ksm_get_folio() might later try to access a * non-existent struct page. */ ksm_check_stable_tree(mn->start_pfn, mn->start_pfn + mn->nr_pages); fallthrough; case MEM_CANCEL_OFFLINE: mutex_lock(&ksm_thread_mutex); ksm_run &= ~KSM_RUN_OFFLINE; mutex_unlock(&ksm_thread_mutex); smp_mb(); /* wake_up_bit advises this */ wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE)); break; } return NOTIFY_OK; } #else static void wait_while_offlining(void) { } #endif /* CONFIG_MEMORY_HOTREMOVE */ #ifdef CONFIG_PROC_FS /* * The process is mergeable only if any VMA is currently * applicable to KSM. * * The mmap lock must be held in read mode. */ bool ksm_process_mergeable(struct mm_struct *mm) { struct vm_area_struct *vma; mmap_assert_locked(mm); VMA_ITERATOR(vmi, mm, 0); for_each_vma(vmi, vma) if (vma->vm_flags & VM_MERGEABLE) return true; return false; } long ksm_process_profit(struct mm_struct *mm) { return (long)(mm->ksm_merging_pages + mm_ksm_zero_pages(mm)) * PAGE_SIZE - mm->ksm_rmap_items * sizeof(struct ksm_rmap_item); } #endif /* CONFIG_PROC_FS */ #ifdef CONFIG_SYSFS /* * This all compiles without CONFIG_SYSFS, but is a waste of space. */ #define KSM_ATTR_RO(_name) \ static struct kobj_attribute _name##_attr = __ATTR_RO(_name) #define KSM_ATTR(_name) \ static struct kobj_attribute _name##_attr = __ATTR_RW(_name) static ssize_t sleep_millisecs_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_thread_sleep_millisecs); } static ssize_t sleep_millisecs_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { unsigned int msecs; int err; err = kstrtouint(buf, 10, &msecs); if (err) return -EINVAL; ksm_thread_sleep_millisecs = msecs; wake_up_interruptible(&ksm_iter_wait); return count; } KSM_ATTR(sleep_millisecs); static ssize_t pages_to_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_thread_pages_to_scan); } static ssize_t pages_to_scan_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { unsigned int nr_pages; int err; if (ksm_advisor != KSM_ADVISOR_NONE) return -EINVAL; err = kstrtouint(buf, 10, &nr_pages); if (err) return -EINVAL; ksm_thread_pages_to_scan = nr_pages; return count; } KSM_ATTR(pages_to_scan); static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_run); } static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { unsigned int flags; int err; err = kstrtouint(buf, 10, &flags); if (err) return -EINVAL; if (flags > KSM_RUN_UNMERGE) return -EINVAL; /* * KSM_RUN_MERGE sets ksmd running, and 0 stops it running. * KSM_RUN_UNMERGE stops it running and unmerges all rmap_items, * breaking COW to free the pages_shared (but leaves mm_slots * on the list for when ksmd may be set running again). */ mutex_lock(&ksm_thread_mutex); wait_while_offlining(); if (ksm_run != flags) { ksm_run = flags; if (flags & KSM_RUN_UNMERGE) { set_current_oom_origin(); err = unmerge_and_remove_all_rmap_items(); clear_current_oom_origin(); if (err) { ksm_run = KSM_RUN_STOP; count = err; } } } mutex_unlock(&ksm_thread_mutex); if (flags & KSM_RUN_MERGE) wake_up_interruptible(&ksm_thread_wait); return count; } KSM_ATTR(run); #ifdef CONFIG_NUMA static ssize_t merge_across_nodes_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_merge_across_nodes); } static ssize_t merge_across_nodes_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long knob; err = kstrtoul(buf, 10, &knob); if (err) return err; if (knob > 1) return -EINVAL; mutex_lock(&ksm_thread_mutex); wait_while_offlining(); if (ksm_merge_across_nodes != knob) { if (ksm_pages_shared || remove_all_stable_nodes()) err = -EBUSY; else if (root_stable_tree == one_stable_tree) { struct rb_root *buf; /* * This is the first time that we switch away from the * default of merging across nodes: must now allocate * a buffer to hold as many roots as may be needed. * Allocate stable and unstable together: * MAXSMP NODES_SHIFT 10 will use 16kB. */ buf = kcalloc(nr_node_ids + nr_node_ids, sizeof(*buf), GFP_KERNEL); /* Let us assume that RB_ROOT is NULL is zero */ if (!buf) err = -ENOMEM; else { root_stable_tree = buf; root_unstable_tree = buf + nr_node_ids; /* Stable tree is empty but not the unstable */ root_unstable_tree[0] = one_unstable_tree[0]; } } if (!err) { ksm_merge_across_nodes = knob; ksm_nr_node_ids = knob ? 1 : nr_node_ids; } } mutex_unlock(&ksm_thread_mutex); return err ? err : count; } KSM_ATTR(merge_across_nodes); #endif static ssize_t use_zero_pages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_use_zero_pages); } static ssize_t use_zero_pages_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; bool value; err = kstrtobool(buf, &value); if (err) return -EINVAL; ksm_use_zero_pages = value; return count; } KSM_ATTR(use_zero_pages); static ssize_t max_page_sharing_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_max_page_sharing); } static ssize_t max_page_sharing_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; int knob; err = kstrtoint(buf, 10, &knob); if (err) return err; /* * When a KSM page is created it is shared by 2 mappings. This * being a signed comparison, it implicitly verifies it's not * negative. */ if (knob < 2) return -EINVAL; if (READ_ONCE(ksm_max_page_sharing) == knob) return count; mutex_lock(&ksm_thread_mutex); wait_while_offlining(); if (ksm_max_page_sharing != knob) { if (ksm_pages_shared || remove_all_stable_nodes()) err = -EBUSY; else ksm_max_page_sharing = knob; } mutex_unlock(&ksm_thread_mutex); return err ? err : count; } KSM_ATTR(max_page_sharing); static ssize_t pages_scanned_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_pages_scanned); } KSM_ATTR_RO(pages_scanned); static ssize_t pages_shared_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_pages_shared); } KSM_ATTR_RO(pages_shared); static ssize_t pages_sharing_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_pages_sharing); } KSM_ATTR_RO(pages_sharing); static ssize_t pages_unshared_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_pages_unshared); } KSM_ATTR_RO(pages_unshared); static ssize_t pages_volatile_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { long ksm_pages_volatile; ksm_pages_volatile = ksm_rmap_items - ksm_pages_shared - ksm_pages_sharing - ksm_pages_unshared; /* * It was not worth any locking to calculate that statistic, * but it might therefore sometimes be negative: conceal that. */ if (ksm_pages_volatile < 0) ksm_pages_volatile = 0; return sysfs_emit(buf, "%ld\n", ksm_pages_volatile); } KSM_ATTR_RO(pages_volatile); static ssize_t pages_skipped_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_pages_skipped); } KSM_ATTR_RO(pages_skipped); static ssize_t ksm_zero_pages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%ld\n", atomic_long_read(&ksm_zero_pages)); } KSM_ATTR_RO(ksm_zero_pages); static ssize_t general_profit_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { long general_profit; general_profit = (ksm_pages_sharing + atomic_long_read(&ksm_zero_pages)) * PAGE_SIZE - ksm_rmap_items * sizeof(struct ksm_rmap_item); return sysfs_emit(buf, "%ld\n", general_profit); } KSM_ATTR_RO(general_profit); static ssize_t stable_node_dups_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_stable_node_dups); } KSM_ATTR_RO(stable_node_dups); static ssize_t stable_node_chains_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_stable_node_chains); } KSM_ATTR_RO(stable_node_chains); static ssize_t stable_node_chains_prune_millisecs_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_stable_node_chains_prune_millisecs); } static ssize_t stable_node_chains_prune_millisecs_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { unsigned int msecs; int err; err = kstrtouint(buf, 10, &msecs); if (err) return -EINVAL; ksm_stable_node_chains_prune_millisecs = msecs; return count; } KSM_ATTR(stable_node_chains_prune_millisecs); static ssize_t full_scans_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_scan.seqnr); } KSM_ATTR_RO(full_scans); static ssize_t smart_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_smart_scan); } static ssize_t smart_scan_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; bool value; err = kstrtobool(buf, &value); if (err) return -EINVAL; ksm_smart_scan = value; return count; } KSM_ATTR(smart_scan); static ssize_t advisor_mode_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { const char *output; if (ksm_advisor == KSM_ADVISOR_SCAN_TIME) output = "none [scan-time]"; else output = "[none] scan-time"; return sysfs_emit(buf, "%s\n", output); } static ssize_t advisor_mode_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { enum ksm_advisor_type curr_advisor = ksm_advisor; if (sysfs_streq("scan-time", buf)) ksm_advisor = KSM_ADVISOR_SCAN_TIME; else if (sysfs_streq("none", buf)) ksm_advisor = KSM_ADVISOR_NONE; else return -EINVAL; /* Set advisor default values */ if (curr_advisor != ksm_advisor) set_advisor_defaults(); return count; } KSM_ATTR(advisor_mode); static ssize_t advisor_max_cpu_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%u\n", ksm_advisor_max_cpu); } static ssize_t advisor_max_cpu_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long value; err = kstrtoul(buf, 10, &value); if (err) return -EINVAL; ksm_advisor_max_cpu = value; return count; } KSM_ATTR(advisor_max_cpu); static ssize_t advisor_min_pages_to_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_advisor_min_pages_to_scan); } static ssize_t advisor_min_pages_to_scan_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long value; err = kstrtoul(buf, 10, &value); if (err) return -EINVAL; ksm_advisor_min_pages_to_scan = value; return count; } KSM_ATTR(advisor_min_pages_to_scan); static ssize_t advisor_max_pages_to_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_advisor_max_pages_to_scan); } static ssize_t advisor_max_pages_to_scan_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long value; err = kstrtoul(buf, 10, &value); if (err) return -EINVAL; ksm_advisor_max_pages_to_scan = value; return count; } KSM_ATTR(advisor_max_pages_to_scan); static ssize_t advisor_target_scan_time_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%lu\n", ksm_advisor_target_scan_time); } static ssize_t advisor_target_scan_time_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long value; err = kstrtoul(buf, 10, &value); if (err) return -EINVAL; if (value < 1) return -EINVAL; ksm_advisor_target_scan_time = value; return count; } KSM_ATTR(advisor_target_scan_time); static struct attribute *ksm_attrs[] = { &sleep_millisecs_attr.attr, &pages_to_scan_attr.attr, &run_attr.attr, &pages_scanned_attr.attr, &pages_shared_attr.attr, &pages_sharing_attr.attr, &pages_unshared_attr.attr, &pages_volatile_attr.attr, &pages_skipped_attr.attr, &ksm_zero_pages_attr.attr, &full_scans_attr.attr, #ifdef CONFIG_NUMA &merge_across_nodes_attr.attr, #endif &max_page_sharing_attr.attr, &stable_node_chains_attr.attr, &stable_node_dups_attr.attr, &stable_node_chains_prune_millisecs_attr.attr, &use_zero_pages_attr.attr, &general_profit_attr.attr, &smart_scan_attr.attr, &advisor_mode_attr.attr, &advisor_max_cpu_attr.attr, &advisor_min_pages_to_scan_attr.attr, &advisor_max_pages_to_scan_attr.attr, &advisor_target_scan_time_attr.attr, NULL, }; static const struct attribute_group ksm_attr_group = { .attrs = ksm_attrs, .name = "ksm", }; #endif /* CONFIG_SYSFS */ static int __init ksm_init(void) { struct task_struct *ksm_thread; int err; /* The correct value depends on page size and endianness */ zero_checksum = calc_checksum(ZERO_PAGE(0)); /* Default to false for backwards compatibility */ ksm_use_zero_pages = false; err = ksm_slab_init(); if (err) goto out; ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); if (IS_ERR(ksm_thread)) { pr_err("ksm: creating kthread failed\n"); err = PTR_ERR(ksm_thread); goto out_free; } #ifdef CONFIG_SYSFS err = sysfs_create_group(mm_kobj, &ksm_attr_group); if (err) { pr_err("ksm: register sysfs failed\n"); kthread_stop(ksm_thread); goto out_free; } #else ksm_run = KSM_RUN_MERGE; /* no way for user to start it */ #endif /* CONFIG_SYSFS */ #ifdef CONFIG_MEMORY_HOTREMOVE /* There is no significance to this priority 100 */ hotplug_memory_notifier(ksm_memory_callback, KSM_CALLBACK_PRI); #endif return 0; out_free: ksm_slab_free(); out: return err; } subsys_initcall(ksm_init); |
| 2 2 2 5 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | // SPDX-License-Identifier: GPL-2.0 /* drivers/rtc/rtc-goldfish.c * * Copyright (C) 2007 Google, Inc. * Copyright (C) 2017 Imagination Technologies Ltd. */ #include <linux/io.h> #include <linux/module.h> #include <linux/of.h> #include <linux/platform_device.h> #include <linux/rtc.h> #include <linux/goldfish.h> #include <clocksource/timer-goldfish.h> struct goldfish_rtc { void __iomem *base; int irq; struct rtc_device *rtc; }; static int goldfish_rtc_read_alarm(struct device *dev, struct rtc_wkalrm *alrm) { u64 rtc_alarm; u64 rtc_alarm_low; u64 rtc_alarm_high; void __iomem *base; struct goldfish_rtc *rtcdrv; rtcdrv = dev_get_drvdata(dev); base = rtcdrv->base; rtc_alarm_low = gf_ioread32(base + TIMER_ALARM_LOW); rtc_alarm_high = gf_ioread32(base + TIMER_ALARM_HIGH); rtc_alarm = (rtc_alarm_high << 32) | rtc_alarm_low; do_div(rtc_alarm, NSEC_PER_SEC); memset(alrm, 0, sizeof(struct rtc_wkalrm)); rtc_time64_to_tm(rtc_alarm, &alrm->time); if (gf_ioread32(base + TIMER_ALARM_STATUS)) alrm->enabled = 1; else alrm->enabled = 0; return 0; } static int goldfish_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alrm) { struct goldfish_rtc *rtcdrv; u64 rtc_alarm64; u64 rtc_status_reg; void __iomem *base; rtcdrv = dev_get_drvdata(dev); base = rtcdrv->base; if (alrm->enabled) { rtc_alarm64 = rtc_tm_to_time64(&alrm->time) * NSEC_PER_SEC; gf_iowrite32((rtc_alarm64 >> 32), base + TIMER_ALARM_HIGH); gf_iowrite32(rtc_alarm64, base + TIMER_ALARM_LOW); gf_iowrite32(1, base + TIMER_IRQ_ENABLED); } else { /* * if this function was called with enabled=0 * then it could mean that the application is * trying to cancel an ongoing alarm */ rtc_status_reg = gf_ioread32(base + TIMER_ALARM_STATUS); if (rtc_status_reg) gf_iowrite32(1, base + TIMER_CLEAR_ALARM); } return 0; } static int goldfish_rtc_alarm_irq_enable(struct device *dev, unsigned int enabled) { void __iomem *base; struct goldfish_rtc *rtcdrv; rtcdrv = dev_get_drvdata(dev); base = rtcdrv->base; if (enabled) gf_iowrite32(1, base + TIMER_IRQ_ENABLED); else gf_iowrite32(0, base + TIMER_IRQ_ENABLED); return 0; } static irqreturn_t goldfish_rtc_interrupt(int irq, void *dev_id) { struct goldfish_rtc *rtcdrv = dev_id; void __iomem *base = rtcdrv->base; gf_iowrite32(1, base + TIMER_CLEAR_INTERRUPT); rtc_update_irq(rtcdrv->rtc, 1, RTC_IRQF | RTC_AF); return IRQ_HANDLED; } static int goldfish_rtc_read_time(struct device *dev, struct rtc_time *tm) { struct goldfish_rtc *rtcdrv; void __iomem *base; u64 time_high; u64 time_low; u64 time; rtcdrv = dev_get_drvdata(dev); base = rtcdrv->base; time_low = gf_ioread32(base + TIMER_TIME_LOW); time_high = gf_ioread32(base + TIMER_TIME_HIGH); time = (time_high << 32) | time_low; do_div(time, NSEC_PER_SEC); rtc_time64_to_tm(time, tm); return 0; } static int goldfish_rtc_set_time(struct device *dev, struct rtc_time *tm) { struct goldfish_rtc *rtcdrv; void __iomem *base; u64 now64; rtcdrv = dev_get_drvdata(dev); base = rtcdrv->base; now64 = rtc_tm_to_time64(tm) * NSEC_PER_SEC; gf_iowrite32((now64 >> 32), base + TIMER_TIME_HIGH); gf_iowrite32(now64, base + TIMER_TIME_LOW); return 0; } static const struct rtc_class_ops goldfish_rtc_ops = { .read_time = goldfish_rtc_read_time, .set_time = goldfish_rtc_set_time, .read_alarm = goldfish_rtc_read_alarm, .set_alarm = goldfish_rtc_set_alarm, .alarm_irq_enable = goldfish_rtc_alarm_irq_enable }; static int goldfish_rtc_probe(struct platform_device *pdev) { struct goldfish_rtc *rtcdrv; int err; rtcdrv = devm_kzalloc(&pdev->dev, sizeof(*rtcdrv), GFP_KERNEL); if (!rtcdrv) return -ENOMEM; platform_set_drvdata(pdev, rtcdrv); rtcdrv->base = devm_platform_ioremap_resource(pdev, 0); if (IS_ERR(rtcdrv->base)) return PTR_ERR(rtcdrv->base); rtcdrv->irq = platform_get_irq(pdev, 0); if (rtcdrv->irq < 0) return -ENODEV; rtcdrv->rtc = devm_rtc_allocate_device(&pdev->dev); if (IS_ERR(rtcdrv->rtc)) return PTR_ERR(rtcdrv->rtc); rtcdrv->rtc->ops = &goldfish_rtc_ops; rtcdrv->rtc->range_max = U64_MAX / NSEC_PER_SEC; err = devm_request_irq(&pdev->dev, rtcdrv->irq, goldfish_rtc_interrupt, 0, pdev->name, rtcdrv); if (err) return err; return devm_rtc_register_device(rtcdrv->rtc); } static const struct of_device_id goldfish_rtc_of_match[] = { { .compatible = "google,goldfish-rtc", }, {}, }; MODULE_DEVICE_TABLE(of, goldfish_rtc_of_match); static struct platform_driver goldfish_rtc = { .probe = goldfish_rtc_probe, .driver = { .name = "goldfish_rtc", .of_match_table = goldfish_rtc_of_match, } }; module_platform_driver(goldfish_rtc); MODULE_DESCRIPTION("Android Goldfish Real Time Clock driver"); MODULE_LICENSE("GPL v2"); |
| 1 1 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 | // SPDX-License-Identifier: GPL-2.0 /* * linux/fs/ext4/namei.c * * Copyright (C) 1992, 1993, 1994, 1995 * Remy Card (card@masi.ibp.fr) * Laboratoire MASI - Institut Blaise Pascal * Universite Pierre et Marie Curie (Paris VI) * * from * * linux/fs/minix/namei.c * * Copyright (C) 1991, 1992 Linus Torvalds * * Big-endian to little-endian byte-swapping/bitmaps by * David S. Miller (davem@caip.rutgers.edu), 1995 * Directory entry file type support and forward compatibility hooks * for B-tree directories by Theodore Ts'o (tytso@mit.edu), 1998 * Hash Tree Directory indexing (c) * Daniel Phillips, 2001 * Hash Tree Directory indexing porting * Christopher Li, 2002 * Hash Tree Directory indexing cleanup * Theodore Ts'o, 2002 */ #include <linux/fs.h> #include <linux/pagemap.h> #include <linux/time.h> #include <linux/fcntl.h> #include <linux/stat.h> #include <linux/string.h> #include <linux/quotaops.h> #include <linux/buffer_head.h> #include <linux/bio.h> #include <linux/iversion.h> #include <linux/unicode.h> #include "ext4.h" #include "ext4_jbd2.h" #include "xattr.h" #include "acl.h" #include <trace/events/ext4.h> /* * define how far ahead to read directories while searching them. */ #define NAMEI_RA_CHUNKS 2 #define NAMEI_RA_BLOCKS 4 #define NAMEI_RA_SIZE (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS) static struct buffer_head *ext4_append(handle_t *handle, struct inode *inode, ext4_lblk_t *block) { struct ext4_map_blocks map; struct buffer_head *bh; int err; if (unlikely(EXT4_SB(inode->i_sb)->s_max_dir_size_kb && ((inode->i_size >> 10) >= EXT4_SB(inode->i_sb)->s_max_dir_size_kb))) return ERR_PTR(-ENOSPC); *block = inode->i_size >> inode->i_sb->s_blocksize_bits; map.m_lblk = *block; map.m_len = 1; /* * We're appending new directory block. Make sure the block is not * allocated yet, otherwise we will end up corrupting the * directory. */ err = ext4_map_blocks(NULL, inode, &map, 0); if (err < 0) return ERR_PTR(err); if (err) { EXT4_ERROR_INODE(inode, "Logical block already allocated"); return ERR_PTR(-EFSCORRUPTED); } bh = ext4_bread(handle, inode, *block, EXT4_GET_BLOCKS_CREATE); if (IS_ERR(bh)) return bh; inode->i_size += inode->i_sb->s_blocksize; EXT4_I(inode)->i_disksize = inode->i_size; err = ext4_mark_inode_dirty(handle, inode); if (err) goto out; BUFFER_TRACE(bh, "get_write_access"); err = ext4_journal_get_write_access(handle, inode->i_sb, bh, EXT4_JTR_NONE); if (err) goto out; return bh; out: brelse(bh); ext4_std_error(inode->i_sb, err); return ERR_PTR(err); } static int ext4_dx_csum_verify(struct inode *inode, struct ext4_dir_entry *dirent); /* * Hints to ext4_read_dirblock regarding whether we expect a directory * block being read to be an index block, or a block containing * directory entries (and if the latter, whether it was found via a * logical block in an htree index block). This is used to control * what sort of sanity checkinig ext4_read_dirblock() will do on the * directory block read from the storage device. EITHER will means * the caller doesn't know what kind of directory block will be read, * so no specific verification will be done. */ typedef enum { EITHER, INDEX, DIRENT, DIRENT_HTREE } dirblock_type_t; #define ext4_read_dirblock(inode, block, type) \ __ext4_read_dirblock((inode), (block), (type), __func__, __LINE__) static struct buffer_head *__ext4_read_dirblock(struct inode *inode, ext4_lblk_t block, dirblock_type_t type, const char *func, unsigned int line) { struct buffer_head *bh; struct ext4_dir_entry *dirent; int is_dx_block = 0; if (block >= inode->i_size >> inode->i_blkbits) { ext4_error_inode(inode, func, line, block, "Attempting to read directory block (%u) that is past i_size (%llu)", block, inode->i_size); return ERR_PTR(-EFSCORRUPTED); } if (ext4_simulate_fail(inode->i_sb, EXT4_SIM_DIRBLOCK_EIO)) bh = ERR_PTR(-EIO); else bh = ext4_bread(NULL, inode, block, 0); if (IS_ERR(bh)) { __ext4_warning(inode->i_sb, func, line, "inode #%lu: lblock %lu: comm %s: " "error %ld reading directory block", inode->i_ino, (unsigned long)block, current->comm, PTR_ERR(bh)); return bh; } /* The first directory block must not be a hole. */ if (!bh && (type == INDEX || type == DIRENT_HTREE || block == 0)) { ext4_error_inode(inode, func, line, block, "Directory hole found for htree %s block %u", (type == INDEX) ? "index" : "leaf", block); return ERR_PTR(-EFSCORRUPTED); } if (!bh) return NULL; dirent = (struct ext4_dir_entry *) bh->b_data; /* Determine whether or not we have an index block */ if (is_dx(inode)) { if (block == 0) is_dx_block = 1; else if (ext4_rec_len_from_disk(dirent->rec_len, inode->i_sb->s_blocksize) == inode->i_sb->s_blocksize) is_dx_block = 1; } if (!is_dx_block && type == INDEX) { ext4_error_inode(inode, func, line, block, "directory leaf block found instead of index block"); brelse(bh); return ERR_PTR(-EFSCORRUPTED); } if (!ext4_has_feature_metadata_csum(inode->i_sb) || buffer_verified(bh)) return bh; /* * An empty leaf block can get mistaken for a index block; for * this reason, we can only check the index checksum when the * caller is sure it should be an index block. */ if (is_dx_block && type == INDEX) { if (ext4_dx_csum_verify(inode, dirent) && !ext4_simulate_fail(inode->i_sb, EXT4_SIM_DIRBLOCK_CRC)) set_buffer_verified(bh); else { ext4_error_inode_err(inode, func, line, block, EFSBADCRC, "Directory index failed checksum"); brelse(bh); return ERR_PTR(-EFSBADCRC); } } if (!is_dx_block) { if (ext4_dirblock_csum_verify(inode, bh) && !ext4_simulate_fail(inode->i_sb, EXT4_SIM_DIRBLOCK_CRC)) set_buffer_verified(bh); else { ext4_error_inode_err(inode, func, line, block, EFSBADCRC, "Directory block failed checksum"); brelse(bh); return ERR_PTR(-EFSBADCRC); } } return bh; } #ifdef DX_DEBUG #define dxtrace(command) command #else #define dxtrace(command) #endif struct fake_dirent { __le32 inode; __le16 rec_len; u8 name_len; u8 file_type; }; struct dx_countlimit { __le16 limit; __le16 count; }; struct dx_entry { __le32 hash; __le32 block; }; /* * dx_root_info is laid out so that if it should somehow get overlaid by a * dirent the two low bits of the hash version will be zero. Therefore, the * hash version mod 4 should never be 0. Sincerely, the paranoia department. */ struct dx_root { struct fake_dirent dot; char dot_name[4]; struct fake_dirent dotdot; char dotdot_name[4]; struct dx_root_info { __le32 reserved_zero; u8 hash_version; u8 info_length; /* 8 */ u8 indirect_levels; u8 unused_flags; } info; struct dx_entry entries[]; }; struct dx_node { struct fake_dirent fake; struct dx_entry entries[]; }; struct dx_frame { struct buffer_head *bh; struct dx_entry *entries; struct dx_entry *at; }; struct dx_map_entry { u32 hash; u16 offs; u16 size; }; /* * This goes at the end of each htree block. */ struct dx_tail { u32 dt_reserved; __le32 dt_checksum; /* crc32c(uuid+inum+dirblock) */ }; static struct buffer_head * ext4_dx_find_entry(struct inode *dir, struct ext4_filename *fname, struct ext4_dir_entry_2 **res_dir); static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname, struct inode *dir, struct inode *inode); /* checksumming functions */ void ext4_initialize_dirent_tail(struct buffer_head *bh, unsigned int blocksize) { struct ext4_dir_entry_tail *t = EXT4_DIRENT_TAIL(bh->b_data, blocksize); memset(t, 0, sizeof(struct ext4_dir_entry_tail)); t->det_rec_len = ext4_rec_len_to_disk( sizeof(struct ext4_dir_entry_tail), blocksize); t->det_reserved_ft = EXT4_FT_DIR_CSUM; } /* Walk through a dirent block to find a checksum "dirent" at the tail */ static struct ext4_dir_entry_tail *get_dirent_tail(struct inode *inode, struct buffer_head *bh) { struct ext4_dir_entry_tail *t; int blocksize = EXT4_BLOCK_SIZE(inode->i_sb); #ifdef PARANOID struct ext4_dir_entry *d, *top; d = (struct ext4_dir_entry *)bh->b_data; top = (struct ext4_dir_entry *)(bh->b_data + (blocksize - sizeof(struct ext4_dir_entry_tail))); while (d < top && ext4_rec_len_from_disk(d->rec_len, blocksize)) d = (struct ext4_dir_entry *)(((void *)d) + ext4_rec_len_from_disk(d->rec_len, blocksize)); if (d != top) return NULL; t = (struct ext4_dir_entry_tail *)d; #else t = EXT4_DIRENT_TAIL(bh->b_data, EXT4_BLOCK_SIZE(inode->i_sb)); #endif if (t->det_reserved_zero1 || (ext4_rec_len_from_disk(t->det_rec_len, blocksize) != sizeof(struct ext4_dir_entry_tail)) || t->det_reserved_zero2 || t->det_reserved_ft != EXT4_FT_DIR_CSUM) return NULL; return t; } static __le32 ext4_dirblock_csum(struct inode *inode, void *dirent, int size) { struct ext4_inode_info *ei = EXT4_I(inode); __u32 csum; csum = ext4_chksum(ei->i_csum_seed, (__u8 *)dirent, size); return cpu_to_le32(csum); } #define warn_no_space_for_csum(inode) \ __warn_no_space_for_csum((inode), __func__, __LINE__) static void __warn_no_space_for_csum(struct inode *inode, const char *func, unsigned int line) { __ext4_warning_inode(inode, func, line, "No space for directory leaf checksum. Please run e2fsck -D."); } int ext4_dirblock_csum_verify(struct inode *inode, struct buffer_head *bh) { struct ext4_dir_entry_tail *t; if (!ext4_has_feature_metadata_csum(inode->i_sb)) return 1; t = get_dirent_tail(inode, bh); if (!t) { warn_no_space_for_csum(inode); return 0; } if (t->det_checksum != ext4_dirblock_csum(inode, bh->b_data, (char *)t - bh->b_data)) return 0; return 1; } static void ext4_dirblock_csum_set(struct inode *inode, struct buffer_head *bh) { struct ext4_dir_entry_tail *t; if (!ext4_has_feature_metadata_csum(inode->i_sb)) return; t = get_dirent_tail(inode, bh); if (!t) { warn_no_space_for_csum(inode); return; } t->det_checksum = ext4_dirblock_csum(inode, bh->b_data, (char *)t - bh->b_data); } int ext4_handle_dirty_dirblock(handle_t *handle, struct inode *inode, struct buffer_head *bh) { ext4_dirblock_csum_set(inode, bh); return ext4_handle_dirty_metadata(handle, inode, bh); } static struct dx_countlimit *get_dx_countlimit(struct inode *inode, struct ext4_dir_entry *dirent, int *offset) { struct ext4_dir_entry *dp; struct dx_root_info *root; int count_offset; int blocksize = EXT4_BLOCK_SIZE(inode->i_sb); unsigned int rlen = ext4_rec_len_from_disk(dirent->rec_len, blocksize); if (rlen == blocksize) count_offset = 8; else if (rlen == 12) { dp = (struct ext4_dir_entry *)(((void *)dirent) + 12); if (ext4_rec_len_from_disk(dp->rec_len, blocksize) != blocksize - 12) return NULL; root = (struct dx_root_info *)(((void *)dp + 12)); if (root->reserved_zero || root->info_length != sizeof(struct dx_root_info)) return NULL; count_offset = 32; } else return NULL; if (offset) *offset = count_offset; return (struct dx_countlimit *)(((void *)dirent) + count_offset); } static __le32 ext4_dx_csum(struct inode *inode, struct ext4_dir_entry *dirent, int count_offset, int count, struct dx_tail *t) { struct ext4_inode_info *ei = EXT4_I(inode); __u32 csum; int size; __u32 dummy_csum = 0; int offset = offsetof(struct dx_tail, dt_checksum); size = count_offset + (count * sizeof(struct dx_entry)); csum = ext4_chksum(ei->i_csum_seed, (__u8 *)dirent, size); csum = ext4_chksum(csum, (__u8 *)t, offset); csum = ext4_chksum(csum, (__u8 *)&dummy_csum, sizeof(dummy_csum)); return cpu_to_le32(csum); } static int ext4_dx_csum_verify(struct inode *inode, struct ext4_dir_entry *dirent) { struct dx_countlimit *c; struct dx_tail *t; int count_offset, limit, count; if (!ext4_has_feature_metadata_csum(inode->i_sb)) return 1; c = get_dx_countlimit(inode, dirent, &count_offset); if (!c) { EXT4_ERROR_INODE(inode, "dir seems corrupt? Run e2fsck -D."); return 0; } limit = le16_to_cpu(c->limit); count = le16_to_cpu(c->count); if (count_offset + (limit * sizeof(struct dx_entry)) > EXT4_BLOCK_SIZE(inode->i_sb) - sizeof(struct dx_tail)) { warn_no_space_for_csum(inode); return 0; } t = (struct dx_tail *)(((struct dx_entry *)c) + limit); if (t->dt_checksum != ext4_dx_csum(inode, dirent, count_offset, count, t)) return 0; return 1; } static void ext4_dx_csum_set(struct inode *inode, struct ext4_dir_entry *dirent) { struct dx_countlimit *c; struct dx_tail *t; int count_offset, limit, count; if (!ext4_has_feature_metadata_csum(inode->i_sb)) return; c = get_dx_countlimit(inode, dirent, &count_offset); if (!c) { EXT4_ERROR_INODE(inode, "dir seems corrupt? Run e2fsck -D."); return; } limit = le16_to_cpu(c->limit); count = le16_to_cpu(c->count); if (count_offset + (limit * sizeof(struct dx_entry)) > EXT4_BLOCK_SIZE(inode->i_sb) - sizeof(struct dx_tail)) { warn_no_space_for_csum(inode); return; } t = (struct dx_tail *)(((struct dx_entry *)c) + limit); t->dt_checksum = ext4_dx_csum(inode, dirent, count_offset, count, t); } static inline int ext4_handle_dirty_dx_node(handle_t *handle, struct inode *inode, struct buffer_head *bh) { ext4_dx_csum_set(inode, (struct ext4_dir_entry *)bh->b_data); return ext4_handle_dirty_metadata(handle, inode, bh); } /* * p is at least 6 bytes before the end of page */ static inline struct ext4_dir_entry_2 * ext4_next_entry(struct ext4_dir_entry_2 *p, unsigned long blocksize) { return (struct ext4_dir_entry_2 *)((char *)p + ext4_rec_len_from_disk(p->rec_len, blocksize)); } /* * Future: use high four bits of block for coalesce-on-delete flags * Mask them off for now. */ static inline ext4_lblk_t dx_get_block(struct dx_entry *entry) { return le32_to_cpu(entry->block) & 0x0fffffff; } static inline void dx_set_block(struct dx_entry *entry, ext4_lblk_t value) { entry->block = cpu_to_le32(value); } static inline unsigned dx_get_hash(struct dx_entry *entry) { return le32_to_cpu(entry->hash); } static inline void dx_set_hash(struct dx_entry *entry, unsigned value) { entry->hash = cpu_to_le32(value); } static inline unsigned dx_get_count(struct dx_entry *entries) { return le16_to_cpu(((struct dx_countlimit *) entries)->count); } static inline unsigned dx_get_limit(struct dx_entry *entries) { return le16_to_cpu(((struct dx_countlimit *) entries)->limit); } static inline void dx_set_count(struct dx_entry *entries, unsigned value) { ((struct dx_countlimit *) entries)->count = cpu_to_le16(value); } static inline void dx_set_limit(struct dx_entry *entries, unsigned value) { ((struct dx_countlimit *) entries)->limit = cpu_to_le16(value); } static inline unsigned dx_root_limit(struct inode *dir, unsigned infosize) { unsigned int entry_space = dir->i_sb->s_blocksize - ext4_dir_rec_len(1, NULL) - ext4_dir_rec_len(2, NULL) - infosize; if (ext4_has_feature_metadata_csum(dir->i_sb)) entry_space -= sizeof(struct dx_tail); return entry_space / sizeof(struct dx_entry); } static inline unsigned dx_node_limit(struct inode *dir) { unsigned int entry_space = dir->i_sb->s_blocksize - ext4_dir_rec_len(0, dir); if (ext4_has_feature_metadata_csum(dir->i_sb)) entry_space -= sizeof(struct dx_tail); return entry_space / sizeof(struct dx_entry); } /* * Debug */ #ifdef DX_DEBUG static void dx_show_index(char * label, struct dx_entry *entries) { int i, n = dx_get_count (entries); printk(KERN_DEBUG "%s index", label); for (i = 0; i < n; i++) { printk(KERN_CONT " %x->%lu", i ? dx_get_hash(entries + i) : 0, (unsigned long)dx_get_block(entries + i)); } printk(KERN_CONT "\n"); } struct stats { unsigned names; unsigned space; unsigned bcount; }; static struct stats dx_show_leaf(struct inode *dir, struct dx_hash_info *hinfo, struct ext4_dir_entry_2 *de, int size, int show_names) { unsigned names = 0, space = 0; char *base = (char *) de; struct dx_hash_info h = *hinfo; printk("names: "); while ((char *) de < base + size) { if (de->inode) { if (show_names) { #ifdef CONFIG_FS_ENCRYPTION int len; char *name; struct fscrypt_str fname_crypto_str = FSTR_INIT(NULL, 0); int res = 0; name = de->name; len = de->name_len; if (!IS_ENCRYPTED(dir)) { /* Directory is not encrypted */ (void) ext4fs_dirhash(dir, de->name, de->name_len, &h); printk("%*.s:(U)%x.%u ", len, name, h.hash, (unsigned) ((char *) de - base)); } else { struct fscrypt_str de_name = FSTR_INIT(name, len); /* Directory is encrypted */ res = fscrypt_fname_alloc_buffer( len, &fname_crypto_str); if (res) printk(KERN_WARNING "Error " "allocating crypto " "buffer--skipping " "crypto\n"); res = fscrypt_fname_disk_to_usr(dir, 0, 0, &de_name, &fname_crypto_str); if (res) { printk(KERN_WARNING "Error " "converting filename " "from disk to usr" "\n"); name = "??"; len = 2; } else { name = fname_crypto_str.name; len = fname_crypto_str.len; } if (IS_CASEFOLDED(dir)) h.hash = EXT4_DIRENT_HASH(de); else (void) ext4fs_dirhash(dir, de->name, de->name_len, &h); printk("%*.s:(E)%x.%u ", len, name, h.hash, (unsigned) ((char *) de - base)); fscrypt_fname_free_buffer( &fname_crypto_str); } #else int len = de->name_len; char *name = de->name; (void) ext4fs_dirhash(dir, de->name, de->name_len, &h); printk("%*.s:%x.%u ", len, name, h.hash, (unsigned) ((char *) de - base)); #endif } space += ext4_dir_rec_len(de->name_len, dir); names++; } de = ext4_next_entry(de, size); } printk(KERN_CONT "(%i)\n", names); return (struct stats) { names, space, 1 }; } struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir, struct dx_entry *entries, int levels) { unsigned blocksize = dir->i_sb->s_blocksize; unsigned count = dx_get_count(entries), names = 0, space = 0, i; unsigned bcount = 0; struct buffer_head *bh; printk("%i indexed blocks...\n", count); for (i = 0; i < count; i++, entries++) { ext4_lblk_t block = dx_get_block(entries); ext4_lblk_t hash = i ? dx_get_hash(entries): 0; u32 range = i < count - 1? (dx_get_hash(entries + 1) - hash): ~hash; struct stats stats; printk("%s%3u:%03u hash %8x/%8x ",levels?"":" ", i, block, hash, range); bh = ext4_bread(NULL,dir, block, 0); if (!bh || IS_ERR(bh)) continue; stats = levels? dx_show_entries(hinfo, dir, ((struct dx_node *) bh->b_data)->entries, levels - 1): dx_show_leaf(dir, hinfo, (struct ext4_dir_entry_2 *) bh->b_data, blocksize, 0); names += stats.names; space += stats.space; bcount += stats.bcount; brelse(bh); } if (bcount) printk(KERN_DEBUG "%snames %u, fullness %u (%u%%)\n", levels ? "" : " ", names, space/bcount, (space/bcount)*100/blocksize); return (struct stats) { names, space, bcount}; } /* * Linear search cross check */ static inline void htree_rep_invariant_check(struct dx_entry *at, struct dx_entry *target, u32 hash, unsigned int n) { while (n--) { dxtrace(printk(KERN_CONT ",")); if (dx_get_hash(++at) > hash) { at--; break; } } ASSERT(at == target - 1); } #else /* DX_DEBUG */ static inline void htree_rep_invariant_check(struct dx_entry *at, struct dx_entry *target, u32 hash, unsigned int n) { } #endif /* DX_DEBUG */ /* * Probe for a directory leaf block to search. * * dx_probe can return ERR_BAD_DX_DIR, which means there was a format * error in the directory index, and the caller should fall back to * searching the directory normally. The callers of dx_probe **MUST** * check for this error code, and make sure it never gets reflected * back to userspace. */ static struct dx_frame * dx_probe(struct ext4_filename *fname, struct inode *dir, struct dx_hash_info *hinfo, struct dx_frame *frame_in) { unsigned count, indirect, level, i; struct dx_entry *at, *entries, *p, *q, *m; struct dx_root *root; struct dx_frame *frame = frame_in; struct dx_frame *ret_err = ERR_PTR(ERR_BAD_DX_DIR); u32 hash; ext4_lblk_t block; ext4_lblk_t blocks[EXT4_HTREE_LEVEL]; memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0])); frame->bh = ext4_read_dirblock(dir, 0, INDEX); if (IS_ERR(frame->bh)) return (struct dx_frame *) frame->bh; root = (struct dx_root *) frame->bh->b_data; if (root->info.hash_version != DX_HASH_TEA && root->info.hash_version != DX_HASH_HALF_MD4 && root->info.hash_version != DX_HASH_LEGACY && root->info.hash_version != DX_HASH_SIPHASH) { ext4_warning_inode(dir, "Unrecognised inode hash code %u", root->info.hash_version); goto fail; } if (ext4_hash_in_dirent(dir)) { if (root->info.hash_version != DX_HASH_SIPHASH) { ext4_warning_inode(dir, "Hash in dirent, but hash is not SIPHASH"); goto fail; } } else { if (root->info.hash_version == DX_HASH_SIPHASH) { ext4_warning_inode(dir, "Hash code is SIPHASH, but hash not in dirent"); goto fail; } } if (fname) hinfo = &fname->hinfo; hinfo->hash_version = root->info.hash_version; if (hinfo->hash_version <= DX_HASH_TEA) hinfo->hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned; hinfo->seed = EXT4_SB(dir->i_sb)->s_hash_seed; /* hash is already computed for encrypted casefolded directory */ if (fname && fname_name(fname) && !(IS_ENCRYPTED(dir) && IS_CASEFOLDED(dir))) { int ret = ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), hinfo); if (ret < 0) { ret_err = ERR_PTR(ret); goto fail; } } hash = hinfo->hash; if (root->info.unused_flags & 1) { ext4_warning_inode(dir, "Unimplemented hash flags: %#06x", root->info.unused_flags); goto fail; } indirect = root->info.indirect_levels; if (indirect >= ext4_dir_htree_level(dir->i_sb)) { ext4_warning(dir->i_sb, "Directory (ino: %lu) htree depth %#06x exceed" "supported value", dir->i_ino, ext4_dir_htree_level(dir->i_sb)); if (ext4_dir_htree_level(dir->i_sb) < EXT4_HTREE_LEVEL) { ext4_warning(dir->i_sb, "Enable large directory " "feature to access it"); } goto fail; } entries = (struct dx_entry *)(((char *)&root->info) + root->info.info_length); if (dx_get_limit(entries) != dx_root_limit(dir, root->info.info_length)) { ext4_warning_inode(dir, "dx entry: limit %u != root limit %u", dx_get_limit(entries), dx_root_limit(dir, root->info.info_length)); goto fail; } dxtrace(printk("Look up %x", hash)); level = 0; blocks[0] = 0; while (1) { count = dx_get_count(entries); if (!count || count > dx_get_limit(entries)) { ext4_warning_inode(dir, "dx entry: count %u beyond limit %u", count, dx_get_limit(entries)); goto fail; } p = entries + 1; q = entries + count - 1; while (p <= q) { m = p + (q - p) / 2; dxtrace(printk(KERN_CONT ".")); if (dx_get_hash(m) > hash) q = m - 1; else p = m + 1; } htree_rep_invariant_check(entries, p, hash, count - 1); at = p - 1; dxtrace(printk(KERN_CONT " %x->%u\n", at == entries ? 0 : dx_get_hash(at), dx_get_block(at))); frame->entries = entries; frame->at = at; block = dx_get_block(at); for (i = 0; i <= level; i++) { if (blocks[i] == block) { ext4_warning_inode(dir, "dx entry: tree cycle block %u points back to block %u", blocks[level], block); goto fail; } } if (++level > indirect) return frame; blocks[level] = block; frame++; frame->bh = ext4_read_dirblock(dir, block, INDEX); if (IS_ERR(frame->bh)) { ret_err = (struct dx_frame *) frame->bh; frame->bh = NULL; goto fail; } entries = ((struct dx_node *) frame->bh->b_data)->entries; if (dx_get_limit(entries) != dx_node_limit(dir)) { ext4_warning_inode(dir, "dx entry: limit %u != node limit %u", dx_get_limit(entries), dx_node_limit(dir)); goto fail; } } fail: while (frame >= frame_in) { brelse(frame->bh); frame--; } if (ret_err == ERR_PTR(ERR_BAD_DX_DIR)) ext4_warning_inode(dir, "Corrupt directory, running e2fsck is recommended"); return ret_err; } static void dx_release(struct dx_frame *frames) { struct dx_root_info *info; int i; unsigned int indirect_levels; if (frames[0].bh == NULL) return; info = &((struct dx_root *)frames[0].bh->b_data)->info; /* save local copy, "info" may be freed after brelse() */ indirect_levels = info->indirect_levels; for (i = 0; i <= indirect_levels; i++) { if (frames[i].bh == NULL) break; brelse(frames[i].bh); frames[i].bh = NULL; } } /* * This function increments the frame pointer to search the next leaf * block, and reads in the necessary intervening nodes if the search * should be necessary. Whether or not the search is necessary is * controlled by the hash parameter. If the hash value is even, then * the search is only continued if the next block starts with that * hash value. This is used if we are searching for a specific file. * * If the hash value is HASH_NB_ALWAYS, then always go to the next block. * * This function returns 1 if the caller should continue to search, * or 0 if it should not. If there is an error reading one of the * index blocks, it will a negative error code. * * If start_hash is non-null, it will be filled in with the starting * hash of the next page. */ static int ext4_htree_next_block(struct inode *dir, __u32 hash, struct dx_frame *frame, struct dx_frame *frames, __u32 *start_hash) { struct dx_frame *p; struct buffer_head *bh; int num_frames = 0; __u32 bhash; p = frame; /* * Find the next leaf page by incrementing the frame pointer. * If we run out of entries in the interior node, loop around and * increment pointer in the parent node. When we break out of * this loop, num_frames indicates the number of interior * nodes need to be read. */ while (1) { if (++(p->at) < p->entries + dx_get_count(p->entries)) break; if (p == frames) return 0; num_frames++; p--; } /* * If the hash is 1, then continue only if the next page has a * continuation hash of any value. This is used for readdir * handling. Otherwise, check to see if the hash matches the * desired continuation hash. If it doesn't, return since * there's no point to read in the successive index pages. */ bhash = dx_get_hash(p->at); if (start_hash) *start_hash = bhash; if ((hash & 1) == 0) { if ((bhash & ~1) != hash) return 0; } /* * If the hash is HASH_NB_ALWAYS, we always go to the next * block so no check is necessary */ while (num_frames--) { bh = ext4_read_dirblock(dir, dx_get_block(p->at), INDEX); if (IS_ERR(bh)) return PTR_ERR(bh); p++; brelse(p->bh); p->bh = bh; p->at = p->entries = ((struct dx_node *) bh->b_data)->entries; } return 1; } /* * This function fills a red-black tree with information from a * directory block. It returns the number directory entries loaded * into the tree. If there is an error it is returned in err. */ static int htree_dirblock_to_tree(struct file *dir_file, struct inode *dir, ext4_lblk_t block, struct dx_hash_info *hinfo, __u32 start_hash, __u32 start_minor_hash) { struct buffer_head *bh; struct ext4_dir_entry_2 *de, *top; int err = 0, count = 0; struct fscrypt_str fname_crypto_str = FSTR_INIT(NULL, 0), tmp_str; int csum = ext4_has_feature_metadata_csum(dir->i_sb); dxtrace(printk(KERN_INFO "In htree dirblock_to_tree: block %lu\n", (unsigned long)block)); bh = ext4_read_dirblock(dir, block, DIRENT_HTREE); if (IS_ERR(bh)) return PTR_ERR(bh); de = (struct ext4_dir_entry_2 *) bh->b_data; /* csum entries are not larger in the casefolded encrypted case */ top = (struct ext4_dir_entry_2 *) ((char *) de + dir->i_sb->s_blocksize - ext4_dir_rec_len(0, csum ? NULL : dir)); /* Check if the directory is encrypted */ if (IS_ENCRYPTED(dir)) { err = fscrypt_prepare_readdir(dir); if (err < 0) { brelse(bh); return err; } err = fscrypt_fname_alloc_buffer(EXT4_NAME_LEN, &fname_crypto_str); if (err < 0) { brelse(bh); return err; } } for (; de < top; de = ext4_next_entry(de, dir->i_sb->s_blocksize)) { if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data, bh->b_size, EXT4_LBLK_TO_B(dir, block) + ((char *)de - bh->b_data))) { /* silently ignore the rest of the block */ break; } if (ext4_hash_in_dirent(dir)) { if (de->name_len && de->inode) { hinfo->hash = EXT4_DIRENT_HASH(de); hinfo->minor_hash = EXT4_DIRENT_MINOR_HASH(de); } else { hinfo->hash = 0; hinfo->minor_hash = 0; } } else { err = ext4fs_dirhash(dir, de->name, de->name_len, hinfo); if (err < 0) { count = err; goto errout; } } if ((hinfo->hash < start_hash) || ((hinfo->hash == start_hash) && (hinfo->minor_hash < start_minor_hash))) continue; if (de->inode == 0) continue; if (!IS_ENCRYPTED(dir)) { tmp_str.name = de->name; tmp_str.len = de->name_len; err = ext4_htree_store_dirent(dir_file, hinfo->hash, hinfo->minor_hash, de, &tmp_str); } else { int save_len = fname_crypto_str.len; struct fscrypt_str de_name = FSTR_INIT(de->name, de->name_len); /* Directory is encrypted */ err = fscrypt_fname_disk_to_usr(dir, hinfo->hash, hinfo->minor_hash, &de_name, &fname_crypto_str); if (err) { count = err; goto errout; } err = ext4_htree_store_dirent(dir_file, hinfo->hash, hinfo->minor_hash, de, &fname_crypto_str); fname_crypto_str.len = save_len; } if (err != 0) { count = err; goto errout; } count++; } errout: brelse(bh); fscrypt_fname_free_buffer(&fname_crypto_str); return count; } /* * This function fills a red-black tree with information from a * directory. We start scanning the directory in hash order, starting * at start_hash and start_minor_hash. * * This function returns the number of entries inserted into the tree, * or a negative error code. */ int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash, __u32 start_minor_hash, __u32 *next_hash) { struct dx_hash_info hinfo; struct ext4_dir_entry_2 *de; struct dx_frame frames[EXT4_HTREE_LEVEL], *frame; struct inode *dir; ext4_lblk_t block; int count = 0; int ret, err; __u32 hashval; struct fscrypt_str tmp_str; dxtrace(printk(KERN_DEBUG "In htree_fill_tree, start hash: %x:%x\n", start_hash, start_minor_hash)); dir = file_inode(dir_file); if (!(ext4_test_inode_flag(dir, EXT4_INODE_INDEX))) { if (ext4_hash_in_dirent(dir)) hinfo.hash_version = DX_HASH_SIPHASH; else hinfo.hash_version = EXT4_SB(dir->i_sb)->s_def_hash_version; if (hinfo.hash_version <= DX_HASH_TEA) hinfo.hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned; hinfo.seed = EXT4_SB(dir->i_sb)->s_hash_seed; if (ext4_has_inline_data(dir)) { int has_inline_data = 1; count = ext4_inlinedir_to_tree(dir_file, dir, 0, &hinfo, start_hash, start_minor_hash, &has_inline_data); if (has_inline_data) { *next_hash = ~0; return count; } } count = htree_dirblock_to_tree(dir_file, dir, 0, &hinfo, start_hash, start_minor_hash); *next_hash = ~0; return count; } hinfo.hash = start_hash; hinfo.minor_hash = 0; frame = dx_probe(NULL, dir, &hinfo, frames); if (IS_ERR(frame)) return PTR_ERR(frame); /* Add '.' and '..' from the htree header */ if (!start_hash && !start_minor_hash) { de = (struct ext4_dir_entry_2 *) frames[0].bh->b_data; tmp_str.name = de->name; tmp_str.len = de->name_len; err = ext4_htree_store_dirent(dir_file, 0, 0, de, &tmp_str); if (err != 0) goto errout; count++; } if (start_hash < 2 || (start_hash ==2 && start_minor_hash==0)) { de = (struct ext4_dir_entry_2 *) frames[0].bh->b_data; de = ext4_next_entry(de, dir->i_sb->s_blocksize); tmp_str.name = de->name; tmp_str.len = de->name_len; err = ext4_htree_store_dirent(dir_file, 2, 0, de, &tmp_str); if (err != 0) goto errout; count++; } while (1) { if (fatal_signal_pending(current)) { err = -ERESTARTSYS; goto errout; } cond_resched(); block = dx_get_block(frame->at); ret = htree_dirblock_to_tree(dir_file, dir, block, &hinfo, start_hash, start_minor_hash); if (ret < 0) { err = ret; goto errout; } count += ret; hashval = ~0; ret = ext4_htree_next_block(dir, HASH_NB_ALWAYS, frame, frames, &hashval); *next_hash = hashval; if (ret < 0) { err = ret; goto errout; } /* * Stop if: (a) there are no more entries, or * (b) we have inserted at least one entry and the * next hash value is not a continuation */ if ((ret == 0) || (count && ((hashval & 1) == 0))) break; } dx_release(frames); dxtrace(printk(KERN_DEBUG "Fill tree: returned %d entries, " "next hash: %x\n", count, *next_hash)); return count; errout: dx_release(frames); return (err); } static inline int search_dirblock(struct buffer_head *bh, struct inode *dir, struct ext4_filename *fname, unsigned int offset, struct ext4_dir_entry_2 **res_dir) { return ext4_search_dir(bh, bh->b_data, dir->i_sb->s_blocksize, dir, fname, offset, res_dir); } /* * Directory block splitting, compacting */ /* * Create map of hash values, offsets, and sizes, stored at end of block. * Returns number of entries mapped. */ static int dx_make_map(struct inode *dir, struct buffer_head *bh, struct dx_hash_info *hinfo, struct dx_map_entry *map_tail) { int count = 0; struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data; unsigned int buflen = bh->b_size; char *base = bh->b_data; struct dx_hash_info h = *hinfo; int blocksize = EXT4_BLOCK_SIZE(dir->i_sb); if (ext4_has_feature_metadata_csum(dir->i_sb)) buflen -= sizeof(struct ext4_dir_entry_tail); while ((char *) de < base + buflen) { if (ext4_check_dir_entry(dir, NULL, de, bh, base, buflen, ((char *)de) - base)) return -EFSCORRUPTED; if (de->name_len && de->inode) { if (ext4_hash_in_dirent(dir)) h.hash = EXT4_DIRENT_HASH(de); else { int err = ext4fs_dirhash(dir, de->name, de->name_len, &h); if (err < 0) return err; } map_tail--; map_tail->hash = h.hash; map_tail->offs = ((char *) de - base)>>2; map_tail->size = ext4_rec_len_from_disk(de->rec_len, blocksize); count++; cond_resched(); } de = ext4_next_entry(de, blocksize); } return count; } /* Sort map by hash value */ static void dx_sort_map (struct dx_map_entry *map, unsigned count) { struct dx_map_entry *p, *q, *top = map + count - 1; int more; /* Combsort until bubble sort doesn't suck */ while (count > 2) { count = count*10/13; if (count - 9 < 2) /* 9, 10 -> 11 */ count = 11; for (p = top, q = p - count; q >= map; p--, q--) if (p->hash < q->hash) swap(*p, *q); } /* Garden variety bubble sort */ do { more = 0; q = top; while (q-- > map) { if (q[1].hash >= q[0].hash) continue; swap(*(q+1), *q); more = 1; } } while(more); } static void dx_insert_block(struct dx_frame *frame, u32 hash, ext4_lblk_t block) { struct dx_entry *entries = frame->entries; struct dx_entry *old = frame->at, *new = old + 1; int count = dx_get_count(entries); ASSERT(count < dx_get_limit(entries)); ASSERT(old < entries + count); memmove(new + 1, new, (char *)(entries + count) - (char *)(new)); dx_set_hash(new, hash); dx_set_block(new, block); dx_set_count(entries, count + 1); } #if IS_ENABLED(CONFIG_UNICODE) int ext4_fname_setup_ci_filename(struct inode *dir, const struct qstr *iname, struct ext4_filename *name) { struct qstr *cf_name = &name->cf_name; unsigned char *buf; struct dx_hash_info *hinfo = &name->hinfo; int len; if (!IS_CASEFOLDED(dir) || (IS_ENCRYPTED(dir) && !fscrypt_has_encryption_key(dir))) { cf_name->name = NULL; return 0; } buf = kmalloc(EXT4_NAME_LEN, GFP_NOFS); if (!buf) return -ENOMEM; len = utf8_casefold(dir->i_sb->s_encoding, iname, buf, EXT4_NAME_LEN); if (len <= 0) { kfree(buf); buf = NULL; } cf_name->name = buf; cf_name->len = (unsigned) len; if (!IS_ENCRYPTED(dir)) return 0; hinfo->hash_version = DX_HASH_SIPHASH; hinfo->seed = NULL; if (cf_name->name) return ext4fs_dirhash(dir, cf_name->name, cf_name->len, hinfo); else return ext4fs_dirhash(dir, iname->name, iname->len, hinfo); } #endif /* * Test whether a directory entry matches the filename being searched for. * * Return: %true if the directory entry matches, otherwise %false. */ static bool ext4_match(struct inode *parent, const struct ext4_filename *fname, struct ext4_dir_entry_2 *de) { struct fscrypt_name f; if (!de->inode) return false; f.usr_fname = fname->usr_fname; f.disk_name = fname->disk_name; #ifdef CONFIG_FS_ENCRYPTION f.crypto_buf = fname->crypto_buf; #endif #if IS_ENABLED(CONFIG_UNICODE) if (IS_CASEFOLDED(parent) && (!IS_ENCRYPTED(parent) || fscrypt_has_encryption_key(parent))) { /* * Just checking IS_ENCRYPTED(parent) below is not * sufficient to decide whether one can use the hash for * skipping the string comparison, because the key might * have been added right after * ext4_fname_setup_ci_filename(). In this case, a hash * mismatch will be a false negative. Therefore, make * sure cf_name was properly initialized before * considering the calculated hash. */ if (sb_no_casefold_compat_fallback(parent->i_sb) && IS_ENCRYPTED(parent) && fname->cf_name.name && (fname->hinfo.hash != EXT4_DIRENT_HASH(de) || fname->hinfo.minor_hash != EXT4_DIRENT_MINOR_HASH(de))) return false; /* * Treat comparison errors as not a match. The * only case where it happens is on a disk * corruption or ENOMEM. */ return generic_ci_match(parent, fname->usr_fname, &fname->cf_name, de->name, de->name_len) > 0; } #endif return fscrypt_match_name(&f, de->name, de->name_len); } /* * Returns 0 if not found, -EFSCORRUPTED on failure, and 1 on success */ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size, struct inode *dir, struct ext4_filename *fname, unsigned int offset, struct ext4_dir_entry_2 **res_dir) { struct ext4_dir_entry_2 * de; char * dlimit; int de_len; de = (struct ext4_dir_entry_2 *)search_buf; dlimit = search_buf + buf_size; while ((char *) de < dlimit - EXT4_BASE_DIR_LEN) { /* this code is executed quadratically often */ /* do minimal checking `by hand' */ if (de->name + de->name_len <= dlimit && ext4_match(dir, fname, de)) { /* found a match - just to be sure, do * a full check */ if (ext4_check_dir_entry(dir, NULL, de, bh, search_buf, buf_size, offset)) return -EFSCORRUPTED; *res_dir = de; return 1; } /* prevent looping on a bad block */ de_len = ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize); if (de_len <= 0) return -EFSCORRUPTED; offset += de_len; de = (struct ext4_dir_entry_2 *) ((char *) de + de_len); } return 0; } static int is_dx_internal_node(struct inode *dir, ext4_lblk_t block, struct ext4_dir_entry *de) { struct super_block *sb = dir->i_sb; if (!is_dx(dir)) return 0; if (block == 0) return 1; if (de->inode == 0 && ext4_rec_len_from_disk(de->rec_len, sb->s_blocksize) == sb->s_blocksize) return 1; return 0; } /* * __ext4_find_entry() * * finds an entry in the specified directory with the wanted name. It * returns the cache buffer in which the entry was found, and the entry * itself (as a parameter - res_dir). It does NOT read the inode of the * entry - you'll have to do that yourself if you want to. * * The returned buffer_head has ->b_count elevated. The caller is expected * to brelse() it when appropriate. */ static struct buffer_head *__ext4_find_entry(struct inode *dir, struct ext4_filename *fname, struct ext4_dir_entry_2 **res_dir, int *inlined) { struct super_block *sb; struct buffer_head *bh_use[NAMEI_RA_SIZE]; struct buffer_head *bh, *ret = NULL; ext4_lblk_t start, block; const u8 *name = fname->usr_fname->name; size_t ra_max = 0; /* Number of bh's in the readahead buffer, bh_use[] */ size_t ra_ptr = 0; /* Current index into readahead buffer */ ext4_lblk_t nblocks; int i, namelen, retval; *res_dir = NULL; sb = dir->i_sb; namelen = fname->usr_fname->len; if (namelen > EXT4_NAME_LEN) return NULL; if (ext4_has_inline_data(dir)) { int has_inline_data = 1; ret = ext4_find_inline_entry(dir, fname, res_dir, &has_inline_data); if (inlined) *inlined = has_inline_data; if (has_inline_data || IS_ERR(ret)) goto cleanup_and_exit; } if ((namelen <= 2) && (name[0] == '.') && (name[1] == '.' || name[1] == '\0')) { /* * "." or ".." will only be in the first block * NFS may look up ".."; "." should be handled by the VFS */ block = start = 0; nblocks = 1; goto restart; } if (is_dx(dir)) { ret = ext4_dx_find_entry(dir, fname, res_dir); /* * On success, or if the error was file not found, * return. Otherwise, fall back to doing a search the * old fashioned way. */ if (IS_ERR(ret) && PTR_ERR(ret) == ERR_BAD_DX_DIR) dxtrace(printk(KERN_DEBUG "ext4_find_entry: dx failed, " "falling back\n")); else if (!sb_no_casefold_compat_fallback(dir->i_sb) && *res_dir == NULL && IS_CASEFOLDED(dir)) dxtrace(printk(KERN_DEBUG "ext4_find_entry: casefold " "failed, falling back\n")); else goto cleanup_and_exit; ret = NULL; } nblocks = dir->i_size >> EXT4_BLOCK_SIZE_BITS(sb); if (!nblocks) { ret = NULL; goto cleanup_and_exit; } start = EXT4_I(dir)->i_dir_start_lookup; if (start >= nblocks) start = 0; block = start; restart: do { /* * We deal with the read-ahead logic here. */ cond_resched(); if (ra_ptr >= ra_max) { /* Refill the readahead buffer */ ra_ptr = 0; if (block < start) ra_max = start - block; else ra_max = nblocks - block; ra_max = min(ra_max, ARRAY_SIZE(bh_use)); retval = ext4_bread_batch(dir, block, ra_max, false /* wait */, bh_use); if (retval) { ret = ERR_PTR(retval); ra_max = 0; goto cleanup_and_exit; } } if ((bh = bh_use[ra_ptr++]) == NULL) goto next; wait_on_buffer(bh); if (!buffer_uptodate(bh)) { EXT4_ERROR_INODE_ERR(dir, EIO, "reading directory lblock %lu", (unsigned long) block); brelse(bh); ret = ERR_PTR(-EIO); goto cleanup_and_exit; } if (!buffer_verified(bh) && !is_dx_internal_node(dir, block, (struct ext4_dir_entry *)bh->b_data) && !ext4_dirblock_csum_verify(dir, bh)) { EXT4_ERROR_INODE_ERR(dir, EFSBADCRC, "checksumming directory " "block %lu", (unsigned long)block); brelse(bh); ret = ERR_PTR(-EFSBADCRC); goto cleanup_and_exit; } set_buffer_verified(bh); i = search_dirblock(bh, dir, fname, EXT4_LBLK_TO_B(dir, block), res_dir); if (i == 1) { EXT4_I(dir)->i_dir_start_lookup = block; ret = bh; goto cleanup_and_exit; } else { brelse(bh); if (i < 0) { ret = ERR_PTR(i); goto cleanup_and_exit; } } next: if (++block >= nblocks) block = 0; } while (block != start); /* * If the directory has grown while we were searching, then * search the last part of the directory before giving up. */ block = nblocks; nblocks = dir->i_size >> EXT4_BLOCK_SIZE_BITS(sb); if (block < nblocks) { start = 0; goto restart; } cleanup_and_exit: /* Clean up the read-ahead blocks */ for (; ra_ptr < ra_max; ra_ptr++) brelse(bh_use[ra_ptr]); return ret; } static struct buffer_head *ext4_find_entry(struct inode *dir, const struct qstr *d_name, struct ext4_dir_entry_2 **res_dir, int *inlined) { int err; struct ext4_filename fname; struct buffer_head *bh; err = ext4_fname_setup_filename(dir, d_name, 1, &fname); if (err == -ENOENT) return NULL; if (err) return ERR_PTR(err); bh = __ext4_find_entry(dir, &fname, res_dir, inlined); ext4_fname_free_filename(&fname); return bh; } static struct buffer_head *ext4_lookup_entry(struct inode *dir, struct dentry *dentry, struct ext4_dir_entry_2 **res_dir) { int err; struct ext4_filename fname; struct buffer_head *bh; err = ext4_fname_prepare_lookup(dir, dentry, &fname); if (err == -ENOENT) return NULL; if (err) return ERR_PTR(err); bh = __ext4_find_entry(dir, &fname, res_dir, NULL); ext4_fname_free_filename(&fname); return bh; } static struct buffer_head * ext4_dx_find_entry(struct inode *dir, struct ext4_filename *fname, struct ext4_dir_entry_2 **res_dir) { struct dx_frame frames[EXT4_HTREE_LEVEL], *frame; struct buffer_head *bh; ext4_lblk_t block; int retval; #ifdef CONFIG_FS_ENCRYPTION *res_dir = NULL; #endif frame = dx_probe(fname, dir, NULL, frames); if (IS_ERR(frame)) return ERR_CAST(frame); do { block = dx_get_block(frame->at); bh = ext4_read_dirblock(dir, block, DIRENT_HTREE); if (IS_ERR(bh)) goto errout; retval = search_dirblock(bh, dir, fname, EXT4_LBLK_TO_B(dir, block), res_dir); if (retval == 1) goto success; brelse(bh); if (retval < 0) { bh = ERR_PTR(ERR_BAD_DX_DIR); goto errout; } /* Check to see if we should continue to search */ retval = ext4_htree_next_block(dir, fname->hinfo.hash, frame, frames, NULL); if (retval < 0) { ext4_warning_inode(dir, "error %d reading directory index block", retval); bh = ERR_PTR(retval); goto errout; } } while (retval == 1); bh = NULL; errout: dxtrace(printk(KERN_DEBUG "%s not found\n", fname->usr_fname->name)); success: dx_release(frames); return bh; } static struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) { struct inode *inode; struct ext4_dir_entry_2 *de = NULL; struct buffer_head *bh; if (dentry->d_name.len > EXT4_NAME_LEN) return ERR_PTR(-ENAMETOOLONG); bh = ext4_lookup_entry(dir, dentry, &de); if (IS_ERR(bh)) return ERR_CAST(bh); inode = NULL; if (bh) { __u32 ino = le32_to_cpu(de->inode); brelse(bh); if (!ext4_valid_inum(dir->i_sb, ino)) { EXT4_ERROR_INODE(dir, "bad inode number: %u", ino); return ERR_PTR(-EFSCORRUPTED); } if (unlikely(ino == dir->i_ino)) { EXT4_ERROR_INODE(dir, "'%pd' linked to parent dir", dentry); return ERR_PTR(-EFSCORRUPTED); } inode = ext4_iget(dir->i_sb, ino, EXT4_IGET_NORMAL); if (inode == ERR_PTR(-ESTALE)) { EXT4_ERROR_INODE(dir, "deleted inode referenced: %u", ino); return ERR_PTR(-EFSCORRUPTED); } if (!IS_ERR(inode) && IS_ENCRYPTED(dir) && (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode)) && !fscrypt_has_permitted_context(dir, inode)) { ext4_warning(inode->i_sb, "Inconsistent encryption contexts: %lu/%lu", dir->i_ino, inode->i_ino); iput(inode); return ERR_PTR(-EPERM); } } if (IS_ENABLED(CONFIG_UNICODE) && !inode && IS_CASEFOLDED(dir)) { /* Eventually we want to call d_add_ci(dentry, NULL) * for negative dentries in the encoding case as * well. For now, prevent the negative dentry * from being cached. */ return NULL; } return d_splice_alias(inode, dentry); } struct dentry *ext4_get_parent(struct dentry *child) { __u32 ino; struct ext4_dir_entry_2 * de = NULL; struct buffer_head *bh; bh = ext4_find_entry(d_inode(child), &dotdot_name, &de, NULL); if (IS_ERR(bh)) return ERR_CAST(bh); if (!bh) return ERR_PTR(-ENOENT); ino = le32_to_cpu(de->inode); brelse(bh); if (!ext4_valid_inum(child->d_sb, ino)) { EXT4_ERROR_INODE(d_inode(child), "bad parent inode number: %u", ino); return ERR_PTR(-EFSCORRUPTED); } return d_obtain_alias(ext4_iget(child->d_sb, ino, EXT4_IGET_NORMAL)); } /* * Move count entries from end of map between two memory locations. * Returns pointer to last entry moved. */ static struct ext4_dir_entry_2 * dx_move_dirents(struct inode *dir, char *from, char *to, struct dx_map_entry *map, int count, unsigned blocksize) { unsigned rec_len = 0; while (count--) { struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *) (from + (map->offs<<2)); rec_len = ext4_dir_rec_len(de->name_len, dir); memcpy (to, de, rec_len); ((struct ext4_dir_entry_2 *) to)->rec_len = ext4_rec_len_to_disk(rec_len, blocksize); /* wipe dir_entry excluding the rec_len field */ de->inode = 0; memset(&de->name_len, 0, ext4_rec_len_from_disk(de->rec_len, blocksize) - offsetof(struct ext4_dir_entry_2, name_len)); map++; to += rec_len; } return (struct ext4_dir_entry_2 *) (to - rec_len); } /* * Compact each dir entry in the range to the minimal rec_len. * Returns pointer to last entry in range. */ static struct ext4_dir_entry_2 *dx_pack_dirents(struct inode *dir, char *base, unsigned int blocksize) { struct ext4_dir_entry_2 *next, *to, *prev, *de = (struct ext4_dir_entry_2 *) base; unsigned rec_len = 0; prev = to = de; while ((char*)de < base + blocksize) { next = ext4_next_entry(de, blocksize); if (de->inode && de->name_len) { rec_len = ext4_dir_rec_len(de->name_len, dir); if (de > to) memmove(to, de, rec_len); to->rec_len = ext4_rec_len_to_disk(rec_len, blocksize); prev = to; to = (struct ext4_dir_entry_2 *) (((char *) to) + rec_len); } de = next; } return prev; } /* * Split a full leaf block to make room for a new dir entry. * Allocate a new block, and move entries so that they are approx. equally full. * Returns pointer to de in block into which the new entry will be inserted. */ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir, struct buffer_head **bh,struct dx_frame *frame, struct dx_hash_info *hinfo) { unsigned blocksize = dir->i_sb->s_blocksize; unsigned continued; int count; struct buffer_head *bh2; ext4_lblk_t newblock; u32 hash2; struct dx_map_entry *map; char *data1 = (*bh)->b_data, *data2; unsigned split, move, size; struct ext4_dir_entry_2 *de = NULL, *de2; int csum_size = 0; int err = 0, i; if (ext4_has_feature_metadata_csum(dir->i_sb)) csum_size = sizeof(struct ext4_dir_entry_tail); bh2 = ext4_append(handle, dir, &newblock); if (IS_ERR(bh2)) { brelse(*bh); *bh = NULL; return ERR_CAST(bh2); } BUFFER_TRACE(*bh, "get_write_access"); err = ext4_journal_get_write_access(handle, dir->i_sb, *bh, EXT4_JTR_NONE); if (err) goto journal_error; BUFFER_TRACE(frame->bh, "get_write_access"); err = ext4_journal_get_write_access(handle, dir->i_sb, frame->bh, EXT4_JTR_NONE); if (err) goto journal_error; data2 = bh2->b_data; /* create map in the end of data2 block */ map = (struct dx_map_entry *) (data2 + blocksize); count = dx_make_map(dir, *bh, hinfo, map); if (count < 0) { err = count; goto journal_error; } map -= count; dx_sort_map(map, count); /* Ensure that neither split block is over half full */ size = 0; move = 0; for (i = count-1; i >= 0; i--) { /* is more than half of this entry in 2nd half of the block? */ if (size + map[i].size/2 > blocksize/2) break; size += map[i].size; move++; } /* * map index at which we will split * * If the sum of active entries didn't exceed half the block size, just * split it in half by count; each resulting block will have at least * half the space free. */ if (i >= 0) split = count - move; else split = count/2; if (WARN_ON_ONCE(split == 0)) { /* Should never happen, but avoid out-of-bounds access below */ ext4_error_inode_block(dir, (*bh)->b_blocknr, 0, "bad indexed directory? hash=%08x:%08x count=%d move=%u", hinfo->hash, hinfo->minor_hash, count, move); err = -EFSCORRUPTED; goto out; } hash2 = map[split].hash; continued = hash2 == map[split - 1].hash; dxtrace(printk(KERN_INFO "Split block %lu at %x, %i/%i\n", (unsigned long)dx_get_block(frame->at), hash2, split, count-split)); /* Fancy dance to stay within two buffers */ de2 = dx_move_dirents(dir, data1, data2, map + split, count - split, blocksize); de = dx_pack_dirents(dir, data1, blocksize); de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) - (char *) de, blocksize); de2->rec_len = ext4_rec_len_to_disk(data2 + (blocksize - csum_size) - (char *) de2, blocksize); if (csum_size) { ext4_initialize_dirent_tail(*bh, blocksize); ext4_initialize_dirent_tail(bh2, blocksize); } dxtrace(dx_show_leaf(dir, hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1)); dxtrace(dx_show_leaf(dir, hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1)); /* Which block gets the new entry? */ if (hinfo->hash >= hash2) { swap(*bh, bh2); de = de2; } dx_insert_block(frame, hash2 + continued, newblock); err = ext4_handle_dirty_dirblock(handle, dir, bh2); if (err) goto journal_error; err = ext4_handle_dirty_dx_node(handle, dir, frame->bh); if (err) goto journal_error; brelse(bh2); dxtrace(dx_show_index("frame", frame->entries)); return de; journal_error: ext4_std_error(dir->i_sb, err); out: brelse(*bh); brelse(bh2); *bh = NULL; return ERR_PTR(err); } int ext4_find_dest_de(struct inode *dir, struct buffer_head *bh, void *buf, int buf_size, struct ext4_filename *fname, struct ext4_dir_entry_2 **dest_de) { struct ext4_dir_entry_2 *de; unsigned short reclen = ext4_dir_rec_len(fname_len(fname), dir); int nlen, rlen; unsigned int offset = 0; char *top; de = buf; top = buf + buf_size - reclen; while ((char *) de <= top) { if (ext4_check_dir_entry(dir, NULL, de, bh, buf, buf_size, offset)) return -EFSCORRUPTED; if (ext4_match(dir, fname, de)) return -EEXIST; nlen = ext4_dir_rec_len(de->name_len, dir); rlen = ext4_rec_len_from_disk(de->rec_len, buf_size); if ((de->inode ? rlen - nlen : rlen) >= reclen) break; de = (struct ext4_dir_entry_2 *)((char *)de + rlen); offset += rlen; } if ((char *) de > top) return -ENOSPC; *dest_de = de; return 0; } void ext4_insert_dentry(struct inode *dir, struct inode *inode, struct ext4_dir_entry_2 *de, int buf_size, struct ext4_filename *fname) { int nlen, rlen; nlen = ext4_dir_rec_len(de->name_len, dir); rlen = ext4_rec_len_from_disk(de->rec_len, buf_size); if (de->inode) { struct ext4_dir_entry_2 *de1 = (struct ext4_dir_entry_2 *)((char *)de + nlen); de1->rec_len = ext4_rec_len_to_disk(rlen - nlen, buf_size); de->rec_len = ext4_rec_len_to_disk(nlen, buf_size); de = de1; } de->file_type = EXT4_FT_UNKNOWN; de->inode = cpu_to_le32(inode->i_ino); ext4_set_de_type(inode->i_sb, de, inode->i_mode); de->name_len = fname_len(fname); memcpy(de->name, fname_name(fname), fname_len(fname)); if (ext4_hash_in_dirent(dir)) { struct dx_hash_info *hinfo = &fname->hinfo; EXT4_DIRENT_HASHES(de)->hash = cpu_to_le32(hinfo->hash); EXT4_DIRENT_HASHES(de)->minor_hash = cpu_to_le32(hinfo->minor_hash); } } /* * Add a new entry into a directory (leaf) block. If de is non-NULL, * it points to a directory entry which is guaranteed to be large * enough for new directory entry. If de is NULL, then * add_dirent_to_buf will attempt search the directory block for * space. It will return -ENOSPC if no space is available, and -EIO * and -EEXIST if directory entry already exists. */ static int add_dirent_to_buf(handle_t *handle, struct ext4_filename *fname, struct inode *dir, struct inode *inode, struct ext4_dir_entry_2 *de, struct buffer_head *bh) { unsigned int blocksize = dir->i_sb->s_blocksize; int csum_size = 0; int err, err2; if (ext4_has_feature_metadata_csum(inode->i_sb)) csum_size = sizeof(struct ext4_dir_entry_tail); if (!de) { err = ext4_find_dest_de(dir, bh, bh->b_data, blocksize - csum_size, fname, &de); if (err) return err; } BUFFER_TRACE(bh, "get_write_access"); err = ext4_journal_get_write_access(handle, dir->i_sb, bh, EXT4_JTR_NONE); if (err) { ext4_std_error(dir->i_sb, err); return err; } /* By now the buffer is marked for journaling */ ext4_insert_dentry(dir, inode, de, blocksize, fname); /* * XXX shouldn't update any times until successful * completion of syscall, but too many callers depend * on this. * * XXX similarly, too many callers depend on * ext4_new_inode() setting the times, but error * recovery deletes the inode, so the worst that can * happen is that the times are slightly out of date * and/or different from the directory change time. */ inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir)); ext4_update_dx_flag(dir); inode_inc_iversion(dir); err2 = ext4_mark_inode_dirty(handle, dir); BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata"); err = ext4_handle_dirty_dirblock(handle, dir, bh); if (err) ext4_std_error(dir->i_sb, err); return err ? err : err2; } static bool ext4_check_dx_root(struct inode *dir, struct dx_root *root) { struct fake_dirent *fde; const char *error_msg; unsigned int rlen; unsigned int blocksize = dir->i_sb->s_blocksize; char *blockend = (char *)root + dir->i_sb->s_blocksize; fde = &root->dot; if (unlikely(fde->name_len != 1)) { error_msg = "invalid name_len for '.'"; goto corrupted; } if (unlikely(strncmp(root->dot_name, ".", fde->name_len))) { error_msg = "invalid name for '.'"; goto corrupted; } rlen = ext4_rec_len_from_disk(fde->rec_len, blocksize); if (unlikely((char *)fde + rlen >= blockend)) { error_msg = "invalid rec_len for '.'"; goto corrupted; } fde = &root->dotdot; if (unlikely(fde->name_len != 2)) { error_msg = "invalid name_len for '..'"; goto corrupted; } if (unlikely(strncmp(root->dotdot_name, "..", fde->name_len))) { error_msg = "invalid name for '..'"; goto corrupted; } rlen = ext4_rec_len_from_disk(fde->rec_len, blocksize); if (unlikely((char *)fde + rlen >= blockend)) { error_msg = "invalid rec_len for '..'"; goto corrupted; } return true; corrupted: EXT4_ERROR_INODE(dir, "Corrupt dir, %s, running e2fsck is recommended", error_msg); return false; } /* * This converts a one block unindexed directory to a 3 block indexed * directory, and adds the dentry to the indexed directory. */ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname, struct inode *dir, struct inode *inode, struct buffer_head *bh) { struct buffer_head *bh2; struct dx_root *root; struct dx_frame frames[EXT4_HTREE_LEVEL], *frame; struct dx_entry *entries; struct ext4_dir_entry_2 *de, *de2; char *data2, *top; unsigned len; int retval; unsigned blocksize; ext4_lblk_t block; struct fake_dirent *fde; int csum_size = 0; if (ext4_has_feature_metadata_csum(inode->i_sb)) csum_size = sizeof(struct ext4_dir_entry_tail); blocksize = dir->i_sb->s_blocksize; dxtrace(printk(KERN_DEBUG "Creating index: inode %lu\n", dir->i_ino)); BUFFER_TRACE(bh, "get_write_access"); retval = ext4_journal_get_write_access(handle, dir->i_sb, bh, EXT4_JTR_NONE); if (retval) { ext4_std_error(dir->i_sb, retval); brelse(bh); return retval; } root = (struct dx_root *) bh->b_data; if (!ext4_check_dx_root(dir, root)) { brelse(bh); return -EFSCORRUPTED; } /* The 0th block becomes the root, move the dirents out */ fde = &root->dotdot; de = (struct ext4_dir_entry_2 *)((char *)fde + ext4_rec_len_from_disk(fde->rec_len, blocksize)); len = ((char *) root) + (blocksize - csum_size) - (char *) de; /* Allocate new block for the 0th block's dirents */ bh2 = ext4_append(handle, dir, &block); if (IS_ERR(bh2)) { brelse(bh); return PTR_ERR(bh2); } ext4_set_inode_flag(dir, EXT4_INODE_INDEX); data2 = bh2->b_data; memcpy(data2, de, len); memset(de, 0, len); /* wipe old data */ de = (struct ext4_dir_entry_2 *) data2; top = data2 + len; while ((char *)(de2 = ext4_next_entry(de, blocksize)) < top) { if (ext4_check_dir_entry(dir, NULL, de, bh2, data2, len, (char *)de - data2)) { brelse(bh2); brelse(bh); return -EFSCORRUPTED; } de = de2; } de->rec_len = ext4_rec_len_to_disk(data2 + (blocksize - csum_size) - (char *) de, blocksize); if (csum_size) ext4_initialize_dirent_tail(bh2, blocksize); /* Initialize the root; the dot dirents already exist */ de = (struct ext4_dir_entry_2 *) (&root->dotdot); de->rec_len = ext4_rec_len_to_disk( blocksize - ext4_dir_rec_len(2, NULL), blocksize); memset (&root->info, 0, sizeof(root->info)); root->info.info_length = sizeof(root->info); if (ext4_hash_in_dirent(dir)) root->info.hash_version = DX_HASH_SIPHASH; else root->info.hash_version = EXT4_SB(dir->i_sb)->s_def_hash_version; entries = root->entries; dx_set_block(entries, 1); dx_set_count(entries, 1); dx_set_limit(entries, dx_root_limit(dir, sizeof(root->info))); /* Initialize as for dx_probe */ fname->hinfo.hash_version = root->info.hash_version; if (fname->hinfo.hash_version <= DX_HASH_TEA) fname->hinfo.hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned; fname->hinfo.seed = EXT4_SB(dir->i_sb)->s_hash_seed; /* casefolded encrypted hashes are computed on fname setup */ if (!ext4_hash_in_dirent(dir)) { int err = ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), &fname->hinfo); if (err < 0) { brelse(bh2); brelse(bh); return err; } } memset(frames, 0, sizeof(frames)); frame = frames; frame->entries = entries; frame->at = entries; frame->bh = bh; retval = ext4_handle_dirty_dx_node(handle, dir, frame->bh); if (retval) goto out_frames; retval = ext4_handle_dirty_dirblock(handle, dir, bh2); if (retval) goto out_frames; de = do_split(handle,dir, &bh2, frame, &fname->hinfo); if (IS_ERR(de)) { retval = PTR_ERR(de); goto out_frames; } retval = add_dirent_to_buf(handle, fname, dir, inode, de, bh2); out_frames: /* * Even if the block split failed, we have to properly write * out all the changes we did so far. Otherwise we can end up * with corrupted filesystem. */ if (retval) ext4_mark_inode_dirty(handle, dir); dx_release(frames); brelse(bh2); return retval; } /* * ext4_add_entry() * * adds a file entry to the specified directory, using the same * semantics as ext4_find_entry(). It returns NULL if it failed. * * NOTE!! The inode part of 'de' is left at 0 - which means you * may not sleep between calling this and putting something into * the entry, as someone else might have used it while you slept. */ static int ext4_add_entry(handle_t *handle, struct dentry *dentry, struct inode *inode) { struct inode *dir = d_inode(dentry->d_parent); struct buffer_head *bh = NULL; struct ext4_dir_entry_2 *de; struct super_block *sb; struct ext4_filename fname; int retval; int dx_fallback=0; unsigned blocksize; ext4_lblk_t block, blocks; int csum_size = 0; if (ext4_has_feature_metadata_csum(inode->i_sb)) csum_size = sizeof(struct ext4_dir_entry_tail); sb = dir->i_sb; blocksize = sb->s_blocksize; if (fscrypt_is_nokey_name(dentry)) return -ENOKEY; if (!generic_ci_validate_strict_name(dir, &dentry->d_name)) return -EINVAL; retval = ext4_fname_setup_filename(dir, &dentry->d_name, 0, &fname); if (retval) return retval; if (ext4_has_inline_data(dir)) { retval = ext4_try_add_inline_entry(handle, &fname, dir, inode); if (retval < 0) goto out; if (retval == 1) { retval = 0; goto out; } } if (is_dx(dir)) { retval = ext4_dx_add_entry(handle, &fname, dir, inode); if (!retval || (retval != ERR_BAD_DX_DIR)) goto out; /* Can we just ignore htree data? */ if (ext4_has_feature_metadata_csum(sb)) { EXT4_ERROR_INODE(dir, "Directory has corrupted htree index."); retval = -EFSCORRUPTED; goto out; } ext4_clear_inode_flag(dir, EXT4_INODE_INDEX); dx_fallback++; retval = ext4_mark_inode_dirty(handle, dir); if (unlikely(retval)) goto out; } blocks = dir->i_size >> sb->s_blocksize_bits; for (block = 0; block < blocks; block++) { bh = ext4_read_dirblock(dir, block, DIRENT); if (bh == NULL) { bh = ext4_bread(handle, dir, block, EXT4_GET_BLOCKS_CREATE); goto add_to_new_block; } if (IS_ERR(bh)) { retval = PTR_ERR(bh); bh = NULL; goto out; } retval = add_dirent_to_buf(handle, &fname, dir, inode, NULL, bh); if (retval != -ENOSPC) goto out; if (blocks == 1 && !dx_fallback && ext4_has_feature_dir_index(sb)) { retval = make_indexed_dir(handle, &fname, dir, inode, bh); bh = NULL; /* make_indexed_dir releases bh */ goto out; } brelse(bh); } bh = ext4_append(handle, dir, &block); add_to_new_block: if (IS_ERR(bh)) { retval = PTR_ERR(bh); bh = NULL; goto out; } de = (struct ext4_dir_entry_2 *) bh->b_data; de->inode = 0; de->rec_len = ext4_rec_len_to_disk(blocksize - csum_size, blocksize); if (csum_size) ext4_initialize_dirent_tail(bh, blocksize); retval = add_dirent_to_buf(handle, &fname, dir, inode, de, bh); out: ext4_fname_free_filename(&fname); brelse(bh); if (retval == 0) ext4_set_inode_state(inode, EXT4_STATE_NEWENTRY); return retval; } /* * Returns 0 for success, or a negative error value */ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname, struct inode *dir, struct inode *inode) { struct dx_frame frames[EXT4_HTREE_LEVEL], *frame; struct dx_entry *entries, *at; struct buffer_head *bh; struct super_block *sb = dir->i_sb; struct ext4_dir_entry_2 *de; int restart; int err; again: restart = 0; frame = dx_probe(fname, dir, NULL, frames); if (IS_ERR(frame)) return PTR_ERR(frame); entries = frame->entries; at = frame->at; bh = ext4_read_dirblock(dir, dx_get_block(frame->at), DIRENT_HTREE); if (IS_ERR(bh)) { err = PTR_ERR(bh); bh = NULL; goto cleanup; } BUFFER_TRACE(bh, "get_write_access"); err = ext4_journal_get_write_access(handle, sb, bh, EXT4_JTR_NONE); if (err) goto journal_error; err = add_dirent_to_buf(handle, fname, dir, inode, NULL, bh); if (err != -ENOSPC) goto cleanup; err = 0; /* Block full, should compress but for now just split */ dxtrace(printk(KERN_DEBUG "using %u of %u node entries\n", dx_get_count(entries), dx_get_limit(entries))); /* Need to split index? */ if (dx_get_count(entries) == dx_get_limit(entries)) { ext4_lblk_t newblock; int levels = frame - frames + 1; unsigned int icount; int add_level = 1; struct dx_entry *entries2; struct dx_node *node2; struct buffer_head *bh2; while (frame > frames) { if (dx_get_count((frame - 1)->entries) < dx_get_limit((frame - 1)->entries)) { add_level = 0; break; } frame--; /* split higher index block */ at = frame->at; entries = frame->entries; restart = 1; } if (add_level && levels == ext4_dir_htree_level(sb)) { ext4_warning(sb, "Directory (ino: %lu) index full, " "reach max htree level :%d", dir->i_ino, levels); if (ext4_dir_htree_level(sb) < EXT4_HTREE_LEVEL) { ext4_warning(sb, "Large directory feature is " "not enabled on this " "filesystem"); } err = -ENOSPC; goto cleanup; } icount = dx_get_count(entries); bh2 = ext4_append(handle, dir, &newblock); if (IS_ERR(bh2)) { err = PTR_ERR(bh2); goto cleanup; } node2 = (struct dx_node *)(bh2->b_data); entries2 = node2->entries; memset(&node2->fake, 0, sizeof(struct fake_dirent)); node2->fake.rec_len = ext4_rec_len_to_disk(sb->s_blocksize, sb->s_blocksize); BUFFER_TRACE(frame->bh, "get_write_access"); err = ext4_journal_get_write_access(handle, sb, frame->bh, EXT4_JTR_NONE); if (err) { brelse(bh2); goto journal_error; } if (!add_level) { unsigned icount1 = icount/2, icount2 = icount - icount1; unsigned hash2 = dx_get_hash(entries + icount1); dxtrace(printk(KERN_DEBUG "Split index %i/%i\n", icount1, icount2)); BUFFER_TRACE(frame->bh, "get_write_access"); /* index root */ err = ext4_journal_get_write_access(handle, sb, (frame - 1)->bh, EXT4_JTR_NONE); if (err) { brelse(bh2); goto journal_error; } memcpy((char *) entries2, (char *) (entries + icount1), icount2 * sizeof(struct dx_entry)); dx_set_count(entries, icount1); dx_set_count(entries2, icount2); dx_set_limit(entries2, dx_node_limit(dir)); /* Which index block gets the new entry? */ if (at - entries >= icount1) { frame->at = at - entries - icount1 + entries2; frame->entries = entries = entries2; swap(frame->bh, bh2); } dx_insert_block((frame - 1), hash2, newblock); dxtrace(dx_show_index("node", frame->entries)); dxtrace(dx_show_index("node", ((struct dx_node *) bh2->b_data)->entries)); err = ext4_handle_dirty_dx_node(handle, dir, bh2); if (err) { brelse(bh2); goto journal_error; } brelse (bh2); err = ext4_handle_dirty_dx_node(handle, dir, (frame - 1)->bh); if (err) goto journal_error; err = ext4_handle_dirty_dx_node(handle, dir, frame->bh); if (restart || err) goto journal_error; } else { struct dx_root *dxroot; memcpy((char *) entries2, (char *) entries, icount * sizeof(struct dx_entry)); dx_set_limit(entries2, dx_node_limit(dir)); /* Set up root */ dx_set_count(entries, 1); dx_set_block(entries + 0, newblock); dxroot = (struct dx_root *)frames[0].bh->b_data; dxroot->info.indirect_levels += 1; dxtrace(printk(KERN_DEBUG "Creating %d level index...\n", dxroot->info.indirect_levels)); err = ext4_handle_dirty_dx_node(handle, dir, frame->bh); if (err) { brelse(bh2); goto journal_error; } err = ext4_handle_dirty_dx_node(handle, dir, bh2); brelse(bh2); restart = 1; goto journal_error; } } de = do_split(handle, dir, &bh, frame, &fname->hinfo); if (IS_ERR(de)) { err = PTR_ERR(de); goto cleanup; } err = add_dirent_to_buf(handle, fname, dir, inode, de, bh); goto cleanup; journal_error: ext4_std_error(dir->i_sb, err); /* this is a no-op if err == 0 */ cleanup: brelse(bh); dx_release(frames); /* @restart is true means htree-path has been changed, we need to * repeat dx_probe() to find out valid htree-path */ if (restart && err == 0) goto again; return err; } /* * ext4_generic_delete_entry deletes a directory entry by merging it * with the previous entry */ int ext4_generic_delete_entry(struct inode *dir, struct ext4_dir_entry_2 *de_del, struct buffer_head *bh, void *entry_buf, int buf_size, int csum_size) { struct ext4_dir_entry_2 *de, *pde; unsigned int blocksize = dir->i_sb->s_blocksize; int i; i = 0; pde = NULL; de = entry_buf; while (i < buf_size - csum_size) { if (ext4_check_dir_entry(dir, NULL, de, bh, entry_buf, buf_size, i)) return -EFSCORRUPTED; if (de == de_del) { if (pde) { pde->rec_len = ext4_rec_len_to_disk( ext4_rec_len_from_disk(pde->rec_len, blocksize) + ext4_rec_len_from_disk(de->rec_len, blocksize), blocksize); /* wipe entire dir_entry */ memset(de, 0, ext4_rec_len_from_disk(de->rec_len, blocksize)); } else { /* wipe dir_entry excluding the rec_len field */ de->inode = 0; memset(&de->name_len, 0, ext4_rec_len_from_disk(de->rec_len, blocksize) - offsetof(struct ext4_dir_entry_2, name_len)); } inode_inc_iversion(dir); return 0; } i += ext4_rec_len_from_disk(de->rec_len, blocksize); pde = de; de = ext4_next_entry(de, blocksize); } return -ENOENT; } static int ext4_delete_entry(handle_t *handle, struct inode *dir, struct ext4_dir_entry_2 *de_del, struct buffer_head *bh) { int err, csum_size = 0; if (ext4_has_inline_data(dir)) { int has_inline_data = 1; err = ext4_delete_inline_entry(handle, dir, de_del, bh, &has_inline_data); if (has_inline_data) return err; } if (ext4_has_feature_metadata_csum(dir->i_sb)) csum_size = sizeof(struct ext4_dir_entry_tail); BUFFER_TRACE(bh, "get_write_access"); err = ext4_journal_get_write_access(handle, dir->i_sb, bh, EXT4_JTR_NONE); if (unlikely(err)) goto out; err = ext4_generic_delete_entry(dir, de_del, bh, bh->b_data, dir->i_sb->s_blocksize, csum_size); if (err) goto out; BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata"); err = ext4_handle_dirty_dirblock(handle, dir, bh); if (unlikely(err)) goto out; return 0; out: if (err != -ENOENT) ext4_std_error(dir->i_sb, err); return err; } /* * Set directory link count to 1 if nlinks > EXT4_LINK_MAX, or if nlinks == 2 * since this indicates that nlinks count was previously 1 to avoid overflowing * the 16-bit i_links_count field on disk. Directories with i_nlink == 1 mean * that subdirectory link counts are not being maintained accurately. * * The caller has already checked for i_nlink overflow in case the DIR_LINK * feature is not enabled and returned -EMLINK. The is_dx() check is a proxy * for checking S_ISDIR(inode) (since the INODE_INDEX feature will not be set * on regular files) and to avoid creating huge/slow non-HTREE directories. */ static void ext4_inc_count(struct inode *inode) { inc_nlink(inode); if (is_dx(inode) && (inode->i_nlink > EXT4_LINK_MAX || inode->i_nlink == 2)) set_nlink(inode, 1); } /* * If a directory had nlink == 1, then we should let it be 1. This indicates * directory has >EXT4_LINK_MAX subdirs. */ static void ext4_dec_count(struct inode *inode) { if (!S_ISDIR(inode->i_mode) || inode->i_nlink > 2) drop_nlink(inode); } /* * Add non-directory inode to a directory. On success, the inode reference is * consumed by dentry is instantiation. This is also indicated by clearing of * *inodep pointer. On failure, the caller is responsible for dropping the * inode reference in the safe context. */ static int ext4_add_nondir(handle_t *handle, struct dentry *dentry, struct inode **inodep) { struct inode *dir = d_inode(dentry->d_parent); struct inode *inode = *inodep; int err = ext4_add_entry(handle, dentry, inode); if (!err) { err = ext4_mark_inode_dirty(handle, inode); if (IS_DIRSYNC(dir)) ext4_handle_sync(handle); d_instantiate_new(dentry, inode); *inodep = NULL; return err; } drop_nlink(inode); ext4_mark_inode_dirty(handle, inode); ext4_orphan_add(handle, inode); unlock_new_inode(inode); return err; } /* * By the time this is called, we already have created * the directory cache entry for the new file, but it * is so far negative - it has no inode. * * If the create succeeds, we fill in the inode information * with d_instantiate(). */ static int ext4_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode, bool excl) { handle_t *handle; struct inode *inode; int err, credits, retries = 0; err = dquot_initialize(dir); if (err) return err; credits = (EXT4_DATA_TRANS_BLOCKS(dir->i_sb) + EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3); retry: inode = ext4_new_inode_start_handle(idmap, dir, mode, &dentry->d_name, 0, NULL, EXT4_HT_DIR, credits); handle = ext4_journal_current_handle(); err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); err = ext4_add_nondir(handle, dentry, &inode); if (!err) ext4_fc_track_create(handle, dentry); } if (handle) ext4_journal_stop(handle); if (!IS_ERR_OR_NULL(inode)) iput(inode); if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries)) goto retry; return err; } static int ext4_mknod(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode, dev_t rdev) { handle_t *handle; struct inode *inode; int err, credits, retries = 0; err = dquot_initialize(dir); if (err) return err; credits = (EXT4_DATA_TRANS_BLOCKS(dir->i_sb) + EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3); retry: inode = ext4_new_inode_start_handle(idmap, dir, mode, &dentry->d_name, 0, NULL, EXT4_HT_DIR, credits); handle = ext4_journal_current_handle(); err = PTR_ERR(inode); if (!IS_ERR(inode)) { init_special_inode(inode, inode->i_mode, rdev); inode->i_op = &ext4_special_inode_operations; err = ext4_add_nondir(handle, dentry, &inode); if (!err) ext4_fc_track_create(handle, dentry); } if (handle) ext4_journal_stop(handle); if (!IS_ERR_OR_NULL(inode)) iput(inode); if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries)) goto retry; return err; } static int ext4_tmpfile(struct mnt_idmap *idmap, struct inode *dir, struct file *file, umode_t mode) { handle_t *handle; struct inode *inode; int err, retries = 0; err = dquot_initialize(dir); if (err) return err; retry: inode = ext4_new_inode_start_handle(idmap, dir, mode, NULL, 0, NULL, EXT4_HT_DIR, EXT4_MAXQUOTAS_TRANS_BLOCKS(dir->i_sb) + 4 + EXT4_XATTR_TRANS_BLOCKS); handle = ext4_journal_current_handle(); err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); d_tmpfile(file, inode); err = ext4_orphan_add(handle, inode); if (err) goto err_unlock_inode; mark_inode_dirty(inode); unlock_new_inode(inode); } if (handle) ext4_journal_stop(handle); if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries)) goto retry; return finish_open_simple(file, err); err_unlock_inode: ext4_journal_stop(handle); unlock_new_inode(inode); return err; } int ext4_init_dirblock(handle_t *handle, struct inode *inode, struct buffer_head *bh, unsigned int parent_ino, void *inline_buf, int inline_size) { struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *) bh->b_data; size_t blocksize = bh->b_size; int csum_size = 0, header_size; if (ext4_has_feature_metadata_csum(inode->i_sb)) csum_size = sizeof(struct ext4_dir_entry_tail); de->inode = cpu_to_le32(inode->i_ino); de->name_len = 1; de->rec_len = ext4_rec_len_to_disk(ext4_dir_rec_len(de->name_len, NULL), blocksize); memcpy(de->name, ".", 2); ext4_set_de_type(inode->i_sb, de, S_IFDIR); de = ext4_next_entry(de, blocksize); de->inode = cpu_to_le32(parent_ino); de->name_len = 2; memcpy(de->name, "..", 3); ext4_set_de_type(inode->i_sb, de, S_IFDIR); if (inline_buf) { de->rec_len = ext4_rec_len_to_disk( ext4_dir_rec_len(de->name_len, NULL), blocksize); de = ext4_next_entry(de, blocksize); header_size = (char *)de - bh->b_data; memcpy((void *)de, inline_buf, inline_size); ext4_update_final_de(bh->b_data, inline_size + header_size, blocksize - csum_size); } else { de->rec_len = ext4_rec_len_to_disk(blocksize - (csum_size + ext4_dir_rec_len(1, NULL)), blocksize); } if (csum_size) ext4_initialize_dirent_tail(bh, blocksize); BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata"); set_buffer_uptodate(bh); set_buffer_verified(bh); return ext4_handle_dirty_dirblock(handle, inode, bh); } int ext4_init_new_dir(handle_t *handle, struct inode *dir, struct inode *inode) { struct buffer_head *dir_block = NULL; ext4_lblk_t block = 0; int err; if (ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) { err = ext4_try_create_inline_dir(handle, dir, inode); if (err < 0 && err != -ENOSPC) goto out; if (!err) goto out; } set_nlink(inode, 2); inode->i_size = 0; dir_block = ext4_append(handle, inode, &block); if (IS_ERR(dir_block)) return PTR_ERR(dir_block); err = ext4_init_dirblock(handle, inode, dir_block, dir->i_ino, NULL, 0); out: brelse(dir_block); return err; } static struct dentry *ext4_mkdir(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode) { handle_t *handle; struct inode *inode; int err, err2 = 0, credits, retries = 0; if (EXT4_DIR_LINK_MAX(dir)) return ERR_PTR(-EMLINK); err = dquot_initialize(dir); if (err) return ERR_PTR(err); credits = (EXT4_DATA_TRANS_BLOCKS(dir->i_sb) + EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3); retry: inode = ext4_new_inode_start_handle(idmap, dir, S_IFDIR | mode, &dentry->d_name, 0, NULL, EXT4_HT_DIR, credits); handle = ext4_journal_current_handle(); err = PTR_ERR(inode); if (IS_ERR(inode)) goto out_stop; inode->i_op = &ext4_dir_inode_operations; inode->i_fop = &ext4_dir_operations; err = ext4_init_new_dir(handle, dir, inode); if (err) goto out_clear_inode; err = ext4_mark_inode_dirty(handle, inode); if (!err) err = ext4_add_entry(handle, dentry, inode); if (err) { out_clear_inode: clear_nlink(inode); ext4_orphan_add(handle, inode); unlock_new_inode(inode); err2 = ext4_mark_inode_dirty(handle, inode); if (unlikely(err2)) err = err2; ext4_journal_stop(handle); iput(inode); goto out_retry; } ext4_inc_count(dir); ext4_update_dx_flag(dir); err = ext4_mark_inode_dirty(handle, dir); if (err) goto out_clear_inode; d_instantiate_new(dentry, inode); ext4_fc_track_create(handle, dentry); if (IS_DIRSYNC(dir)) ext4_handle_sync(handle); out_stop: if (handle) ext4_journal_stop(handle); out_retry: if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries)) goto retry; return ERR_PTR(err); } /* * routine to check that the specified directory is empty (for rmdir) */ bool ext4_empty_dir(struct inode *inode) { unsigned int offset; struct buffer_head *bh; struct ext4_dir_entry_2 *de; struct super_block *sb; if (ext4_has_inline_data(inode)) { int has_inline_data = 1; int ret; ret = empty_inline_dir(inode, &has_inline_data); if (has_inline_data) return ret; } sb = inode->i_sb; if (inode->i_size < ext4_dir_rec_len(1, NULL) + ext4_dir_rec_len(2, NULL)) { EXT4_ERROR_INODE(inode, "invalid size"); return false; } bh = ext4_read_dirblock(inode, 0, EITHER); if (IS_ERR(bh)) return false; de = (struct ext4_dir_entry_2 *) bh->b_data; if (ext4_check_dir_entry(inode, NULL, de, bh, bh->b_data, bh->b_size, 0) || le32_to_cpu(de->inode) != inode->i_ino || de->name_len != 1 || de->name[0] != '.') { ext4_warning_inode(inode, "directory missing '.'"); brelse(bh); return false; } offset = ext4_rec_len_from_disk(de->rec_len, sb->s_blocksize); de = ext4_next_entry(de, sb->s_blocksize); if (ext4_check_dir_entry(inode, NULL, de, bh, bh->b_data, bh->b_size, offset) || le32_to_cpu(de->inode) == 0 || de->name_len != 2 || de->name[0] != '.' || de->name[1] != '.') { ext4_warning_inode(inode, "directory missing '..'"); brelse(bh); return false; } offset += ext4_rec_len_from_disk(de->rec_len, sb->s_blocksize); while (offset < inode->i_size) { if (!(offset & (sb->s_blocksize - 1))) { unsigned int lblock; brelse(bh); lblock = offset >> EXT4_BLOCK_SIZE_BITS(sb); bh = ext4_read_dirblock(inode, lblock, EITHER); if (bh == NULL) { offset += sb->s_blocksize; continue; } if (IS_ERR(bh)) return false; } de = (struct ext4_dir_entry_2 *) (bh->b_data + (offset & (sb->s_blocksize - 1))); if (ext4_check_dir_entry(inode, NULL, de, bh, bh->b_data, bh->b_size, offset) || le32_to_cpu(de->inode)) { brelse(bh); return false; } offset += ext4_rec_len_from_disk(de->rec_len, sb->s_blocksize); } brelse(bh); return true; } static int ext4_rmdir(struct inode *dir, struct dentry *dentry) { int retval; struct inode *inode; struct buffer_head *bh; struct ext4_dir_entry_2 *de = NULL; handle_t *handle = NULL; retval = ext4_emergency_state(dir->i_sb); if (unlikely(retval)) return retval; /* Initialize quotas before so that eventual writes go in * separate transaction */ retval = dquot_initialize(dir); if (retval) return retval; retval = dquot_initialize(d_inode(dentry)); if (retval) return retval; retval = -ENOENT; bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL); if (IS_ERR(bh)) return PTR_ERR(bh); if (!bh) goto end_rmdir; inode = d_inode(dentry); retval = -EFSCORRUPTED; if (le32_to_cpu(de->inode) != inode->i_ino) goto end_rmdir; retval = -ENOTEMPTY; if (!ext4_empty_dir(inode)) goto end_rmdir; handle = ext4_journal_start(dir, EXT4_HT_DIR, EXT4_DATA_TRANS_BLOCKS(dir->i_sb)); if (IS_ERR(handle)) { retval = PTR_ERR(handle); handle = NULL; goto end_rmdir; } if (IS_DIRSYNC(dir)) ext4_handle_sync(handle); retval = ext4_delete_entry(handle, dir, de, bh); if (retval) goto end_rmdir; if (!EXT4_DIR_LINK_EMPTY(inode)) ext4_warning_inode(inode, "empty directory '%.*s' has too many links (%u)", dentry->d_name.len, dentry->d_name.name, inode->i_nlink); inode_inc_iversion(inode); clear_nlink(inode); /* There's no need to set i_disksize: the fact that i_nlink is * zero will ensure that the right thing happens during any * recovery. */ inode->i_size = 0; ext4_orphan_add(handle, inode); inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir)); inode_set_ctime_current(inode); retval = ext4_mark_inode_dirty(handle, inode); if (retval) goto end_rmdir; ext4_dec_count(dir); ext4_update_dx_flag(dir); ext4_fc_track_unlink(handle, dentry); retval = ext4_mark_inode_dirty(handle, dir); /* VFS negative dentries are incompatible with Encoding and * Case-insensitiveness. Eventually we'll want avoid * invalidating the dentries here, alongside with returning the * negative dentries at ext4_lookup(), when it is better * supported by the VFS for the CI case. */ if (IS_ENABLED(CONFIG_UNICODE) && IS_CASEFOLDED(dir)) d_invalidate(dentry); end_rmdir: brelse(bh); if (handle) ext4_journal_stop(handle); return retval; } int __ext4_unlink(struct inode *dir, const struct qstr *d_name, struct inode *inode, struct dentry *dentry /* NULL during fast_commit recovery */) { int retval = -ENOENT; struct buffer_head *bh; struct ext4_dir_entry_2 *de = NULL; handle_t *handle; int skip_remove_dentry = 0; /* * Keep this outside the transaction; it may have to set up the * directory's encryption key, which isn't GFP_NOFS-safe. */ bh = ext4_find_entry(dir, d_name, &de, NULL); if (IS_ERR(bh)) return PTR_ERR(bh); if (!bh) return -ENOENT; if (le32_to_cpu(de->inode) != inode->i_ino) { /* * It's okay if we find dont find dentry which matches * the inode. That's because it might have gotten * renamed to a different inode number */ if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) skip_remove_dentry = 1; else goto out_bh; } handle = ext4_journal_start(dir, EXT4_HT_DIR, EXT4_DATA_TRANS_BLOCKS(dir->i_sb)); if (IS_ERR(handle)) { retval = PTR_ERR(handle); goto out_bh; } if (IS_DIRSYNC(dir)) ext4_handle_sync(handle); if (!skip_remove_dentry) { retval = ext4_delete_entry(handle, dir, de, bh); if (retval) goto out_handle; inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir)); ext4_update_dx_flag(dir); retval = ext4_mark_inode_dirty(handle, dir); if (retval) goto out_handle; } else { retval = 0; } if (inode->i_nlink == 0) ext4_warning_inode(inode, "Deleting file '%.*s' with no links", d_name->len, d_name->name); else drop_nlink(inode); if (!inode->i_nlink) ext4_orphan_add(handle, inode); inode_set_ctime_current(inode); retval = ext4_mark_inode_dirty(handle, inode); if (dentry && !retval) ext4_fc_track_unlink(handle, dentry); out_handle: ext4_journal_stop(handle); out_bh: brelse(bh); return retval; } static int ext4_unlink(struct inode *dir, struct dentry *dentry) { int retval; retval = ext4_emergency_state(dir->i_sb); if (unlikely(retval)) return retval; trace_ext4_unlink_enter(dir, dentry); /* * Initialize quotas before so that eventual writes go * in separate transaction */ retval = dquot_initialize(dir); if (retval) goto out_trace; retval = dquot_initialize(d_inode(dentry)); if (retval) goto out_trace; retval = __ext4_unlink(dir, &dentry->d_name, d_inode(dentry), dentry); /* VFS negative dentries are incompatible with Encoding and * Case-insensitiveness. Eventually we'll want avoid * invalidating the dentries here, alongside with returning the * negative dentries at ext4_lookup(), when it is better * supported by the VFS for the CI case. */ if (IS_ENABLED(CONFIG_UNICODE) && IS_CASEFOLDED(dir)) d_invalidate(dentry); out_trace: trace_ext4_unlink_exit(dentry, retval); return retval; } static int ext4_init_symlink_block(handle_t *handle, struct inode *inode, struct fscrypt_str *disk_link) { struct buffer_head *bh; char *kaddr; int err = 0; bh = ext4_bread(handle, inode, 0, EXT4_GET_BLOCKS_CREATE); if (IS_ERR(bh)) return PTR_ERR(bh); BUFFER_TRACE(bh, "get_write_access"); err = ext4_journal_get_write_access(handle, inode->i_sb, bh, EXT4_JTR_NONE); if (err) goto out; kaddr = (char *)bh->b_data; memcpy(kaddr, disk_link->name, disk_link->len); inode->i_size = disk_link->len - 1; EXT4_I(inode)->i_disksize = inode->i_size; err = ext4_handle_dirty_metadata(handle, inode, bh); out: brelse(bh); return err; } static int ext4_symlink(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, const char *symname) { handle_t *handle; struct inode *inode; int err, len = strlen(symname); int credits; struct fscrypt_str disk_link; int retries = 0; err = ext4_emergency_state(dir->i_sb); if (unlikely(err)) return err; err = fscrypt_prepare_symlink(dir, symname, len, dir->i_sb->s_blocksize, &disk_link); if (err) return err; err = dquot_initialize(dir); if (err) return err; /* * EXT4_INDEX_EXTRA_TRANS_BLOCKS for addition of entry into the * directory. +3 for inode, inode bitmap, group descriptor allocation. * EXT4_DATA_TRANS_BLOCKS for the data block allocation and * modification. */ credits = EXT4_DATA_TRANS_BLOCKS(dir->i_sb) + EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3; retry: inode = ext4_new_inode_start_handle(idmap, dir, S_IFLNK|S_IRWXUGO, &dentry->d_name, 0, NULL, EXT4_HT_DIR, credits); handle = ext4_journal_current_handle(); if (IS_ERR(inode)) { if (handle) ext4_journal_stop(handle); err = PTR_ERR(inode); goto out_retry; } if (IS_ENCRYPTED(inode)) { err = fscrypt_encrypt_symlink(inode, symname, len, &disk_link); if (err) goto err_drop_inode; inode->i_op = &ext4_encrypted_symlink_inode_operations; } else { if ((disk_link.len > EXT4_N_BLOCKS * 4)) { inode->i_op = &ext4_symlink_inode_operations; } else { inode->i_op = &ext4_fast_symlink_inode_operations; } } if ((disk_link.len > EXT4_N_BLOCKS * 4)) { /* alloc symlink block and fill it */ err = ext4_init_symlink_block(handle, inode, &disk_link); if (err) goto err_drop_inode; } else { /* clear the extent format for fast symlink */ ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS); memcpy((char *)&EXT4_I(inode)->i_data, disk_link.name, disk_link.len); inode->i_size = disk_link.len - 1; EXT4_I(inode)->i_disksize = inode->i_size; if (!IS_ENCRYPTED(inode)) inode_set_cached_link(inode, (char *)&EXT4_I(inode)->i_data, inode->i_size); } err = ext4_add_nondir(handle, dentry, &inode); if (handle) ext4_journal_stop(handle); iput(inode); goto out_retry; err_drop_inode: clear_nlink(inode); ext4_mark_inode_dirty(handle, inode); ext4_orphan_add(handle, inode); unlock_new_inode(inode); if (handle) ext4_journal_stop(handle); iput(inode); out_retry: if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries)) goto retry; if (disk_link.name != (unsigned char *)symname) kfree(disk_link.name); return err; } int __ext4_link(struct inode *dir, struct inode *inode, struct dentry *dentry) { handle_t *handle; int err, retries = 0; retry: handle = ext4_journal_start(dir, EXT4_HT_DIR, (EXT4_DATA_TRANS_BLOCKS(dir->i_sb) + EXT4_INDEX_EXTRA_TRANS_BLOCKS) + 1); if (IS_ERR(handle)) return PTR_ERR(handle); if (IS_DIRSYNC(dir)) ext4_handle_sync(handle); inode_set_ctime_current(inode); ext4_inc_count(inode); ihold(inode); err = ext4_add_entry(handle, dentry, inode); if (!err) { err = ext4_mark_inode_dirty(handle, inode); /* this can happen only for tmpfile being * linked the first time */ if (inode->i_nlink == 1) ext4_orphan_del(handle, inode); d_instantiate(dentry, inode); ext4_fc_track_link(handle, dentry); } else { drop_nlink(inode); iput(inode); } ext4_journal_stop(handle); if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries)) goto retry; return err; } static int ext4_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) { struct inode *inode = d_inode(old_dentry); int err; if (inode->i_nlink >= EXT4_LINK_MAX) return -EMLINK; err = fscrypt_prepare_link(old_dentry, dir, dentry); if (err) return err; if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) && (!projid_eq(EXT4_I(dir)->i_projid, EXT4_I(old_dentry->d_inode)->i_projid))) return -EXDEV; err = dquot_initialize(dir); if (err) return err; return __ext4_link(dir, inode, dentry); } /* * Try to find buffer head where contains the parent block. * It should be the inode block if it is inlined or the 1st block * if it is a normal dir. */ static struct buffer_head *ext4_get_first_dir_block(handle_t *handle, struct inode *inode, int *retval, struct ext4_dir_entry_2 **parent_de, int *inlined) { struct buffer_head *bh; if (!ext4_has_inline_data(inode)) { struct ext4_dir_entry_2 *de; unsigned int offset; bh = ext4_read_dirblock(inode, 0, EITHER); if (IS_ERR(bh)) { *retval = PTR_ERR(bh); return NULL; } de = (struct ext4_dir_entry_2 *) bh->b_data; if (ext4_check_dir_entry(inode, NULL, de, bh, bh->b_data, bh->b_size, 0) || le32_to_cpu(de->inode) != inode->i_ino || de->name_len != 1 || de->name[0] != '.') { EXT4_ERROR_INODE(inode, "directory missing '.'"); brelse(bh); *retval = -EFSCORRUPTED; return NULL; } offset = ext4_rec_len_from_disk(de->rec_len, inode->i_sb->s_blocksize); de = ext4_next_entry(de, inode->i_sb->s_blocksize); if (ext4_check_dir_entry(inode, NULL, de, bh, bh->b_data, bh->b_size, offset) || le32_to_cpu(de->inode) == 0 || de->name_len != 2 || de->name[0] != '.' || de->name[1] != '.') { EXT4_ERROR_INODE(inode, "directory missing '..'"); brelse(bh); *retval = -EFSCORRUPTED; return NULL; } *parent_de = de; return bh; } *inlined = 1; return ext4_get_first_inline_block(inode, parent_de, retval); } struct ext4_renament { struct inode *dir; struct dentry *dentry; struct inode *inode; bool is_dir; int dir_nlink_delta; /* entry for "dentry" */ struct buffer_head *bh; struct ext4_dir_entry_2 *de; int inlined; /* entry for ".." in inode if it's a directory */ struct buffer_head *dir_bh; struct ext4_dir_entry_2 *parent_de; int dir_inlined; }; static int ext4_rename_dir_prepare(handle_t *handle, struct ext4_renament *ent, bool is_cross) { int retval; ent->is_dir = true; if (!is_cross) return 0; ent->dir_bh = ext4_get_first_dir_block(handle, ent->inode, &retval, &ent->parent_de, &ent->dir_inlined); if (!ent->dir_bh) return retval; if (le32_to_cpu(ent->parent_de->inode) != ent->dir->i_ino) return -EFSCORRUPTED; BUFFER_TRACE(ent->dir_bh, "get_write_access"); return ext4_journal_get_write_access(handle, ent->dir->i_sb, ent->dir_bh, EXT4_JTR_NONE); } static int ext4_rename_dir_finish(handle_t *handle, struct ext4_renament *ent, unsigned dir_ino) { int retval; if (!ent->dir_bh) return 0; ent->parent_de->inode = cpu_to_le32(dir_ino); BUFFER_TRACE(ent->dir_bh, "call ext4_handle_dirty_metadata"); if (!ent->dir_inlined) { if (is_dx(ent->inode)) { retval = ext4_handle_dirty_dx_node(handle, ent->inode, ent->dir_bh); } else { retval = ext4_handle_dirty_dirblock(handle, ent->inode, ent->dir_bh); } } else { retval = ext4_mark_inode_dirty(handle, ent->inode); } if (retval) { ext4_std_error(ent->dir->i_sb, retval); return retval; } return 0; } static int ext4_setent(handle_t *handle, struct ext4_renament *ent, unsigned ino, unsigned file_type) { int retval, retval2; BUFFER_TRACE(ent->bh, "get write access"); retval = ext4_journal_get_write_access(handle, ent->dir->i_sb, ent->bh, EXT4_JTR_NONE); if (retval) return retval; ent->de->inode = cpu_to_le32(ino); if (ext4_has_feature_filetype(ent->dir->i_sb)) ent->de->file_type = file_type; inode_inc_iversion(ent->dir); inode_set_mtime_to_ts(ent->dir, inode_set_ctime_current(ent->dir)); retval = ext4_mark_inode_dirty(handle, ent->dir); BUFFER_TRACE(ent->bh, "call ext4_handle_dirty_metadata"); if (!ent->inlined) { retval2 = ext4_handle_dirty_dirblock(handle, ent->dir, ent->bh); if (unlikely(retval2)) { ext4_std_error(ent->dir->i_sb, retval2); return retval2; } } return retval; } static void ext4_resetent(handle_t *handle, struct ext4_renament *ent, unsigned ino, unsigned file_type) { struct ext4_renament old = *ent; int retval = 0; /* * old->de could have moved from under us during make indexed dir, * so the old->de may no longer valid and need to find it again * before reset old inode info. */ old.bh = ext4_find_entry(old.dir, &old.dentry->d_name, &old.de, &old.inlined); if (IS_ERR(old.bh)) retval = PTR_ERR(old.bh); if (!old.bh) retval = -ENOENT; if (retval) { ext4_std_error(old.dir->i_sb, retval); return; } ext4_setent(handle, &old, ino, file_type); brelse(old.bh); } static int ext4_find_delete_entry(handle_t *handle, struct inode *dir, const struct qstr *d_name) { int retval = -ENOENT; struct buffer_head *bh; struct ext4_dir_entry_2 *de = NULL; bh = ext4_find_entry(dir, d_name, &de, NULL); if (IS_ERR(bh)) return PTR_ERR(bh); if (bh) { retval = ext4_delete_entry(handle, dir, de, bh); brelse(bh); } return retval; } static void ext4_rename_delete(handle_t *handle, struct ext4_renament *ent, int force_reread) { int retval; /* * ent->de could have moved from under us during htree split, so make * sure that we are deleting the right entry. We might also be pointing * to a stale entry in the unused part of ent->bh so just checking inum * and the name isn't enough. */ if (le32_to_cpu(ent->de->inode) != ent->inode->i_ino || ent->de->name_len != ent->dentry->d_name.len || strncmp(ent->de->name, ent->dentry->d_name.name, ent->de->name_len) || force_reread) { retval = ext4_find_delete_entry(handle, ent->dir, &ent->dentry->d_name); } else { retval = ext4_delete_entry(handle, ent->dir, ent->de, ent->bh); if (retval == -ENOENT) { retval = ext4_find_delete_entry(handle, ent->dir, &ent->dentry->d_name); } } if (retval) { ext4_warning_inode(ent->dir, "Deleting old file: nlink %d, error=%d", ent->dir->i_nlink, retval); } } static void ext4_update_dir_count(handle_t *handle, struct ext4_renament *ent) { if (ent->dir_nlink_delta) { if (ent->dir_nlink_delta == -1) ext4_dec_count(ent->dir); else ext4_inc_count(ent->dir); ext4_mark_inode_dirty(handle, ent->dir); } } static struct inode *ext4_whiteout_for_rename(struct mnt_idmap *idmap, struct ext4_renament *ent, int credits, handle_t **h) { struct inode *wh; handle_t *handle; int retries = 0; /* * for inode block, sb block, group summaries, * and inode bitmap */ credits += (EXT4_MAXQUOTAS_TRANS_BLOCKS(ent->dir->i_sb) + EXT4_XATTR_TRANS_BLOCKS + 4); retry: wh = ext4_new_inode_start_handle(idmap, ent->dir, S_IFCHR | WHITEOUT_MODE, &ent->dentry->d_name, 0, NULL, EXT4_HT_DIR, credits); handle = ext4_journal_current_handle(); if (IS_ERR(wh)) { if (handle) ext4_journal_stop(handle); if (PTR_ERR(wh) == -ENOSPC && ext4_should_retry_alloc(ent->dir->i_sb, &retries)) goto retry; } else { *h = handle; init_special_inode(wh, wh->i_mode, WHITEOUT_DEV); wh->i_op = &ext4_special_inode_operations; } return wh; } /* * Anybody can rename anything with this: the permission checks are left to the * higher-level routines. * * n.b. old_{dentry,inode) refers to the source dentry/inode * while new_{dentry,inode) refers to the destination dentry/inode * This comes from rename(const char *oldpath, const char *newpath) */ static int ext4_rename(struct mnt_idmap *idmap, struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry, unsigned int flags) { handle_t *handle = NULL; struct ext4_renament old = { .dir = old_dir, .dentry = old_dentry, .inode = d_inode(old_dentry), }; struct ext4_renament new = { .dir = new_dir, .dentry = new_dentry, .inode = d_inode(new_dentry), }; int force_reread; int retval; struct inode *whiteout = NULL; int credits; u8 old_file_type; if (new.inode && new.inode->i_nlink == 0) { EXT4_ERROR_INODE(new.inode, "target of rename is already freed"); return -EFSCORRUPTED; } if ((ext4_test_inode_flag(new_dir, EXT4_INODE_PROJINHERIT)) && (!projid_eq(EXT4_I(new_dir)->i_projid, EXT4_I(old_dentry->d_inode)->i_projid))) return -EXDEV; retval = dquot_initialize(old.dir); if (retval) return retval; retval = dquot_initialize(old.inode); if (retval) return retval; retval = dquot_initialize(new.dir); if (retval) return retval; /* Initialize quotas before so that eventual writes go * in separate transaction */ if (new.inode) { retval = dquot_initialize(new.inode); if (retval) return retval; } old.bh = ext4_find_entry(old.dir, &old.dentry->d_name, &old.de, &old.inlined); if (IS_ERR(old.bh)) return PTR_ERR(old.bh); /* * Check for inode number is _not_ due to possible IO errors. * We might rmdir the source, keep it as pwd of some process * and merrily kill the link to whatever was created under the * same name. Goodbye sticky bit ;-< */ retval = -ENOENT; if (!old.bh || le32_to_cpu(old.de->inode) != old.inode->i_ino) goto release_bh; new.bh = ext4_find_entry(new.dir, &new.dentry->d_name, &new.de, &new.inlined); if (IS_ERR(new.bh)) { retval = PTR_ERR(new.bh); new.bh = NULL; goto release_bh; } if (new.bh) { if (!new.inode) { brelse(new.bh); new.bh = NULL; } } if (new.inode && !test_opt(new.dir->i_sb, NO_AUTO_DA_ALLOC)) ext4_alloc_da_blocks(old.inode); credits = (2 * EXT4_DATA_TRANS_BLOCKS(old.dir->i_sb) + EXT4_INDEX_EXTRA_TRANS_BLOCKS + 2); if (!(flags & RENAME_WHITEOUT)) { handle = ext4_journal_start(old.dir, EXT4_HT_DIR, credits); if (IS_ERR(handle)) { retval = PTR_ERR(handle); goto release_bh; } } else { whiteout = ext4_whiteout_for_rename(idmap, &old, credits, &handle); if (IS_ERR(whiteout)) { retval = PTR_ERR(whiteout); goto release_bh; } } old_file_type = old.de->file_type; if (IS_DIRSYNC(old.dir) || IS_DIRSYNC(new.dir)) ext4_handle_sync(handle); if (S_ISDIR(old.inode->i_mode)) { if (new.inode) { retval = -ENOTEMPTY; if (!ext4_empty_dir(new.inode)) goto end_rename; } else { retval = -EMLINK; if (new.dir != old.dir && EXT4_DIR_LINK_MAX(new.dir)) goto end_rename; } retval = ext4_rename_dir_prepare(handle, &old, new.dir != old.dir); if (retval) goto end_rename; } /* * If we're renaming a file within an inline_data dir and adding or * setting the new dirent causes a conversion from inline_data to * extents/blockmap, we need to force the dirent delete code to * re-read the directory, or else we end up trying to delete a dirent * from what is now the extent tree root (or a block map). */ force_reread = (new.dir->i_ino == old.dir->i_ino && ext4_test_inode_flag(new.dir, EXT4_INODE_INLINE_DATA)); if (whiteout) { /* * Do this before adding a new entry, so the old entry is sure * to be still pointing to the valid old entry. */ retval = ext4_setent(handle, &old, whiteout->i_ino, EXT4_FT_CHRDEV); if (retval) goto end_rename; retval = ext4_mark_inode_dirty(handle, whiteout); if (unlikely(retval)) goto end_rename; } if (!new.bh) { retval = ext4_add_entry(handle, new.dentry, old.inode); if (retval) goto end_rename; } else { retval = ext4_setent(handle, &new, old.inode->i_ino, old_file_type); if (retval) goto end_rename; } if (force_reread) force_reread = !ext4_test_inode_flag(new.dir, EXT4_INODE_INLINE_DATA); /* * Like most other Unix systems, set the ctime for inodes on a * rename. */ inode_set_ctime_current(old.inode); retval = ext4_mark_inode_dirty(handle, old.inode); if (unlikely(retval)) goto end_rename; if (!whiteout) { /* * ok, that's it */ ext4_rename_delete(handle, &old, force_reread); } if (new.inode) { ext4_dec_count(new.inode); inode_set_ctime_current(new.inode); } inode_set_mtime_to_ts(old.dir, inode_set_ctime_current(old.dir)); ext4_update_dx_flag(old.dir); if (old.is_dir) { retval = ext4_rename_dir_finish(handle, &old, new.dir->i_ino); if (retval) goto end_rename; ext4_dec_count(old.dir); if (new.inode) { /* checked ext4_empty_dir above, can't have another * parent, ext4_dec_count() won't work for many-linked * dirs */ clear_nlink(new.inode); } else { ext4_inc_count(new.dir); ext4_update_dx_flag(new.dir); retval = ext4_mark_inode_dirty(handle, new.dir); if (unlikely(retval)) goto end_rename; } } retval = ext4_mark_inode_dirty(handle, old.dir); if (unlikely(retval)) goto end_rename; if (old.is_dir) { /* * We disable fast commits here that's because the * replay code is not yet capable of changing dot dot * dirents in directories. */ ext4_fc_mark_ineligible(old.inode->i_sb, EXT4_FC_REASON_RENAME_DIR, handle); } else { struct super_block *sb = old.inode->i_sb; if (new.inode) ext4_fc_track_unlink(handle, new.dentry); if (test_opt2(sb, JOURNAL_FAST_COMMIT) && !(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY) && !(ext4_test_mount_flag(sb, EXT4_MF_FC_INELIGIBLE))) { __ext4_fc_track_link(handle, old.inode, new.dentry); __ext4_fc_track_unlink(handle, old.inode, old.dentry); if (whiteout) __ext4_fc_track_create(handle, whiteout, old.dentry); } } if (new.inode) { retval = ext4_mark_inode_dirty(handle, new.inode); if (unlikely(retval)) goto end_rename; if (!new.inode->i_nlink) ext4_orphan_add(handle, new.inode); } retval = 0; end_rename: if (whiteout) { if (retval) { ext4_resetent(handle, &old, old.inode->i_ino, old_file_type); drop_nlink(whiteout); ext4_mark_inode_dirty(handle, whiteout); ext4_orphan_add(handle, whiteout); } unlock_new_inode(whiteout); ext4_journal_stop(handle); iput(whiteout); } else { ext4_journal_stop(handle); } release_bh: brelse(old.dir_bh); brelse(old.bh); brelse(new.bh); return retval; } static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry) { handle_t *handle = NULL; struct ext4_renament old = { .dir = old_dir, .dentry = old_dentry, .inode = d_inode(old_dentry), }; struct ext4_renament new = { .dir = new_dir, .dentry = new_dentry, .inode = d_inode(new_dentry), }; u8 new_file_type; int retval; if ((ext4_test_inode_flag(new_dir, EXT4_INODE_PROJINHERIT) && !projid_eq(EXT4_I(new_dir)->i_projid, EXT4_I(old_dentry->d_inode)->i_projid)) || (ext4_test_inode_flag(old_dir, EXT4_INODE_PROJINHERIT) && !projid_eq(EXT4_I(old_dir)->i_projid, EXT4_I(new_dentry->d_inode)->i_projid))) return -EXDEV; retval = dquot_initialize(old.dir); if (retval) return retval; retval = dquot_initialize(new.dir); if (retval) return retval; old.bh = ext4_find_entry(old.dir, &old.dentry->d_name, &old.de, &old.inlined); if (IS_ERR(old.bh)) return PTR_ERR(old.bh); /* * Check for inode number is _not_ due to possible IO errors. * We might rmdir the source, keep it as pwd of some process * and merrily kill the link to whatever was created under the * same name. Goodbye sticky bit ;-< */ retval = -ENOENT; if (!old.bh || le32_to_cpu(old.de->inode) != old.inode->i_ino) goto end_rename; new.bh = ext4_find_entry(new.dir, &new.dentry->d_name, &new.de, &new.inlined); if (IS_ERR(new.bh)) { retval = PTR_ERR(new.bh); new.bh = NULL; goto end_rename; } /* RENAME_EXCHANGE case: old *and* new must both exist */ if (!new.bh || le32_to_cpu(new.de->inode) != new.inode->i_ino) goto end_rename; handle = ext4_journal_start(old.dir, EXT4_HT_DIR, (2 * EXT4_DATA_TRANS_BLOCKS(old.dir->i_sb) + 2 * EXT4_INDEX_EXTRA_TRANS_BLOCKS + 2)); if (IS_ERR(handle)) { retval = PTR_ERR(handle); handle = NULL; goto end_rename; } if (IS_DIRSYNC(old.dir) || IS_DIRSYNC(new.dir)) ext4_handle_sync(handle); if (S_ISDIR(old.inode->i_mode)) { retval = ext4_rename_dir_prepare(handle, &old, new.dir != old.dir); if (retval) goto end_rename; } if (S_ISDIR(new.inode->i_mode)) { retval = ext4_rename_dir_prepare(handle, &new, new.dir != old.dir); if (retval) goto end_rename; } /* * Other than the special case of overwriting a directory, parents' * nlink only needs to be modified if this is a cross directory rename. */ if (old.dir != new.dir && old.is_dir != new.is_dir) { old.dir_nlink_delta = old.is_dir ? -1 : 1; new.dir_nlink_delta = -old.dir_nlink_delta; retval = -EMLINK; if ((old.dir_nlink_delta > 0 && EXT4_DIR_LINK_MAX(old.dir)) || (new.dir_nlink_delta > 0 && EXT4_DIR_LINK_MAX(new.dir))) goto end_rename; } new_file_type = new.de->file_type; retval = ext4_setent(handle, &new, old.inode->i_ino, old.de->file_type); if (retval) goto end_rename; retval = ext4_setent(handle, &old, new.inode->i_ino, new_file_type); if (retval) goto end_rename; /* * Like most other Unix systems, set the ctime for inodes on a * rename. */ inode_set_ctime_current(old.inode); inode_set_ctime_current(new.inode); retval = ext4_mark_inode_dirty(handle, old.inode); if (unlikely(retval)) goto end_rename; retval = ext4_mark_inode_dirty(handle, new.inode); if (unlikely(retval)) goto end_rename; ext4_fc_mark_ineligible(new.inode->i_sb, EXT4_FC_REASON_CROSS_RENAME, handle); if (old.dir_bh) { retval = ext4_rename_dir_finish(handle, &old, new.dir->i_ino); if (retval) goto end_rename; } if (new.dir_bh) { retval = ext4_rename_dir_finish(handle, &new, old.dir->i_ino); if (retval) goto end_rename; } ext4_update_dir_count(handle, &old); ext4_update_dir_count(handle, &new); retval = 0; end_rename: brelse(old.dir_bh); brelse(new.dir_bh); brelse(old.bh); brelse(new.bh); if (handle) ext4_journal_stop(handle); return retval; } static int ext4_rename2(struct mnt_idmap *idmap, struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry, unsigned int flags) { int err; err = ext4_emergency_state(old_dir->i_sb); if (unlikely(err)) return err; if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) return -EINVAL; err = fscrypt_prepare_rename(old_dir, old_dentry, new_dir, new_dentry, flags); if (err) return err; if (flags & RENAME_EXCHANGE) { return ext4_cross_rename(old_dir, old_dentry, new_dir, new_dentry); } return ext4_rename(idmap, old_dir, old_dentry, new_dir, new_dentry, flags); } /* * directories can handle most operations... */ const struct inode_operations ext4_dir_inode_operations = { .create = ext4_create, .lookup = ext4_lookup, .link = ext4_link, .unlink = ext4_unlink, .symlink = ext4_symlink, .mkdir = ext4_mkdir, .rmdir = ext4_rmdir, .mknod = ext4_mknod, .tmpfile = ext4_tmpfile, .rename = ext4_rename2, .setattr = ext4_setattr, .getattr = ext4_getattr, .listxattr = ext4_listxattr, .get_inode_acl = ext4_get_acl, .set_acl = ext4_set_acl, .fiemap = ext4_fiemap, .fileattr_get = ext4_fileattr_get, .fileattr_set = ext4_fileattr_set, }; const struct inode_operations ext4_special_inode_operations = { .setattr = ext4_setattr, .getattr = ext4_getattr, .listxattr = ext4_listxattr, .get_inode_acl = ext4_get_acl, .set_acl = ext4_set_acl, }; |
| 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 | // SPDX-License-Identifier: GPL-2.0 /* * Functions related to setting various queue properties from drivers */ #include <linux/kernel.h> #include <linux/module.h> #include <linux/init.h> #include <linux/bio.h> #include <linux/blk-integrity.h> #include <linux/pagemap.h> #include <linux/backing-dev-defs.h> #include <linux/gcd.h> #include <linux/lcm.h> #include <linux/jiffies.h> #include <linux/gfp.h> #include <linux/dma-mapping.h> #include <linux/t10-pi.h> #include <linux/crc64.h> #include "blk.h" #include "blk-rq-qos.h" #include "blk-wbt.h" void blk_queue_rq_timeout(struct request_queue *q, unsigned int timeout) { WRITE_ONCE(q->rq_timeout, timeout); } EXPORT_SYMBOL_GPL(blk_queue_rq_timeout); /** * blk_set_stacking_limits - set default limits for stacking devices * @lim: the queue_limits structure to reset * * Prepare queue limits for applying limits from underlying devices using * blk_stack_limits(). */ void blk_set_stacking_limits(struct queue_limits *lim) { memset(lim, 0, sizeof(*lim)); lim->logical_block_size = SECTOR_SIZE; lim->physical_block_size = SECTOR_SIZE; lim->io_min = SECTOR_SIZE; lim->discard_granularity = SECTOR_SIZE; lim->dma_alignment = SECTOR_SIZE - 1; lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK; /* Inherit limits from component devices */ lim->max_segments = USHRT_MAX; lim->max_discard_segments = USHRT_MAX; lim->max_hw_sectors = UINT_MAX; lim->max_segment_size = UINT_MAX; lim->max_sectors = UINT_MAX; lim->max_dev_sectors = UINT_MAX; lim->max_write_zeroes_sectors = UINT_MAX; lim->max_hw_wzeroes_unmap_sectors = UINT_MAX; lim->max_user_wzeroes_unmap_sectors = UINT_MAX; lim->max_hw_zone_append_sectors = UINT_MAX; lim->max_user_discard_sectors = UINT_MAX; lim->atomic_write_hw_max = UINT_MAX; } EXPORT_SYMBOL(blk_set_stacking_limits); void blk_apply_bdi_limits(struct backing_dev_info *bdi, struct queue_limits *lim) { u64 io_opt = lim->io_opt; /* * For read-ahead of large files to be effective, we need to read ahead * at least twice the optimal I/O size. For rotational devices that do * not report an optimal I/O size (e.g. ATA HDDs), use the maximum I/O * size to avoid falling back to the (rather inefficient) small default * read-ahead size. * * There is no hardware limitation for the read-ahead size and the user * might have increased the read-ahead size through sysfs, so don't ever * decrease it. */ if (!io_opt && (lim->features & BLK_FEAT_ROTATIONAL)) io_opt = (u64)lim->max_sectors << SECTOR_SHIFT; bdi->ra_pages = max3(bdi->ra_pages, io_opt * 2 >> PAGE_SHIFT, VM_READAHEAD_PAGES); bdi->io_pages = lim->max_sectors >> PAGE_SECTORS_SHIFT; } static int blk_validate_zoned_limits(struct queue_limits *lim) { if (!(lim->features & BLK_FEAT_ZONED)) { if (WARN_ON_ONCE(lim->max_open_zones) || WARN_ON_ONCE(lim->max_active_zones) || WARN_ON_ONCE(lim->zone_write_granularity) || WARN_ON_ONCE(lim->max_zone_append_sectors)) return -EINVAL; return 0; } if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_BLK_DEV_ZONED))) return -EINVAL; /* * Given that active zones include open zones, the maximum number of * open zones cannot be larger than the maximum number of active zones. */ if (lim->max_active_zones && lim->max_open_zones > lim->max_active_zones) return -EINVAL; if (lim->zone_write_granularity < lim->logical_block_size) lim->zone_write_granularity = lim->logical_block_size; /* * The Zone Append size is limited by the maximum I/O size and the zone * size given that it can't span zones. * * If no max_hw_zone_append_sectors limit is provided, the block layer * will emulated it, else we're also bound by the hardware limit. */ lim->max_zone_append_sectors = min_not_zero(lim->max_hw_zone_append_sectors, min(lim->chunk_sectors, lim->max_hw_sectors)); return 0; } /* * Maximum size of I/O that needs a block layer integrity buffer. Limited * by the number of intervals for which we can fit the integrity buffer into * the buffer size. Because the buffer is a single segment it is also limited * by the maximum segment size. */ static inline unsigned int max_integrity_io_size(struct queue_limits *lim) { return min_t(unsigned int, lim->max_segment_size, (BLK_INTEGRITY_MAX_SIZE / lim->integrity.metadata_size) << lim->integrity.interval_exp); } static int blk_validate_integrity_limits(struct queue_limits *lim) { struct blk_integrity *bi = &lim->integrity; if (!bi->metadata_size) { if (bi->csum_type != BLK_INTEGRITY_CSUM_NONE || bi->tag_size || ((bi->flags & BLK_INTEGRITY_REF_TAG))) { pr_warn("invalid PI settings.\n"); return -EINVAL; } bi->flags |= BLK_INTEGRITY_NOGENERATE | BLK_INTEGRITY_NOVERIFY; return 0; } if (!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY)) { pr_warn("integrity support disabled.\n"); return -EINVAL; } if (bi->csum_type == BLK_INTEGRITY_CSUM_NONE && (bi->flags & BLK_INTEGRITY_REF_TAG)) { pr_warn("ref tag not support without checksum.\n"); return -EINVAL; } if (bi->pi_offset + bi->pi_tuple_size > bi->metadata_size) { pr_warn("pi_offset (%u) + pi_tuple_size (%u) exceeds metadata_size (%u)\n", bi->pi_offset, bi->pi_tuple_size, bi->metadata_size); return -EINVAL; } switch (bi->csum_type) { case BLK_INTEGRITY_CSUM_NONE: if (bi->pi_tuple_size) { pr_warn("pi_tuple_size must be 0 when checksum type is none\n"); return -EINVAL; } break; case BLK_INTEGRITY_CSUM_CRC: case BLK_INTEGRITY_CSUM_IP: if (bi->pi_tuple_size != sizeof(struct t10_pi_tuple)) { pr_warn("pi_tuple_size mismatch for T10 PI: expected %zu, got %u\n", sizeof(struct t10_pi_tuple), bi->pi_tuple_size); return -EINVAL; } break; case BLK_INTEGRITY_CSUM_CRC64: if (bi->pi_tuple_size != sizeof(struct crc64_pi_tuple)) { pr_warn("pi_tuple_size mismatch for CRC64 PI: expected %zu, got %u\n", sizeof(struct crc64_pi_tuple), bi->pi_tuple_size); return -EINVAL; } break; } if (!bi->interval_exp) { bi->interval_exp = ilog2(lim->logical_block_size); } else if (bi->interval_exp < SECTOR_SHIFT || bi->interval_exp > ilog2(lim->logical_block_size)) { pr_warn("invalid interval_exp %u\n", bi->interval_exp); return -EINVAL; } /* * The PI generation / validation helpers do not expect intervals to * straddle multiple bio_vecs. Enforce alignment so that those are * never generated, and that each buffer is aligned as expected. */ if (bi->csum_type) { lim->dma_alignment = max(lim->dma_alignment, (1U << bi->interval_exp) - 1); } /* * The block layer automatically adds integrity data for bios that don't * already have it. Limit the I/O size so that a single maximum size * metadata segment can cover the integrity data for the entire I/O. */ lim->max_sectors = min(lim->max_sectors, max_integrity_io_size(lim) >> SECTOR_SHIFT); return 0; } /* * Returns max guaranteed bytes which we can fit in a bio. * * We request that an atomic_write is ITER_UBUF iov_iter (so a single vector), * so we assume that we can fit in at least PAGE_SIZE in a segment, apart from * the first and last segments. */ static unsigned int blk_queue_max_guaranteed_bio(struct queue_limits *lim) { unsigned int max_segments = min(BIO_MAX_VECS, lim->max_segments); unsigned int length; length = min(max_segments, 2) * lim->logical_block_size; if (max_segments > 2) length += (max_segments - 2) * PAGE_SIZE; return length; } static void blk_atomic_writes_update_limits(struct queue_limits *lim) { unsigned int unit_limit = min(lim->max_hw_sectors << SECTOR_SHIFT, blk_queue_max_guaranteed_bio(lim)); unit_limit = rounddown_pow_of_two(unit_limit); lim->atomic_write_max_sectors = min(lim->atomic_write_hw_max >> SECTOR_SHIFT, lim->max_hw_sectors); lim->atomic_write_unit_min = min(lim->atomic_write_hw_unit_min, unit_limit); lim->atomic_write_unit_max = min(lim->atomic_write_hw_unit_max, unit_limit); lim->atomic_write_boundary_sectors = lim->atomic_write_hw_boundary >> SECTOR_SHIFT; } /* * Test whether any boundary is aligned with any chunk size. Stacked * devices store any stripe size in t->chunk_sectors. */ static bool blk_valid_atomic_writes_boundary(unsigned int chunk_sectors, unsigned int boundary_sectors) { if (!chunk_sectors || !boundary_sectors) return true; if (boundary_sectors > chunk_sectors && boundary_sectors % chunk_sectors) return false; if (chunk_sectors > boundary_sectors && chunk_sectors % boundary_sectors) return false; return true; } static void blk_validate_atomic_write_limits(struct queue_limits *lim) { unsigned int boundary_sectors; unsigned int atomic_write_hw_max_sectors = lim->atomic_write_hw_max >> SECTOR_SHIFT; if (!(lim->features & BLK_FEAT_ATOMIC_WRITES)) goto unsupported; /* UINT_MAX indicates stacked limits in initial state */ if (lim->atomic_write_hw_max == UINT_MAX) goto unsupported; if (!lim->atomic_write_hw_max) goto unsupported; if (WARN_ON_ONCE(!is_power_of_2(lim->atomic_write_hw_unit_min))) goto unsupported; if (WARN_ON_ONCE(!is_power_of_2(lim->atomic_write_hw_unit_max))) goto unsupported; if (WARN_ON_ONCE(lim->atomic_write_hw_unit_min > lim->atomic_write_hw_unit_max)) goto unsupported; if (WARN_ON_ONCE(lim->atomic_write_hw_unit_max > lim->atomic_write_hw_max)) goto unsupported; if (WARN_ON_ONCE(lim->chunk_sectors && atomic_write_hw_max_sectors > lim->chunk_sectors)) goto unsupported; boundary_sectors = lim->atomic_write_hw_boundary >> SECTOR_SHIFT; if (boundary_sectors) { if (WARN_ON_ONCE(lim->atomic_write_hw_max > lim->atomic_write_hw_boundary)) goto unsupported; if (WARN_ON_ONCE(!blk_valid_atomic_writes_boundary( lim->chunk_sectors, boundary_sectors))) goto unsupported; /* * The boundary size just needs to be a multiple of unit_max * (and not necessarily a power-of-2), so this following check * could be relaxed in future. * Furthermore, if needed, unit_max could even be reduced so * that it is compliant with a !power-of-2 boundary. */ if (!is_power_of_2(boundary_sectors)) goto unsupported; } blk_atomic_writes_update_limits(lim); return; unsupported: lim->atomic_write_max_sectors = 0; lim->atomic_write_boundary_sectors = 0; lim->atomic_write_unit_min = 0; lim->atomic_write_unit_max = 0; } /* * Check that the limits in lim are valid, initialize defaults for unset * values, and cap values based on others where needed. */ int blk_validate_limits(struct queue_limits *lim) { unsigned int max_hw_sectors; unsigned int logical_block_sectors; unsigned long seg_size; int err; /* * Unless otherwise specified, default to 512 byte logical blocks and a * physical block size equal to the logical block size. */ if (!lim->logical_block_size) lim->logical_block_size = SECTOR_SIZE; else if (blk_validate_block_size(lim->logical_block_size)) { pr_warn("Invalid logical block size (%d)\n", lim->logical_block_size); return -EINVAL; } if (lim->physical_block_size < lim->logical_block_size) { lim->physical_block_size = lim->logical_block_size; } else if (!is_power_of_2(lim->physical_block_size)) { pr_warn("Invalid physical block size (%d)\n", lim->physical_block_size); return -EINVAL; } /* * The minimum I/O size defaults to the physical block size unless * explicitly overridden. */ if (lim->io_min < lim->physical_block_size) lim->io_min = lim->physical_block_size; /* * The optimal I/O size may not be aligned to physical block size * (because it may be limited by dma engines which have no clue about * block size of the disks attached to them), so we round it down here. */ lim->io_opt = round_down(lim->io_opt, lim->physical_block_size); /* * max_hw_sectors has a somewhat weird default for historical reason, * but driver really should set their own instead of relying on this * value. * * The block layer relies on the fact that every driver can * handle at lest a page worth of data per I/O, and needs the value * aligned to the logical block size. */ if (!lim->max_hw_sectors) lim->max_hw_sectors = BLK_SAFE_MAX_SECTORS; if (WARN_ON_ONCE(lim->max_hw_sectors < PAGE_SECTORS)) return -EINVAL; logical_block_sectors = lim->logical_block_size >> SECTOR_SHIFT; if (WARN_ON_ONCE(logical_block_sectors > lim->max_hw_sectors)) return -EINVAL; lim->max_hw_sectors = round_down(lim->max_hw_sectors, logical_block_sectors); /* * The actual max_sectors value is a complex beast and also takes the * max_dev_sectors value (set by SCSI ULPs) and a user configurable * value into account. The ->max_sectors value is always calculated * from these, so directly setting it won't have any effect. */ max_hw_sectors = min_not_zero(lim->max_hw_sectors, lim->max_dev_sectors); if (lim->max_user_sectors) { if (lim->max_user_sectors < BLK_MIN_SEGMENT_SIZE / SECTOR_SIZE) return -EINVAL; lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors); } else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) { lim->max_sectors = min(max_hw_sectors, lim->io_opt >> SECTOR_SHIFT); } else if (lim->io_min > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) { lim->max_sectors = min(max_hw_sectors, lim->io_min >> SECTOR_SHIFT); } else { lim->max_sectors = min(max_hw_sectors, BLK_DEF_MAX_SECTORS_CAP); } lim->max_sectors = round_down(lim->max_sectors, logical_block_sectors); /* * Random default for the maximum number of segments. Driver should not * rely on this and set their own. */ if (!lim->max_segments) lim->max_segments = BLK_MAX_SEGMENTS; if (lim->max_hw_wzeroes_unmap_sectors && lim->max_hw_wzeroes_unmap_sectors != lim->max_write_zeroes_sectors) return -EINVAL; lim->max_wzeroes_unmap_sectors = min(lim->max_hw_wzeroes_unmap_sectors, lim->max_user_wzeroes_unmap_sectors); lim->max_discard_sectors = min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors); /* * When discard is not supported, discard_granularity should be reported * as 0 to userspace. */ if (lim->max_discard_sectors) lim->discard_granularity = max(lim->discard_granularity, lim->physical_block_size); else lim->discard_granularity = 0; if (!lim->max_discard_segments) lim->max_discard_segments = 1; /* * By default there is no limit on the segment boundary alignment, * but if there is one it can't be smaller than the page size as * that would break all the normal I/O patterns. */ if (!lim->seg_boundary_mask) lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK; if (WARN_ON_ONCE(lim->seg_boundary_mask < BLK_MIN_SEGMENT_SIZE - 1)) return -EINVAL; /* * Stacking device may have both virtual boundary and max segment * size limit, so allow this setting now, and long-term the two * might need to move out of stacking limits since we have immutable * bvec and lower layer bio splitting is supposed to handle the two * correctly. */ if (lim->virt_boundary_mask) { if (!lim->max_segment_size) lim->max_segment_size = UINT_MAX; } else { /* * The maximum segment size has an odd historic 64k default that * drivers probably should override. Just like the I/O size we * require drivers to at least handle a full page per segment. */ if (!lim->max_segment_size) lim->max_segment_size = BLK_MAX_SEGMENT_SIZE; if (WARN_ON_ONCE(lim->max_segment_size < BLK_MIN_SEGMENT_SIZE)) return -EINVAL; } /* setup max segment size for building new segment in fast path */ if (lim->seg_boundary_mask > lim->max_segment_size - 1) seg_size = lim->max_segment_size; else seg_size = lim->seg_boundary_mask + 1; lim->max_fast_segment_size = min_t(unsigned int, seg_size, PAGE_SIZE); /* * We require drivers to at least do logical block aligned I/O, but * historically could not check for that due to the separate calls * to set the limits. Once the transition is finished the check * below should be narrowed down to check the logical block size. */ if (!lim->dma_alignment) lim->dma_alignment = SECTOR_SIZE - 1; if (WARN_ON_ONCE(lim->dma_alignment > PAGE_SIZE)) return -EINVAL; if (lim->alignment_offset) { lim->alignment_offset &= (lim->physical_block_size - 1); lim->flags &= ~BLK_FLAG_MISALIGNED; } if (!(lim->features & BLK_FEAT_WRITE_CACHE)) lim->features &= ~BLK_FEAT_FUA; blk_validate_atomic_write_limits(lim); err = blk_validate_integrity_limits(lim); if (err) return err; return blk_validate_zoned_limits(lim); } EXPORT_SYMBOL_GPL(blk_validate_limits); /* * Set the default limits for a newly allocated queue. @lim contains the * initial limits set by the driver, which could be no limit in which case * all fields are cleared to zero. */ int blk_set_default_limits(struct queue_limits *lim) { /* * Most defaults are set by capping the bounds in blk_validate_limits, * but these limits are special and need an explicit initialization to * the max value here. */ lim->max_user_discard_sectors = UINT_MAX; lim->max_user_wzeroes_unmap_sectors = UINT_MAX; return blk_validate_limits(lim); } /** * queue_limits_commit_update - commit an atomic update of queue limits * @q: queue to update * @lim: limits to apply * * Apply the limits in @lim that were obtained from queue_limits_start_update() * and updated by the caller to @q. The caller must have frozen the queue or * ensure that there are no outstanding I/Os by other means. * * Returns 0 if successful, else a negative error code. */ int queue_limits_commit_update(struct request_queue *q, struct queue_limits *lim) { int error; lockdep_assert_held(&q->limits_lock); error = blk_validate_limits(lim); if (error) goto out_unlock; #ifdef CONFIG_BLK_INLINE_ENCRYPTION if (q->crypto_profile && lim->integrity.tag_size) { pr_warn("blk-integrity: Integrity and hardware inline encryption are not supported together.\n"); error = -EINVAL; goto out_unlock; } #endif q->limits = *lim; if (q->disk) blk_apply_bdi_limits(q->disk->bdi, lim); out_unlock: mutex_unlock(&q->limits_lock); return error; } EXPORT_SYMBOL_GPL(queue_limits_commit_update); /** * queue_limits_commit_update_frozen - commit an atomic update of queue limits * @q: queue to update * @lim: limits to apply * * Apply the limits in @lim that were obtained from queue_limits_start_update() * and updated with the new values by the caller to @q. Freezes the queue * before the update and unfreezes it after. * * Returns 0 if successful, else a negative error code. */ int queue_limits_commit_update_frozen(struct request_queue *q, struct queue_limits *lim) { unsigned int memflags; int ret; memflags = blk_mq_freeze_queue(q); ret = queue_limits_commit_update(q, lim); blk_mq_unfreeze_queue(q, memflags); return ret; } EXPORT_SYMBOL_GPL(queue_limits_commit_update_frozen); /** * queue_limits_set - apply queue limits to queue * @q: queue to update * @lim: limits to apply * * Apply the limits in @lim that were freshly initialized to @q. * To update existing limits use queue_limits_start_update() and * queue_limits_commit_update() instead. * * Returns 0 if successful, else a negative error code. */ int queue_limits_set(struct request_queue *q, struct queue_limits *lim) { mutex_lock(&q->limits_lock); return queue_limits_commit_update(q, lim); } EXPORT_SYMBOL_GPL(queue_limits_set); static int queue_limit_alignment_offset(const struct queue_limits *lim, sector_t sector) { unsigned int granularity = max(lim->physical_block_size, lim->io_min); unsigned int alignment = sector_div(sector, granularity >> SECTOR_SHIFT) << SECTOR_SHIFT; return (granularity + lim->alignment_offset - alignment) % granularity; } static unsigned int queue_limit_discard_alignment( const struct queue_limits *lim, sector_t sector) { unsigned int alignment, granularity, offset; if (!lim->max_discard_sectors) return 0; /* Why are these in bytes, not sectors? */ alignment = lim->discard_alignment >> SECTOR_SHIFT; granularity = lim->discard_granularity >> SECTOR_SHIFT; /* Offset of the partition start in 'granularity' sectors */ offset = sector_div(sector, granularity); /* And why do we do this modulus *again* in blkdev_issue_discard()? */ offset = (granularity + alignment - offset) % granularity; /* Turn it back into bytes, gaah */ return offset << SECTOR_SHIFT; } static unsigned int blk_round_down_sectors(unsigned int sectors, unsigned int lbs) { sectors = round_down(sectors, lbs >> SECTOR_SHIFT); if (sectors < PAGE_SIZE >> SECTOR_SHIFT) sectors = PAGE_SIZE >> SECTOR_SHIFT; return sectors; } /* Check if second and later bottom devices are compliant */ static bool blk_stack_atomic_writes_tail(struct queue_limits *t, struct queue_limits *b) { /* We're not going to support different boundary sizes.. yet */ if (t->atomic_write_hw_boundary != b->atomic_write_hw_boundary) return false; /* Can't support this */ if (t->atomic_write_hw_unit_min > b->atomic_write_hw_unit_max) return false; /* Or this */ if (t->atomic_write_hw_unit_max < b->atomic_write_hw_unit_min) return false; t->atomic_write_hw_max = min(t->atomic_write_hw_max, b->atomic_write_hw_max); t->atomic_write_hw_unit_min = max(t->atomic_write_hw_unit_min, b->atomic_write_hw_unit_min); t->atomic_write_hw_unit_max = min(t->atomic_write_hw_unit_max, b->atomic_write_hw_unit_max); return true; } static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t) { unsigned int chunk_bytes; if (!t->chunk_sectors) return; /* * If chunk sectors is so large that its value in bytes overflows * UINT_MAX, then just shift it down so it definitely will fit. * We don't support atomic writes of such a large size anyway. */ if (check_shl_overflow(t->chunk_sectors, SECTOR_SHIFT, &chunk_bytes)) chunk_bytes = t->chunk_sectors; /* * Find values for limits which work for chunk size. * b->atomic_write_hw_unit_{min, max} may not be aligned with chunk * size, as the chunk size is not restricted to a power-of-2. * So we need to find highest power-of-2 which works for the chunk * size. * As an example scenario, we could have t->unit_max = 16K and * t->chunk_sectors = 24KB. For this case, reduce t->unit_max to a * value aligned with both limits, i.e. 8K in this example. */ t->atomic_write_hw_unit_max = min(t->atomic_write_hw_unit_max, max_pow_of_two_factor(chunk_bytes)); t->atomic_write_hw_unit_min = min(t->atomic_write_hw_unit_min, t->atomic_write_hw_unit_max); t->atomic_write_hw_max = min(t->atomic_write_hw_max, chunk_bytes); } /* Check stacking of first bottom device */ static bool blk_stack_atomic_writes_head(struct queue_limits *t, struct queue_limits *b) { if (!blk_valid_atomic_writes_boundary(t->chunk_sectors, b->atomic_write_hw_boundary >> SECTOR_SHIFT)) return false; t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max; t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min; t->atomic_write_hw_max = b->atomic_write_hw_max; t->atomic_write_hw_boundary = b->atomic_write_hw_boundary; return true; } static void blk_stack_atomic_writes_limits(struct queue_limits *t, struct queue_limits *b, sector_t start) { if (!(b->features & BLK_FEAT_ATOMIC_WRITES)) goto unsupported; if (!b->atomic_write_hw_unit_min) goto unsupported; if (!blk_atomic_write_start_sect_aligned(start, b)) goto unsupported; /* UINT_MAX indicates no stacking of bottom devices yet */ if (t->atomic_write_hw_max == UINT_MAX) { if (!blk_stack_atomic_writes_head(t, b)) goto unsupported; } else { if (!blk_stack_atomic_writes_tail(t, b)) goto unsupported; } blk_stack_atomic_writes_chunk_sectors(t); return; unsupported: t->atomic_write_hw_max = 0; t->atomic_write_hw_unit_max = 0; t->atomic_write_hw_unit_min = 0; t->atomic_write_hw_boundary = 0; } /** * blk_stack_limits - adjust queue_limits for stacked devices * @t: the stacking driver limits (top device) * @b: the underlying queue limits (bottom, component device) * @start: first data sector within component device * * Description: * This function is used by stacking drivers like MD and DM to ensure * that all component devices have compatible block sizes and * alignments. The stacking driver must provide a queue_limits * struct (top) and then iteratively call the stacking function for * all component (bottom) devices. The stacking function will * attempt to combine the values and ensure proper alignment. * * Returns 0 if the top and bottom queue_limits are compatible. The * top device's block sizes and alignment offsets may be adjusted to * ensure alignment with the bottom device. If no compatible sizes * and alignments exist, -1 is returned and the resulting top * queue_limits will have the misaligned flag set to indicate that * the alignment_offset is undefined. */ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, sector_t start) { unsigned int top, bottom, alignment; int ret = 0; t->features |= (b->features & BLK_FEAT_INHERIT_MASK); /* * Some feaures need to be supported both by the stacking driver and all * underlying devices. The stacking driver sets these flags before * stacking the limits, and this will clear the flags if any of the * underlying devices does not support it. */ if (!(b->features & BLK_FEAT_NOWAIT)) t->features &= ~BLK_FEAT_NOWAIT; if (!(b->features & BLK_FEAT_POLL)) t->features &= ~BLK_FEAT_POLL; t->flags |= (b->flags & BLK_FLAG_MISALIGNED); t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); t->max_user_sectors = min_not_zero(t->max_user_sectors, b->max_user_sectors); t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors); t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors); t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors, b->max_write_zeroes_sectors); t->max_user_wzeroes_unmap_sectors = min(t->max_user_wzeroes_unmap_sectors, b->max_user_wzeroes_unmap_sectors); t->max_hw_wzeroes_unmap_sectors = min(t->max_hw_wzeroes_unmap_sectors, b->max_hw_wzeroes_unmap_sectors); t->max_hw_zone_append_sectors = min(t->max_hw_zone_append_sectors, b->max_hw_zone_append_sectors); t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask, b->seg_boundary_mask); t->virt_boundary_mask = min_not_zero(t->virt_boundary_mask, b->virt_boundary_mask); t->max_segments = min_not_zero(t->max_segments, b->max_segments); t->max_discard_segments = min_not_zero(t->max_discard_segments, b->max_discard_segments); t->max_integrity_segments = min_not_zero(t->max_integrity_segments, b->max_integrity_segments); t->max_segment_size = min_not_zero(t->max_segment_size, b->max_segment_size); alignment = queue_limit_alignment_offset(b, start); /* Bottom device has different alignment. Check that it is * compatible with the current top alignment. */ if (t->alignment_offset != alignment) { top = max(t->physical_block_size, t->io_min) + t->alignment_offset; bottom = max(b->physical_block_size, b->io_min) + alignment; /* Verify that top and bottom intervals line up */ if (max(top, bottom) % min(top, bottom)) { t->flags |= BLK_FLAG_MISALIGNED; ret = -1; } } t->logical_block_size = max(t->logical_block_size, b->logical_block_size); t->physical_block_size = max(t->physical_block_size, b->physical_block_size); t->io_min = max(t->io_min, b->io_min); t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); t->dma_alignment = max(t->dma_alignment, b->dma_alignment); /* Set non-power-of-2 compatible chunk_sectors boundary */ if (b->chunk_sectors) t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors); /* Physical block size a multiple of the logical block size? */ if (t->physical_block_size & (t->logical_block_size - 1)) { t->physical_block_size = t->logical_block_size; t->flags |= BLK_FLAG_MISALIGNED; ret = -1; } /* Minimum I/O a multiple of the physical block size? */ if (t->io_min & (t->physical_block_size - 1)) { t->io_min = t->physical_block_size; t->flags |= BLK_FLAG_MISALIGNED; ret = -1; } /* Optimal I/O a multiple of the physical block size? */ if (t->io_opt & (t->physical_block_size - 1)) { t->io_opt = 0; t->flags |= BLK_FLAG_MISALIGNED; ret = -1; } /* chunk_sectors a multiple of the physical block size? */ if (t->chunk_sectors % (t->physical_block_size >> SECTOR_SHIFT)) { t->chunk_sectors = 0; t->flags |= BLK_FLAG_MISALIGNED; ret = -1; } /* Find lowest common alignment_offset */ t->alignment_offset = lcm_not_zero(t->alignment_offset, alignment) % max(t->physical_block_size, t->io_min); /* Verify that new alignment_offset is on a logical block boundary */ if (t->alignment_offset & (t->logical_block_size - 1)) { t->flags |= BLK_FLAG_MISALIGNED; ret = -1; } t->max_sectors = blk_round_down_sectors(t->max_sectors, t->logical_block_size); t->max_hw_sectors = blk_round_down_sectors(t->max_hw_sectors, t->logical_block_size); t->max_dev_sectors = blk_round_down_sectors(t->max_dev_sectors, t->logical_block_size); /* Discard alignment and granularity */ if (b->discard_granularity) { alignment = queue_limit_discard_alignment(b, start); t->max_discard_sectors = min_not_zero(t->max_discard_sectors, b->max_discard_sectors); t->max_hw_discard_sectors = min_not_zero(t->max_hw_discard_sectors, b->max_hw_discard_sectors); t->discard_granularity = max(t->discard_granularity, b->discard_granularity); t->discard_alignment = lcm_not_zero(t->discard_alignment, alignment) % t->discard_granularity; } t->max_secure_erase_sectors = min_not_zero(t->max_secure_erase_sectors, b->max_secure_erase_sectors); t->zone_write_granularity = max(t->zone_write_granularity, b->zone_write_granularity); if (!(t->features & BLK_FEAT_ZONED)) { t->zone_write_granularity = 0; t->max_zone_append_sectors = 0; } blk_stack_atomic_writes_limits(t, b, start); return ret; } EXPORT_SYMBOL(blk_stack_limits); /** * queue_limits_stack_bdev - adjust queue_limits for stacked devices * @t: the stacking driver limits (top device) * @bdev: the underlying block device (bottom) * @offset: offset to beginning of data within component device * @pfx: prefix to use for warnings logged * * Description: * This function is used by stacking drivers like MD and DM to ensure * that all component devices have compatible block sizes and * alignments. The stacking driver must provide a queue_limits * struct (top) and then iteratively call the stacking function for * all component (bottom) devices. The stacking function will * attempt to combine the values and ensure proper alignment. */ void queue_limits_stack_bdev(struct queue_limits *t, struct block_device *bdev, sector_t offset, const char *pfx) { if (blk_stack_limits(t, bdev_limits(bdev), get_start_sect(bdev) + offset)) pr_notice("%s: Warning: Device %pg is misaligned\n", pfx, bdev); } EXPORT_SYMBOL_GPL(queue_limits_stack_bdev); /** * queue_limits_stack_integrity - stack integrity profile * @t: target queue limits * @b: base queue limits * * Check if the integrity profile in the @b can be stacked into the * target @t. Stacking is possible if either: * * a) does not have any integrity information stacked into it yet * b) the integrity profile in @b is identical to the one in @t * * If @b can be stacked into @t, return %true. Else return %false and clear the * integrity information in @t. */ bool queue_limits_stack_integrity(struct queue_limits *t, struct queue_limits *b) { struct blk_integrity *ti = &t->integrity; struct blk_integrity *bi = &b->integrity; if (!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY)) return true; if (ti->flags & BLK_INTEGRITY_STACKED) { if (ti->metadata_size != bi->metadata_size) goto incompatible; if (ti->interval_exp != bi->interval_exp) goto incompatible; if (ti->tag_size != bi->tag_size) goto incompatible; if (ti->csum_type != bi->csum_type) goto incompatible; if (ti->pi_tuple_size != bi->pi_tuple_size) goto incompatible; if ((ti->flags & BLK_INTEGRITY_REF_TAG) != (bi->flags & BLK_INTEGRITY_REF_TAG)) goto incompatible; } else { ti->flags = BLK_INTEGRITY_STACKED; ti->flags |= (bi->flags & BLK_INTEGRITY_DEVICE_CAPABLE) | (bi->flags & BLK_INTEGRITY_REF_TAG); ti->csum_type = bi->csum_type; ti->pi_tuple_size = bi->pi_tuple_size; ti->metadata_size = bi->metadata_size; ti->pi_offset = bi->pi_offset; ti->interval_exp = bi->interval_exp; ti->tag_size = bi->tag_size; } return true; incompatible: memset(ti, 0, sizeof(*ti)); return false; } EXPORT_SYMBOL_GPL(queue_limits_stack_integrity); /** * blk_set_queue_depth - tell the block layer about the device queue depth * @q: the request queue for the device * @depth: queue depth * */ void blk_set_queue_depth(struct request_queue *q, unsigned int depth) { q->queue_depth = depth; rq_qos_queue_depth_changed(q); } EXPORT_SYMBOL(blk_set_queue_depth); int bdev_alignment_offset(struct block_device *bdev) { struct request_queue *q = bdev_get_queue(bdev); if (q->limits.flags & BLK_FLAG_MISALIGNED) return -1; if (bdev_is_partition(bdev)) return queue_limit_alignment_offset(&q->limits, bdev->bd_start_sect); return q->limits.alignment_offset; } EXPORT_SYMBOL_GPL(bdev_alignment_offset); unsigned int bdev_discard_alignment(struct block_device *bdev) { struct request_queue *q = bdev_get_queue(bdev); if (bdev_is_partition(bdev)) return queue_limit_discard_alignment(&q->limits, bdev->bd_start_sect); return q->limits.discard_alignment; } EXPORT_SYMBOL_GPL(bdev_discard_alignment); |
| 9 2 4 1 13 4 11 9 8 6 5 5 4 2 2 3 5 54 27 42 49 9 15 56 25 127 38 115 103 9 133 29 25 120 121 136 129 82 114 185 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _NET_NF_TABLES_H #define _NET_NF_TABLES_H #include <linux/unaligned.h> #include <linux/list.h> #include <linux/netfilter.h> #include <linux/netfilter/nfnetlink.h> #include <linux/netfilter/x_tables.h> #include <linux/netfilter/nf_tables.h> #include <linux/u64_stats_sync.h> #include <linux/rhashtable.h> #include <net/netfilter/nf_flow_table.h> #include <net/netlink.h> #include <net/flow_offload.h> #include <net/netns/generic.h> #define NFT_MAX_HOOKS (NF_INET_INGRESS + 1) struct module; #define NFT_JUMP_STACK_SIZE 16 enum { NFT_PKTINFO_L4PROTO = (1 << 0), NFT_PKTINFO_INNER = (1 << 1), NFT_PKTINFO_INNER_FULL = (1 << 2), }; struct nft_pktinfo { struct sk_buff *skb; const struct nf_hook_state *state; u8 flags; u8 tprot; u16 fragoff; u16 thoff; u16 inneroff; }; static inline struct sock *nft_sk(const struct nft_pktinfo *pkt) { return pkt->state->sk; } static inline unsigned int nft_thoff(const struct nft_pktinfo *pkt) { return pkt->thoff; } static inline struct net *nft_net(const struct nft_pktinfo *pkt) { return pkt->state->net; } static inline unsigned int nft_hook(const struct nft_pktinfo *pkt) { return pkt->state->hook; } static inline u8 nft_pf(const struct nft_pktinfo *pkt) { return pkt->state->pf; } static inline const struct net_device *nft_in(const struct nft_pktinfo *pkt) { return pkt->state->in; } static inline const struct net_device *nft_out(const struct nft_pktinfo *pkt) { return pkt->state->out; } static inline void nft_set_pktinfo(struct nft_pktinfo *pkt, struct sk_buff *skb, const struct nf_hook_state *state) { pkt->skb = skb; pkt->state = state; } static inline void nft_set_pktinfo_unspec(struct nft_pktinfo *pkt) { pkt->flags = 0; pkt->tprot = 0; pkt->thoff = 0; pkt->fragoff = 0; } /** * struct nft_verdict - nf_tables verdict * * @code: nf_tables/netfilter verdict code * @chain: destination chain for NFT_JUMP/NFT_GOTO */ struct nft_verdict { u32 code; struct nft_chain *chain; }; struct nft_data { union { u32 data[4]; struct nft_verdict verdict; }; } __attribute__((aligned(__alignof__(u64)))); #define NFT_REG32_NUM 20 /** * struct nft_regs - nf_tables register set * * @data: data registers * @verdict: verdict register * * The first four data registers alias to the verdict register. */ struct nft_regs { union { u32 data[NFT_REG32_NUM]; struct nft_verdict verdict; }; }; struct nft_regs_track { struct { const struct nft_expr *selector; const struct nft_expr *bitwise; u8 num_reg; } regs[NFT_REG32_NUM]; const struct nft_expr *cur; const struct nft_expr *last; }; /* Store/load an u8, u16 or u64 integer to/from the u32 data register. * * Note, when using concatenations, register allocation happens at 32-bit * level. So for store instruction, pad the rest part with zero to avoid * garbage values. */ static inline void nft_reg_store8(u32 *dreg, u8 val) { *dreg = 0; *(u8 *)dreg = val; } static inline u8 nft_reg_load8(const u32 *sreg) { return *(u8 *)sreg; } static inline void nft_reg_store16(u32 *dreg, u16 val) { *dreg = 0; *(u16 *)dreg = val; } static inline void nft_reg_store_be16(u32 *dreg, __be16 val) { nft_reg_store16(dreg, (__force __u16)val); } static inline u16 nft_reg_load16(const u32 *sreg) { return *(u16 *)sreg; } static inline __be16 nft_reg_load_be16(const u32 *sreg) { return (__force __be16)nft_reg_load16(sreg); } static inline __be32 nft_reg_load_be32(const u32 *sreg) { return *(__force __be32 *)sreg; } static inline void nft_reg_store64(u64 *dreg, u64 val) { put_unaligned(val, dreg); } static inline u64 nft_reg_load64(const u32 *sreg) { return get_unaligned((u64 *)sreg); } static inline void nft_data_copy(u32 *dst, const struct nft_data *src, unsigned int len) { if (len % NFT_REG32_SIZE) dst[len / NFT_REG32_SIZE] = 0; memcpy(dst, src, len); } /** * struct nft_ctx - nf_tables rule/set context * * @net: net namespace * @table: the table the chain is contained in * @chain: the chain the rule is contained in * @nla: netlink attributes * @portid: netlink portID of the original message * @seq: netlink sequence number * @flags: modifiers to new request * @family: protocol family * @level: depth of the chains * @report: notify via unicast netlink message * @reg_inited: bitmap of initialised registers */ struct nft_ctx { struct net *net; struct nft_table *table; struct nft_chain *chain; const struct nlattr * const *nla; u32 portid; u32 seq; u16 flags; u8 family; u8 level; bool report; DECLARE_BITMAP(reg_inited, NFT_REG32_NUM); }; enum nft_data_desc_flags { NFT_DATA_DESC_SETELEM = (1 << 0), }; struct nft_data_desc { enum nft_data_types type; unsigned int size; unsigned int len; unsigned int flags; }; int nft_data_init(const struct nft_ctx *ctx, struct nft_data *data, struct nft_data_desc *desc, const struct nlattr *nla); void nft_data_hold(const struct nft_data *data, enum nft_data_types type); void nft_data_release(const struct nft_data *data, enum nft_data_types type); int nft_data_dump(struct sk_buff *skb, int attr, const struct nft_data *data, enum nft_data_types type, unsigned int len); static inline enum nft_data_types nft_dreg_to_type(enum nft_registers reg) { return reg == NFT_REG_VERDICT ? NFT_DATA_VERDICT : NFT_DATA_VALUE; } static inline enum nft_registers nft_type_to_reg(enum nft_data_types type) { return type == NFT_DATA_VERDICT ? NFT_REG_VERDICT : NFT_REG_1 * NFT_REG_SIZE / NFT_REG32_SIZE; } int nft_parse_u32_check(const struct nlattr *attr, int max, u32 *dest); int nft_dump_register(struct sk_buff *skb, unsigned int attr, unsigned int reg); int nft_parse_register_load(const struct nft_ctx *ctx, const struct nlattr *attr, u8 *sreg, u32 len); int nft_parse_register_store(const struct nft_ctx *ctx, const struct nlattr *attr, u8 *dreg, const struct nft_data *data, enum nft_data_types type, unsigned int len); /** * struct nft_userdata - user defined data associated with an object * * @len: length of the data * @data: content * * The presence of user data is indicated in an object specific fashion, * so a length of zero can't occur and the value "len" indicates data * of length len + 1. */ struct nft_userdata { u8 len; unsigned char data[]; }; /* placeholder structure for opaque set element backend representation. */ struct nft_elem_priv { }; /** * struct nft_set_elem - generic representation of set elements * * @key: element key * @key_end: closing element key * @data: element data * @priv: element private data and extensions */ struct nft_set_elem { union { u32 buf[NFT_DATA_VALUE_MAXLEN / sizeof(u32)]; struct nft_data val; } key; union { u32 buf[NFT_DATA_VALUE_MAXLEN / sizeof(u32)]; struct nft_data val; } key_end; union { u32 buf[NFT_DATA_VALUE_MAXLEN / sizeof(u32)]; struct nft_data val; } data; struct nft_elem_priv *priv; }; static inline void *nft_elem_priv_cast(const struct nft_elem_priv *priv) { return (void *)priv; } /** * enum nft_iter_type - nftables set iterator type * * @NFT_ITER_UNSPEC: unspecified, to catch errors * @NFT_ITER_READ: read-only iteration over set elements * @NFT_ITER_UPDATE: iteration under mutex to update set element state */ enum nft_iter_type { NFT_ITER_UNSPEC, NFT_ITER_READ, NFT_ITER_UPDATE, }; struct nft_set; struct nft_set_iter { u8 genmask; enum nft_iter_type type:8; unsigned int count; unsigned int skip; int err; int (*fn)(const struct nft_ctx *ctx, struct nft_set *set, const struct nft_set_iter *iter, struct nft_elem_priv *elem_priv); }; /** * struct nft_set_desc - description of set elements * * @ktype: key type * @klen: key length * @dtype: data type * @dlen: data length * @objtype: object type * @size: number of set elements * @policy: set policy * @gc_int: garbage collector interval * @timeout: element timeout * @field_len: length of each field in concatenation, bytes * @field_count: number of concatenated fields in element * @expr: set must support for expressions */ struct nft_set_desc { u32 ktype; unsigned int klen; u32 dtype; unsigned int dlen; u32 objtype; unsigned int size; u32 policy; u32 gc_int; u64 timeout; u8 field_len[NFT_REG32_COUNT]; u8 field_count; bool expr; }; /** * enum nft_set_class - performance class * * @NFT_SET_CLASS_O_1: constant, O(1) * @NFT_SET_CLASS_O_LOG_N: logarithmic, O(log N) * @NFT_SET_CLASS_O_N: linear, O(N) */ enum nft_set_class { NFT_SET_CLASS_O_1, NFT_SET_CLASS_O_LOG_N, NFT_SET_CLASS_O_N, }; /** * struct nft_set_estimate - estimation of memory and performance * characteristics * * @size: required memory * @lookup: lookup performance class * @space: memory class */ struct nft_set_estimate { u64 size; enum nft_set_class lookup; enum nft_set_class space; }; #define NFT_EXPR_MAXATTR 16 #define NFT_EXPR_SIZE(size) (sizeof(struct nft_expr) + \ ALIGN(size, __alignof__(struct nft_expr))) /** * struct nft_expr - nf_tables expression * * @ops: expression ops * @data: expression private data */ struct nft_expr { const struct nft_expr_ops *ops; unsigned char data[] __attribute__((aligned(__alignof__(u64)))); }; static inline void *nft_expr_priv(const struct nft_expr *expr) { return (void *)expr->data; } struct nft_expr_info; int nft_expr_inner_parse(const struct nft_ctx *ctx, const struct nlattr *nla, struct nft_expr_info *info); int nft_expr_clone(struct nft_expr *dst, struct nft_expr *src, gfp_t gfp); void nft_expr_destroy(const struct nft_ctx *ctx, struct nft_expr *expr); int nft_expr_dump(struct sk_buff *skb, unsigned int attr, const struct nft_expr *expr, bool reset); bool nft_expr_reduce_bitwise(struct nft_regs_track *track, const struct nft_expr *expr); struct nft_set_ext; /** * struct nft_set_ops - nf_tables set operations * * @lookup: look up an element within the set * @update: update an element if exists, add it if doesn't exist * @delete: delete an element * @insert: insert new element into set * @activate: activate new element in the next generation * @deactivate: lookup for element and deactivate it in the next generation * @flush: deactivate element in the next generation * @remove: remove element from set * @walk: iterate over all set elements * @get: get set elements * @ksize: kernel set size * @usize: userspace set size * @adjust_maxsize: delta to adjust maximum set size * @commit: commit set elements * @abort: abort set elements * @privsize: function to return size of set private data * @estimate: estimate the required memory size and the lookup complexity class * @init: initialize private data of new set instance * @destroy: destroy private data of set instance * @gc_init: initialize garbage collection * @elemsize: element private size * * Operations lookup, update and delete have simpler interfaces, are faster * and currently only used in the packet path. All the rest are slower, * control plane functions. */ struct nft_set_ops { const struct nft_set_ext * (*lookup)(const struct net *net, const struct nft_set *set, const u32 *key); const struct nft_set_ext * (*update)(struct nft_set *set, const u32 *key, const struct nft_expr *expr, struct nft_regs *regs); bool (*delete)(const struct nft_set *set, const u32 *key); int (*insert)(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem, struct nft_elem_priv **priv); void (*activate)(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv); struct nft_elem_priv * (*deactivate)(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem); void (*flush)(const struct net *net, const struct nft_set *set, struct nft_elem_priv *priv); void (*remove)(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv); void (*walk)(const struct nft_ctx *ctx, struct nft_set *set, struct nft_set_iter *iter); struct nft_elem_priv * (*get)(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem, unsigned int flags); u32 (*ksize)(u32 size); u32 (*usize)(u32 size); u32 (*adjust_maxsize)(const struct nft_set *set); void (*commit)(struct nft_set *set); void (*abort)(const struct nft_set *set); u64 (*privsize)(const struct nlattr * const nla[], const struct nft_set_desc *desc); bool (*estimate)(const struct nft_set_desc *desc, u32 features, struct nft_set_estimate *est); int (*init)(const struct nft_set *set, const struct nft_set_desc *desc, const struct nlattr * const nla[]); void (*destroy)(const struct nft_ctx *ctx, const struct nft_set *set); void (*gc_init)(const struct nft_set *set); unsigned int elemsize; }; /** * struct nft_set_type - nf_tables set type * * @ops: set ops for this type * @features: features supported by the implementation */ struct nft_set_type { const struct nft_set_ops ops; u32 features; }; #define to_set_type(o) container_of(o, struct nft_set_type, ops) struct nft_set_elem_expr { u8 size; unsigned char data[] __attribute__((aligned(__alignof__(struct nft_expr)))); }; #define nft_setelem_expr_at(__elem_expr, __offset) \ ((struct nft_expr *)&__elem_expr->data[__offset]) #define nft_setelem_expr_foreach(__expr, __elem_expr, __size) \ for (__expr = nft_setelem_expr_at(__elem_expr, 0), __size = 0; \ __size < (__elem_expr)->size; \ __size += (__expr)->ops->size, __expr = ((void *)(__expr)) + (__expr)->ops->size) #define NFT_SET_EXPR_MAX 2 /** * struct nft_set - nf_tables set instance * * @list: table set list node * @bindings: list of set bindings * @refs: internal refcounting for async set destruction * @table: table this set belongs to * @net: netnamespace this set belongs to * @name: name of the set * @handle: unique handle of the set * @ktype: key type (numeric type defined by userspace, not used in the kernel) * @dtype: data type (verdict or numeric type defined by userspace) * @objtype: object type (see NFT_OBJECT_* definitions) * @size: maximum set size * @field_len: length of each field in concatenation, bytes * @field_count: number of concatenated fields in element * @in_update_walk: true during ->walk() in transaction phase * @use: number of rules references to this set * @nelems: number of elements * @ndeact: number of deactivated elements queued for removal * @timeout: default timeout value in jiffies * @gc_int: garbage collection interval in msecs * @policy: set parameterization (see enum nft_set_policies) * @udlen: user data length * @udata: user data * @pending_update: list of pending update set element * @ops: set ops * @flags: set flags * @dead: set will be freed, never cleared * @genmask: generation mask * @klen: key length * @dlen: data length * @num_exprs: numbers of exprs * @exprs: stateful expression * @catchall_list: list of catch-all set element * @data: private set data */ struct nft_set { struct list_head list; struct list_head bindings; refcount_t refs; struct nft_table *table; possible_net_t net; char *name; u64 handle; u32 ktype; u32 dtype; u32 objtype; u32 size; u8 field_len[NFT_REG32_COUNT]; u8 field_count; bool in_update_walk; u32 use; atomic_t nelems; u32 ndeact; u64 timeout; u32 gc_int; u16 policy; u16 udlen; unsigned char *udata; struct list_head pending_update; /* runtime data below here */ const struct nft_set_ops *ops ____cacheline_aligned; u16 flags:13, dead:1, genmask:2; u8 klen; u8 dlen; u8 num_exprs; struct nft_expr *exprs[NFT_SET_EXPR_MAX]; struct list_head catchall_list; unsigned char data[] __attribute__((aligned(__alignof__(u64)))); }; static inline bool nft_set_is_anonymous(const struct nft_set *set) { return set->flags & NFT_SET_ANONYMOUS; } static inline void *nft_set_priv(const struct nft_set *set) { return (void *)set->data; } static inline enum nft_data_types nft_set_datatype(const struct nft_set *set) { return set->dtype == NFT_DATA_VERDICT ? NFT_DATA_VERDICT : NFT_DATA_VALUE; } static inline bool nft_set_gc_is_pending(const struct nft_set *s) { return refcount_read(&s->refs) != 1; } static inline struct nft_set *nft_set_container_of(const void *priv) { return (void *)priv - offsetof(struct nft_set, data); } struct nft_set *nft_set_lookup_global(const struct net *net, const struct nft_table *table, const struct nlattr *nla_set_name, const struct nlattr *nla_set_id, u8 genmask); struct nft_set_ext *nft_set_catchall_lookup(const struct net *net, const struct nft_set *set); static inline unsigned long nft_set_gc_interval(const struct nft_set *set) { u32 gc_int = READ_ONCE(set->gc_int); return gc_int ? msecs_to_jiffies(gc_int) : HZ; } /** * struct nft_set_binding - nf_tables set binding * * @list: set bindings list node * @chain: chain containing the rule bound to the set * @flags: set action flags * * A set binding contains all information necessary for validation * of new elements added to a bound set. */ struct nft_set_binding { struct list_head list; const struct nft_chain *chain; u32 flags; }; enum nft_trans_phase; void nf_tables_activate_set(const struct nft_ctx *ctx, struct nft_set *set); void nf_tables_deactivate_set(const struct nft_ctx *ctx, struct nft_set *set, struct nft_set_binding *binding, enum nft_trans_phase phase); int nf_tables_bind_set(const struct nft_ctx *ctx, struct nft_set *set, struct nft_set_binding *binding); void nf_tables_destroy_set(const struct nft_ctx *ctx, struct nft_set *set); /** * enum nft_set_extensions - set extension type IDs * * @NFT_SET_EXT_KEY: element key * @NFT_SET_EXT_KEY_END: upper bound element key, for ranges * @NFT_SET_EXT_DATA: mapping data * @NFT_SET_EXT_FLAGS: element flags * @NFT_SET_EXT_TIMEOUT: element timeout * @NFT_SET_EXT_USERDATA: user data associated with the element * @NFT_SET_EXT_EXPRESSIONS: expressions associated with the element * @NFT_SET_EXT_OBJREF: stateful object reference associated with element * @NFT_SET_EXT_NUM: number of extension types */ enum nft_set_extensions { NFT_SET_EXT_KEY, NFT_SET_EXT_KEY_END, NFT_SET_EXT_DATA, NFT_SET_EXT_FLAGS, NFT_SET_EXT_TIMEOUT, NFT_SET_EXT_USERDATA, NFT_SET_EXT_EXPRESSIONS, NFT_SET_EXT_OBJREF, NFT_SET_EXT_NUM }; /** * struct nft_set_ext_type - set extension type * * @len: fixed part length of the extension * @align: alignment requirements of the extension */ struct nft_set_ext_type { u8 len; u8 align; }; extern const struct nft_set_ext_type nft_set_ext_types[]; /** * struct nft_set_ext_tmpl - set extension template * * @len: length of extension area * @offset: offsets of individual extension types * @ext_len: length of the expected extension(used to sanity check) */ struct nft_set_ext_tmpl { u16 len; u8 offset[NFT_SET_EXT_NUM]; u8 ext_len[NFT_SET_EXT_NUM]; }; /** * struct nft_set_ext - set extensions * * @genmask: generation mask, but also flags (see NFT_SET_ELEM_DEAD_BIT) * @offset: offsets of individual extension types * @data: beginning of extension data * * This structure must be aligned to word size, otherwise atomic bitops * on genmask field can cause alignment failure on some archs. */ struct nft_set_ext { u8 genmask; u8 offset[NFT_SET_EXT_NUM]; char data[]; } __aligned(BITS_PER_LONG / 8); static inline void nft_set_ext_prepare(struct nft_set_ext_tmpl *tmpl) { memset(tmpl, 0, sizeof(*tmpl)); tmpl->len = sizeof(struct nft_set_ext); } static inline int nft_set_ext_add_length(struct nft_set_ext_tmpl *tmpl, u8 id, unsigned int len) { tmpl->len = ALIGN(tmpl->len, nft_set_ext_types[id].align); if (tmpl->len > U8_MAX) return -EINVAL; tmpl->offset[id] = tmpl->len; tmpl->ext_len[id] = nft_set_ext_types[id].len + len; tmpl->len += tmpl->ext_len[id]; return 0; } static inline int nft_set_ext_add(struct nft_set_ext_tmpl *tmpl, u8 id) { return nft_set_ext_add_length(tmpl, id, 0); } static inline void nft_set_ext_init(struct nft_set_ext *ext, const struct nft_set_ext_tmpl *tmpl) { memcpy(ext->offset, tmpl->offset, sizeof(ext->offset)); } static inline bool __nft_set_ext_exists(const struct nft_set_ext *ext, u8 id) { return !!ext->offset[id]; } static inline bool nft_set_ext_exists(const struct nft_set_ext *ext, u8 id) { return ext && __nft_set_ext_exists(ext, id); } static inline void *nft_set_ext(const struct nft_set_ext *ext, u8 id) { return (void *)ext + ext->offset[id]; } static inline struct nft_data *nft_set_ext_key(const struct nft_set_ext *ext) { return nft_set_ext(ext, NFT_SET_EXT_KEY); } static inline struct nft_data *nft_set_ext_key_end(const struct nft_set_ext *ext) { return nft_set_ext(ext, NFT_SET_EXT_KEY_END); } static inline struct nft_data *nft_set_ext_data(const struct nft_set_ext *ext) { return nft_set_ext(ext, NFT_SET_EXT_DATA); } static inline u8 *nft_set_ext_flags(const struct nft_set_ext *ext) { return nft_set_ext(ext, NFT_SET_EXT_FLAGS); } struct nft_timeout { u64 timeout; u64 expiration; }; static inline struct nft_timeout *nft_set_ext_timeout(const struct nft_set_ext *ext) { return nft_set_ext(ext, NFT_SET_EXT_TIMEOUT); } static inline struct nft_userdata *nft_set_ext_userdata(const struct nft_set_ext *ext) { return nft_set_ext(ext, NFT_SET_EXT_USERDATA); } static inline struct nft_set_elem_expr *nft_set_ext_expr(const struct nft_set_ext *ext) { return nft_set_ext(ext, NFT_SET_EXT_EXPRESSIONS); } static inline bool __nft_set_elem_expired(const struct nft_set_ext *ext, u64 tstamp) { if (!nft_set_ext_exists(ext, NFT_SET_EXT_TIMEOUT) || READ_ONCE(nft_set_ext_timeout(ext)->timeout) == 0) return false; return time_after_eq64(tstamp, READ_ONCE(nft_set_ext_timeout(ext)->expiration)); } static inline bool nft_set_elem_expired(const struct nft_set_ext *ext) { return __nft_set_elem_expired(ext, get_jiffies_64()); } static inline struct nft_set_ext *nft_set_elem_ext(const struct nft_set *set, const struct nft_elem_priv *elem_priv) { return (void *)elem_priv + set->ops->elemsize; } static inline struct nft_object **nft_set_ext_obj(const struct nft_set_ext *ext) { return nft_set_ext(ext, NFT_SET_EXT_OBJREF); } struct nft_expr *nft_set_elem_expr_alloc(const struct nft_ctx *ctx, const struct nft_set *set, const struct nlattr *attr); struct nft_elem_priv *nft_set_elem_init(const struct nft_set *set, const struct nft_set_ext_tmpl *tmpl, const u32 *key, const u32 *key_end, const u32 *data, u64 timeout, u64 expiration, gfp_t gfp); int nft_set_elem_expr_clone(const struct nft_ctx *ctx, struct nft_set *set, struct nft_expr *expr_array[]); void nft_set_elem_destroy(const struct nft_set *set, const struct nft_elem_priv *elem_priv, bool destroy_expr); void nf_tables_set_elem_destroy(const struct nft_ctx *ctx, const struct nft_set *set, const struct nft_elem_priv *elem_priv); struct nft_expr_ops; /** * struct nft_expr_type - nf_tables expression type * * @select_ops: function to select nft_expr_ops * @release_ops: release nft_expr_ops * @ops: default ops, used when no select_ops functions is present * @inner_ops: inner ops, used for inner packet operation * @list: used internally * @name: Identifier * @owner: module reference * @policy: netlink attribute policy * @maxattr: highest netlink attribute number * @family: address family for AF-specific types * @flags: expression type flags */ struct nft_expr_type { const struct nft_expr_ops *(*select_ops)(const struct nft_ctx *, const struct nlattr * const tb[]); void (*release_ops)(const struct nft_expr_ops *ops); const struct nft_expr_ops *ops; const struct nft_expr_ops *inner_ops; struct list_head list; const char *name; struct module *owner; const struct nla_policy *policy; unsigned int maxattr; u8 family; u8 flags; }; #define NFT_EXPR_STATEFUL 0x1 #define NFT_EXPR_GC 0x2 enum nft_trans_phase { NFT_TRANS_PREPARE, NFT_TRANS_PREPARE_ERROR, NFT_TRANS_ABORT, NFT_TRANS_COMMIT, NFT_TRANS_RELEASE }; struct nft_flow_rule; struct nft_offload_ctx; /** * struct nft_expr_ops - nf_tables expression operations * * @eval: Expression evaluation function * @clone: Expression clone function * @size: full expression size, including private data size * @init: initialization function * @activate: activate expression in the next generation * @deactivate: deactivate expression in next generation * @destroy: destruction function, called after synchronize_rcu * @destroy_clone: destruction clone function * @dump: function to dump parameters * @validate: validate expression, called during loop detection * @reduce: reduce expression * @gc: garbage collection expression * @offload: hardware offload expression * @offload_action: function to report true/false to allocate one slot or not in the flow * offload array * @offload_stats: function to synchronize hardware stats via updating the counter expression * @type: expression type * @data: extra data to attach to this expression operation */ struct nft_expr_ops { void (*eval)(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt); int (*clone)(struct nft_expr *dst, const struct nft_expr *src, gfp_t gfp); unsigned int size; int (*init)(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]); void (*activate)(const struct nft_ctx *ctx, const struct nft_expr *expr); void (*deactivate)(const struct nft_ctx *ctx, const struct nft_expr *expr, enum nft_trans_phase phase); void (*destroy)(const struct nft_ctx *ctx, const struct nft_expr *expr); void (*destroy_clone)(const struct nft_ctx *ctx, const struct nft_expr *expr); int (*dump)(struct sk_buff *skb, const struct nft_expr *expr, bool reset); int (*validate)(const struct nft_ctx *ctx, const struct nft_expr *expr); bool (*reduce)(struct nft_regs_track *track, const struct nft_expr *expr); bool (*gc)(struct net *net, const struct nft_expr *expr); int (*offload)(struct nft_offload_ctx *ctx, struct nft_flow_rule *flow, const struct nft_expr *expr); bool (*offload_action)(const struct nft_expr *expr); void (*offload_stats)(struct nft_expr *expr, const struct flow_stats *stats); const struct nft_expr_type *type; void *data; }; /** * struct nft_rule - nf_tables rule * * @list: used internally * @handle: rule handle * @genmask: generation mask * @dlen: length of expression data * @udata: user data is appended to the rule * @data: expression data */ struct nft_rule { struct list_head list; u64 handle:42, genmask:2, dlen:12, udata:1; unsigned char data[] __attribute__((aligned(__alignof__(struct nft_expr)))); }; static inline struct nft_expr *nft_expr_first(const struct nft_rule *rule) { return (struct nft_expr *)&rule->data[0]; } static inline struct nft_expr *nft_expr_next(const struct nft_expr *expr) { return ((void *)expr) + expr->ops->size; } static inline struct nft_expr *nft_expr_last(const struct nft_rule *rule) { return (struct nft_expr *)&rule->data[rule->dlen]; } static inline bool nft_expr_more(const struct nft_rule *rule, const struct nft_expr *expr) { return expr != nft_expr_last(rule) && expr->ops; } static inline struct nft_userdata *nft_userdata(const struct nft_rule *rule) { return (void *)&rule->data[rule->dlen]; } void nft_rule_expr_activate(const struct nft_ctx *ctx, struct nft_rule *rule); void nft_rule_expr_deactivate(const struct nft_ctx *ctx, struct nft_rule *rule, enum nft_trans_phase phase); void nf_tables_rule_destroy(const struct nft_ctx *ctx, struct nft_rule *rule); static inline void nft_set_elem_update_expr(const struct nft_set_ext *ext, struct nft_regs *regs, const struct nft_pktinfo *pkt) { struct nft_set_elem_expr *elem_expr; struct nft_expr *expr; u32 size; if (__nft_set_ext_exists(ext, NFT_SET_EXT_EXPRESSIONS)) { elem_expr = nft_set_ext_expr(ext); nft_setelem_expr_foreach(expr, elem_expr, size) { expr->ops->eval(expr, regs, pkt); if (regs->verdict.code == NFT_BREAK) return; } } } /* * The last pointer isn't really necessary, but the compiler isn't able to * determine that the result of nft_expr_last() is always the same since it * can't assume that the dlen value wasn't changed within calls in the loop. */ #define nft_rule_for_each_expr(expr, last, rule) \ for ((expr) = nft_expr_first(rule), (last) = nft_expr_last(rule); \ (expr) != (last); \ (expr) = nft_expr_next(expr)) #define NFT_CHAIN_POLICY_UNSET U8_MAX struct nft_rule_dp { u64 is_last:1, dlen:12, handle:42; /* for tracing */ unsigned char data[] __attribute__((aligned(__alignof__(struct nft_expr)))); }; struct nft_rule_dp_last { struct nft_rule_dp end; /* end of nft_rule_blob marker */ struct rcu_head h; /* call_rcu head */ struct nft_rule_blob *blob; /* ptr to free via call_rcu */ const struct nft_chain *chain; /* for nftables tracing */ }; static inline const struct nft_rule_dp *nft_rule_next(const struct nft_rule_dp *rule) { return (void *)rule + sizeof(*rule) + rule->dlen; } struct nft_rule_blob { unsigned long size; unsigned char data[] __attribute__((aligned(__alignof__(struct nft_rule_dp)))); }; enum nft_chain_types { NFT_CHAIN_T_DEFAULT = 0, NFT_CHAIN_T_ROUTE, NFT_CHAIN_T_NAT, NFT_CHAIN_T_MAX }; /** * struct nft_chain_validate_state - validation state * * If a chain is encountered again during table validation it is * possible to avoid revalidation provided the calling context is * compatible. This structure stores relevant calling context of * previous validations. * * @hook_mask: the hook numbers and locations the chain is linked to * @depth: the deepest call chain level the chain is linked to */ struct nft_chain_validate_state { u8 hook_mask[NFT_CHAIN_T_MAX]; u8 depth; }; /** * struct nft_chain - nf_tables chain * * @blob_gen_0: rule blob pointer to the current generation * @blob_gen_1: rule blob pointer to the future generation * @rules: list of rules in the chain * @list: used internally * @rhlhead: used internally * @table: table that this chain belongs to * @handle: chain handle * @use: number of jump references to this chain * @flags: bitmask of enum NFTA_CHAIN_FLAGS * @bound: bind or not * @genmask: generation mask * @name: name of the chain * @udlen: user data length * @udata: user data in the chain * @blob_next: rule blob pointer to the next in the chain * @vstate: validation state */ struct nft_chain { struct nft_rule_blob __rcu *blob_gen_0; struct nft_rule_blob __rcu *blob_gen_1; struct list_head rules; struct list_head list; struct rhlist_head rhlhead; struct nft_table *table; u64 handle; u32 use; u8 flags:5, bound:1, genmask:2; char *name; u16 udlen; u8 *udata; /* Only used during control plane commit phase: */ struct nft_rule_blob *blob_next; struct nft_chain_validate_state vstate; }; int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain); int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set, const struct nft_set_iter *iter, struct nft_elem_priv *elem_priv); int nft_set_catchall_validate(const struct nft_ctx *ctx, struct nft_set *set); int nf_tables_bind_chain(const struct nft_ctx *ctx, struct nft_chain *chain); void nf_tables_unbind_chain(const struct nft_ctx *ctx, struct nft_chain *chain); /** * struct nft_chain_type - nf_tables chain type info * * @name: name of the type * @type: numeric identifier * @family: address family * @owner: module owner * @hook_mask: mask of valid hooks * @hooks: array of hook functions * @ops_register: base chain register function * @ops_unregister: base chain unregister function */ struct nft_chain_type { const char *name; enum nft_chain_types type; int family; struct module *owner; unsigned int hook_mask; nf_hookfn *hooks[NFT_MAX_HOOKS]; int (*ops_register)(struct net *net, const struct nf_hook_ops *ops); void (*ops_unregister)(struct net *net, const struct nf_hook_ops *ops); }; int nft_chain_validate_dependency(const struct nft_chain *chain, enum nft_chain_types type); int nft_chain_validate_hooks(const struct nft_chain *chain, unsigned int hook_flags); static inline bool nft_chain_binding(const struct nft_chain *chain) { return chain->flags & NFT_CHAIN_BINDING; } static inline bool nft_chain_is_bound(struct nft_chain *chain) { return (chain->flags & NFT_CHAIN_BINDING) && chain->bound; } int nft_chain_add(struct nft_table *table, struct nft_chain *chain); void nft_chain_del(struct nft_chain *chain); void nf_tables_chain_destroy(struct nft_chain *chain); struct nft_stats { u64 bytes; u64 pkts; struct u64_stats_sync syncp; }; struct nft_hook { struct list_head list; struct list_head ops_list; struct rcu_head rcu; char ifname[IFNAMSIZ]; u8 ifnamelen; }; struct nf_hook_ops *nft_hook_find_ops(const struct nft_hook *hook, const struct net_device *dev); struct nf_hook_ops *nft_hook_find_ops_rcu(const struct nft_hook *hook, const struct net_device *dev); /** * struct nft_base_chain - nf_tables base chain * * @ops: netfilter hook ops * @hook_list: list of netfilter hooks (for NFPROTO_NETDEV family) * @type: chain type * @policy: default policy * @flags: indicate the base chain disabled or not * @stats: per-cpu chain stats * @chain: the chain * @flow_block: flow block (for hardware offload) */ struct nft_base_chain { struct nf_hook_ops ops; struct list_head hook_list; const struct nft_chain_type *type; u8 policy; u8 flags; struct nft_stats __percpu *stats; struct nft_chain chain; struct flow_block flow_block; }; static inline struct nft_base_chain *nft_base_chain(const struct nft_chain *chain) { return container_of(chain, struct nft_base_chain, chain); } static inline bool nft_is_base_chain(const struct nft_chain *chain) { return chain->flags & NFT_CHAIN_BASE; } unsigned int nft_do_chain(struct nft_pktinfo *pkt, void *priv); static inline bool nft_use_inc(u32 *use) { if (*use == UINT_MAX) return false; (*use)++; return true; } static inline void nft_use_dec(u32 *use) { WARN_ON_ONCE((*use)-- == 0); } /* For error and abort path: restore use counter to previous state. */ static inline void nft_use_inc_restore(u32 *use) { WARN_ON_ONCE(!nft_use_inc(use)); } #define nft_use_dec_restore nft_use_dec /** * struct nft_table - nf_tables table * * @list: used internally * @chains_ht: chains in the table * @chains: same, for stable walks * @sets: sets in the table * @objects: stateful objects in the table * @flowtables: flow tables in the table * @hgenerator: handle generator state * @handle: table handle * @use: number of chain references to this table * @family:address family * @flags: table flag (see enum nft_table_flags) * @genmask: generation mask * @nlpid: netlink port ID * @name: name of the table * @udlen: length of the user data * @udata: user data * @validate_state: internal, set when transaction adds jumps */ struct nft_table { struct list_head list; struct rhltable chains_ht; struct list_head chains; struct list_head sets; struct list_head objects; struct list_head flowtables; u64 hgenerator; u64 handle; u32 use; u16 family:6, flags:8, genmask:2; u32 nlpid; char *name; u16 udlen; u8 *udata; u8 validate_state; }; static inline bool nft_table_has_owner(const struct nft_table *table) { return table->flags & NFT_TABLE_F_OWNER; } static inline bool nft_table_is_orphan(const struct nft_table *table) { return (table->flags & (NFT_TABLE_F_OWNER | NFT_TABLE_F_PERSIST)) == NFT_TABLE_F_PERSIST; } static inline bool nft_base_chain_netdev(int family, u32 hooknum) { return family == NFPROTO_NETDEV || (family == NFPROTO_INET && hooknum == NF_INET_INGRESS); } void nft_register_chain_type(const struct nft_chain_type *); void nft_unregister_chain_type(const struct nft_chain_type *); int nft_register_expr(struct nft_expr_type *); void nft_unregister_expr(struct nft_expr_type *); int nft_verdict_dump(struct sk_buff *skb, int type, const struct nft_verdict *v); /** * struct nft_object_hash_key - key to lookup nft_object * * @name: name of the stateful object to look up * @table: table the object belongs to */ struct nft_object_hash_key { const char *name; const struct nft_table *table; }; /** * struct nft_object - nf_tables stateful object * * @list: table stateful object list node * @rhlhead: nft_objname_ht node * @key: keys that identify this object * @genmask: generation mask * @use: number of references to this stateful object * @handle: unique object handle * @udlen: length of user data * @udata: user data * @ops: object operations * @data: object data, layout depends on type */ struct nft_object { struct list_head list; struct rhlist_head rhlhead; struct nft_object_hash_key key; u32 genmask:2; u32 use; u64 handle; u16 udlen; u8 *udata; /* runtime data below here */ const struct nft_object_ops *ops ____cacheline_aligned; unsigned char data[] __attribute__((aligned(__alignof__(u64)))); }; static inline void *nft_obj_data(const struct nft_object *obj) { return (void *)obj->data; } #define nft_expr_obj(expr) *((struct nft_object **)nft_expr_priv(expr)) struct nft_object *nft_obj_lookup(const struct net *net, const struct nft_table *table, const struct nlattr *nla, u32 objtype, u8 genmask); void nft_obj_notify(struct net *net, const struct nft_table *table, struct nft_object *obj, u32 portid, u32 seq, int event, u16 flags, int family, int report, gfp_t gfp); /** * struct nft_object_type - stateful object type * * @select_ops: function to select nft_object_ops * @ops: default ops, used when no select_ops functions is present * @list: list node in list of object types * @type: stateful object numeric type * @owner: module owner * @maxattr: maximum netlink attribute * @family: address family for AF-specific object types * @policy: netlink attribute policy */ struct nft_object_type { const struct nft_object_ops *(*select_ops)(const struct nft_ctx *, const struct nlattr * const tb[]); const struct nft_object_ops *ops; struct list_head list; u32 type; unsigned int maxattr; u8 family; struct module *owner; const struct nla_policy *policy; }; /** * struct nft_object_ops - stateful object operations * * @eval: stateful object evaluation function * @size: stateful object size * @init: initialize object from netlink attributes * @destroy: release existing stateful object * @dump: netlink dump stateful object * @update: update stateful object * @type: pointer to object type */ struct nft_object_ops { void (*eval)(struct nft_object *obj, struct nft_regs *regs, const struct nft_pktinfo *pkt); unsigned int size; int (*init)(const struct nft_ctx *ctx, const struct nlattr *const tb[], struct nft_object *obj); void (*destroy)(const struct nft_ctx *ctx, struct nft_object *obj); int (*dump)(struct sk_buff *skb, struct nft_object *obj, bool reset); void (*update)(struct nft_object *obj, struct nft_object *newobj); const struct nft_object_type *type; }; int nft_register_obj(struct nft_object_type *obj_type); void nft_unregister_obj(struct nft_object_type *obj_type); #define NFT_NETDEVICE_MAX 256 /** * struct nft_flowtable - nf_tables flow table * * @list: flow table list node in table list * @table: the table the flow table is contained in * @name: name of this flow table * @hooknum: hook number * @ops_len: number of hooks in array * @genmask: generation mask * @use: number of references to this flow table * @handle: unique object handle * @hook_list: hook list for hooks per net_device in flowtables * @data: rhashtable and garbage collector */ struct nft_flowtable { struct list_head list; struct nft_table *table; char *name; int hooknum; int ops_len; u32 genmask:2; u32 use; u64 handle; /* runtime data below here */ struct list_head hook_list ____cacheline_aligned; struct nf_flowtable data; }; struct nft_flowtable *nft_flowtable_lookup(const struct net *net, const struct nft_table *table, const struct nlattr *nla, u8 genmask); void nf_tables_deactivate_flowtable(const struct nft_ctx *ctx, struct nft_flowtable *flowtable, enum nft_trans_phase phase); void nft_register_flowtable_type(struct nf_flowtable_type *type); void nft_unregister_flowtable_type(struct nf_flowtable_type *type); /** * struct nft_traceinfo - nft tracing information and state * * @trace: other struct members are initialised * @nf_trace: copy of skb->nf_trace before rule evaluation * @type: event type (enum nft_trace_types) * @skbid: hash of skb to be used as trace id * @packet_dumped: packet headers sent in a previous traceinfo message * @basechain: base chain currently processed */ struct nft_traceinfo { bool trace; bool nf_trace; bool packet_dumped; enum nft_trace_types type:8; u32 skbid; const struct nft_base_chain *basechain; }; void nft_trace_init(struct nft_traceinfo *info, const struct nft_pktinfo *pkt, const struct nft_chain *basechain); void nft_trace_notify(const struct nft_pktinfo *pkt, const struct nft_verdict *verdict, const struct nft_rule_dp *rule, struct nft_traceinfo *info); #define MODULE_ALIAS_NFT_CHAIN(family, name) \ MODULE_ALIAS("nft-chain-" __stringify(family) "-" name) #define MODULE_ALIAS_NFT_AF_EXPR(family, name) \ MODULE_ALIAS("nft-expr-" __stringify(family) "-" name) #define MODULE_ALIAS_NFT_EXPR(name) \ MODULE_ALIAS("nft-expr-" name) #define MODULE_ALIAS_NFT_OBJ(type) \ MODULE_ALIAS("nft-obj-" __stringify(type)) #if IS_ENABLED(CONFIG_NF_TABLES) /* * The gencursor defines two generations, the currently active and the * next one. Objects contain a bitmask of 2 bits specifying the generations * they're active in. A set bit means they're inactive in the generation * represented by that bit. * * New objects start out as inactive in the current and active in the * next generation. When committing the ruleset the bitmask is cleared, * meaning they're active in all generations. When removing an object, * it is set inactive in the next generation. After committing the ruleset, * the objects are removed. */ static inline unsigned int nft_gencursor_next(const struct net *net) { return net->nft.gencursor + 1 == 1 ? 1 : 0; } static inline u8 nft_genmask_next(const struct net *net) { return 1 << nft_gencursor_next(net); } static inline u8 nft_genmask_cur(const struct net *net) { /* Use READ_ONCE() to prevent refetching the value for atomicity */ return 1 << READ_ONCE(net->nft.gencursor); } #define NFT_GENMASK_ANY ((1 << 0) | (1 << 1)) /* * Generic transaction helpers */ /* Check if this object is currently active. */ #define nft_is_active(__net, __obj) \ (((__obj)->genmask & nft_genmask_cur(__net)) == 0) /* Check if this object is active in the next generation. */ #define nft_is_active_next(__net, __obj) \ (((__obj)->genmask & nft_genmask_next(__net)) == 0) /* This object becomes active in the next generation. */ #define nft_activate_next(__net, __obj) \ (__obj)->genmask = nft_genmask_cur(__net) /* This object becomes inactive in the next generation. */ #define nft_deactivate_next(__net, __obj) \ (__obj)->genmask = nft_genmask_next(__net) /* After committing the ruleset, clear the stale generation bit. */ #define nft_clear(__net, __obj) \ (__obj)->genmask &= ~nft_genmask_next(__net) #define nft_active_genmask(__obj, __genmask) \ !((__obj)->genmask & __genmask) /* * Set element transaction helpers */ static inline bool nft_set_elem_active(const struct nft_set_ext *ext, u8 genmask) { return !(ext->genmask & genmask); } static inline void nft_set_elem_change_active(const struct net *net, const struct nft_set *set, struct nft_set_ext *ext) { ext->genmask ^= nft_genmask_next(net); } #endif /* IS_ENABLED(CONFIG_NF_TABLES) */ #define NFT_SET_ELEM_DEAD_MASK (1 << 2) #if defined(__LITTLE_ENDIAN_BITFIELD) #define NFT_SET_ELEM_DEAD_BIT 2 #elif defined(__BIG_ENDIAN_BITFIELD) #define NFT_SET_ELEM_DEAD_BIT (BITS_PER_LONG - BITS_PER_BYTE + 2) #else #error #endif static inline void nft_set_elem_dead(struct nft_set_ext *ext) { unsigned long *word = (unsigned long *)ext; BUILD_BUG_ON(offsetof(struct nft_set_ext, genmask) != 0); set_bit(NFT_SET_ELEM_DEAD_BIT, word); } static inline int nft_set_elem_is_dead(const struct nft_set_ext *ext) { unsigned long *word = (unsigned long *)ext; BUILD_BUG_ON(offsetof(struct nft_set_ext, genmask) != 0); return test_bit(NFT_SET_ELEM_DEAD_BIT, word); } /** * struct nft_trans - nf_tables object update in transaction * * @list: used internally * @net: struct net * @table: struct nft_table the object resides in * @msg_type: message type * @seq: netlink sequence number * @flags: modifiers to new request * @report: notify via unicast netlink message * @put_net: net needs to be put * * This is the information common to all objects in the transaction, * this must always be the first member of derived sub-types. */ struct nft_trans { struct list_head list; struct net *net; struct nft_table *table; int msg_type; u32 seq; u16 flags; u8 report:1; u8 put_net:1; }; /** * struct nft_trans_binding - nf_tables object with binding support in transaction * @nft_trans: base structure, MUST be first member * @binding_list: list of objects with possible bindings * * This is the base type used by objects that can be bound to a chain. */ struct nft_trans_binding { struct nft_trans nft_trans; struct list_head binding_list; }; struct nft_trans_rule { struct nft_trans nft_trans; struct nft_rule *rule; struct nft_chain *chain; struct nft_flow_rule *flow; u32 rule_id; bool bound; }; #define nft_trans_container_rule(trans) \ container_of(trans, struct nft_trans_rule, nft_trans) #define nft_trans_rule(trans) \ nft_trans_container_rule(trans)->rule #define nft_trans_flow_rule(trans) \ nft_trans_container_rule(trans)->flow #define nft_trans_rule_id(trans) \ nft_trans_container_rule(trans)->rule_id #define nft_trans_rule_bound(trans) \ nft_trans_container_rule(trans)->bound #define nft_trans_rule_chain(trans) \ nft_trans_container_rule(trans)->chain struct nft_trans_set { struct nft_trans_binding nft_trans_binding; struct list_head list_trans_newset; struct nft_set *set; u32 set_id; u32 gc_int; u64 timeout; bool update; bool bound; u32 size; }; #define nft_trans_container_set(t) \ container_of(t, struct nft_trans_set, nft_trans_binding.nft_trans) #define nft_trans_set(trans) \ nft_trans_container_set(trans)->set #define nft_trans_set_id(trans) \ nft_trans_container_set(trans)->set_id #define nft_trans_set_bound(trans) \ nft_trans_container_set(trans)->bound #define nft_trans_set_update(trans) \ nft_trans_container_set(trans)->update #define nft_trans_set_timeout(trans) \ nft_trans_container_set(trans)->timeout #define nft_trans_set_gc_int(trans) \ nft_trans_container_set(trans)->gc_int #define nft_trans_set_size(trans) \ nft_trans_container_set(trans)->size struct nft_trans_chain { struct nft_trans_binding nft_trans_binding; struct nft_chain *chain; char *name; struct nft_stats __percpu *stats; u8 policy; bool update; bool bound; u32 chain_id; struct nft_base_chain *basechain; struct list_head hook_list; }; #define nft_trans_container_chain(t) \ container_of(t, struct nft_trans_chain, nft_trans_binding.nft_trans) #define nft_trans_chain(trans) \ nft_trans_container_chain(trans)->chain #define nft_trans_chain_update(trans) \ nft_trans_container_chain(trans)->update #define nft_trans_chain_name(trans) \ nft_trans_container_chain(trans)->name #define nft_trans_chain_stats(trans) \ nft_trans_container_chain(trans)->stats #define nft_trans_chain_policy(trans) \ nft_trans_container_chain(trans)->policy #define nft_trans_chain_bound(trans) \ nft_trans_container_chain(trans)->bound #define nft_trans_chain_id(trans) \ nft_trans_container_chain(trans)->chain_id #define nft_trans_basechain(trans) \ nft_trans_container_chain(trans)->basechain #define nft_trans_chain_hooks(trans) \ nft_trans_container_chain(trans)->hook_list struct nft_trans_table { struct nft_trans nft_trans; bool update; }; #define nft_trans_container_table(trans) \ container_of(trans, struct nft_trans_table, nft_trans) #define nft_trans_table_update(trans) \ nft_trans_container_table(trans)->update enum nft_trans_elem_flags { NFT_TRANS_UPD_TIMEOUT = (1 << 0), NFT_TRANS_UPD_EXPIRATION = (1 << 1), }; struct nft_elem_update { u64 timeout; u64 expiration; u8 flags; }; struct nft_trans_one_elem { struct nft_elem_priv *priv; struct nft_elem_update *update; }; struct nft_trans_elem { struct nft_trans nft_trans; struct nft_set *set; bool bound; unsigned int nelems; struct nft_trans_one_elem elems[] __counted_by(nelems); }; #define nft_trans_container_elem(t) \ container_of(t, struct nft_trans_elem, nft_trans) #define nft_trans_elem_set(trans) \ nft_trans_container_elem(trans)->set #define nft_trans_elem_set_bound(trans) \ nft_trans_container_elem(trans)->bound struct nft_trans_obj { struct nft_trans nft_trans; struct nft_object *obj; struct nft_object *newobj; bool update; }; #define nft_trans_container_obj(t) \ container_of(t, struct nft_trans_obj, nft_trans) #define nft_trans_obj(trans) \ nft_trans_container_obj(trans)->obj #define nft_trans_obj_newobj(trans) \ nft_trans_container_obj(trans)->newobj #define nft_trans_obj_update(trans) \ nft_trans_container_obj(trans)->update struct nft_trans_flowtable { struct nft_trans nft_trans; struct nft_flowtable *flowtable; struct list_head hook_list; u32 flags; bool update; }; #define nft_trans_container_flowtable(t) \ container_of(t, struct nft_trans_flowtable, nft_trans) #define nft_trans_flowtable(trans) \ nft_trans_container_flowtable(trans)->flowtable #define nft_trans_flowtable_update(trans) \ nft_trans_container_flowtable(trans)->update #define nft_trans_flowtable_hooks(trans) \ nft_trans_container_flowtable(trans)->hook_list #define nft_trans_flowtable_flags(trans) \ nft_trans_container_flowtable(trans)->flags #define NFT_TRANS_GC_BATCHCOUNT 256 struct nft_trans_gc { struct list_head list; struct net *net; struct nft_set *set; u32 seq; u16 count; struct nft_elem_priv *priv[NFT_TRANS_GC_BATCHCOUNT]; struct rcu_head rcu; }; static inline void nft_ctx_update(struct nft_ctx *ctx, const struct nft_trans *trans) { switch (trans->msg_type) { case NFT_MSG_NEWRULE: case NFT_MSG_DELRULE: case NFT_MSG_DESTROYRULE: ctx->chain = nft_trans_rule_chain(trans); break; case NFT_MSG_NEWCHAIN: case NFT_MSG_DELCHAIN: case NFT_MSG_DESTROYCHAIN: ctx->chain = nft_trans_chain(trans); break; default: ctx->chain = NULL; break; } ctx->net = trans->net; ctx->table = trans->table; ctx->family = trans->table->family; ctx->report = trans->report; ctx->flags = trans->flags; ctx->seq = trans->seq; } struct nft_trans_gc *nft_trans_gc_alloc(struct nft_set *set, unsigned int gc_seq, gfp_t gfp); void nft_trans_gc_destroy(struct nft_trans_gc *trans); struct nft_trans_gc *nft_trans_gc_queue_async(struct nft_trans_gc *gc, unsigned int gc_seq, gfp_t gfp); void nft_trans_gc_queue_async_done(struct nft_trans_gc *gc); struct nft_trans_gc *nft_trans_gc_queue_sync(struct nft_trans_gc *gc, gfp_t gfp); void nft_trans_gc_queue_sync_done(struct nft_trans_gc *trans); void nft_trans_gc_elem_add(struct nft_trans_gc *gc, void *priv); struct nft_trans_gc *nft_trans_gc_catchall_async(struct nft_trans_gc *gc, unsigned int gc_seq); struct nft_trans_gc *nft_trans_gc_catchall_sync(struct nft_trans_gc *gc); void nft_setelem_data_deactivate(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv); int __init nft_chain_filter_init(void); void nft_chain_filter_fini(void); void __init nft_chain_route_init(void); void nft_chain_route_fini(void); void nf_tables_trans_destroy_flush_work(struct net *net); int nf_msecs_to_jiffies64(const struct nlattr *nla, u64 *result); __be64 nf_jiffies64_to_msecs(u64 input); #ifdef CONFIG_MODULES __printf(2, 3) int nft_request_module(struct net *net, const char *fmt, ...); #else static inline int nft_request_module(struct net *net, const char *fmt, ...) { return -ENOENT; } #endif struct nftables_pernet { struct list_head tables; struct list_head commit_list; struct list_head destroy_list; struct list_head commit_set_list; struct list_head binding_list; struct list_head module_list; struct list_head notify_list; struct mutex commit_mutex; u64 table_handle; u64 tstamp; unsigned int gc_seq; u8 validate_state; struct work_struct destroy_work; }; extern unsigned int nf_tables_net_id; static inline struct nftables_pernet *nft_pernet(const struct net *net) { return net_generic(net, nf_tables_net_id); } static inline u64 nft_net_tstamp(const struct net *net) { return nft_pernet(net)->tstamp; } #define __NFT_REDUCE_READONLY 1UL #define NFT_REDUCE_READONLY (void *)__NFT_REDUCE_READONLY void nft_reg_track_update(struct nft_regs_track *track, const struct nft_expr *expr, u8 dreg, u8 len); void nft_reg_track_cancel(struct nft_regs_track *track, u8 dreg, u8 len); void __nft_reg_track_cancel(struct nft_regs_track *track, u8 dreg); static inline bool nft_reg_track_cmp(struct nft_regs_track *track, const struct nft_expr *expr, u8 dreg) { return track->regs[dreg].selector && track->regs[dreg].selector->ops == expr->ops && track->regs[dreg].num_reg == 0; } #endif /* _NET_NF_TABLES_H */ |
| 23 23 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | /* SPDX-License-Identifier: GPL-2.0-or-later */ #ifndef __SOUND_INFO_H #define __SOUND_INFO_H /* * Header file for info interface * Copyright (c) by Jaroslav Kysela <perex@perex.cz> */ #include <linux/poll.h> #include <linux/seq_file.h> #include <sound/core.h> /* buffer for information */ struct snd_info_buffer { char *buffer; /* pointer to begin of buffer */ unsigned int curr; /* current position in buffer */ unsigned int size; /* current size */ unsigned int len; /* total length of buffer */ int stop; /* stop flag */ int error; /* error code */ }; #define SNDRV_INFO_CONTENT_TEXT 0 #define SNDRV_INFO_CONTENT_DATA 1 struct snd_info_entry; struct snd_info_entry_text { void (*read)(struct snd_info_entry *entry, struct snd_info_buffer *buffer); void (*write)(struct snd_info_entry *entry, struct snd_info_buffer *buffer); }; struct snd_info_entry_ops { int (*open)(struct snd_info_entry *entry, unsigned short mode, void **file_private_data); int (*release)(struct snd_info_entry *entry, unsigned short mode, void *file_private_data); ssize_t (*read)(struct snd_info_entry *entry, void *file_private_data, struct file *file, char __user *buf, size_t count, loff_t pos); ssize_t (*write)(struct snd_info_entry *entry, void *file_private_data, struct file *file, const char __user *buf, size_t count, loff_t pos); loff_t (*llseek)(struct snd_info_entry *entry, void *file_private_data, struct file *file, loff_t offset, int orig); __poll_t (*poll)(struct snd_info_entry *entry, void *file_private_data, struct file *file, poll_table *wait); int (*ioctl)(struct snd_info_entry *entry, void *file_private_data, struct file *file, unsigned int cmd, unsigned long arg); int (*mmap)(struct snd_info_entry *entry, void *file_private_data, struct inode *inode, struct file *file, struct vm_area_struct *vma); }; struct snd_info_entry { const char *name; umode_t mode; long size; unsigned short content; union { struct snd_info_entry_text text; const struct snd_info_entry_ops *ops; } c; struct snd_info_entry *parent; struct module *module; void *private_data; void (*private_free)(struct snd_info_entry *entry); struct proc_dir_entry *p; struct mutex access; struct list_head children; struct list_head list; }; #if defined(CONFIG_SND_OSSEMUL) && defined(CONFIG_SND_PROC_FS) int snd_info_minor_register(void); #else #define snd_info_minor_register() 0 #endif #ifdef CONFIG_SND_PROC_FS extern struct snd_info_entry *snd_seq_root; #ifdef CONFIG_SND_OSSEMUL extern struct snd_info_entry *snd_oss_root; void snd_card_info_read_oss(struct snd_info_buffer *buffer); #else #define snd_oss_root NULL static inline void snd_card_info_read_oss(struct snd_info_buffer *buffer) {} #endif /** * snd_iprintf - printf on the procfs buffer * @buf: the procfs buffer * @fmt: the printf format * * Outputs the string on the procfs buffer just like printf(). * * Return: zero for success, or a negative error code. */ #define snd_iprintf(buf, fmt, args...) \ seq_printf((struct seq_file *)(buf)->buffer, fmt, ##args) int snd_info_init(void); int snd_info_done(void); int snd_info_get_line(struct snd_info_buffer *buffer, char *line, int len); const char *snd_info_get_str(char *dest, const char *src, int len); struct snd_info_entry *snd_info_create_module_entry(struct module *module, const char *name, struct snd_info_entry *parent); struct snd_info_entry *snd_info_create_card_entry(struct snd_card *card, const char *name, struct snd_info_entry *parent); void snd_info_free_entry(struct snd_info_entry *entry); int snd_info_card_create(struct snd_card *card); int snd_info_card_register(struct snd_card *card); int snd_info_card_free(struct snd_card *card); void snd_info_card_disconnect(struct snd_card *card); void snd_info_card_id_change(struct snd_card *card); int snd_info_register(struct snd_info_entry *entry); /* for card drivers */ static inline int snd_card_proc_new(struct snd_card *card, const char *name, struct snd_info_entry **entryp) { *entryp = snd_info_create_card_entry(card, name, card->proc_root); return *entryp ? 0 : -ENOMEM; } static inline void snd_info_set_text_ops(struct snd_info_entry *entry, void *private_data, void (*read)(struct snd_info_entry *, struct snd_info_buffer *)) { entry->private_data = private_data; entry->c.text.read = read; } int snd_card_rw_proc_new(struct snd_card *card, const char *name, void *private_data, void (*read)(struct snd_info_entry *, struct snd_info_buffer *), void (*write)(struct snd_info_entry *entry, struct snd_info_buffer *buffer)); int snd_info_check_reserved_words(const char *str); #else #define snd_seq_root NULL #define snd_oss_root NULL static inline int snd_iprintf(struct snd_info_buffer *buffer, char *fmt, ...) { return 0; } static inline int snd_info_init(void) { return 0; } static inline int snd_info_done(void) { return 0; } static inline int snd_info_get_line(struct snd_info_buffer *buffer, char *line, int len) { return 0; } static inline char *snd_info_get_str(char *dest, char *src, int len) { return NULL; } static inline struct snd_info_entry *snd_info_create_module_entry(struct module *module, const char *name, struct snd_info_entry *parent) { return NULL; } static inline struct snd_info_entry *snd_info_create_card_entry(struct snd_card *card, const char *name, struct snd_info_entry *parent) { return NULL; } static inline void snd_info_free_entry(struct snd_info_entry *entry) { ; } static inline int snd_info_card_create(struct snd_card *card) { return 0; } static inline int snd_info_card_register(struct snd_card *card) { return 0; } static inline int snd_info_card_free(struct snd_card *card) { return 0; } static inline void snd_info_card_disconnect(struct snd_card *card) { } static inline void snd_info_card_id_change(struct snd_card *card) { } static inline int snd_info_register(struct snd_info_entry *entry) { return 0; } static inline int snd_card_proc_new(struct snd_card *card, const char *name, struct snd_info_entry **entryp) { return -EINVAL; } static inline void snd_info_set_text_ops(struct snd_info_entry *entry __attribute__((unused)), void *private_data, void (*read)(struct snd_info_entry *, struct snd_info_buffer *)) {} static inline int snd_card_rw_proc_new(struct snd_card *card, const char *name, void *private_data, void (*read)(struct snd_info_entry *, struct snd_info_buffer *), void (*write)(struct snd_info_entry *entry, struct snd_info_buffer *buffer)) { return 0; } static inline int snd_info_check_reserved_words(const char *str) { return 1; } #endif /** * snd_card_ro_proc_new - Create a read-only text proc file entry for the card * @card: the card instance * @name: the file name * @private_data: the arbitrary private data * @read: the read callback * * This proc file entry will be registered via snd_card_register() call, and * it will be removed automatically at the card removal, too. */ static inline int snd_card_ro_proc_new(struct snd_card *card, const char *name, void *private_data, void (*read)(struct snd_info_entry *, struct snd_info_buffer *)) { return snd_card_rw_proc_new(card, name, private_data, read, NULL); } /* * OSS info part */ #if defined(CONFIG_SND_OSSEMUL) && defined(CONFIG_SND_PROC_FS) #define SNDRV_OSS_INFO_DEV_AUDIO 0 #define SNDRV_OSS_INFO_DEV_SYNTH 1 #define SNDRV_OSS_INFO_DEV_MIDI 2 #define SNDRV_OSS_INFO_DEV_TIMERS 4 #define SNDRV_OSS_INFO_DEV_MIXERS 5 #define SNDRV_OSS_INFO_DEV_COUNT 6 int snd_oss_info_register(int dev, int num, char *string); #define snd_oss_info_unregister(dev, num) snd_oss_info_register(dev, num, NULL) #endif /* CONFIG_SND_OSSEMUL && CONFIG_SND_PROC_FS */ #endif /* __SOUND_INFO_H */ |
| 188 6 3 3 6 6 6 6 6 6 5 6 6 712 711 710 52 714 52 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (C) 2008 Red Hat, Inc., Eric Paris <eparis@redhat.com> */ #include <linux/dcache.h> #include <linux/fs.h> #include <linux/gfp.h> #include <linux/init.h> #include <linux/module.h> #include <linux/mount.h> #include <linux/srcu.h> #include <linux/fsnotify_backend.h> #include "fsnotify.h" /* * Clear all of the marks on an inode when it is being evicted from core */ void __fsnotify_inode_delete(struct inode *inode) { fsnotify_clear_marks_by_inode(inode); } EXPORT_SYMBOL_GPL(__fsnotify_inode_delete); void __fsnotify_vfsmount_delete(struct vfsmount *mnt) { fsnotify_clear_marks_by_mount(mnt); } void __fsnotify_mntns_delete(struct mnt_namespace *mntns) { fsnotify_clear_marks_by_mntns(mntns); } /** * fsnotify_unmount_inodes - an sb is unmounting. handle any watched inodes. * @sb: superblock being unmounted. * * Called during unmount with no locks held, so needs to be safe against * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block. */ static void fsnotify_unmount_inodes(struct super_block *sb) { struct inode *inode, *iput_inode = NULL; spin_lock(&sb->s_inode_list_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { /* * We cannot __iget() an inode in state I_FREEING, * I_WILL_FREE, or I_NEW which is fine because by that point * the inode cannot have any associated watches. */ spin_lock(&inode->i_lock); if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_NEW)) { spin_unlock(&inode->i_lock); continue; } /* * If i_count is zero, the inode cannot have any watches and * doing an __iget/iput with SB_ACTIVE clear would actually * evict all inodes with zero i_count from icache which is * unnecessarily violent and may in fact be illegal to do. * However, we should have been called /after/ evict_inodes * removed all zero refcount inodes, in any case. Test to * be sure. */ if (!icount_read(inode)) { spin_unlock(&inode->i_lock); continue; } __iget(inode); spin_unlock(&inode->i_lock); spin_unlock(&sb->s_inode_list_lock); iput(iput_inode); /* for each watch, send FS_UNMOUNT and then remove it */ fsnotify_inode(inode, FS_UNMOUNT); fsnotify_inode_delete(inode); iput_inode = inode; cond_resched(); spin_lock(&sb->s_inode_list_lock); } spin_unlock(&sb->s_inode_list_lock); iput(iput_inode); } void fsnotify_sb_delete(struct super_block *sb) { struct fsnotify_sb_info *sbinfo = fsnotify_sb_info(sb); /* Were any marks ever added to any object on this sb? */ if (!sbinfo) return; fsnotify_unmount_inodes(sb); fsnotify_clear_marks_by_sb(sb); /* Wait for outstanding object references from connectors */ wait_var_event(fsnotify_sb_watched_objects(sb), !atomic_long_read(fsnotify_sb_watched_objects(sb))); WARN_ON(fsnotify_sb_has_priority_watchers(sb, FSNOTIFY_PRIO_CONTENT)); WARN_ON(fsnotify_sb_has_priority_watchers(sb, FSNOTIFY_PRIO_PRE_CONTENT)); } void fsnotify_sb_free(struct super_block *sb) { kfree(sb->s_fsnotify_info); } /* * Given an inode, first check if we care what happens to our children. Inotify * and dnotify both tell their parents about events. If we care about any event * on a child we run all of our children and set a dentry flag saying that the * parent cares. Thus when an event happens on a child it can quickly tell * if there is a need to find a parent and send the event to the parent. */ void fsnotify_set_children_dentry_flags(struct inode *inode) { struct dentry *alias; if (!S_ISDIR(inode->i_mode)) return; spin_lock(&inode->i_lock); /* run all of the dentries associated with this inode. Since this is a * directory, there damn well better only be one item on this list */ hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) { struct dentry *child; /* run all of the children of the original inode and fix their * d_flags to indicate parental interest (their parent is the * original inode) */ spin_lock(&alias->d_lock); hlist_for_each_entry(child, &alias->d_children, d_sib) { if (!child->d_inode) continue; spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED); child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED; spin_unlock(&child->d_lock); } spin_unlock(&alias->d_lock); } spin_unlock(&inode->i_lock); } /* * Lazily clear false positive PARENT_WATCHED flag for child whose parent had * stopped watching children. */ static void fsnotify_clear_child_dentry_flag(struct inode *pinode, struct dentry *dentry) { spin_lock(&dentry->d_lock); /* * d_lock is a sufficient barrier to prevent observing a non-watched * parent state from before the fsnotify_set_children_dentry_flags() * or fsnotify_update_flags() call that had set PARENT_WATCHED. */ if (!fsnotify_inode_watches_children(pinode)) dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED; spin_unlock(&dentry->d_lock); } /* Are inode/sb/mount interested in parent and name info with this event? */ static bool fsnotify_event_needs_parent(struct inode *inode, __u32 mnt_mask, __u32 mask) { __u32 marks_mask = 0; /* We only send parent/name to inode/sb/mount for events on non-dir */ if (mask & FS_ISDIR) return false; /* * All events that are possible on child can also may be reported with * parent/name info to inode/sb/mount. Otherwise, a watching parent * could result in events reported with unexpected name info to sb/mount. */ BUILD_BUG_ON(FS_EVENTS_POSS_ON_CHILD & ~FS_EVENTS_POSS_TO_PARENT); /* Did either inode/sb/mount subscribe for events with parent/name? */ marks_mask |= fsnotify_parent_needed_mask( READ_ONCE(inode->i_fsnotify_mask)); marks_mask |= fsnotify_parent_needed_mask( READ_ONCE(inode->i_sb->s_fsnotify_mask)); marks_mask |= fsnotify_parent_needed_mask(mnt_mask); /* Did they subscribe for this event with parent/name info? */ return mask & marks_mask; } /* Are there any inode/mount/sb objects that watch for these events? */ static inline __u32 fsnotify_object_watched(struct inode *inode, __u32 mnt_mask, __u32 mask) { __u32 marks_mask = READ_ONCE(inode->i_fsnotify_mask) | mnt_mask | READ_ONCE(inode->i_sb->s_fsnotify_mask); return mask & marks_mask & ALL_FSNOTIFY_EVENTS; } /* Report pre-content event with optional range info */ int fsnotify_pre_content(const struct path *path, const loff_t *ppos, size_t count) { struct file_range range; /* Report page aligned range only when pos is known */ if (!ppos) return fsnotify_path(path, FS_PRE_ACCESS); range.path = path; range.pos = PAGE_ALIGN_DOWN(*ppos); range.count = PAGE_ALIGN(*ppos + count) - range.pos; return fsnotify_parent(path->dentry, FS_PRE_ACCESS, &range, FSNOTIFY_EVENT_FILE_RANGE); } /* * Notify this dentry's parent about a child's events with child name info * if parent is watching or if inode/sb/mount are interested in events with * parent and name info. * * Notify only the child without name info if parent is not watching and * inode/sb/mount are not interested in events with parent and name info. */ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, int data_type) { const struct path *path = fsnotify_data_path(data, data_type); __u32 mnt_mask = path ? READ_ONCE(real_mount(path->mnt)->mnt_fsnotify_mask) : 0; struct inode *inode = d_inode(dentry); struct dentry *parent; bool parent_watched = dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED; bool parent_needed, parent_interested; __u32 p_mask; struct inode *p_inode = NULL; struct name_snapshot name; struct qstr *file_name = NULL; int ret = 0; /* Optimize the likely case of nobody watching this path */ if (likely(!parent_watched && !fsnotify_object_watched(inode, mnt_mask, mask))) return 0; parent = NULL; parent_needed = fsnotify_event_needs_parent(inode, mnt_mask, mask); if (!parent_watched && !parent_needed) goto notify; /* Does parent inode care about events on children? */ parent = dget_parent(dentry); p_inode = parent->d_inode; p_mask = fsnotify_inode_watches_children(p_inode); if (unlikely(parent_watched && !p_mask)) fsnotify_clear_child_dentry_flag(p_inode, dentry); /* * Include parent/name in notification either if some notification * groups require parent info or the parent is interested in this event. * The parent interest in ACCESS/MODIFY events does not apply to special * files, where read/write are not on the filesystem of the parent and * events can provide an undesirable side-channel for information * exfiltration. */ parent_interested = mask & p_mask & ALL_FSNOTIFY_EVENTS && !(data_type == FSNOTIFY_EVENT_PATH && d_is_special(dentry) && (mask & (FS_ACCESS | FS_MODIFY))); if (parent_needed || parent_interested) { /* When notifying parent, child should be passed as data */ WARN_ON_ONCE(inode != fsnotify_data_inode(data, data_type)); /* Notify both parent and child with child name info */ take_dentry_name_snapshot(&name, dentry); file_name = &name.name; if (parent_interested) mask |= FS_EVENT_ON_CHILD; } notify: ret = fsnotify(mask, data, data_type, p_inode, file_name, inode, 0); if (file_name) release_dentry_name_snapshot(&name); dput(parent); return ret; } EXPORT_SYMBOL_GPL(__fsnotify_parent); static int fsnotify_handle_inode_event(struct fsnotify_group *group, struct fsnotify_mark *inode_mark, u32 mask, const void *data, int data_type, struct inode *dir, const struct qstr *name, u32 cookie) { const struct path *path = fsnotify_data_path(data, data_type); struct inode *inode = fsnotify_data_inode(data, data_type); const struct fsnotify_ops *ops = group->ops; if (WARN_ON_ONCE(!ops->handle_inode_event)) return 0; if (WARN_ON_ONCE(!inode && !dir)) return 0; if ((inode_mark->flags & FSNOTIFY_MARK_FLAG_EXCL_UNLINK) && path && d_unlinked(path->dentry)) return 0; /* Check interest of this mark in case event was sent with two marks */ if (!(mask & inode_mark->mask & ALL_FSNOTIFY_EVENTS)) return 0; return ops->handle_inode_event(inode_mark, mask, inode, dir, name, cookie); } static int fsnotify_handle_event(struct fsnotify_group *group, __u32 mask, const void *data, int data_type, struct inode *dir, const struct qstr *name, u32 cookie, struct fsnotify_iter_info *iter_info) { struct fsnotify_mark *inode_mark = fsnotify_iter_inode_mark(iter_info); struct fsnotify_mark *parent_mark = fsnotify_iter_parent_mark(iter_info); int ret; if (WARN_ON_ONCE(fsnotify_iter_sb_mark(iter_info)) || WARN_ON_ONCE(fsnotify_iter_vfsmount_mark(iter_info))) return 0; /* * For FS_RENAME, 'dir' is old dir and 'data' is new dentry. * The only ->handle_inode_event() backend that supports FS_RENAME is * dnotify, where it means file was renamed within same parent. */ if (mask & FS_RENAME) { struct dentry *moved = fsnotify_data_dentry(data, data_type); if (dir != moved->d_parent->d_inode) return 0; } if (parent_mark) { ret = fsnotify_handle_inode_event(group, parent_mark, mask, data, data_type, dir, name, 0); if (ret) return ret; } if (!inode_mark) return 0; /* * Some events can be sent on both parent dir and child marks (e.g. * FS_ATTRIB). If both parent dir and child are watching, report the * event once to parent dir with name (if interested) and once to child * without name (if interested). * * In any case regardless whether the parent is watching or not, the * child watcher is expecting an event without the FS_EVENT_ON_CHILD * flag. The file name is expected if and only if this is a directory * event. */ mask &= ~FS_EVENT_ON_CHILD; if (!(mask & ALL_FSNOTIFY_DIRENT_EVENTS)) { dir = NULL; name = NULL; } return fsnotify_handle_inode_event(group, inode_mark, mask, data, data_type, dir, name, cookie); } static int send_to_group(__u32 mask, const void *data, int data_type, struct inode *dir, const struct qstr *file_name, u32 cookie, struct fsnotify_iter_info *iter_info) { struct fsnotify_group *group = NULL; __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS); __u32 marks_mask = 0; __u32 marks_ignore_mask = 0; bool is_dir = mask & FS_ISDIR; struct fsnotify_mark *mark; int type; if (!iter_info->report_mask) return 0; /* clear ignored on inode modification */ if (mask & FS_MODIFY) { fsnotify_foreach_iter_mark_type(iter_info, mark, type) { if (!(mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)) mark->ignore_mask = 0; } } /* Are any of the group marks interested in this event? */ fsnotify_foreach_iter_mark_type(iter_info, mark, type) { group = mark->group; marks_mask |= mark->mask; marks_ignore_mask |= fsnotify_effective_ignore_mask(mark, is_dir, type); } pr_debug("%s: group=%p mask=%x marks_mask=%x marks_ignore_mask=%x data=%p data_type=%d dir=%p cookie=%d\n", __func__, group, mask, marks_mask, marks_ignore_mask, data, data_type, dir, cookie); if (!(test_mask & marks_mask & ~marks_ignore_mask)) return 0; if (group->ops->handle_event) { return group->ops->handle_event(group, mask, data, data_type, dir, file_name, cookie, iter_info); } return fsnotify_handle_event(group, mask, data, data_type, dir, file_name, cookie, iter_info); } static struct fsnotify_mark *fsnotify_first_mark(struct fsnotify_mark_connector *const *connp) { struct fsnotify_mark_connector *conn; struct hlist_node *node = NULL; conn = srcu_dereference(*connp, &fsnotify_mark_srcu); if (conn) node = srcu_dereference(conn->list.first, &fsnotify_mark_srcu); return hlist_entry_safe(node, struct fsnotify_mark, obj_list); } static struct fsnotify_mark *fsnotify_next_mark(struct fsnotify_mark *mark) { struct hlist_node *node = NULL; if (mark) node = srcu_dereference(mark->obj_list.next, &fsnotify_mark_srcu); return hlist_entry_safe(node, struct fsnotify_mark, obj_list); } /* * iter_info is a multi head priority queue of marks. * Pick a subset of marks from queue heads, all with the same group * and set the report_mask to a subset of the selected marks. * Returns false if there are no more groups to iterate. */ static bool fsnotify_iter_select_report_types( struct fsnotify_iter_info *iter_info) { struct fsnotify_group *max_prio_group = NULL; struct fsnotify_mark *mark; int type; /* Choose max prio group among groups of all queue heads */ fsnotify_foreach_iter_type(type) { mark = iter_info->marks[type]; if (mark && fsnotify_compare_groups(max_prio_group, mark->group) > 0) max_prio_group = mark->group; } if (!max_prio_group) return false; /* Set the report mask for marks from same group as max prio group */ iter_info->current_group = max_prio_group; iter_info->report_mask = 0; fsnotify_foreach_iter_type(type) { mark = iter_info->marks[type]; if (mark && mark->group == iter_info->current_group) { /* * FSNOTIFY_ITER_TYPE_PARENT indicates that this inode * is watching children and interested in this event, * which is an event possible on child. * But is *this mark* watching children? */ if (type == FSNOTIFY_ITER_TYPE_PARENT && !(mark->mask & FS_EVENT_ON_CHILD) && !(fsnotify_ignore_mask(mark) & FS_EVENT_ON_CHILD)) continue; fsnotify_iter_set_report_type(iter_info, type); } } return true; } /* * Pop from iter_info multi head queue, the marks that belong to the group of * current iteration step. */ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info) { struct fsnotify_mark *mark; int type; /* * We cannot use fsnotify_foreach_iter_mark_type() here because we * may need to advance a mark of type X that belongs to current_group * but was not selected for reporting. */ fsnotify_foreach_iter_type(type) { mark = iter_info->marks[type]; if (mark && mark->group == iter_info->current_group) iter_info->marks[type] = fsnotify_next_mark(iter_info->marks[type]); } } /* * fsnotify - This is the main call to fsnotify. * * The VFS calls into hook specific functions in linux/fsnotify.h. * Those functions then in turn call here. Here will call out to all of the * registered fsnotify_group. Those groups can then use the notification event * in whatever means they feel necessary. * * @mask: event type and flags * @data: object that event happened on * @data_type: type of object for fanotify_data_XXX() accessors * @dir: optional directory associated with event - * if @file_name is not NULL, this is the directory that * @file_name is relative to * @file_name: optional file name associated with event * @inode: optional inode associated with event - * If @dir and @inode are both non-NULL, event may be * reported to both. * @cookie: inotify rename cookie */ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir, const struct qstr *file_name, struct inode *inode, u32 cookie) { const struct path *path = fsnotify_data_path(data, data_type); struct super_block *sb = fsnotify_data_sb(data, data_type); const struct fsnotify_mnt *mnt_data = fsnotify_data_mnt(data, data_type); struct fsnotify_sb_info *sbinfo = sb ? fsnotify_sb_info(sb) : NULL; struct fsnotify_iter_info iter_info = {}; struct mount *mnt = NULL; struct inode *inode2 = NULL; struct dentry *moved; int inode2_type; int ret = 0; __u32 test_mask, marks_mask = 0; if (path) mnt = real_mount(path->mnt); if (!inode) { /* Dirent event - report on TYPE_INODE to dir */ inode = dir; /* For FS_RENAME, inode is old_dir and inode2 is new_dir */ if (mask & FS_RENAME) { moved = fsnotify_data_dentry(data, data_type); inode2 = moved->d_parent->d_inode; inode2_type = FSNOTIFY_ITER_TYPE_INODE2; } } else if (mask & FS_EVENT_ON_CHILD) { /* * Event on child - report on TYPE_PARENT to dir if it is * watching children and on TYPE_INODE to child. */ inode2 = dir; inode2_type = FSNOTIFY_ITER_TYPE_PARENT; } /* * Optimization: srcu_read_lock() has a memory barrier which can * be expensive. It protects walking the *_fsnotify_marks lists. * However, if we do not walk the lists, we do not have to do * SRCU because we have no references to any objects and do not * need SRCU to keep them "alive". */ if ((!sbinfo || !sbinfo->sb_marks) && (!mnt || !mnt->mnt_fsnotify_marks) && (!inode || !inode->i_fsnotify_marks) && (!inode2 || !inode2->i_fsnotify_marks) && (!mnt_data || !mnt_data->ns->n_fsnotify_marks)) return 0; if (sb) marks_mask |= READ_ONCE(sb->s_fsnotify_mask); if (mnt) marks_mask |= READ_ONCE(mnt->mnt_fsnotify_mask); if (inode) marks_mask |= READ_ONCE(inode->i_fsnotify_mask); if (inode2) marks_mask |= READ_ONCE(inode2->i_fsnotify_mask); if (mnt_data) marks_mask |= READ_ONCE(mnt_data->ns->n_fsnotify_mask); /* * If this is a modify event we may need to clear some ignore masks. * In that case, the object with ignore masks will have the FS_MODIFY * event in its mask. * Otherwise, return if none of the marks care about this type of event. */ test_mask = (mask & ALL_FSNOTIFY_EVENTS); if (!(test_mask & marks_mask)) return 0; iter_info.srcu_idx = srcu_read_lock(&fsnotify_mark_srcu); if (sbinfo) { iter_info.marks[FSNOTIFY_ITER_TYPE_SB] = fsnotify_first_mark(&sbinfo->sb_marks); } if (mnt) { iter_info.marks[FSNOTIFY_ITER_TYPE_VFSMOUNT] = fsnotify_first_mark(&mnt->mnt_fsnotify_marks); } if (inode) { iter_info.marks[FSNOTIFY_ITER_TYPE_INODE] = fsnotify_first_mark(&inode->i_fsnotify_marks); } if (inode2) { iter_info.marks[inode2_type] = fsnotify_first_mark(&inode2->i_fsnotify_marks); } if (mnt_data) { iter_info.marks[FSNOTIFY_ITER_TYPE_MNTNS] = fsnotify_first_mark(&mnt_data->ns->n_fsnotify_marks); } /* * We need to merge inode/vfsmount/sb mark lists so that e.g. inode mark * ignore masks are properly reflected for mount/sb mark notifications. * That's why this traversal is so complicated... */ while (fsnotify_iter_select_report_types(&iter_info)) { ret = send_to_group(mask, data, data_type, dir, file_name, cookie, &iter_info); if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS)) goto out; fsnotify_iter_next(&iter_info); } ret = 0; out: srcu_read_unlock(&fsnotify_mark_srcu, iter_info.srcu_idx); return ret; } EXPORT_SYMBOL_GPL(fsnotify); #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS /* * At open time we check fsnotify_sb_has_priority_watchers(), call the open perm * hook and set the FMODE_NONOTIFY_ mode bits accordignly. * Later, fsnotify permission hooks do not check if there are permission event * watches, but that there were permission event watches at open time. */ int fsnotify_open_perm_and_set_mode(struct file *file) { struct dentry *dentry = file->f_path.dentry, *parent; struct super_block *sb = dentry->d_sb; __u32 mnt_mask, p_mask = 0; /* Is it a file opened by fanotify? */ if (FMODE_FSNOTIFY_NONE(file->f_mode)) return 0; /* * Permission events is a super set of pre-content events, so if there * are no permission event watchers, there are also no pre-content event * watchers and this is implied from the single FMODE_NONOTIFY_PERM bit. */ if (likely(!fsnotify_sb_has_priority_watchers(sb, FSNOTIFY_PRIO_CONTENT))) { file_set_fsnotify_mode(file, FMODE_NONOTIFY_PERM); return 0; } /* * OK, there are some permission event watchers. Check if anybody is * watching for permission events on *this* file. */ mnt_mask = READ_ONCE(real_mount(file->f_path.mnt)->mnt_fsnotify_mask); p_mask = fsnotify_object_watched(d_inode(dentry), mnt_mask, ALL_FSNOTIFY_PERM_EVENTS); if (dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED) { parent = dget_parent(dentry); p_mask |= fsnotify_inode_watches_children(d_inode(parent)); dput(parent); } /* * Legacy FAN_ACCESS_PERM events have very high performance overhead, * so unlikely to be used in the wild. If they are used there will be * no optimizations at all. */ if (unlikely(p_mask & FS_ACCESS_PERM)) { /* Enable all permission and pre-content events */ file_set_fsnotify_mode(file, 0); goto open_perm; } /* * Pre-content events are only supported on regular files. * If there are pre-content event watchers and no permission access * watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that. * That is the common case with HSM service. */ if (d_is_reg(dentry) && (p_mask & FSNOTIFY_PRE_CONTENT_EVENTS)) { file_set_fsnotify_mode(file, FMODE_NONOTIFY | FMODE_NONOTIFY_PERM); goto open_perm; } /* Nobody watching permission and pre-content events on this file */ file_set_fsnotify_mode(file, FMODE_NONOTIFY_PERM); open_perm: /* * Send open perm events depending on object masks and regardless of * FMODE_NONOTIFY_PERM. */ if (file->f_flags & __FMODE_EXEC && p_mask & FS_OPEN_EXEC_PERM) { int ret = fsnotify_path(&file->f_path, FS_OPEN_EXEC_PERM); if (ret) return ret; } if (p_mask & FS_OPEN_PERM) return fsnotify_path(&file->f_path, FS_OPEN_PERM); return 0; } #endif void fsnotify_mnt(__u32 mask, struct mnt_namespace *ns, struct vfsmount *mnt) { struct fsnotify_mnt data = { .ns = ns, .mnt_id = real_mount(mnt)->mnt_id_unique, }; if (WARN_ON_ONCE(!ns)) return; /* * This is an optimization as well as making sure fsnotify_init() has * been called. */ if (!ns->n_fsnotify_marks) return; fsnotify(mask, &data, FSNOTIFY_EVENT_MNT, NULL, NULL, NULL, 0); } static __init int fsnotify_init(void) { int ret; BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 26); ret = init_srcu_struct(&fsnotify_mark_srcu); if (ret) panic("initializing fsnotify_mark_srcu"); fsnotify_mark_connector_cachep = KMEM_CACHE(fsnotify_mark_connector, SLAB_PANIC); return 0; } core_initcall(fsnotify_init); |
| 337 281 339 274 143 185 23 22 116 112 101 39 23 41 15 5 5 5 5 5 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 | // SPDX-License-Identifier: GPL-2.0-only /* * IPv6 library code, needed by static components when full IPv6 support is * not configured or static. */ #include <linux/export.h> #include <net/ipv6.h> #include <net/ipv6_stubs.h> #include <net/addrconf.h> #include <net/ip.h> /* if ipv6 module registers this function is used by xfrm to force all * sockets to relookup their nodes - this is fairly expensive, be * careful */ void (*__fib6_flush_trees)(struct net *); EXPORT_SYMBOL(__fib6_flush_trees); #define IPV6_ADDR_SCOPE_TYPE(scope) ((scope) << 16) static inline unsigned int ipv6_addr_scope2type(unsigned int scope) { switch (scope) { case IPV6_ADDR_SCOPE_NODELOCAL: return (IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_NODELOCAL) | IPV6_ADDR_LOOPBACK); case IPV6_ADDR_SCOPE_LINKLOCAL: return (IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_LINKLOCAL) | IPV6_ADDR_LINKLOCAL); case IPV6_ADDR_SCOPE_SITELOCAL: return (IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_SITELOCAL) | IPV6_ADDR_SITELOCAL); } return IPV6_ADDR_SCOPE_TYPE(scope); } int __ipv6_addr_type(const struct in6_addr *addr) { __be32 st; st = addr->s6_addr32[0]; /* Consider all addresses with the first three bits different of 000 and 111 as unicasts. */ if ((st & htonl(0xE0000000)) != htonl(0x00000000) && (st & htonl(0xE0000000)) != htonl(0xE0000000)) return (IPV6_ADDR_UNICAST | IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_GLOBAL)); if ((st & htonl(0xFF000000)) == htonl(0xFF000000)) { /* multicast */ /* addr-select 3.1 */ return (IPV6_ADDR_MULTICAST | ipv6_addr_scope2type(IPV6_ADDR_MC_SCOPE(addr))); } if ((st & htonl(0xFFC00000)) == htonl(0xFE800000)) return (IPV6_ADDR_LINKLOCAL | IPV6_ADDR_UNICAST | IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_LINKLOCAL)); /* addr-select 3.1 */ if ((st & htonl(0xFFC00000)) == htonl(0xFEC00000)) return (IPV6_ADDR_SITELOCAL | IPV6_ADDR_UNICAST | IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_SITELOCAL)); /* addr-select 3.1 */ if ((st & htonl(0xFE000000)) == htonl(0xFC000000)) return (IPV6_ADDR_UNICAST | IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_GLOBAL)); /* RFC 4193 */ if ((addr->s6_addr32[0] | addr->s6_addr32[1]) == 0) { if (addr->s6_addr32[2] == 0) { if (addr->s6_addr32[3] == 0) return IPV6_ADDR_ANY; if (addr->s6_addr32[3] == htonl(0x00000001)) return (IPV6_ADDR_LOOPBACK | IPV6_ADDR_UNICAST | IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_LINKLOCAL)); /* addr-select 3.4 */ return (IPV6_ADDR_COMPATv4 | IPV6_ADDR_UNICAST | IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_GLOBAL)); /* addr-select 3.3 */ } if (addr->s6_addr32[2] == htonl(0x0000ffff)) return (IPV6_ADDR_MAPPED | IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_GLOBAL)); /* addr-select 3.3 */ } return (IPV6_ADDR_UNICAST | IPV6_ADDR_SCOPE_TYPE(IPV6_ADDR_SCOPE_GLOBAL)); /* addr-select 3.4 */ } EXPORT_SYMBOL(__ipv6_addr_type); static ATOMIC_NOTIFIER_HEAD(inet6addr_chain); static BLOCKING_NOTIFIER_HEAD(inet6addr_validator_chain); int register_inet6addr_notifier(struct notifier_block *nb) { return atomic_notifier_chain_register(&inet6addr_chain, nb); } EXPORT_SYMBOL(register_inet6addr_notifier); int unregister_inet6addr_notifier(struct notifier_block *nb) { return atomic_notifier_chain_unregister(&inet6addr_chain, nb); } EXPORT_SYMBOL(unregister_inet6addr_notifier); int inet6addr_notifier_call_chain(unsigned long val, void *v) { return atomic_notifier_call_chain(&inet6addr_chain, val, v); } EXPORT_SYMBOL(inet6addr_notifier_call_chain); int register_inet6addr_validator_notifier(struct notifier_block *nb) { return blocking_notifier_chain_register(&inet6addr_validator_chain, nb); } EXPORT_SYMBOL(register_inet6addr_validator_notifier); int unregister_inet6addr_validator_notifier(struct notifier_block *nb) { return blocking_notifier_chain_unregister(&inet6addr_validator_chain, nb); } EXPORT_SYMBOL(unregister_inet6addr_validator_notifier); int inet6addr_validator_notifier_call_chain(unsigned long val, void *v) { return blocking_notifier_call_chain(&inet6addr_validator_chain, val, v); } EXPORT_SYMBOL(inet6addr_validator_notifier_call_chain); static struct dst_entry *eafnosupport_ipv6_dst_lookup_flow(struct net *net, const struct sock *sk, struct flowi6 *fl6, const struct in6_addr *final_dst) { return ERR_PTR(-EAFNOSUPPORT); } static int eafnosupport_ipv6_route_input(struct sk_buff *skb) { return -EAFNOSUPPORT; } static struct fib6_table *eafnosupport_fib6_get_table(struct net *net, u32 id) { return NULL; } static int eafnosupport_fib6_table_lookup(struct net *net, struct fib6_table *table, int oif, struct flowi6 *fl6, struct fib6_result *res, int flags) { return -EAFNOSUPPORT; } static int eafnosupport_fib6_lookup(struct net *net, int oif, struct flowi6 *fl6, struct fib6_result *res, int flags) { return -EAFNOSUPPORT; } static void eafnosupport_fib6_select_path(const struct net *net, struct fib6_result *res, struct flowi6 *fl6, int oif, bool have_oif_match, const struct sk_buff *skb, int strict) { } static u32 eafnosupport_ip6_mtu_from_fib6(const struct fib6_result *res, const struct in6_addr *daddr, const struct in6_addr *saddr) { return 0; } static int eafnosupport_fib6_nh_init(struct net *net, struct fib6_nh *fib6_nh, struct fib6_config *cfg, gfp_t gfp_flags, struct netlink_ext_ack *extack) { NL_SET_ERR_MSG(extack, "IPv6 support not enabled in kernel"); return -EAFNOSUPPORT; } static int eafnosupport_ip6_del_rt(struct net *net, struct fib6_info *rt, bool skip_notify) { return -EAFNOSUPPORT; } static int eafnosupport_ipv6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, int (*output)(struct net *, struct sock *, struct sk_buff *)) { kfree_skb(skb); return -EAFNOSUPPORT; } static struct net_device *eafnosupport_ipv6_dev_find(struct net *net, const struct in6_addr *addr, struct net_device *dev) { return ERR_PTR(-EAFNOSUPPORT); } const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) { .ipv6_dst_lookup_flow = eafnosupport_ipv6_dst_lookup_flow, .ipv6_route_input = eafnosupport_ipv6_route_input, .fib6_get_table = eafnosupport_fib6_get_table, .fib6_table_lookup = eafnosupport_fib6_table_lookup, .fib6_lookup = eafnosupport_fib6_lookup, .fib6_select_path = eafnosupport_fib6_select_path, .ip6_mtu_from_fib6 = eafnosupport_ip6_mtu_from_fib6, .fib6_nh_init = eafnosupport_fib6_nh_init, .ip6_del_rt = eafnosupport_ip6_del_rt, .ipv6_fragment = eafnosupport_ipv6_fragment, .ipv6_dev_find = eafnosupport_ipv6_dev_find, }; EXPORT_SYMBOL_GPL(ipv6_stub); /* IPv6 Wildcard Address and Loopback Address defined by RFC2553 */ const struct in6_addr in6addr_loopback __aligned(BITS_PER_LONG/8) = IN6ADDR_LOOPBACK_INIT; EXPORT_SYMBOL(in6addr_loopback); const struct in6_addr in6addr_any __aligned(BITS_PER_LONG/8) = IN6ADDR_ANY_INIT; EXPORT_SYMBOL(in6addr_any); const struct in6_addr in6addr_linklocal_allnodes __aligned(BITS_PER_LONG/8) = IN6ADDR_LINKLOCAL_ALLNODES_INIT; EXPORT_SYMBOL(in6addr_linklocal_allnodes); const struct in6_addr in6addr_linklocal_allrouters __aligned(BITS_PER_LONG/8) = IN6ADDR_LINKLOCAL_ALLROUTERS_INIT; EXPORT_SYMBOL(in6addr_linklocal_allrouters); const struct in6_addr in6addr_interfacelocal_allnodes __aligned(BITS_PER_LONG/8) = IN6ADDR_INTERFACELOCAL_ALLNODES_INIT; EXPORT_SYMBOL(in6addr_interfacelocal_allnodes); const struct in6_addr in6addr_interfacelocal_allrouters __aligned(BITS_PER_LONG/8) = IN6ADDR_INTERFACELOCAL_ALLROUTERS_INIT; EXPORT_SYMBOL(in6addr_interfacelocal_allrouters); const struct in6_addr in6addr_sitelocal_allrouters __aligned(BITS_PER_LONG/8) = IN6ADDR_SITELOCAL_ALLROUTERS_INIT; EXPORT_SYMBOL(in6addr_sitelocal_allrouters); static void snmp6_free_dev(struct inet6_dev *idev) { kfree(idev->stats.icmpv6msgdev); kfree(idev->stats.icmpv6dev); free_percpu(idev->stats.ipv6); } static void in6_dev_finish_destroy_rcu(struct rcu_head *head) { struct inet6_dev *idev = container_of(head, struct inet6_dev, rcu); snmp6_free_dev(idev); kfree(idev); } /* Nobody refers to this device, we may destroy it. */ void in6_dev_finish_destroy(struct inet6_dev *idev) { struct net_device *dev = idev->dev; WARN_ON(!list_empty(&idev->addr_list)); WARN_ON(rcu_access_pointer(idev->mc_list)); WARN_ON(timer_pending(&idev->rs_timer)); #ifdef NET_REFCNT_DEBUG pr_debug("%s: %s\n", __func__, dev ? dev->name : "NIL"); #endif netdev_put(dev, &idev->dev_tracker); if (!idev->dead) { pr_warn("Freeing alive inet6 device %p\n", idev); return; } call_rcu(&idev->rcu, in6_dev_finish_destroy_rcu); } EXPORT_SYMBOL(in6_dev_finish_destroy); |
| 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* user-type.h: User-defined key type * * Copyright (C) 2005 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) */ #ifndef _KEYS_USER_TYPE_H #define _KEYS_USER_TYPE_H #include <linux/key.h> #include <linux/rcupdate.h> #ifdef CONFIG_KEYS /*****************************************************************************/ /* * the payload for a key of type "user" or "logon" * - once filled in and attached to a key: * - the payload struct is invariant may not be changed, only replaced * - the payload must be read with RCU procedures or with the key semaphore * held * - the payload may only be replaced with the key semaphore write-locked * - the key's data length is the size of the actual data, not including the * payload wrapper */ struct user_key_payload { struct rcu_head rcu; /* RCU destructor */ unsigned short datalen; /* length of this data */ char data[] __aligned(__alignof__(u64)); /* actual data */ }; extern struct key_type key_type_user; extern struct key_type key_type_logon; struct key_preparsed_payload; extern int user_preparse(struct key_preparsed_payload *prep); extern void user_free_preparse(struct key_preparsed_payload *prep); extern int user_update(struct key *key, struct key_preparsed_payload *prep); extern void user_revoke(struct key *key); extern void user_destroy(struct key *key); extern void user_describe(const struct key *user, struct seq_file *m); extern long user_read(const struct key *key, char *buffer, size_t buflen); static inline const struct user_key_payload *user_key_payload_rcu(const struct key *key) { return (struct user_key_payload *)dereference_key_rcu(key); } static inline struct user_key_payload *user_key_payload_locked(const struct key *key) { return (struct user_key_payload *)dereference_key_locked((struct key *)key); } #endif /* CONFIG_KEYS */ #endif /* _KEYS_USER_TYPE_H */ |
| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | /* * llc_s_ev.c - Defines SAP component events * * The followed event functions are SAP component events which are described * in 802.2 LLC protocol standard document. * * Copyright (c) 1997 by Procom Technology, Inc. * 2001-2003 by Arnaldo Carvalho de Melo <acme@conectiva.com.br> * * This program can be redistributed or modified under the terms of the * GNU General Public License as published by the Free Software Foundation. * This program is distributed without any warranty or implied warranty * of merchantability or fitness for a particular purpose. * * See the GNU General Public License for more details. */ #include <linux/socket.h> #include <net/sock.h> #include <net/llc_if.h> #include <net/llc_s_ev.h> #include <net/llc_pdu.h> int llc_sap_ev_activation_req(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); return ev->type == LLC_SAP_EV_TYPE_SIMPLE && ev->prim_type == LLC_SAP_EV_ACTIVATION_REQ ? 0 : 1; } int llc_sap_ev_rx_ui(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); struct llc_pdu_un *pdu = llc_pdu_un_hdr(skb); return ev->type == LLC_SAP_EV_TYPE_PDU && LLC_PDU_IS_CMD(pdu) && LLC_PDU_TYPE_IS_U(pdu) && LLC_U_PDU_CMD(pdu) == LLC_1_PDU_CMD_UI ? 0 : 1; } int llc_sap_ev_unitdata_req(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); return ev->type == LLC_SAP_EV_TYPE_PRIM && ev->prim == LLC_DATAUNIT_PRIM && ev->prim_type == LLC_PRIM_TYPE_REQ ? 0 : 1; } int llc_sap_ev_xid_req(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); return ev->type == LLC_SAP_EV_TYPE_PRIM && ev->prim == LLC_XID_PRIM && ev->prim_type == LLC_PRIM_TYPE_REQ ? 0 : 1; } int llc_sap_ev_rx_xid_c(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); struct llc_pdu_un *pdu = llc_pdu_un_hdr(skb); return ev->type == LLC_SAP_EV_TYPE_PDU && LLC_PDU_IS_CMD(pdu) && LLC_PDU_TYPE_IS_U(pdu) && LLC_U_PDU_CMD(pdu) == LLC_1_PDU_CMD_XID ? 0 : 1; } int llc_sap_ev_rx_xid_r(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); struct llc_pdu_un *pdu = llc_pdu_un_hdr(skb); return ev->type == LLC_SAP_EV_TYPE_PDU && LLC_PDU_IS_RSP(pdu) && LLC_PDU_TYPE_IS_U(pdu) && LLC_U_PDU_RSP(pdu) == LLC_1_PDU_CMD_XID ? 0 : 1; } int llc_sap_ev_test_req(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); return ev->type == LLC_SAP_EV_TYPE_PRIM && ev->prim == LLC_TEST_PRIM && ev->prim_type == LLC_PRIM_TYPE_REQ ? 0 : 1; } int llc_sap_ev_rx_test_c(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); struct llc_pdu_un *pdu = llc_pdu_un_hdr(skb); return ev->type == LLC_SAP_EV_TYPE_PDU && LLC_PDU_IS_CMD(pdu) && LLC_PDU_TYPE_IS_U(pdu) && LLC_U_PDU_CMD(pdu) == LLC_1_PDU_CMD_TEST ? 0 : 1; } int llc_sap_ev_rx_test_r(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); struct llc_pdu_un *pdu = llc_pdu_un_hdr(skb); return ev->type == LLC_SAP_EV_TYPE_PDU && LLC_PDU_IS_RSP(pdu) && LLC_PDU_TYPE_IS_U(pdu) && LLC_U_PDU_RSP(pdu) == LLC_1_PDU_CMD_TEST ? 0 : 1; } int llc_sap_ev_deactivation_req(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); return ev->type == LLC_SAP_EV_TYPE_SIMPLE && ev->prim_type == LLC_SAP_EV_DEACTIVATION_REQ ? 0 : 1; } |
| 2 3 3 3 3 3 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Force feedback support for memoryless devices * * Copyright (c) 2006 Anssi Hannula <anssi.hannula@gmail.com> * Copyright (c) 2006 Dmitry Torokhov <dtor@mail.ru> */ /* #define DEBUG */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/export.h> #include <linux/slab.h> #include <linux/input.h> #include <linux/module.h> #include <linux/mutex.h> #include <linux/spinlock.h> #include <linux/jiffies.h> #include <linux/fixp-arith.h> MODULE_LICENSE("GPL"); MODULE_AUTHOR("Anssi Hannula <anssi.hannula@gmail.com>"); MODULE_DESCRIPTION("Force feedback support for memoryless devices"); /* Number of effects handled with memoryless devices */ #define FF_MEMLESS_EFFECTS 16 /* Envelope update interval in ms */ #define FF_ENVELOPE_INTERVAL 50 #define FF_EFFECT_STARTED 0 #define FF_EFFECT_PLAYING 1 #define FF_EFFECT_ABORTING 2 struct ml_effect_state { struct ff_effect *effect; unsigned long flags; /* effect state (STARTED, PLAYING, etc) */ int count; /* loop count of the effect */ unsigned long play_at; /* start time */ unsigned long stop_at; /* stop time */ unsigned long adj_at; /* last time the effect was sent */ }; struct ml_device { void *private; struct ml_effect_state states[FF_MEMLESS_EFFECTS]; int gain; struct timer_list timer; struct input_dev *dev; int (*play_effect)(struct input_dev *dev, void *data, struct ff_effect *effect); }; static const struct ff_envelope *get_envelope(const struct ff_effect *effect) { static const struct ff_envelope empty_envelope; switch (effect->type) { case FF_PERIODIC: return &effect->u.periodic.envelope; case FF_CONSTANT: return &effect->u.constant.envelope; default: return &empty_envelope; } } /* * Check for the next time envelope requires an update on memoryless devices */ static unsigned long calculate_next_time(struct ml_effect_state *state) { const struct ff_envelope *envelope = get_envelope(state->effect); unsigned long attack_stop, fade_start, next_fade; if (envelope->attack_length) { attack_stop = state->play_at + msecs_to_jiffies(envelope->attack_length); if (time_before(state->adj_at, attack_stop)) return state->adj_at + msecs_to_jiffies(FF_ENVELOPE_INTERVAL); } if (state->effect->replay.length) { if (envelope->fade_length) { /* check when fading should start */ fade_start = state->stop_at - msecs_to_jiffies(envelope->fade_length); if (time_before(state->adj_at, fade_start)) return fade_start; /* already fading, advance to next checkpoint */ next_fade = state->adj_at + msecs_to_jiffies(FF_ENVELOPE_INTERVAL); if (time_before(next_fade, state->stop_at)) return next_fade; } return state->stop_at; } return state->play_at; } static void ml_schedule_timer(struct ml_device *ml) { struct ml_effect_state *state; unsigned long now = jiffies; unsigned long earliest = 0; unsigned long next_at; int events = 0; int i; pr_debug("calculating next timer\n"); for (i = 0; i < FF_MEMLESS_EFFECTS; i++) { state = &ml->states[i]; if (!test_bit(FF_EFFECT_STARTED, &state->flags)) continue; if (test_bit(FF_EFFECT_PLAYING, &state->flags)) next_at = calculate_next_time(state); else next_at = state->play_at; if (time_before_eq(now, next_at) && (++events == 1 || time_before(next_at, earliest))) earliest = next_at; } if (!events) { pr_debug("no actions\n"); timer_delete(&ml->timer); } else { pr_debug("timer set\n"); mod_timer(&ml->timer, earliest); } } /* * Apply an envelope to a value */ static int apply_envelope(struct ml_effect_state *state, int value, struct ff_envelope *envelope) { struct ff_effect *effect = state->effect; unsigned long now = jiffies; int time_from_level; int time_of_envelope; int envelope_level; int difference; if (envelope->attack_length && time_before(now, state->play_at + msecs_to_jiffies(envelope->attack_length))) { pr_debug("value = 0x%x, attack_level = 0x%x\n", value, envelope->attack_level); time_from_level = jiffies_to_msecs(now - state->play_at); time_of_envelope = envelope->attack_length; envelope_level = min_t(u16, envelope->attack_level, 0x7fff); } else if (envelope->fade_length && effect->replay.length && time_after(now, state->stop_at - msecs_to_jiffies(envelope->fade_length)) && time_before(now, state->stop_at)) { time_from_level = jiffies_to_msecs(state->stop_at - now); time_of_envelope = envelope->fade_length; envelope_level = min_t(u16, envelope->fade_level, 0x7fff); } else return value; difference = abs(value) - envelope_level; pr_debug("difference = %d\n", difference); pr_debug("time_from_level = 0x%x\n", time_from_level); pr_debug("time_of_envelope = 0x%x\n", time_of_envelope); difference = difference * time_from_level / time_of_envelope; pr_debug("difference = %d\n", difference); return value < 0 ? -(difference + envelope_level) : (difference + envelope_level); } /* * Return the type the effect has to be converted into (memless devices) */ static int get_compatible_type(struct ff_device *ff, int effect_type) { if (test_bit(effect_type, ff->ffbit)) return effect_type; if (effect_type == FF_PERIODIC && test_bit(FF_RUMBLE, ff->ffbit)) return FF_RUMBLE; pr_err("invalid type in get_compatible_type()\n"); return 0; } /* * Only left/right direction should be used (under/over 0x8000) for * forward/reverse motor direction (to keep calculation fast & simple). */ static u16 ml_calculate_direction(u16 direction, u16 force, u16 new_direction, u16 new_force) { if (!force) return new_direction; if (!new_force) return direction; return (((u32)(direction >> 1) * force + (new_direction >> 1) * new_force) / (force + new_force)) << 1; } #define FRAC_N 8 static inline s16 fixp_new16(s16 a) { return ((s32)a) >> (16 - FRAC_N); } static inline s16 fixp_mult(s16 a, s16 b) { a = ((s32)a * 0x100) / 0x7fff; return ((s32)(a * b)) >> FRAC_N; } /* * Combine two effects and apply gain. */ static void ml_combine_effects(struct ff_effect *effect, struct ml_effect_state *state, int gain) { struct ff_effect *new = state->effect; unsigned int strong, weak, i; int x, y; s16 level; switch (new->type) { case FF_CONSTANT: i = new->direction * 360 / 0xffff; level = fixp_new16(apply_envelope(state, new->u.constant.level, &new->u.constant.envelope)); x = fixp_mult(fixp_sin16(i), level) * gain / 0xffff; y = fixp_mult(-fixp_cos16(i), level) * gain / 0xffff; /* * here we abuse ff_ramp to hold x and y of constant force * If in future any driver wants something else than x and y * in s8, this should be changed to something more generic */ effect->u.ramp.start_level = clamp_val(effect->u.ramp.start_level + x, -0x80, 0x7f); effect->u.ramp.end_level = clamp_val(effect->u.ramp.end_level + y, -0x80, 0x7f); break; case FF_RUMBLE: strong = (u32)new->u.rumble.strong_magnitude * gain / 0xffff; weak = (u32)new->u.rumble.weak_magnitude * gain / 0xffff; if (effect->u.rumble.strong_magnitude + strong) effect->direction = ml_calculate_direction( effect->direction, effect->u.rumble.strong_magnitude, new->direction, strong); else if (effect->u.rumble.weak_magnitude + weak) effect->direction = ml_calculate_direction( effect->direction, effect->u.rumble.weak_magnitude, new->direction, weak); else effect->direction = 0; effect->u.rumble.strong_magnitude = min(strong + effect->u.rumble.strong_magnitude, 0xffffU); effect->u.rumble.weak_magnitude = min(weak + effect->u.rumble.weak_magnitude, 0xffffU); break; case FF_PERIODIC: i = apply_envelope(state, abs(new->u.periodic.magnitude), &new->u.periodic.envelope); /* here we also scale it 0x7fff => 0xffff */ i = i * gain / 0x7fff; if (effect->u.rumble.strong_magnitude + i) effect->direction = ml_calculate_direction( effect->direction, effect->u.rumble.strong_magnitude, new->direction, i); else effect->direction = 0; effect->u.rumble.strong_magnitude = min(i + effect->u.rumble.strong_magnitude, 0xffffU); effect->u.rumble.weak_magnitude = min(i + effect->u.rumble.weak_magnitude, 0xffffU); break; default: pr_err("invalid type in ml_combine_effects()\n"); break; } } /* * Because memoryless devices have only one effect per effect type active * at one time we have to combine multiple effects into one */ static int ml_get_combo_effect(struct ml_device *ml, unsigned long *effect_handled, struct ff_effect *combo_effect) { struct ff_effect *effect; struct ml_effect_state *state; int effect_type; int i; memset(combo_effect, 0, sizeof(struct ff_effect)); for (i = 0; i < FF_MEMLESS_EFFECTS; i++) { if (__test_and_set_bit(i, effect_handled)) continue; state = &ml->states[i]; effect = state->effect; if (!test_bit(FF_EFFECT_STARTED, &state->flags)) continue; if (time_before(jiffies, state->play_at)) continue; /* * here we have started effects that are either * currently playing (and may need be aborted) * or need to start playing. */ effect_type = get_compatible_type(ml->dev->ff, effect->type); if (combo_effect->type != effect_type) { if (combo_effect->type != 0) { __clear_bit(i, effect_handled); continue; } combo_effect->type = effect_type; } if (__test_and_clear_bit(FF_EFFECT_ABORTING, &state->flags)) { __clear_bit(FF_EFFECT_PLAYING, &state->flags); __clear_bit(FF_EFFECT_STARTED, &state->flags); } else if (effect->replay.length && time_after_eq(jiffies, state->stop_at)) { __clear_bit(FF_EFFECT_PLAYING, &state->flags); if (--state->count <= 0) { __clear_bit(FF_EFFECT_STARTED, &state->flags); } else { state->play_at = jiffies + msecs_to_jiffies(effect->replay.delay); state->stop_at = state->play_at + msecs_to_jiffies(effect->replay.length); } } else { __set_bit(FF_EFFECT_PLAYING, &state->flags); state->adj_at = jiffies; ml_combine_effects(combo_effect, state, ml->gain); } } return combo_effect->type != 0; } static void ml_play_effects(struct ml_device *ml) { struct ff_effect effect; DECLARE_BITMAP(handled_bm, FF_MEMLESS_EFFECTS); memset(handled_bm, 0, sizeof(handled_bm)); while (ml_get_combo_effect(ml, handled_bm, &effect)) ml->play_effect(ml->dev, ml->private, &effect); ml_schedule_timer(ml); } static void ml_effect_timer(struct timer_list *t) { struct ml_device *ml = timer_container_of(ml, t, timer); struct input_dev *dev = ml->dev; pr_debug("timer: updating effects\n"); guard(spinlock_irqsave)(&dev->event_lock); ml_play_effects(ml); } /* * Sets requested gain for FF effects. Called with dev->event_lock held. */ static void ml_ff_set_gain(struct input_dev *dev, u16 gain) { struct ml_device *ml = dev->ff->private; int i; ml->gain = gain; for (i = 0; i < FF_MEMLESS_EFFECTS; i++) __clear_bit(FF_EFFECT_PLAYING, &ml->states[i].flags); ml_play_effects(ml); } /* * Start/stop specified FF effect. Called with dev->event_lock held. */ static int ml_ff_playback(struct input_dev *dev, int effect_id, int value) { struct ml_device *ml = dev->ff->private; struct ml_effect_state *state = &ml->states[effect_id]; if (value > 0) { pr_debug("initiated play\n"); __set_bit(FF_EFFECT_STARTED, &state->flags); state->count = value; state->play_at = jiffies + msecs_to_jiffies(state->effect->replay.delay); state->stop_at = state->play_at + msecs_to_jiffies(state->effect->replay.length); state->adj_at = state->play_at; } else { pr_debug("initiated stop\n"); if (test_bit(FF_EFFECT_PLAYING, &state->flags)) __set_bit(FF_EFFECT_ABORTING, &state->flags); else __clear_bit(FF_EFFECT_STARTED, &state->flags); } ml_play_effects(ml); return 0; } static int ml_ff_upload(struct input_dev *dev, struct ff_effect *effect, struct ff_effect *old) { struct ml_device *ml = dev->ff->private; struct ml_effect_state *state = &ml->states[effect->id]; guard(spinlock_irq)(&dev->event_lock); if (test_bit(FF_EFFECT_STARTED, &state->flags)) { __clear_bit(FF_EFFECT_PLAYING, &state->flags); state->play_at = jiffies + msecs_to_jiffies(state->effect->replay.delay); state->stop_at = state->play_at + msecs_to_jiffies(state->effect->replay.length); state->adj_at = state->play_at; ml_schedule_timer(ml); } return 0; } static void ml_ff_destroy(struct ff_device *ff) { struct ml_device *ml = ff->private; /* * Even though we stop all playing effects when tearing down * an input device (via input_device_flush() that calls into * input_ff_flush() that stops and erases all effects), we * do not actually stop the timer, and therefore we should * do it here. */ timer_delete_sync(&ml->timer); kfree(ml->private); } /** * input_ff_create_memless() - create memoryless force-feedback device * @dev: input device supporting force-feedback * @data: driver-specific data to be passed into @play_effect * @play_effect: driver-specific method for playing FF effect */ int input_ff_create_memless(struct input_dev *dev, void *data, int (*play_effect)(struct input_dev *, void *, struct ff_effect *)) { struct ff_device *ff; int error; int i; struct ml_device *ml __free(kfree) = kzalloc(sizeof(*ml), GFP_KERNEL); if (!ml) return -ENOMEM; ml->dev = dev; ml->private = data; ml->play_effect = play_effect; ml->gain = 0xffff; timer_setup(&ml->timer, ml_effect_timer, 0); set_bit(FF_GAIN, dev->ffbit); error = input_ff_create(dev, FF_MEMLESS_EFFECTS); if (error) return error; ff = dev->ff; ff->upload = ml_ff_upload; ff->playback = ml_ff_playback; ff->set_gain = ml_ff_set_gain; ff->destroy = ml_ff_destroy; /* we can emulate periodic effects with RUMBLE */ if (test_bit(FF_RUMBLE, ff->ffbit)) { set_bit(FF_PERIODIC, dev->ffbit); set_bit(FF_SINE, dev->ffbit); set_bit(FF_TRIANGLE, dev->ffbit); set_bit(FF_SQUARE, dev->ffbit); } for (i = 0; i < FF_MEMLESS_EFFECTS; i++) ml->states[i].effect = &ff->effects[i]; ff->private = no_free_ptr(ml); return 0; } EXPORT_SYMBOL_GPL(input_ff_create_memless); |
| 3 2 3 1 1 1 1 1 1 1 21 21 22 5 1 1 1 1 4 3 2 22 2 1 1 1 3 3 1 1 1 1 2 3 4 1 1 1 1 1 1 1 1 1 1 1 6 2 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 | // SPDX-License-Identifier: GPL-2.0 /* * RTC subsystem, dev interface * * Copyright (C) 2005 Tower Technologies * Author: Alessandro Zummo <a.zummo@towertech.it> * * based on arch/arm/common/rtctime.c */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/compat.h> #include <linux/module.h> #include <linux/rtc.h> #include <linux/sched/signal.h> #include "rtc-core.h" static dev_t rtc_devt; #define RTC_DEV_MAX 16 /* 16 RTCs should be enough for everyone... */ static int rtc_dev_open(struct inode *inode, struct file *file) { struct rtc_device *rtc = container_of(inode->i_cdev, struct rtc_device, char_dev); if (test_and_set_bit_lock(RTC_DEV_BUSY, &rtc->flags)) return -EBUSY; file->private_data = rtc; spin_lock_irq(&rtc->irq_lock); rtc->irq_data = 0; spin_unlock_irq(&rtc->irq_lock); return 0; } #ifdef CONFIG_RTC_INTF_DEV_UIE_EMUL /* * Routine to poll RTC seconds field for change as often as possible, * after first RTC_UIE use timer to reduce polling */ static void rtc_uie_task(struct work_struct *work) { struct rtc_device *rtc = container_of(work, struct rtc_device, uie_task); struct rtc_time tm; int num = 0; int err; err = rtc_read_time(rtc, &tm); spin_lock_irq(&rtc->irq_lock); if (rtc->stop_uie_polling || err) { rtc->uie_task_active = 0; } else if (rtc->oldsecs != tm.tm_sec) { num = (tm.tm_sec + 60 - rtc->oldsecs) % 60; rtc->oldsecs = tm.tm_sec; rtc->uie_timer.expires = jiffies + HZ - (HZ / 10); rtc->uie_timer_active = 1; rtc->uie_task_active = 0; add_timer(&rtc->uie_timer); } else if (schedule_work(&rtc->uie_task) == 0) { rtc->uie_task_active = 0; } spin_unlock_irq(&rtc->irq_lock); if (num) rtc_handle_legacy_irq(rtc, num, RTC_UF); } static void rtc_uie_timer(struct timer_list *t) { struct rtc_device *rtc = timer_container_of(rtc, t, uie_timer); unsigned long flags; spin_lock_irqsave(&rtc->irq_lock, flags); rtc->uie_timer_active = 0; rtc->uie_task_active = 1; if ((schedule_work(&rtc->uie_task) == 0)) rtc->uie_task_active = 0; spin_unlock_irqrestore(&rtc->irq_lock, flags); } static int clear_uie(struct rtc_device *rtc) { spin_lock_irq(&rtc->irq_lock); if (rtc->uie_irq_active) { rtc->stop_uie_polling = 1; if (rtc->uie_timer_active) { spin_unlock_irq(&rtc->irq_lock); timer_delete_sync(&rtc->uie_timer); spin_lock_irq(&rtc->irq_lock); rtc->uie_timer_active = 0; } if (rtc->uie_task_active) { spin_unlock_irq(&rtc->irq_lock); flush_work(&rtc->uie_task); spin_lock_irq(&rtc->irq_lock); } rtc->uie_irq_active = 0; } spin_unlock_irq(&rtc->irq_lock); return 0; } static int set_uie(struct rtc_device *rtc) { struct rtc_time tm; int err; err = rtc_read_time(rtc, &tm); if (err) return err; spin_lock_irq(&rtc->irq_lock); if (!rtc->uie_irq_active) { rtc->uie_irq_active = 1; rtc->stop_uie_polling = 0; rtc->oldsecs = tm.tm_sec; rtc->uie_task_active = 1; if (schedule_work(&rtc->uie_task) == 0) rtc->uie_task_active = 0; } rtc->irq_data = 0; spin_unlock_irq(&rtc->irq_lock); return 0; } int rtc_dev_update_irq_enable_emul(struct rtc_device *rtc, unsigned int enabled) { if (enabled) return set_uie(rtc); else return clear_uie(rtc); } EXPORT_SYMBOL(rtc_dev_update_irq_enable_emul); #endif /* CONFIG_RTC_INTF_DEV_UIE_EMUL */ static ssize_t rtc_dev_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct rtc_device *rtc = file->private_data; DECLARE_WAITQUEUE(wait, current); unsigned long data; ssize_t ret; if (count != sizeof(unsigned int) && count < sizeof(unsigned long)) return -EINVAL; add_wait_queue(&rtc->irq_queue, &wait); do { __set_current_state(TASK_INTERRUPTIBLE); spin_lock_irq(&rtc->irq_lock); data = rtc->irq_data; rtc->irq_data = 0; spin_unlock_irq(&rtc->irq_lock); if (data != 0) { ret = 0; break; } if (file->f_flags & O_NONBLOCK) { ret = -EAGAIN; break; } if (signal_pending(current)) { ret = -ERESTARTSYS; break; } schedule(); } while (1); set_current_state(TASK_RUNNING); remove_wait_queue(&rtc->irq_queue, &wait); if (ret == 0) { if (sizeof(int) != sizeof(long) && count == sizeof(unsigned int)) ret = put_user(data, (unsigned int __user *)buf) ?: sizeof(unsigned int); else ret = put_user(data, (unsigned long __user *)buf) ?: sizeof(unsigned long); } return ret; } static __poll_t rtc_dev_poll(struct file *file, poll_table *wait) { struct rtc_device *rtc = file->private_data; unsigned long data; poll_wait(file, &rtc->irq_queue, wait); data = rtc->irq_data; return (data != 0) ? (EPOLLIN | EPOLLRDNORM) : 0; } static long rtc_dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { int err = 0; struct rtc_device *rtc = file->private_data; const struct rtc_class_ops *ops = rtc->ops; struct rtc_time tm; struct rtc_wkalrm alarm; struct rtc_param param; void __user *uarg = (void __user *)arg; err = mutex_lock_interruptible(&rtc->ops_lock); if (err) return err; /* check that the calling task has appropriate permissions * for certain ioctls. doing this check here is useful * to avoid duplicate code in each driver. */ switch (cmd) { case RTC_EPOCH_SET: case RTC_SET_TIME: case RTC_PARAM_SET: if (!capable(CAP_SYS_TIME)) err = -EACCES; break; case RTC_IRQP_SET: if (arg > rtc->max_user_freq && !capable(CAP_SYS_RESOURCE)) err = -EACCES; break; case RTC_PIE_ON: if (rtc->irq_freq > rtc->max_user_freq && !capable(CAP_SYS_RESOURCE)) err = -EACCES; break; } if (err) goto done; /* * Drivers *SHOULD NOT* provide ioctl implementations * for these requests. Instead, provide methods to * support the following code, so that the RTC's main * features are accessible without using ioctls. * * RTC and alarm times will be in UTC, by preference, * but dual-booting with MS-Windows implies RTCs must * use the local wall clock time. */ switch (cmd) { case RTC_ALM_READ: mutex_unlock(&rtc->ops_lock); err = rtc_read_alarm(rtc, &alarm); if (err < 0) return err; if (copy_to_user(uarg, &alarm.time, sizeof(tm))) err = -EFAULT; return err; case RTC_ALM_SET: mutex_unlock(&rtc->ops_lock); if (copy_from_user(&alarm.time, uarg, sizeof(tm))) return -EFAULT; alarm.enabled = 0; alarm.pending = 0; alarm.time.tm_wday = -1; alarm.time.tm_yday = -1; alarm.time.tm_isdst = -1; /* RTC_ALM_SET alarms may be up to 24 hours in the future. * Rather than expecting every RTC to implement "don't care" * for day/month/year fields, just force the alarm to have * the right values for those fields. * * RTC_WKALM_SET should be used instead. Not only does it * eliminate the need for a separate RTC_AIE_ON call, it * doesn't have the "alarm 23:59:59 in the future" race. * * NOTE: some legacy code may have used invalid fields as * wildcards, exposing hardware "periodic alarm" capabilities. * Not supported here. */ { time64_t now, then; err = rtc_read_time(rtc, &tm); if (err < 0) return err; now = rtc_tm_to_time64(&tm); alarm.time.tm_mday = tm.tm_mday; alarm.time.tm_mon = tm.tm_mon; alarm.time.tm_year = tm.tm_year; err = rtc_valid_tm(&alarm.time); if (err < 0) return err; then = rtc_tm_to_time64(&alarm.time); /* alarm may need to wrap into tomorrow */ if (then < now) { rtc_time64_to_tm(now + 24 * 60 * 60, &tm); alarm.time.tm_mday = tm.tm_mday; alarm.time.tm_mon = tm.tm_mon; alarm.time.tm_year = tm.tm_year; } } return rtc_set_alarm(rtc, &alarm); case RTC_RD_TIME: mutex_unlock(&rtc->ops_lock); err = rtc_read_time(rtc, &tm); if (err < 0) return err; if (copy_to_user(uarg, &tm, sizeof(tm))) err = -EFAULT; return err; case RTC_SET_TIME: mutex_unlock(&rtc->ops_lock); if (copy_from_user(&tm, uarg, sizeof(tm))) return -EFAULT; return rtc_set_time(rtc, &tm); case RTC_PIE_ON: err = rtc_irq_set_state(rtc, 1); break; case RTC_PIE_OFF: err = rtc_irq_set_state(rtc, 0); break; case RTC_AIE_ON: mutex_unlock(&rtc->ops_lock); return rtc_alarm_irq_enable(rtc, 1); case RTC_AIE_OFF: mutex_unlock(&rtc->ops_lock); return rtc_alarm_irq_enable(rtc, 0); case RTC_UIE_ON: mutex_unlock(&rtc->ops_lock); return rtc_update_irq_enable(rtc, 1); case RTC_UIE_OFF: mutex_unlock(&rtc->ops_lock); return rtc_update_irq_enable(rtc, 0); case RTC_IRQP_SET: err = rtc_irq_set_freq(rtc, arg); break; case RTC_IRQP_READ: err = put_user(rtc->irq_freq, (unsigned long __user *)uarg); break; case RTC_WKALM_SET: mutex_unlock(&rtc->ops_lock); if (copy_from_user(&alarm, uarg, sizeof(alarm))) return -EFAULT; return rtc_set_alarm(rtc, &alarm); case RTC_WKALM_RD: mutex_unlock(&rtc->ops_lock); err = rtc_read_alarm(rtc, &alarm); if (err < 0) return err; if (copy_to_user(uarg, &alarm, sizeof(alarm))) err = -EFAULT; return err; case RTC_PARAM_GET: if (copy_from_user(¶m, uarg, sizeof(param))) { mutex_unlock(&rtc->ops_lock); return -EFAULT; } switch(param.param) { case RTC_PARAM_FEATURES: if (param.index != 0) err = -EINVAL; param.uvalue = rtc->features[0]; break; case RTC_PARAM_CORRECTION: { long offset; mutex_unlock(&rtc->ops_lock); if (param.index != 0) return -EINVAL; err = rtc_read_offset(rtc, &offset); mutex_lock(&rtc->ops_lock); if (err == 0) param.svalue = offset; break; } default: if (rtc->ops->param_get) err = rtc->ops->param_get(rtc->dev.parent, ¶m); else err = -EINVAL; } if (!err) if (copy_to_user(uarg, ¶m, sizeof(param))) err = -EFAULT; break; case RTC_PARAM_SET: if (copy_from_user(¶m, uarg, sizeof(param))) { mutex_unlock(&rtc->ops_lock); return -EFAULT; } switch(param.param) { case RTC_PARAM_FEATURES: err = -EINVAL; break; case RTC_PARAM_CORRECTION: mutex_unlock(&rtc->ops_lock); if (param.index != 0) return -EINVAL; return rtc_set_offset(rtc, param.svalue); default: if (rtc->ops->param_set) err = rtc->ops->param_set(rtc->dev.parent, ¶m); else err = -EINVAL; } break; default: /* Finally try the driver's ioctl interface */ if (ops->ioctl) { err = ops->ioctl(rtc->dev.parent, cmd, arg); if (err == -ENOIOCTLCMD) err = -ENOTTY; } else { err = -ENOTTY; } break; } done: mutex_unlock(&rtc->ops_lock); return err; } #ifdef CONFIG_COMPAT #define RTC_IRQP_SET32 _IOW('p', 0x0c, __u32) #define RTC_IRQP_READ32 _IOR('p', 0x0b, __u32) #define RTC_EPOCH_SET32 _IOW('p', 0x0e, __u32) static long rtc_dev_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct rtc_device *rtc = file->private_data; void __user *uarg = compat_ptr(arg); switch (cmd) { case RTC_IRQP_READ32: return put_user(rtc->irq_freq, (__u32 __user *)uarg); case RTC_IRQP_SET32: /* arg is a plain integer, not pointer */ return rtc_dev_ioctl(file, RTC_IRQP_SET, arg); case RTC_EPOCH_SET32: /* arg is a plain integer, not pointer */ return rtc_dev_ioctl(file, RTC_EPOCH_SET, arg); } return rtc_dev_ioctl(file, cmd, (unsigned long)uarg); } #endif static int rtc_dev_fasync(int fd, struct file *file, int on) { struct rtc_device *rtc = file->private_data; return fasync_helper(fd, file, on, &rtc->async_queue); } static int rtc_dev_release(struct inode *inode, struct file *file) { struct rtc_device *rtc = file->private_data; /* We shut down the repeating IRQs that userspace enabled, * since nothing is listening to them. * - Update (UIE) ... currently only managed through ioctls * - Periodic (PIE) ... also used through rtc_*() interface calls * * Leave the alarm alone; it may be set to trigger a system wakeup * later, or be used by kernel code, and is a one-shot event anyway. */ /* Keep ioctl until all drivers are converted */ rtc_dev_ioctl(file, RTC_UIE_OFF, 0); rtc_update_irq_enable(rtc, 0); rtc_irq_set_state(rtc, 0); clear_bit_unlock(RTC_DEV_BUSY, &rtc->flags); return 0; } static const struct file_operations rtc_dev_fops = { .owner = THIS_MODULE, .read = rtc_dev_read, .poll = rtc_dev_poll, .unlocked_ioctl = rtc_dev_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = rtc_dev_compat_ioctl, #endif .open = rtc_dev_open, .release = rtc_dev_release, .fasync = rtc_dev_fasync, }; /* insertion/removal hooks */ void rtc_dev_prepare(struct rtc_device *rtc) { if (!rtc_devt) return; if (rtc->id >= RTC_DEV_MAX) { dev_dbg(&rtc->dev, "too many RTC devices\n"); return; } rtc->dev.devt = MKDEV(MAJOR(rtc_devt), rtc->id); #ifdef CONFIG_RTC_INTF_DEV_UIE_EMUL INIT_WORK(&rtc->uie_task, rtc_uie_task); timer_setup(&rtc->uie_timer, rtc_uie_timer, 0); #endif cdev_init(&rtc->char_dev, &rtc_dev_fops); rtc->char_dev.owner = rtc->owner; } void __init rtc_dev_init(void) { int err; err = alloc_chrdev_region(&rtc_devt, 0, RTC_DEV_MAX, "rtc"); if (err < 0) pr_err("failed to allocate char dev region\n"); } |
| 7 7 7 7 7 7 7 7 7 7 7 7 4 10 9 5 5 4 8 8 8 8 7 3 7 7 7 3 3 1 3 3 3 3 3 5 5 5 4 4 1 1 4 13 13 13 4 3 4 4 2 2 2 2 2 2 2 6 5 6 5 4 1 1 4 4 4 4 4 3 5 3 3 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 4 24 31 31 18 13 12 31 3 3 3 2 1 1 2 2 1 3 3 2 4 4 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 6 5 5 5 7 1 6 6 5 1 5 4 4 2 7 12 7 1 7 7 6 12 10 1 1 1 1 1 1 2 1 1 2 2 2 2 2 2 2 1 1 1 1 3 1 3 6 1 5 5 1 4 4 2 2 4 4 4 6 4 4 4 2 1 3 3 2 3 1 3 3 3 1 1 6 6 6 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 51 53 1 3 4 6 13 1 2 4 2 2 2 2 2 53 3 19 17 18 18 1 1 1 1 43 1 1 1 43 15 13 10 11 12 12 12 13 13 14 14 24 24 24 23 24 4 2 2 2 1 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Routines for driver control interface * Copyright (c) by Jaroslav Kysela <perex@perex.cz> */ #include <linux/threads.h> #include <linux/interrupt.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/slab.h> #include <linux/vmalloc.h> #include <linux/time.h> #include <linux/mm.h> #include <linux/math64.h> #include <linux/sched/signal.h> #include <sound/core.h> #include <sound/minors.h> #include <sound/info.h> #include <sound/control.h> // Max allocation size for user controls. static int max_user_ctl_alloc_size = 8 * 1024 * 1024; module_param_named(max_user_ctl_alloc_size, max_user_ctl_alloc_size, int, 0444); MODULE_PARM_DESC(max_user_ctl_alloc_size, "Max allocation size for user controls"); #define MAX_CONTROL_COUNT 1028 struct snd_kctl_ioctl { struct list_head list; /* list of all ioctls */ snd_kctl_ioctl_func_t fioctl; }; static DECLARE_RWSEM(snd_ioctl_rwsem); static DECLARE_RWSEM(snd_ctl_layer_rwsem); static LIST_HEAD(snd_control_ioctls); #ifdef CONFIG_COMPAT static LIST_HEAD(snd_control_compat_ioctls); #endif static struct snd_ctl_layer_ops *snd_ctl_layer; static int snd_ctl_remove_locked(struct snd_card *card, struct snd_kcontrol *kcontrol); static int snd_ctl_open(struct inode *inode, struct file *file) { struct snd_card *card; struct snd_ctl_file *ctl; int i, err; err = stream_open(inode, file); if (err < 0) return err; card = snd_lookup_minor_data(iminor(inode), SNDRV_DEVICE_TYPE_CONTROL); if (!card) { err = -ENODEV; goto __error1; } err = snd_card_file_add(card, file); if (err < 0) { err = -ENODEV; goto __error1; } if (!try_module_get(card->module)) { err = -EFAULT; goto __error2; } ctl = kzalloc(sizeof(*ctl), GFP_KERNEL); if (ctl == NULL) { err = -ENOMEM; goto __error; } INIT_LIST_HEAD(&ctl->events); init_waitqueue_head(&ctl->change_sleep); spin_lock_init(&ctl->read_lock); ctl->card = card; for (i = 0; i < SND_CTL_SUBDEV_ITEMS; i++) ctl->preferred_subdevice[i] = -1; ctl->pid = get_pid(task_pid(current)); file->private_data = ctl; scoped_guard(write_lock_irqsave, &card->controls_rwlock) list_add_tail(&ctl->list, &card->ctl_files); snd_card_unref(card); return 0; __error: module_put(card->module); __error2: snd_card_file_remove(card, file); __error1: if (card) snd_card_unref(card); return err; } static void snd_ctl_empty_read_queue(struct snd_ctl_file * ctl) { struct snd_kctl_event *cread; guard(spinlock_irqsave)(&ctl->read_lock); while (!list_empty(&ctl->events)) { cread = snd_kctl_event(ctl->events.next); list_del(&cread->list); kfree(cread); } } static int snd_ctl_release(struct inode *inode, struct file *file) { struct snd_card *card; struct snd_ctl_file *ctl; struct snd_kcontrol *control; unsigned int idx; ctl = file->private_data; file->private_data = NULL; card = ctl->card; scoped_guard(write_lock_irqsave, &card->controls_rwlock) list_del(&ctl->list); scoped_guard(rwsem_write, &card->controls_rwsem) { list_for_each_entry(control, &card->controls, list) for (idx = 0; idx < control->count; idx++) if (control->vd[idx].owner == ctl) control->vd[idx].owner = NULL; } snd_fasync_free(ctl->fasync); snd_ctl_empty_read_queue(ctl); put_pid(ctl->pid); kfree(ctl); module_put(card->module); snd_card_file_remove(card, file); return 0; } /** * snd_ctl_notify - Send notification to user-space for a control change * @card: the card to send notification * @mask: the event mask, SNDRV_CTL_EVENT_* * @id: the ctl element id to send notification * * This function adds an event record with the given id and mask, appends * to the list and wakes up the user-space for notification. This can be * called in the atomic context. */ void snd_ctl_notify(struct snd_card *card, unsigned int mask, struct snd_ctl_elem_id *id) { struct snd_ctl_file *ctl; struct snd_kctl_event *ev; if (snd_BUG_ON(!card || !id)) return; if (card->shutdown) return; guard(read_lock_irqsave)(&card->controls_rwlock); #if IS_ENABLED(CONFIG_SND_MIXER_OSS) card->mixer_oss_change_count++; #endif list_for_each_entry(ctl, &card->ctl_files, list) { if (!ctl->subscribed) continue; scoped_guard(spinlock, &ctl->read_lock) { list_for_each_entry(ev, &ctl->events, list) { if (ev->id.numid == id->numid) { ev->mask |= mask; goto _found; } } ev = kzalloc(sizeof(*ev), GFP_ATOMIC); if (ev) { ev->id = *id; ev->mask = mask; list_add_tail(&ev->list, &ctl->events); } else { dev_err(card->dev, "No memory available to allocate event\n"); } _found: wake_up(&ctl->change_sleep); } snd_kill_fasync(ctl->fasync, SIGIO, POLL_IN); } } EXPORT_SYMBOL(snd_ctl_notify); /** * snd_ctl_notify_one - Send notification to user-space for a control change * @card: the card to send notification * @mask: the event mask, SNDRV_CTL_EVENT_* * @kctl: the pointer with the control instance * @ioff: the additional offset to the control index * * This function calls snd_ctl_notify() and does additional jobs * like LED state changes. */ void snd_ctl_notify_one(struct snd_card *card, unsigned int mask, struct snd_kcontrol *kctl, unsigned int ioff) { struct snd_ctl_elem_id id = kctl->id; struct snd_ctl_layer_ops *lops; id.index += ioff; id.numid += ioff; snd_ctl_notify(card, mask, &id); guard(rwsem_read)(&snd_ctl_layer_rwsem); for (lops = snd_ctl_layer; lops; lops = lops->next) lops->lnotify(card, mask, kctl, ioff); } EXPORT_SYMBOL(snd_ctl_notify_one); /** * snd_ctl_new - create a new control instance with some elements * @kctl: the pointer to store new control instance * @count: the number of elements in this control * @access: the default access flags for elements in this control * @file: given when locking these elements * * Allocates a memory object for a new control instance. The instance has * elements as many as the given number (@count). Each element has given * access permissions (@access). Each element is locked when @file is given. * * Return: 0 on success, error code on failure */ static int snd_ctl_new(struct snd_kcontrol **kctl, unsigned int count, unsigned int access, struct snd_ctl_file *file) { unsigned int idx; if (count == 0 || count > MAX_CONTROL_COUNT) return -EINVAL; *kctl = kzalloc(struct_size(*kctl, vd, count), GFP_KERNEL); if (!*kctl) return -ENOMEM; (*kctl)->count = count; for (idx = 0; idx < count; idx++) { (*kctl)->vd[idx].access = access; (*kctl)->vd[idx].owner = file; } return 0; } /** * snd_ctl_new1 - create a control instance from the template * @ncontrol: the initialization record * @private_data: the private data to set * * Allocates a new struct snd_kcontrol instance and initialize from the given * template. When the access field of ncontrol is 0, it's assumed as * READWRITE access. When the count field is 0, it's assumes as one. * * Return: The pointer of the newly generated instance, or %NULL on failure. */ struct snd_kcontrol *snd_ctl_new1(const struct snd_kcontrol_new *ncontrol, void *private_data) { struct snd_kcontrol *kctl; unsigned int count; unsigned int access; int err; if (snd_BUG_ON(!ncontrol || !ncontrol->info)) return NULL; count = ncontrol->count; if (count == 0) count = 1; access = ncontrol->access; if (access == 0) access = SNDRV_CTL_ELEM_ACCESS_READWRITE; access &= (SNDRV_CTL_ELEM_ACCESS_READWRITE | SNDRV_CTL_ELEM_ACCESS_VOLATILE | SNDRV_CTL_ELEM_ACCESS_INACTIVE | SNDRV_CTL_ELEM_ACCESS_TLV_READWRITE | SNDRV_CTL_ELEM_ACCESS_TLV_COMMAND | SNDRV_CTL_ELEM_ACCESS_TLV_CALLBACK | SNDRV_CTL_ELEM_ACCESS_LED_MASK | SNDRV_CTL_ELEM_ACCESS_SKIP_CHECK); err = snd_ctl_new(&kctl, count, access, NULL); if (err < 0) return NULL; /* The 'numid' member is decided when calling snd_ctl_add(). */ kctl->id.iface = ncontrol->iface; kctl->id.device = ncontrol->device; kctl->id.subdevice = ncontrol->subdevice; if (ncontrol->name) { strscpy(kctl->id.name, ncontrol->name, sizeof(kctl->id.name)); if (strcmp(ncontrol->name, kctl->id.name) != 0) pr_warn("ALSA: Control name '%s' truncated to '%s'\n", ncontrol->name, kctl->id.name); } kctl->id.index = ncontrol->index; kctl->info = ncontrol->info; kctl->get = ncontrol->get; kctl->put = ncontrol->put; kctl->tlv.p = ncontrol->tlv.p; kctl->private_value = ncontrol->private_value; kctl->private_data = private_data; return kctl; } EXPORT_SYMBOL(snd_ctl_new1); /** * snd_ctl_free_one - release the control instance * @kcontrol: the control instance * * Releases the control instance created via snd_ctl_new() * or snd_ctl_new1(). * Don't call this after the control was added to the card. */ void snd_ctl_free_one(struct snd_kcontrol *kcontrol) { if (kcontrol) { if (kcontrol->private_free) kcontrol->private_free(kcontrol); kfree(kcontrol); } } EXPORT_SYMBOL(snd_ctl_free_one); static bool snd_ctl_remove_numid_conflict(struct snd_card *card, unsigned int count) { struct snd_kcontrol *kctl; /* Make sure that the ids assigned to the control do not wrap around */ if (card->last_numid >= UINT_MAX - count) card->last_numid = 0; list_for_each_entry(kctl, &card->controls, list) { if (kctl->id.numid < card->last_numid + 1 + count && kctl->id.numid + kctl->count > card->last_numid + 1) { card->last_numid = kctl->id.numid + kctl->count - 1; return true; } } return false; } static int snd_ctl_find_hole(struct snd_card *card, unsigned int count) { unsigned int iter = 100000; while (snd_ctl_remove_numid_conflict(card, count)) { if (--iter == 0) { /* this situation is very unlikely */ dev_err(card->dev, "unable to allocate new control numid\n"); return -ENOMEM; } } return 0; } /* check whether the given id is contained in the given kctl */ static bool elem_id_matches(const struct snd_kcontrol *kctl, const struct snd_ctl_elem_id *id) { return kctl->id.iface == id->iface && kctl->id.device == id->device && kctl->id.subdevice == id->subdevice && !strncmp(kctl->id.name, id->name, sizeof(kctl->id.name)) && kctl->id.index <= id->index && kctl->id.index + kctl->count > id->index; } #ifdef CONFIG_SND_CTL_FAST_LOOKUP /* Compute a hash key for the corresponding ctl id * It's for the name lookup, hence the numid is excluded. * The hash key is bound in LONG_MAX to be used for Xarray key. */ #define MULTIPLIER 37 static unsigned long get_ctl_id_hash(const struct snd_ctl_elem_id *id) { int i; unsigned long h; h = id->iface; h = MULTIPLIER * h + id->device; h = MULTIPLIER * h + id->subdevice; for (i = 0; i < SNDRV_CTL_ELEM_ID_NAME_MAXLEN && id->name[i]; i++) h = MULTIPLIER * h + id->name[i]; h = MULTIPLIER * h + id->index; h &= LONG_MAX; return h; } /* add hash entries to numid and ctl xarray tables */ static void add_hash_entries(struct snd_card *card, struct snd_kcontrol *kcontrol) { struct snd_ctl_elem_id id = kcontrol->id; int i; xa_store_range(&card->ctl_numids, kcontrol->id.numid, kcontrol->id.numid + kcontrol->count - 1, kcontrol, GFP_KERNEL); for (i = 0; i < kcontrol->count; i++) { id.index = kcontrol->id.index + i; if (xa_insert(&card->ctl_hash, get_ctl_id_hash(&id), kcontrol, GFP_KERNEL)) { /* skip hash for this entry, noting we had collision */ card->ctl_hash_collision = true; dev_dbg(card->dev, "ctl_hash collision %d:%s:%d\n", id.iface, id.name, id.index); } } } /* remove hash entries that have been added */ static void remove_hash_entries(struct snd_card *card, struct snd_kcontrol *kcontrol) { struct snd_ctl_elem_id id = kcontrol->id; struct snd_kcontrol *matched; unsigned long h; int i; for (i = 0; i < kcontrol->count; i++) { xa_erase(&card->ctl_numids, id.numid); h = get_ctl_id_hash(&id); matched = xa_load(&card->ctl_hash, h); if (matched && (matched == kcontrol || elem_id_matches(matched, &id))) xa_erase(&card->ctl_hash, h); id.index++; id.numid++; } } #else /* CONFIG_SND_CTL_FAST_LOOKUP */ static inline void add_hash_entries(struct snd_card *card, struct snd_kcontrol *kcontrol) { } static inline void remove_hash_entries(struct snd_card *card, struct snd_kcontrol *kcontrol) { } #endif /* CONFIG_SND_CTL_FAST_LOOKUP */ enum snd_ctl_add_mode { CTL_ADD_EXCLUSIVE, CTL_REPLACE, CTL_ADD_ON_REPLACE, }; /* add/replace a new kcontrol object; call with card->controls_rwsem locked */ static int __snd_ctl_add_replace(struct snd_card *card, struct snd_kcontrol *kcontrol, enum snd_ctl_add_mode mode) { struct snd_ctl_elem_id id; unsigned int idx; struct snd_kcontrol *old; int err; lockdep_assert_held_write(&card->controls_rwsem); id = kcontrol->id; if (id.index > UINT_MAX - kcontrol->count) return -EINVAL; old = snd_ctl_find_id(card, &id); if (!old) { if (mode == CTL_REPLACE) return -EINVAL; } else { if (mode == CTL_ADD_EXCLUSIVE) { dev_err(card->dev, "control %i:%i:%i:%s:%i is already present\n", id.iface, id.device, id.subdevice, id.name, id.index); return -EBUSY; } err = snd_ctl_remove_locked(card, old); if (err < 0) return err; } if (snd_ctl_find_hole(card, kcontrol->count) < 0) return -ENOMEM; scoped_guard(write_lock_irq, &card->controls_rwlock) { list_add_tail(&kcontrol->list, &card->controls); card->controls_count += kcontrol->count; kcontrol->id.numid = card->last_numid + 1; card->last_numid += kcontrol->count; } add_hash_entries(card, kcontrol); for (idx = 0; idx < kcontrol->count; idx++) snd_ctl_notify_one(card, SNDRV_CTL_EVENT_MASK_ADD, kcontrol, idx); return 0; } static int snd_ctl_add_replace(struct snd_card *card, struct snd_kcontrol *kcontrol, enum snd_ctl_add_mode mode) { int err = -EINVAL; if (! kcontrol) return err; if (snd_BUG_ON(!card || !kcontrol->info)) goto error; scoped_guard(rwsem_write, &card->controls_rwsem) err = __snd_ctl_add_replace(card, kcontrol, mode); if (err < 0) goto error; return 0; error: snd_ctl_free_one(kcontrol); return err; } /** * snd_ctl_add - add the control instance to the card * @card: the card instance * @kcontrol: the control instance to add * * Adds the control instance created via snd_ctl_new() or * snd_ctl_new1() to the given card. Assigns also an unique * numid used for fast search. * * It frees automatically the control which cannot be added. * * Return: Zero if successful, or a negative error code on failure. * */ int snd_ctl_add(struct snd_card *card, struct snd_kcontrol *kcontrol) { return snd_ctl_add_replace(card, kcontrol, CTL_ADD_EXCLUSIVE); } EXPORT_SYMBOL(snd_ctl_add); /** * snd_ctl_replace - replace the control instance of the card * @card: the card instance * @kcontrol: the control instance to replace * @add_on_replace: add the control if not already added * * Replaces the given control. If the given control does not exist * and the add_on_replace flag is set, the control is added. If the * control exists, it is destroyed first. * * It frees automatically the control which cannot be added or replaced. * * Return: Zero if successful, or a negative error code on failure. */ int snd_ctl_replace(struct snd_card *card, struct snd_kcontrol *kcontrol, bool add_on_replace) { return snd_ctl_add_replace(card, kcontrol, add_on_replace ? CTL_ADD_ON_REPLACE : CTL_REPLACE); } EXPORT_SYMBOL(snd_ctl_replace); static int __snd_ctl_remove(struct snd_card *card, struct snd_kcontrol *kcontrol, bool remove_hash) { unsigned int idx; lockdep_assert_held_write(&card->controls_rwsem); if (snd_BUG_ON(!card || !kcontrol)) return -EINVAL; if (remove_hash) remove_hash_entries(card, kcontrol); scoped_guard(write_lock_irq, &card->controls_rwlock) { list_del(&kcontrol->list); card->controls_count -= kcontrol->count; } for (idx = 0; idx < kcontrol->count; idx++) snd_ctl_notify_one(card, SNDRV_CTL_EVENT_MASK_REMOVE, kcontrol, idx); snd_ctl_free_one(kcontrol); return 0; } static inline int snd_ctl_remove_locked(struct snd_card *card, struct snd_kcontrol *kcontrol) { return __snd_ctl_remove(card, kcontrol, true); } /** * snd_ctl_remove - remove the control from the card and release it * @card: the card instance * @kcontrol: the control instance to remove * * Removes the control from the card and then releases the instance. * You don't need to call snd_ctl_free_one(). * Passing NULL to @kcontrol argument is allowed as noop. * * Return: 0 if successful, or a negative error code on failure. * * Note that this function takes card->controls_rwsem lock internally. */ int snd_ctl_remove(struct snd_card *card, struct snd_kcontrol *kcontrol) { if (!kcontrol) return 0; guard(rwsem_write)(&card->controls_rwsem); return snd_ctl_remove_locked(card, kcontrol); } EXPORT_SYMBOL(snd_ctl_remove); /** * snd_ctl_remove_id - remove the control of the given id and release it * @card: the card instance * @id: the control id to remove * * Finds the control instance with the given id, removes it from the * card list and releases it. * * Return: 0 if successful, or a negative error code on failure. */ int snd_ctl_remove_id(struct snd_card *card, struct snd_ctl_elem_id *id) { struct snd_kcontrol *kctl; guard(rwsem_write)(&card->controls_rwsem); kctl = snd_ctl_find_id(card, id); if (kctl == NULL) return -ENOENT; return snd_ctl_remove_locked(card, kctl); } EXPORT_SYMBOL(snd_ctl_remove_id); /** * snd_ctl_remove_user_ctl - remove and release the unlocked user control * @file: active control handle * @id: the control id to remove * * Finds the control instance with the given id, removes it from the * card list and releases it. * * Return: 0 if successful, or a negative error code on failure. */ static int snd_ctl_remove_user_ctl(struct snd_ctl_file * file, struct snd_ctl_elem_id *id) { struct snd_card *card = file->card; struct snd_kcontrol *kctl; int idx; guard(rwsem_write)(&card->controls_rwsem); kctl = snd_ctl_find_id(card, id); if (kctl == NULL) return -ENOENT; if (!(kctl->vd[0].access & SNDRV_CTL_ELEM_ACCESS_USER)) return -EINVAL; for (idx = 0; idx < kctl->count; idx++) if (kctl->vd[idx].owner != NULL && kctl->vd[idx].owner != file) return -EBUSY; return snd_ctl_remove_locked(card, kctl); } /** * snd_ctl_activate_id - activate/inactivate the control of the given id * @card: the card instance * @id: the control id to activate/inactivate * @active: non-zero to activate * * Finds the control instance with the given id, and activate or * inactivate the control together with notification, if changed. * The given ID data is filled with full information. * * Return: 0 if unchanged, 1 if changed, or a negative error code on failure. */ int snd_ctl_activate_id(struct snd_card *card, struct snd_ctl_elem_id *id, int active) { struct snd_kcontrol *kctl; struct snd_kcontrol_volatile *vd; unsigned int index_offset; int ret; down_write(&card->controls_rwsem); kctl = snd_ctl_find_id(card, id); if (kctl == NULL) { ret = -ENOENT; goto unlock; } index_offset = snd_ctl_get_ioff(kctl, id); vd = &kctl->vd[index_offset]; ret = 0; if (active) { if (!(vd->access & SNDRV_CTL_ELEM_ACCESS_INACTIVE)) goto unlock; vd->access &= ~SNDRV_CTL_ELEM_ACCESS_INACTIVE; } else { if (vd->access & SNDRV_CTL_ELEM_ACCESS_INACTIVE) goto unlock; vd->access |= SNDRV_CTL_ELEM_ACCESS_INACTIVE; } snd_ctl_build_ioff(id, kctl, index_offset); downgrade_write(&card->controls_rwsem); snd_ctl_notify_one(card, SNDRV_CTL_EVENT_MASK_INFO, kctl, index_offset); up_read(&card->controls_rwsem); return 1; unlock: up_write(&card->controls_rwsem); return ret; } EXPORT_SYMBOL_GPL(snd_ctl_activate_id); /** * snd_ctl_rename_id - replace the id of a control on the card * @card: the card instance * @src_id: the old id * @dst_id: the new id * * Finds the control with the old id from the card, and replaces the * id with the new one. * * The function tries to keep the already assigned numid while replacing * the rest. * * Note that this function should be used only in the card initialization * phase. Calling after the card instantiation may cause issues with * user-space expecting persistent numids. * * Return: Zero if successful, or a negative error code on failure. */ int snd_ctl_rename_id(struct snd_card *card, struct snd_ctl_elem_id *src_id, struct snd_ctl_elem_id *dst_id) { struct snd_kcontrol *kctl; int saved_numid; guard(rwsem_write)(&card->controls_rwsem); kctl = snd_ctl_find_id(card, src_id); if (kctl == NULL) return -ENOENT; saved_numid = kctl->id.numid; remove_hash_entries(card, kctl); kctl->id = *dst_id; kctl->id.numid = saved_numid; add_hash_entries(card, kctl); return 0; } EXPORT_SYMBOL(snd_ctl_rename_id); /** * snd_ctl_rename - rename the control on the card * @card: the card instance * @kctl: the control to rename * @name: the new name * * Renames the specified control on the card to the new name. * * Note that this function takes card->controls_rwsem lock internally. */ void snd_ctl_rename(struct snd_card *card, struct snd_kcontrol *kctl, const char *name) { guard(rwsem_write)(&card->controls_rwsem); remove_hash_entries(card, kctl); if (strscpy(kctl->id.name, name, sizeof(kctl->id.name)) < 0) pr_warn("ALSA: Renamed control new name '%s' truncated to '%s'\n", name, kctl->id.name); add_hash_entries(card, kctl); } EXPORT_SYMBOL(snd_ctl_rename); #ifndef CONFIG_SND_CTL_FAST_LOOKUP static struct snd_kcontrol * snd_ctl_find_numid_slow(struct snd_card *card, unsigned int numid) { struct snd_kcontrol *kctl; guard(read_lock_irqsave)(&card->controls_rwlock); list_for_each_entry(kctl, &card->controls, list) { if (kctl->id.numid <= numid && kctl->id.numid + kctl->count > numid) return kctl; } return NULL; } #endif /* !CONFIG_SND_CTL_FAST_LOOKUP */ /** * snd_ctl_find_numid - find the control instance with the given number-id * @card: the card instance * @numid: the number-id to search * * Finds the control instance with the given number-id from the card. * * Return: The pointer of the instance if found, or %NULL if not. * * Note that this function takes card->controls_rwlock lock internally. */ struct snd_kcontrol *snd_ctl_find_numid(struct snd_card *card, unsigned int numid) { if (snd_BUG_ON(!card || !numid)) return NULL; #ifdef CONFIG_SND_CTL_FAST_LOOKUP return xa_load(&card->ctl_numids, numid); #else return snd_ctl_find_numid_slow(card, numid); #endif } EXPORT_SYMBOL(snd_ctl_find_numid); /** * snd_ctl_find_id - find the control instance with the given id * @card: the card instance * @id: the id to search * * Finds the control instance with the given id from the card. * * Return: The pointer of the instance if found, or %NULL if not. * * Note that this function takes card->controls_rwlock lock internally. */ struct snd_kcontrol *snd_ctl_find_id(struct snd_card *card, const struct snd_ctl_elem_id *id) { struct snd_kcontrol *kctl; if (snd_BUG_ON(!card || !id)) return NULL; if (id->numid != 0) return snd_ctl_find_numid(card, id->numid); #ifdef CONFIG_SND_CTL_FAST_LOOKUP kctl = xa_load(&card->ctl_hash, get_ctl_id_hash(id)); if (kctl && elem_id_matches(kctl, id)) return kctl; if (!card->ctl_hash_collision) return NULL; /* we can rely on only hash table */ #endif /* no matching in hash table - try all as the last resort */ guard(read_lock_irqsave)(&card->controls_rwlock); list_for_each_entry(kctl, &card->controls, list) if (elem_id_matches(kctl, id)) return kctl; return NULL; } EXPORT_SYMBOL(snd_ctl_find_id); static int snd_ctl_card_info(struct snd_card *card, struct snd_ctl_file * ctl, unsigned int cmd, void __user *arg) { struct snd_ctl_card_info *info __free(kfree) = NULL; info = kzalloc(sizeof(*info), GFP_KERNEL); if (! info) return -ENOMEM; scoped_guard(rwsem_read, &snd_ioctl_rwsem) { info->card = card->number; strscpy(info->id, card->id, sizeof(info->id)); strscpy(info->driver, card->driver, sizeof(info->driver)); strscpy(info->name, card->shortname, sizeof(info->name)); strscpy(info->longname, card->longname, sizeof(info->longname)); strscpy(info->mixername, card->mixername, sizeof(info->mixername)); strscpy(info->components, card->components, sizeof(info->components)); } if (copy_to_user(arg, info, sizeof(struct snd_ctl_card_info))) return -EFAULT; return 0; } static int snd_ctl_elem_list(struct snd_card *card, struct snd_ctl_elem_list *list) { struct snd_kcontrol *kctl; struct snd_ctl_elem_id id; unsigned int offset, space, jidx; offset = list->offset; space = list->space; guard(rwsem_read)(&card->controls_rwsem); list->count = card->controls_count; list->used = 0; if (!space) return 0; list_for_each_entry(kctl, &card->controls, list) { if (offset >= kctl->count) { offset -= kctl->count; continue; } for (jidx = offset; jidx < kctl->count; jidx++) { snd_ctl_build_ioff(&id, kctl, jidx); if (copy_to_user(list->pids + list->used, &id, sizeof(id))) return -EFAULT; list->used++; if (!--space) return 0; } offset = 0; } return 0; } static int snd_ctl_elem_list_user(struct snd_card *card, struct snd_ctl_elem_list __user *_list) { struct snd_ctl_elem_list list; int err; if (copy_from_user(&list, _list, sizeof(list))) return -EFAULT; err = snd_ctl_elem_list(card, &list); if (err) return err; if (copy_to_user(_list, &list, sizeof(list))) return -EFAULT; return 0; } /* Check whether the given kctl info is valid */ static int snd_ctl_check_elem_info(struct snd_card *card, const struct snd_ctl_elem_info *info) { static const unsigned int max_value_counts[] = { [SNDRV_CTL_ELEM_TYPE_BOOLEAN] = 128, [SNDRV_CTL_ELEM_TYPE_INTEGER] = 128, [SNDRV_CTL_ELEM_TYPE_ENUMERATED] = 128, [SNDRV_CTL_ELEM_TYPE_BYTES] = 512, [SNDRV_CTL_ELEM_TYPE_IEC958] = 1, [SNDRV_CTL_ELEM_TYPE_INTEGER64] = 64, }; if (info->type < SNDRV_CTL_ELEM_TYPE_BOOLEAN || info->type > SNDRV_CTL_ELEM_TYPE_INTEGER64) { if (card) dev_err(card->dev, "control %i:%i:%i:%s:%i: invalid type %d\n", info->id.iface, info->id.device, info->id.subdevice, info->id.name, info->id.index, info->type); return -EINVAL; } if (info->type == SNDRV_CTL_ELEM_TYPE_ENUMERATED && info->value.enumerated.items == 0) { if (card) dev_err(card->dev, "control %i:%i:%i:%s:%i: zero enum items\n", info->id.iface, info->id.device, info->id.subdevice, info->id.name, info->id.index); return -EINVAL; } if (info->count > max_value_counts[info->type]) { if (card) dev_err(card->dev, "control %i:%i:%i:%s:%i: invalid count %d\n", info->id.iface, info->id.device, info->id.subdevice, info->id.name, info->id.index, info->count); return -EINVAL; } return 0; } /* The capacity of struct snd_ctl_elem_value.value.*/ static const unsigned int value_sizes[] = { [SNDRV_CTL_ELEM_TYPE_BOOLEAN] = sizeof(long), [SNDRV_CTL_ELEM_TYPE_INTEGER] = sizeof(long), [SNDRV_CTL_ELEM_TYPE_ENUMERATED] = sizeof(unsigned int), [SNDRV_CTL_ELEM_TYPE_BYTES] = sizeof(unsigned char), [SNDRV_CTL_ELEM_TYPE_IEC958] = sizeof(struct snd_aes_iec958), [SNDRV_CTL_ELEM_TYPE_INTEGER64] = sizeof(long long), }; /* fill the remaining snd_ctl_elem_value data with the given pattern */ static void fill_remaining_elem_value(struct snd_ctl_elem_value *control, struct snd_ctl_elem_info *info, u32 pattern) { size_t offset = value_sizes[info->type] * info->count; offset = DIV_ROUND_UP(offset, sizeof(u32)); memset32((u32 *)control->value.bytes.data + offset, pattern, sizeof(control->value) / sizeof(u32) - offset); } /* check whether the given integer ctl value is valid */ static int sanity_check_int_value(struct snd_card *card, const struct snd_ctl_elem_value *control, const struct snd_ctl_elem_info *info, int i, bool print_error) { long long lval, lmin, lmax, lstep; u64 rem; switch (info->type) { default: case SNDRV_CTL_ELEM_TYPE_BOOLEAN: lval = control->value.integer.value[i]; lmin = 0; lmax = 1; lstep = 0; break; case SNDRV_CTL_ELEM_TYPE_INTEGER: lval = control->value.integer.value[i]; lmin = info->value.integer.min; lmax = info->value.integer.max; lstep = info->value.integer.step; break; case SNDRV_CTL_ELEM_TYPE_INTEGER64: lval = control->value.integer64.value[i]; lmin = info->value.integer64.min; lmax = info->value.integer64.max; lstep = info->value.integer64.step; break; case SNDRV_CTL_ELEM_TYPE_ENUMERATED: lval = control->value.enumerated.item[i]; lmin = 0; lmax = info->value.enumerated.items - 1; lstep = 0; break; } if (lval < lmin || lval > lmax) { if (print_error) dev_err(card->dev, "control %i:%i:%i:%s:%i: value out of range %lld (%lld/%lld) at count %i\n", control->id.iface, control->id.device, control->id.subdevice, control->id.name, control->id.index, lval, lmin, lmax, i); return -EINVAL; } if (lstep) { div64_u64_rem(lval, lstep, &rem); if (rem) { if (print_error) dev_err(card->dev, "control %i:%i:%i:%s:%i: unaligned value %lld (step %lld) at count %i\n", control->id.iface, control->id.device, control->id.subdevice, control->id.name, control->id.index, lval, lstep, i); return -EINVAL; } } return 0; } /* check whether the all input values are valid for the given elem value */ static int sanity_check_input_values(struct snd_card *card, const struct snd_ctl_elem_value *control, const struct snd_ctl_elem_info *info, bool print_error) { int i, ret; switch (info->type) { case SNDRV_CTL_ELEM_TYPE_BOOLEAN: case SNDRV_CTL_ELEM_TYPE_INTEGER: case SNDRV_CTL_ELEM_TYPE_INTEGER64: case SNDRV_CTL_ELEM_TYPE_ENUMERATED: for (i = 0; i < info->count; i++) { ret = sanity_check_int_value(card, control, info, i, print_error); if (ret < 0) return ret; } break; default: break; } return 0; } /* perform sanity checks to the given snd_ctl_elem_value object */ static int sanity_check_elem_value(struct snd_card *card, const struct snd_ctl_elem_value *control, const struct snd_ctl_elem_info *info, u32 pattern) { size_t offset; int ret; u32 *p; ret = sanity_check_input_values(card, control, info, true); if (ret < 0) return ret; /* check whether the remaining area kept untouched */ offset = value_sizes[info->type] * info->count; offset = DIV_ROUND_UP(offset, sizeof(u32)); p = (u32 *)control->value.bytes.data + offset; for (; offset < sizeof(control->value) / sizeof(u32); offset++, p++) { if (*p != pattern) { ret = -EINVAL; break; } *p = 0; /* clear the checked area */ } return ret; } static int __snd_ctl_elem_info(struct snd_card *card, struct snd_kcontrol *kctl, struct snd_ctl_elem_info *info, struct snd_ctl_file *ctl) { struct snd_kcontrol_volatile *vd; unsigned int index_offset; int result; #ifdef CONFIG_SND_DEBUG info->access = 0; #endif result = kctl->info(kctl, info); if (result >= 0) { snd_BUG_ON(info->access); index_offset = snd_ctl_get_ioff(kctl, &info->id); vd = &kctl->vd[index_offset]; snd_ctl_build_ioff(&info->id, kctl, index_offset); info->access = vd->access; if (vd->owner) { info->access |= SNDRV_CTL_ELEM_ACCESS_LOCK; if (vd->owner == ctl) info->access |= SNDRV_CTL_ELEM_ACCESS_OWNER; info->owner = pid_vnr(vd->owner->pid); } else { info->owner = -1; } if (!snd_ctl_skip_validation(info) && snd_ctl_check_elem_info(card, info) < 0) result = -EINVAL; } return result; } static int snd_ctl_elem_info(struct snd_ctl_file *ctl, struct snd_ctl_elem_info *info) { struct snd_card *card = ctl->card; struct snd_kcontrol *kctl; guard(rwsem_read)(&card->controls_rwsem); kctl = snd_ctl_find_id(card, &info->id); if (!kctl) return -ENOENT; return __snd_ctl_elem_info(card, kctl, info, ctl); } static int snd_ctl_elem_info_user(struct snd_ctl_file *ctl, struct snd_ctl_elem_info __user *_info) { struct snd_card *card = ctl->card; struct snd_ctl_elem_info info; int result; if (copy_from_user(&info, _info, sizeof(info))) return -EFAULT; result = snd_power_ref_and_wait(card); if (result) return result; result = snd_ctl_elem_info(ctl, &info); snd_power_unref(card); if (result < 0) return result; /* drop internal access flags */ info.access &= ~(SNDRV_CTL_ELEM_ACCESS_SKIP_CHECK| SNDRV_CTL_ELEM_ACCESS_LED_MASK); if (copy_to_user(_info, &info, sizeof(info))) return -EFAULT; return result; } static int snd_ctl_elem_read(struct snd_card *card, struct snd_ctl_elem_value *control) { struct snd_kcontrol *kctl; struct snd_kcontrol_volatile *vd; unsigned int index_offset; struct snd_ctl_elem_info info; const u32 pattern = 0xdeadbeef; int ret; guard(rwsem_read)(&card->controls_rwsem); kctl = snd_ctl_find_id(card, &control->id); if (!kctl) return -ENOENT; index_offset = snd_ctl_get_ioff(kctl, &control->id); vd = &kctl->vd[index_offset]; if (!(vd->access & SNDRV_CTL_ELEM_ACCESS_READ) || !kctl->get) return -EPERM; snd_ctl_build_ioff(&control->id, kctl, index_offset); #ifdef CONFIG_SND_CTL_DEBUG /* info is needed only for validation */ memset(&info, 0, sizeof(info)); info.id = control->id; ret = __snd_ctl_elem_info(card, kctl, &info, NULL); if (ret < 0) return ret; #endif if (!snd_ctl_skip_validation(&info)) fill_remaining_elem_value(control, &info, pattern); ret = kctl->get(kctl, control); if (ret < 0) return ret; if (!snd_ctl_skip_validation(&info) && sanity_check_elem_value(card, control, &info, pattern) < 0) { dev_err(card->dev, "control %i:%i:%i:%s:%i: access overflow\n", control->id.iface, control->id.device, control->id.subdevice, control->id.name, control->id.index); return -EINVAL; } return 0; } static int snd_ctl_elem_read_user(struct snd_card *card, struct snd_ctl_elem_value __user *_control) { struct snd_ctl_elem_value *control __free(kfree) = NULL; int result; control = memdup_user(_control, sizeof(*control)); if (IS_ERR(control)) return PTR_ERR(control); result = snd_power_ref_and_wait(card); if (result) return result; result = snd_ctl_elem_read(card, control); snd_power_unref(card); if (result < 0) return result; if (copy_to_user(_control, control, sizeof(*control))) return -EFAULT; return result; } static int snd_ctl_elem_write(struct snd_card *card, struct snd_ctl_file *file, struct snd_ctl_elem_value *control) { struct snd_kcontrol *kctl; struct snd_kcontrol_volatile *vd; unsigned int index_offset; int result = 0; down_write(&card->controls_rwsem); kctl = snd_ctl_find_id(card, &control->id); if (kctl == NULL) { up_write(&card->controls_rwsem); return -ENOENT; } index_offset = snd_ctl_get_ioff(kctl, &control->id); vd = &kctl->vd[index_offset]; if (!(vd->access & SNDRV_CTL_ELEM_ACCESS_WRITE) || kctl->put == NULL || (file && vd->owner && vd->owner != file)) { up_write(&card->controls_rwsem); return -EPERM; } snd_ctl_build_ioff(&control->id, kctl, index_offset); /* validate input values */ if (IS_ENABLED(CONFIG_SND_CTL_INPUT_VALIDATION)) { struct snd_ctl_elem_info info; memset(&info, 0, sizeof(info)); info.id = control->id; result = __snd_ctl_elem_info(card, kctl, &info, NULL); if (!result) result = sanity_check_input_values(card, control, &info, false); } if (!result) result = kctl->put(kctl, control); if (result < 0) { up_write(&card->controls_rwsem); return result; } if (result > 0) { downgrade_write(&card->controls_rwsem); snd_ctl_notify_one(card, SNDRV_CTL_EVENT_MASK_VALUE, kctl, index_offset); up_read(&card->controls_rwsem); } else { up_write(&card->controls_rwsem); } return 0; } static int snd_ctl_elem_write_user(struct snd_ctl_file *file, struct snd_ctl_elem_value __user *_control) { struct snd_ctl_elem_value *control __free(kfree) = NULL; struct snd_card *card; int result; control = memdup_user(_control, sizeof(*control)); if (IS_ERR(control)) return PTR_ERR(control); card = file->card; result = snd_power_ref_and_wait(card); if (result < 0) return result; result = snd_ctl_elem_write(card, file, control); snd_power_unref(card); if (result < 0) return result; if (copy_to_user(_control, control, sizeof(*control))) return -EFAULT; return result; } static int snd_ctl_elem_lock(struct snd_ctl_file *file, struct snd_ctl_elem_id __user *_id) { struct snd_card *card = file->card; struct snd_ctl_elem_id id; struct snd_kcontrol *kctl; struct snd_kcontrol_volatile *vd; if (copy_from_user(&id, _id, sizeof(id))) return -EFAULT; guard(rwsem_write)(&card->controls_rwsem); kctl = snd_ctl_find_id(card, &id); if (!kctl) return -ENOENT; vd = &kctl->vd[snd_ctl_get_ioff(kctl, &id)]; if (vd->owner) return -EBUSY; vd->owner = file; return 0; } static int snd_ctl_elem_unlock(struct snd_ctl_file *file, struct snd_ctl_elem_id __user *_id) { struct snd_card *card = file->card; struct snd_ctl_elem_id id; struct snd_kcontrol *kctl; struct snd_kcontrol_volatile *vd; if (copy_from_user(&id, _id, sizeof(id))) return -EFAULT; guard(rwsem_write)(&card->controls_rwsem); kctl = snd_ctl_find_id(card, &id); if (!kctl) return -ENOENT; vd = &kctl->vd[snd_ctl_get_ioff(kctl, &id)]; if (!vd->owner) return -EINVAL; if (vd->owner != file) return -EPERM; vd->owner = NULL; return 0; } struct user_element { struct snd_ctl_elem_info info; struct snd_card *card; char *elem_data; /* element data */ unsigned long elem_data_size; /* size of element data in bytes */ void *tlv_data; /* TLV data */ unsigned long tlv_data_size; /* TLV data size */ void *priv_data; /* private data (like strings for enumerated type) */ }; // check whether the addition (in bytes) of user ctl element may overflow the limit. static bool check_user_elem_overflow(struct snd_card *card, ssize_t add) { return (ssize_t)card->user_ctl_alloc_size + add > max_user_ctl_alloc_size; } static int snd_ctl_elem_user_info(struct snd_kcontrol *kcontrol, struct snd_ctl_elem_info *uinfo) { struct user_element *ue = snd_kcontrol_chip(kcontrol); unsigned int offset; offset = snd_ctl_get_ioff(kcontrol, &uinfo->id); *uinfo = ue->info; snd_ctl_build_ioff(&uinfo->id, kcontrol, offset); return 0; } static int snd_ctl_elem_user_enum_info(struct snd_kcontrol *kcontrol, struct snd_ctl_elem_info *uinfo) { struct user_element *ue = snd_kcontrol_chip(kcontrol); const char *names; unsigned int item; unsigned int offset; item = uinfo->value.enumerated.item; offset = snd_ctl_get_ioff(kcontrol, &uinfo->id); *uinfo = ue->info; snd_ctl_build_ioff(&uinfo->id, kcontrol, offset); item = min(item, uinfo->value.enumerated.items - 1); uinfo->value.enumerated.item = item; names = ue->priv_data; for (; item > 0; --item) names += strlen(names) + 1; strscpy(uinfo->value.enumerated.name, names); return 0; } static int snd_ctl_elem_user_get(struct snd_kcontrol *kcontrol, struct snd_ctl_elem_value *ucontrol) { struct user_element *ue = snd_kcontrol_chip(kcontrol); unsigned int size = ue->elem_data_size; char *src = ue->elem_data + snd_ctl_get_ioff(kcontrol, &ucontrol->id) * size; memcpy(&ucontrol->value, src, size); return 0; } static int snd_ctl_elem_user_put(struct snd_kcontrol *kcontrol, struct snd_ctl_elem_value *ucontrol) { int err, change; struct user_element *ue = snd_kcontrol_chip(kcontrol); unsigned int size = ue->elem_data_size; char *dst = ue->elem_data + snd_ctl_get_ioff(kcontrol, &ucontrol->id) * size; err = sanity_check_input_values(ue->card, ucontrol, &ue->info, false); if (err < 0) return err; change = memcmp(&ucontrol->value, dst, size) != 0; if (change) memcpy(dst, &ucontrol->value, size); return change; } /* called in controls_rwsem write lock */ static int replace_user_tlv(struct snd_kcontrol *kctl, unsigned int __user *buf, unsigned int size) { struct user_element *ue = snd_kcontrol_chip(kctl); unsigned int *container; unsigned int mask = 0; int i; int change; lockdep_assert_held_write(&ue->card->controls_rwsem); if (size > 1024 * 128) /* sane value */ return -EINVAL; // does the TLV size change cause overflow? if (check_user_elem_overflow(ue->card, (ssize_t)(size - ue->tlv_data_size))) return -ENOMEM; container = vmemdup_user(buf, size); if (IS_ERR(container)) return PTR_ERR(container); change = ue->tlv_data_size != size; if (!change) change = memcmp(ue->tlv_data, container, size) != 0; if (!change) { kvfree(container); return 0; } if (ue->tlv_data == NULL) { /* Now TLV data is available. */ for (i = 0; i < kctl->count; ++i) kctl->vd[i].access |= SNDRV_CTL_ELEM_ACCESS_TLV_READ; mask = SNDRV_CTL_EVENT_MASK_INFO; } else { ue->card->user_ctl_alloc_size -= ue->tlv_data_size; ue->tlv_data_size = 0; kvfree(ue->tlv_data); } ue->tlv_data = container; ue->tlv_data_size = size; // decremented at private_free. ue->card->user_ctl_alloc_size += size; mask |= SNDRV_CTL_EVENT_MASK_TLV; for (i = 0; i < kctl->count; ++i) snd_ctl_notify_one(ue->card, mask, kctl, i); return change; } static int read_user_tlv(struct snd_kcontrol *kctl, unsigned int __user *buf, unsigned int size) { struct user_element *ue = snd_kcontrol_chip(kctl); if (ue->tlv_data_size == 0 || ue->tlv_data == NULL) return -ENXIO; if (size < ue->tlv_data_size) return -ENOSPC; if (copy_to_user(buf, ue->tlv_data, ue->tlv_data_size)) return -EFAULT; return 0; } static int snd_ctl_elem_user_tlv(struct snd_kcontrol *kctl, int op_flag, unsigned int size, unsigned int __user *buf) { if (op_flag == SNDRV_CTL_TLV_OP_WRITE) return replace_user_tlv(kctl, buf, size); else return read_user_tlv(kctl, buf, size); } /* called in controls_rwsem write lock */ static int snd_ctl_elem_init_enum_names(struct user_element *ue) { char *names, *p; size_t buf_len, name_len; unsigned int i; const uintptr_t user_ptrval = ue->info.value.enumerated.names_ptr; lockdep_assert_held_write(&ue->card->controls_rwsem); buf_len = ue->info.value.enumerated.names_length; if (buf_len > 64 * 1024) return -EINVAL; if (check_user_elem_overflow(ue->card, buf_len)) return -ENOMEM; names = vmemdup_user((const void __user *)user_ptrval, buf_len); if (IS_ERR(names)) return PTR_ERR(names); /* check that there are enough valid names */ p = names; for (i = 0; i < ue->info.value.enumerated.items; ++i) { name_len = strnlen(p, buf_len); if (name_len == 0 || name_len >= 64 || name_len == buf_len) { kvfree(names); return -EINVAL; } p += name_len + 1; buf_len -= name_len + 1; } ue->priv_data = names; ue->info.value.enumerated.names_ptr = 0; // increment the allocation size; decremented again at private_free. ue->card->user_ctl_alloc_size += ue->info.value.enumerated.names_length; return 0; } static size_t compute_user_elem_size(size_t size, unsigned int count) { return sizeof(struct user_element) + size * count; } static void snd_ctl_elem_user_free(struct snd_kcontrol *kcontrol) { struct user_element *ue = snd_kcontrol_chip(kcontrol); // decrement the allocation size. ue->card->user_ctl_alloc_size -= compute_user_elem_size(ue->elem_data_size, kcontrol->count); ue->card->user_ctl_alloc_size -= ue->tlv_data_size; if (ue->priv_data) ue->card->user_ctl_alloc_size -= ue->info.value.enumerated.names_length; kvfree(ue->tlv_data); kvfree(ue->priv_data); kfree(ue); } static int snd_ctl_elem_add(struct snd_ctl_file *file, struct snd_ctl_elem_info *info, int replace) { struct snd_card *card = file->card; struct snd_kcontrol *kctl; unsigned int count; unsigned int access; long private_size; size_t alloc_size; struct user_element *ue; unsigned int offset; int err; if (!*info->id.name) return -EINVAL; if (strnlen(info->id.name, sizeof(info->id.name)) >= sizeof(info->id.name)) return -EINVAL; /* Delete a control to replace them if needed. */ if (replace) { info->id.numid = 0; err = snd_ctl_remove_user_ctl(file, &info->id); if (err) return err; } /* Check the number of elements for this userspace control. */ count = info->owner; if (count == 0) count = 1; if (count > MAX_CONTROL_COUNT) return -EINVAL; /* Arrange access permissions if needed. */ access = info->access; if (access == 0) access = SNDRV_CTL_ELEM_ACCESS_READWRITE; access &= (SNDRV_CTL_ELEM_ACCESS_READWRITE | SNDRV_CTL_ELEM_ACCESS_INACTIVE | SNDRV_CTL_ELEM_ACCESS_TLV_WRITE); /* In initial state, nothing is available as TLV container. */ if (access & SNDRV_CTL_ELEM_ACCESS_TLV_WRITE) access |= SNDRV_CTL_ELEM_ACCESS_TLV_CALLBACK; access |= SNDRV_CTL_ELEM_ACCESS_USER; /* * Check information and calculate the size of data specific to * this userspace control. */ /* pass NULL to card for suppressing error messages */ err = snd_ctl_check_elem_info(NULL, info); if (err < 0) return err; /* user-space control doesn't allow zero-size data */ if (info->count < 1) return -EINVAL; private_size = value_sizes[info->type] * info->count; alloc_size = compute_user_elem_size(private_size, count); guard(rwsem_write)(&card->controls_rwsem); if (check_user_elem_overflow(card, alloc_size)) return -ENOMEM; /* * Keep memory object for this userspace control. After passing this * code block, the instance should be freed by snd_ctl_free_one(). * * Note that these elements in this control are locked. */ err = snd_ctl_new(&kctl, count, access, file); if (err < 0) return err; memcpy(&kctl->id, &info->id, sizeof(kctl->id)); ue = kzalloc(alloc_size, GFP_KERNEL); if (!ue) { kfree(kctl); return -ENOMEM; } kctl->private_data = ue; kctl->private_free = snd_ctl_elem_user_free; // increment the allocated size; decremented again at private_free. card->user_ctl_alloc_size += alloc_size; /* Set private data for this userspace control. */ ue->card = card; ue->info = *info; ue->info.access = 0; ue->elem_data = (char *)ue + sizeof(*ue); ue->elem_data_size = private_size; if (ue->info.type == SNDRV_CTL_ELEM_TYPE_ENUMERATED) { err = snd_ctl_elem_init_enum_names(ue); if (err < 0) { snd_ctl_free_one(kctl); return err; } } /* Set callback functions. */ if (info->type == SNDRV_CTL_ELEM_TYPE_ENUMERATED) kctl->info = snd_ctl_elem_user_enum_info; else kctl->info = snd_ctl_elem_user_info; if (access & SNDRV_CTL_ELEM_ACCESS_READ) kctl->get = snd_ctl_elem_user_get; if (access & SNDRV_CTL_ELEM_ACCESS_WRITE) kctl->put = snd_ctl_elem_user_put; if (access & SNDRV_CTL_ELEM_ACCESS_TLV_WRITE) kctl->tlv.c = snd_ctl_elem_user_tlv; /* This function manage to free the instance on failure. */ err = __snd_ctl_add_replace(card, kctl, CTL_ADD_EXCLUSIVE); if (err < 0) { snd_ctl_free_one(kctl); return err; } offset = snd_ctl_get_ioff(kctl, &info->id); snd_ctl_build_ioff(&info->id, kctl, offset); /* * Here we cannot fill any field for the number of elements added by * this operation because there're no specific fields. The usage of * 'owner' field for this purpose may cause any bugs to userspace * applications because the field originally means PID of a process * which locks the element. */ return 0; } static int snd_ctl_elem_add_user(struct snd_ctl_file *file, struct snd_ctl_elem_info __user *_info, int replace) { struct snd_ctl_elem_info info; int err; if (copy_from_user(&info, _info, sizeof(info))) return -EFAULT; err = snd_ctl_elem_add(file, &info, replace); if (err < 0) return err; if (copy_to_user(_info, &info, sizeof(info))) { snd_ctl_remove_user_ctl(file, &info.id); return -EFAULT; } return 0; } static int snd_ctl_elem_remove(struct snd_ctl_file *file, struct snd_ctl_elem_id __user *_id) { struct snd_ctl_elem_id id; if (copy_from_user(&id, _id, sizeof(id))) return -EFAULT; return snd_ctl_remove_user_ctl(file, &id); } static int snd_ctl_subscribe_events(struct snd_ctl_file *file, int __user *ptr) { int subscribe; if (get_user(subscribe, ptr)) return -EFAULT; if (subscribe < 0) { subscribe = file->subscribed; if (put_user(subscribe, ptr)) return -EFAULT; return 0; } if (subscribe) { file->subscribed = 1; return 0; } else if (file->subscribed) { snd_ctl_empty_read_queue(file); file->subscribed = 0; } return 0; } static int call_tlv_handler(struct snd_ctl_file *file, int op_flag, struct snd_kcontrol *kctl, struct snd_ctl_elem_id *id, unsigned int __user *buf, unsigned int size) { static const struct { int op; int perm; } pairs[] = { {SNDRV_CTL_TLV_OP_READ, SNDRV_CTL_ELEM_ACCESS_TLV_READ}, {SNDRV_CTL_TLV_OP_WRITE, SNDRV_CTL_ELEM_ACCESS_TLV_WRITE}, {SNDRV_CTL_TLV_OP_CMD, SNDRV_CTL_ELEM_ACCESS_TLV_COMMAND}, }; struct snd_kcontrol_volatile *vd = &kctl->vd[snd_ctl_get_ioff(kctl, id)]; int i; /* Check support of the request for this element. */ for (i = 0; i < ARRAY_SIZE(pairs); ++i) { if (op_flag == pairs[i].op && (vd->access & pairs[i].perm)) break; } if (i == ARRAY_SIZE(pairs)) return -ENXIO; if (kctl->tlv.c == NULL) return -ENXIO; /* Write and command operations are not allowed for locked element. */ if (op_flag != SNDRV_CTL_TLV_OP_READ && vd->owner != NULL && vd->owner != file) return -EPERM; return kctl->tlv.c(kctl, op_flag, size, buf); } static int read_tlv_buf(struct snd_kcontrol *kctl, struct snd_ctl_elem_id *id, unsigned int __user *buf, unsigned int size) { struct snd_kcontrol_volatile *vd = &kctl->vd[snd_ctl_get_ioff(kctl, id)]; unsigned int len; if (!(vd->access & SNDRV_CTL_ELEM_ACCESS_TLV_READ)) return -ENXIO; if (kctl->tlv.p == NULL) return -ENXIO; len = sizeof(unsigned int) * 2 + kctl->tlv.p[1]; if (size < len) return -ENOMEM; if (copy_to_user(buf, kctl->tlv.p, len)) return -EFAULT; return 0; } static int snd_ctl_tlv_ioctl(struct snd_ctl_file *file, struct snd_ctl_tlv __user *buf, int op_flag) { struct snd_ctl_tlv header; unsigned int __user *container; unsigned int container_size; struct snd_kcontrol *kctl; struct snd_ctl_elem_id id; struct snd_kcontrol_volatile *vd; lockdep_assert_held(&file->card->controls_rwsem); if (copy_from_user(&header, buf, sizeof(header))) return -EFAULT; /* In design of control core, numerical ID starts at 1. */ if (header.numid == 0) return -EINVAL; /* At least, container should include type and length fields. */ if (header.length < sizeof(unsigned int) * 2) return -EINVAL; container_size = header.length; container = buf->tlv; kctl = snd_ctl_find_numid(file->card, header.numid); if (kctl == NULL) return -ENOENT; /* Calculate index of the element in this set. */ id = kctl->id; snd_ctl_build_ioff(&id, kctl, header.numid - id.numid); vd = &kctl->vd[snd_ctl_get_ioff(kctl, &id)]; if (vd->access & SNDRV_CTL_ELEM_ACCESS_TLV_CALLBACK) { return call_tlv_handler(file, op_flag, kctl, &id, container, container_size); } else { if (op_flag == SNDRV_CTL_TLV_OP_READ) { return read_tlv_buf(kctl, &id, container, container_size); } } /* Not supported. */ return -ENXIO; } static long snd_ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct snd_ctl_file *ctl; struct snd_card *card; struct snd_kctl_ioctl *p; void __user *argp = (void __user *)arg; int __user *ip = argp; int err; ctl = file->private_data; card = ctl->card; if (snd_BUG_ON(!card)) return -ENXIO; switch (cmd) { case SNDRV_CTL_IOCTL_PVERSION: return put_user(SNDRV_CTL_VERSION, ip) ? -EFAULT : 0; case SNDRV_CTL_IOCTL_CARD_INFO: return snd_ctl_card_info(card, ctl, cmd, argp); case SNDRV_CTL_IOCTL_ELEM_LIST: return snd_ctl_elem_list_user(card, argp); case SNDRV_CTL_IOCTL_ELEM_INFO: return snd_ctl_elem_info_user(ctl, argp); case SNDRV_CTL_IOCTL_ELEM_READ: return snd_ctl_elem_read_user(card, argp); case SNDRV_CTL_IOCTL_ELEM_WRITE: return snd_ctl_elem_write_user(ctl, argp); case SNDRV_CTL_IOCTL_ELEM_LOCK: return snd_ctl_elem_lock(ctl, argp); case SNDRV_CTL_IOCTL_ELEM_UNLOCK: return snd_ctl_elem_unlock(ctl, argp); case SNDRV_CTL_IOCTL_ELEM_ADD: return snd_ctl_elem_add_user(ctl, argp, 0); case SNDRV_CTL_IOCTL_ELEM_REPLACE: return snd_ctl_elem_add_user(ctl, argp, 1); case SNDRV_CTL_IOCTL_ELEM_REMOVE: return snd_ctl_elem_remove(ctl, argp); case SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS: return snd_ctl_subscribe_events(ctl, ip); case SNDRV_CTL_IOCTL_TLV_READ: err = snd_power_ref_and_wait(card); if (err < 0) return err; scoped_guard(rwsem_read, &card->controls_rwsem) err = snd_ctl_tlv_ioctl(ctl, argp, SNDRV_CTL_TLV_OP_READ); snd_power_unref(card); return err; case SNDRV_CTL_IOCTL_TLV_WRITE: err = snd_power_ref_and_wait(card); if (err < 0) return err; scoped_guard(rwsem_write, &card->controls_rwsem) err = snd_ctl_tlv_ioctl(ctl, argp, SNDRV_CTL_TLV_OP_WRITE); snd_power_unref(card); return err; case SNDRV_CTL_IOCTL_TLV_COMMAND: err = snd_power_ref_and_wait(card); if (err < 0) return err; scoped_guard(rwsem_write, &card->controls_rwsem) err = snd_ctl_tlv_ioctl(ctl, argp, SNDRV_CTL_TLV_OP_CMD); snd_power_unref(card); return err; case SNDRV_CTL_IOCTL_POWER: return -ENOPROTOOPT; case SNDRV_CTL_IOCTL_POWER_STATE: return put_user(SNDRV_CTL_POWER_D0, ip) ? -EFAULT : 0; } guard(rwsem_read)(&snd_ioctl_rwsem); list_for_each_entry(p, &snd_control_ioctls, list) { err = p->fioctl(card, ctl, cmd, arg); if (err != -ENOIOCTLCMD) return err; } dev_dbg(card->dev, "unknown ioctl = 0x%x\n", cmd); return -ENOTTY; } static ssize_t snd_ctl_read(struct file *file, char __user *buffer, size_t count, loff_t * offset) { struct snd_ctl_file *ctl; int err = 0; ssize_t result = 0; ctl = file->private_data; if (snd_BUG_ON(!ctl || !ctl->card)) return -ENXIO; if (!ctl->subscribed) return -EBADFD; if (count < sizeof(struct snd_ctl_event)) return -EINVAL; spin_lock_irq(&ctl->read_lock); while (count >= sizeof(struct snd_ctl_event)) { struct snd_ctl_event ev; struct snd_kctl_event *kev; while (list_empty(&ctl->events)) { wait_queue_entry_t wait; if ((file->f_flags & O_NONBLOCK) != 0 || result > 0) { err = -EAGAIN; goto __end_lock; } init_waitqueue_entry(&wait, current); add_wait_queue(&ctl->change_sleep, &wait); set_current_state(TASK_INTERRUPTIBLE); spin_unlock_irq(&ctl->read_lock); schedule(); remove_wait_queue(&ctl->change_sleep, &wait); if (ctl->card->shutdown) return -ENODEV; if (signal_pending(current)) return -ERESTARTSYS; spin_lock_irq(&ctl->read_lock); } kev = snd_kctl_event(ctl->events.next); ev.type = SNDRV_CTL_EVENT_ELEM; ev.data.elem.mask = kev->mask; ev.data.elem.id = kev->id; list_del(&kev->list); spin_unlock_irq(&ctl->read_lock); kfree(kev); if (copy_to_user(buffer, &ev, sizeof(struct snd_ctl_event))) { err = -EFAULT; goto __end; } spin_lock_irq(&ctl->read_lock); buffer += sizeof(struct snd_ctl_event); count -= sizeof(struct snd_ctl_event); result += sizeof(struct snd_ctl_event); } __end_lock: spin_unlock_irq(&ctl->read_lock); __end: return result > 0 ? result : err; } static __poll_t snd_ctl_poll(struct file *file, poll_table * wait) { __poll_t mask; struct snd_ctl_file *ctl; ctl = file->private_data; if (!ctl->subscribed) return 0; poll_wait(file, &ctl->change_sleep, wait); mask = 0; if (!list_empty(&ctl->events)) mask |= EPOLLIN | EPOLLRDNORM; return mask; } /* * register the device-specific control-ioctls. * called from each device manager like pcm.c, hwdep.c, etc. */ static int _snd_ctl_register_ioctl(snd_kctl_ioctl_func_t fcn, struct list_head *lists) { struct snd_kctl_ioctl *pn; pn = kzalloc(sizeof(struct snd_kctl_ioctl), GFP_KERNEL); if (pn == NULL) return -ENOMEM; pn->fioctl = fcn; guard(rwsem_write)(&snd_ioctl_rwsem); list_add_tail(&pn->list, lists); return 0; } /** * snd_ctl_register_ioctl - register the device-specific control-ioctls * @fcn: ioctl callback function * * called from each device manager like pcm.c, hwdep.c, etc. * * Return: zero if successful, or a negative error code */ int snd_ctl_register_ioctl(snd_kctl_ioctl_func_t fcn) { return _snd_ctl_register_ioctl(fcn, &snd_control_ioctls); } EXPORT_SYMBOL(snd_ctl_register_ioctl); #ifdef CONFIG_COMPAT /** * snd_ctl_register_ioctl_compat - register the device-specific 32bit compat * control-ioctls * @fcn: ioctl callback function * * Return: zero if successful, or a negative error code */ int snd_ctl_register_ioctl_compat(snd_kctl_ioctl_func_t fcn) { return _snd_ctl_register_ioctl(fcn, &snd_control_compat_ioctls); } EXPORT_SYMBOL(snd_ctl_register_ioctl_compat); #endif /* * de-register the device-specific control-ioctls. */ static int _snd_ctl_unregister_ioctl(snd_kctl_ioctl_func_t fcn, struct list_head *lists) { struct snd_kctl_ioctl *p; if (snd_BUG_ON(!fcn)) return -EINVAL; guard(rwsem_write)(&snd_ioctl_rwsem); list_for_each_entry(p, lists, list) { if (p->fioctl == fcn) { list_del(&p->list); kfree(p); return 0; } } snd_BUG(); return -EINVAL; } /** * snd_ctl_unregister_ioctl - de-register the device-specific control-ioctls * @fcn: ioctl callback function to unregister * * Return: zero if successful, or a negative error code */ int snd_ctl_unregister_ioctl(snd_kctl_ioctl_func_t fcn) { return _snd_ctl_unregister_ioctl(fcn, &snd_control_ioctls); } EXPORT_SYMBOL(snd_ctl_unregister_ioctl); #ifdef CONFIG_COMPAT /** * snd_ctl_unregister_ioctl_compat - de-register the device-specific compat * 32bit control-ioctls * @fcn: ioctl callback function to unregister * * Return: zero if successful, or a negative error code */ int snd_ctl_unregister_ioctl_compat(snd_kctl_ioctl_func_t fcn) { return _snd_ctl_unregister_ioctl(fcn, &snd_control_compat_ioctls); } EXPORT_SYMBOL(snd_ctl_unregister_ioctl_compat); #endif static int snd_ctl_fasync(int fd, struct file * file, int on) { struct snd_ctl_file *ctl; ctl = file->private_data; return snd_fasync_helper(fd, file, on, &ctl->fasync); } /* return the preferred subdevice number if already assigned; * otherwise return -1 */ int snd_ctl_get_preferred_subdevice(struct snd_card *card, int type) { struct snd_ctl_file *kctl; int subdevice = -1; guard(read_lock_irqsave)(&card->controls_rwlock); list_for_each_entry(kctl, &card->ctl_files, list) { if (kctl->pid == task_pid(current)) { subdevice = kctl->preferred_subdevice[type]; if (subdevice != -1) break; } } return subdevice; } EXPORT_SYMBOL_GPL(snd_ctl_get_preferred_subdevice); /* * ioctl32 compat */ #ifdef CONFIG_COMPAT #include "control_compat.c" #else #define snd_ctl_ioctl_compat NULL #endif /* * control layers (audio LED etc.) */ /** * snd_ctl_request_layer - request to use the layer * @module_name: Name of the kernel module (NULL == build-in) * * Return: zero if successful, or an error code when the module cannot be loaded */ int snd_ctl_request_layer(const char *module_name) { struct snd_ctl_layer_ops *lops; if (module_name == NULL) return 0; scoped_guard(rwsem_read, &snd_ctl_layer_rwsem) { for (lops = snd_ctl_layer; lops; lops = lops->next) if (strcmp(lops->module_name, module_name) == 0) return 0; } return request_module(module_name); } EXPORT_SYMBOL_GPL(snd_ctl_request_layer); /** * snd_ctl_register_layer - register new control layer * @lops: operation structure * * The new layer can track all control elements and do additional * operations on top (like audio LED handling). */ void snd_ctl_register_layer(struct snd_ctl_layer_ops *lops) { struct snd_card *card; int card_number; scoped_guard(rwsem_write, &snd_ctl_layer_rwsem) { lops->next = snd_ctl_layer; snd_ctl_layer = lops; } for (card_number = 0; card_number < SNDRV_CARDS; card_number++) { card = snd_card_ref(card_number); if (card) { scoped_guard(rwsem_read, &card->controls_rwsem) lops->lregister(card); snd_card_unref(card); } } } EXPORT_SYMBOL_GPL(snd_ctl_register_layer); /** * snd_ctl_disconnect_layer - disconnect control layer * @lops: operation structure * * It is expected that the information about tracked cards * is freed before this call (the disconnect callback is * not called here). */ void snd_ctl_disconnect_layer(struct snd_ctl_layer_ops *lops) { struct snd_ctl_layer_ops *lops2, *prev_lops2; guard(rwsem_write)(&snd_ctl_layer_rwsem); for (lops2 = snd_ctl_layer, prev_lops2 = NULL; lops2; lops2 = lops2->next) { if (lops2 == lops) { if (!prev_lops2) snd_ctl_layer = lops->next; else prev_lops2->next = lops->next; break; } prev_lops2 = lops2; } } EXPORT_SYMBOL_GPL(snd_ctl_disconnect_layer); /* * INIT PART */ static const struct file_operations snd_ctl_f_ops = { .owner = THIS_MODULE, .read = snd_ctl_read, .open = snd_ctl_open, .release = snd_ctl_release, .poll = snd_ctl_poll, .unlocked_ioctl = snd_ctl_ioctl, .compat_ioctl = snd_ctl_ioctl_compat, .fasync = snd_ctl_fasync, }; /* call lops under rwsems; called from snd_ctl_dev_*() below() */ #define call_snd_ctl_lops(_card, _op) \ do { \ struct snd_ctl_layer_ops *lops; \ guard(rwsem_read)(&(_card)->controls_rwsem); \ guard(rwsem_read)(&snd_ctl_layer_rwsem); \ for (lops = snd_ctl_layer; lops; lops = lops->next) \ lops->_op(_card); \ } while (0) /* * registration of the control device */ static int snd_ctl_dev_register(struct snd_device *device) { struct snd_card *card = device->device_data; int err; err = snd_register_device(SNDRV_DEVICE_TYPE_CONTROL, card, -1, &snd_ctl_f_ops, card, card->ctl_dev); if (err < 0) return err; call_snd_ctl_lops(card, lregister); return 0; } /* * disconnection of the control device */ static int snd_ctl_dev_disconnect(struct snd_device *device) { struct snd_card *card = device->device_data; struct snd_ctl_file *ctl; scoped_guard(read_lock_irqsave, &card->controls_rwlock) { list_for_each_entry(ctl, &card->ctl_files, list) { wake_up(&ctl->change_sleep); snd_kill_fasync(ctl->fasync, SIGIO, POLL_ERR); } } call_snd_ctl_lops(card, ldisconnect); return snd_unregister_device(card->ctl_dev); } /* * free all controls */ static int snd_ctl_dev_free(struct snd_device *device) { struct snd_card *card = device->device_data; struct snd_kcontrol *control; scoped_guard(rwsem_write, &card->controls_rwsem) { while (!list_empty(&card->controls)) { control = snd_kcontrol(card->controls.next); __snd_ctl_remove(card, control, false); } #ifdef CONFIG_SND_CTL_FAST_LOOKUP xa_destroy(&card->ctl_numids); xa_destroy(&card->ctl_hash); #endif } put_device(card->ctl_dev); return 0; } /* * create control core: * called from init.c */ int snd_ctl_create(struct snd_card *card) { static const struct snd_device_ops ops = { .dev_free = snd_ctl_dev_free, .dev_register = snd_ctl_dev_register, .dev_disconnect = snd_ctl_dev_disconnect, }; int err; if (snd_BUG_ON(!card)) return -ENXIO; if (snd_BUG_ON(card->number < 0 || card->number >= SNDRV_CARDS)) return -ENXIO; err = snd_device_alloc(&card->ctl_dev, card); if (err < 0) return err; dev_set_name(card->ctl_dev, "controlC%d", card->number); err = snd_device_new(card, SNDRV_DEV_CONTROL, card, &ops); if (err < 0) put_device(card->ctl_dev); return err; } /* * Frequently used control callbacks/helpers */ /** * snd_ctl_boolean_mono_info - Helper function for a standard boolean info * callback with a mono channel * @kcontrol: the kcontrol instance * @uinfo: info to store * * This is a function that can be used as info callback for a standard * boolean control with a single mono channel. * * Return: Zero (always successful) */ int snd_ctl_boolean_mono_info(struct snd_kcontrol *kcontrol, struct snd_ctl_elem_info *uinfo) { uinfo->type = SNDRV_CTL_ELEM_TYPE_BOOLEAN; uinfo->count = 1; uinfo->value.integer.min = 0; uinfo->value.integer.max = 1; return 0; } EXPORT_SYMBOL(snd_ctl_boolean_mono_info); /** * snd_ctl_boolean_stereo_info - Helper function for a standard boolean info * callback with stereo two channels * @kcontrol: the kcontrol instance * @uinfo: info to store * * This is a function that can be used as info callback for a standard * boolean control with stereo two channels. * * Return: Zero (always successful) */ int snd_ctl_boolean_stereo_info(struct snd_kcontrol *kcontrol, struct snd_ctl_elem_info *uinfo) { uinfo->type = SNDRV_CTL_ELEM_TYPE_BOOLEAN; uinfo->count = 2; uinfo->value.integer.min = 0; uinfo->value.integer.max = 1; return 0; } EXPORT_SYMBOL(snd_ctl_boolean_stereo_info); /** * snd_ctl_enum_info - fills the info structure for an enumerated control * @info: the structure to be filled * @channels: the number of the control's channels; often one * @items: the number of control values; also the size of @names * @names: an array containing the names of all control values * * Sets all required fields in @info to their appropriate values. * If the control's accessibility is not the default (readable and writable), * the caller has to fill @info->access. * * Return: Zero (always successful) */ int snd_ctl_enum_info(struct snd_ctl_elem_info *info, unsigned int channels, unsigned int items, const char *const names[]) { info->type = SNDRV_CTL_ELEM_TYPE_ENUMERATED; info->count = channels; info->value.enumerated.items = items; if (!items) return 0; if (info->value.enumerated.item >= items) info->value.enumerated.item = items - 1; WARN(strlen(names[info->value.enumerated.item]) >= sizeof(info->value.enumerated.name), "ALSA: too long item name '%s'\n", names[info->value.enumerated.item]); strscpy(info->value.enumerated.name, names[info->value.enumerated.item], sizeof(info->value.enumerated.name)); return 0; } EXPORT_SYMBOL(snd_ctl_enum_info); |
| 3 2 2 1 1 1 2 1 1 5 2 2 2 2 1 1 1 1 1 6 9 9 9 9 9 6 4 3 3 3 6 2 2 1 1 1 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 6 8 16 25 2 2 1 23 24 4 1 23 1 1 22 25 11 11 9 9 3 1 1 1 1 1 13 3 2 11 11 11 10 9 10 1 9 10 3 10 12 21 1 2 1 19 3 3 1 2 2 18 3 2 2 17 1 1 16 1 17 2 17 17 1 1 1 16 2 2 14 16 2 3 13 4 3 2 13 1 3 2 13 3 4 2 12 1 3 21 4 8 8 8 6 6 4 6 8 16 16 15 13 8 1 1 1 1 1 1 1 5 5 5 1 1 1 1 10 10 10 10 10 10 8 8 1 1 10 10 10 10 10 10 10 10 10 3 10 134 134 135 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 | // SPDX-License-Identifier: GPL-2.0-only /* * GENEVE: Generic Network Virtualization Encapsulation * * Copyright (c) 2015 Red Hat, Inc. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/ethtool.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/etherdevice.h> #include <linux/hash.h> #include <net/ipv6_stubs.h> #include <net/dst_metadata.h> #include <net/gro_cells.h> #include <net/rtnetlink.h> #include <net/geneve.h> #include <net/gro.h> #include <net/netdev_lock.h> #include <net/protocol.h> #define GENEVE_NETDEV_VER "0.6" #define GENEVE_N_VID (1u << 24) #define GENEVE_VID_MASK (GENEVE_N_VID - 1) #define VNI_HASH_BITS 10 #define VNI_HASH_SIZE (1<<VNI_HASH_BITS) static bool log_ecn_error = true; module_param(log_ecn_error, bool, 0644); MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN"); #define GENEVE_VER 0 #define GENEVE_BASE_HLEN (sizeof(struct udphdr) + sizeof(struct genevehdr)) #define GENEVE_IPV4_HLEN (ETH_HLEN + sizeof(struct iphdr) + GENEVE_BASE_HLEN) #define GENEVE_IPV6_HLEN (ETH_HLEN + sizeof(struct ipv6hdr) + GENEVE_BASE_HLEN) /* per-network namespace private data for this module */ struct geneve_net { struct list_head geneve_list; /* sock_list is protected by rtnl lock */ struct list_head sock_list; }; static unsigned int geneve_net_id; struct geneve_dev_node { struct hlist_node hlist; struct geneve_dev *geneve; }; struct geneve_config { bool collect_md; bool use_udp6_rx_checksums; bool ttl_inherit; enum ifla_geneve_df df; bool inner_proto_inherit; u16 port_min; u16 port_max; /* Must be last --ends in a flexible-array member. */ struct ip_tunnel_info info; }; /* Pseudo network device */ struct geneve_dev { struct geneve_dev_node hlist4; /* vni hash table for IPv4 socket */ #if IS_ENABLED(CONFIG_IPV6) struct geneve_dev_node hlist6; /* vni hash table for IPv6 socket */ #endif struct net *net; /* netns for packet i/o */ struct net_device *dev; /* netdev for geneve tunnel */ struct geneve_sock __rcu *sock4; /* IPv4 socket used for geneve tunnel */ #if IS_ENABLED(CONFIG_IPV6) struct geneve_sock __rcu *sock6; /* IPv6 socket used for geneve tunnel */ #endif struct list_head next; /* geneve's per namespace list */ struct gro_cells gro_cells; struct geneve_config cfg; }; struct geneve_sock { bool collect_md; struct list_head list; struct socket *sock; struct rcu_head rcu; int refcnt; struct hlist_head vni_list[VNI_HASH_SIZE]; }; static inline __u32 geneve_net_vni_hash(u8 vni[3]) { __u32 vnid; vnid = (vni[0] << 16) | (vni[1] << 8) | vni[2]; return hash_32(vnid, VNI_HASH_BITS); } static __be64 vni_to_tunnel_id(const __u8 *vni) { #ifdef __BIG_ENDIAN return (vni[0] << 16) | (vni[1] << 8) | vni[2]; #else return (__force __be64)(((__force u64)vni[0] << 40) | ((__force u64)vni[1] << 48) | ((__force u64)vni[2] << 56)); #endif } /* Convert 64 bit tunnel ID to 24 bit VNI. */ static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni) { #ifdef __BIG_ENDIAN vni[0] = (__force __u8)(tun_id >> 16); vni[1] = (__force __u8)(tun_id >> 8); vni[2] = (__force __u8)tun_id; #else vni[0] = (__force __u8)((__force u64)tun_id >> 40); vni[1] = (__force __u8)((__force u64)tun_id >> 48); vni[2] = (__force __u8)((__force u64)tun_id >> 56); #endif } static bool eq_tun_id_and_vni(u8 *tun_id, u8 *vni) { return !memcmp(vni, &tun_id[5], 3); } static sa_family_t geneve_get_sk_family(struct geneve_sock *gs) { return gs->sock->sk->sk_family; } static struct geneve_dev *geneve_lookup(struct geneve_sock *gs, __be32 addr, u8 vni[]) { struct hlist_head *vni_list_head; struct geneve_dev_node *node; __u32 hash; /* Find the device for this VNI */ hash = geneve_net_vni_hash(vni); vni_list_head = &gs->vni_list[hash]; hlist_for_each_entry_rcu(node, vni_list_head, hlist) { if (eq_tun_id_and_vni((u8 *)&node->geneve->cfg.info.key.tun_id, vni) && addr == node->geneve->cfg.info.key.u.ipv4.dst) return node->geneve; } return NULL; } #if IS_ENABLED(CONFIG_IPV6) static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs, struct in6_addr addr6, u8 vni[]) { struct hlist_head *vni_list_head; struct geneve_dev_node *node; __u32 hash; /* Find the device for this VNI */ hash = geneve_net_vni_hash(vni); vni_list_head = &gs->vni_list[hash]; hlist_for_each_entry_rcu(node, vni_list_head, hlist) { if (eq_tun_id_and_vni((u8 *)&node->geneve->cfg.info.key.tun_id, vni) && ipv6_addr_equal(&addr6, &node->geneve->cfg.info.key.u.ipv6.dst)) return node->geneve; } return NULL; } #endif static inline struct genevehdr *geneve_hdr(const struct sk_buff *skb) { return (struct genevehdr *)(udp_hdr(skb) + 1); } static struct geneve_dev *geneve_lookup_skb(struct geneve_sock *gs, struct sk_buff *skb) { static u8 zero_vni[3]; u8 *vni; if (geneve_get_sk_family(gs) == AF_INET) { struct iphdr *iph; __be32 addr; iph = ip_hdr(skb); /* outer IP header... */ if (gs->collect_md) { vni = zero_vni; addr = 0; } else { vni = geneve_hdr(skb)->vni; addr = iph->saddr; } return geneve_lookup(gs, addr, vni); #if IS_ENABLED(CONFIG_IPV6) } else if (geneve_get_sk_family(gs) == AF_INET6) { static struct in6_addr zero_addr6; struct ipv6hdr *ip6h; struct in6_addr addr6; ip6h = ipv6_hdr(skb); /* outer IPv6 header... */ if (gs->collect_md) { vni = zero_vni; addr6 = zero_addr6; } else { vni = geneve_hdr(skb)->vni; addr6 = ip6h->saddr; } return geneve6_lookup(gs, addr6, vni); #endif } return NULL; } /* geneve receive/decap routine */ static void geneve_rx(struct geneve_dev *geneve, struct geneve_sock *gs, struct sk_buff *skb) { struct genevehdr *gnvh = geneve_hdr(skb); struct metadata_dst *tun_dst = NULL; unsigned int len; int nh, err = 0; void *oiph; if (ip_tunnel_collect_metadata() || gs->collect_md) { IP_TUNNEL_DECLARE_FLAGS(flags) = { }; __set_bit(IP_TUNNEL_KEY_BIT, flags); __assign_bit(IP_TUNNEL_OAM_BIT, flags, gnvh->oam); __assign_bit(IP_TUNNEL_CRIT_OPT_BIT, flags, gnvh->critical); tun_dst = udp_tun_rx_dst(skb, geneve_get_sk_family(gs), flags, vni_to_tunnel_id(gnvh->vni), gnvh->opt_len * 4); if (!tun_dst) { dev_dstats_rx_dropped(geneve->dev); goto drop; } /* Update tunnel dst according to Geneve options. */ ip_tunnel_flags_zero(flags); __set_bit(IP_TUNNEL_GENEVE_OPT_BIT, flags); ip_tunnel_info_opts_set(&tun_dst->u.tun_info, gnvh->options, gnvh->opt_len * 4, flags); } else { /* Drop packets w/ critical options, * since we don't support any... */ if (gnvh->critical) { DEV_STATS_INC(geneve->dev, rx_frame_errors); DEV_STATS_INC(geneve->dev, rx_errors); goto drop; } } if (tun_dst) skb_dst_set(skb, &tun_dst->dst); if (gnvh->proto_type == htons(ETH_P_TEB)) { skb_reset_mac_header(skb); skb->protocol = eth_type_trans(skb, geneve->dev); skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); /* Ignore packet loops (and multicast echo) */ if (ether_addr_equal(eth_hdr(skb)->h_source, geneve->dev->dev_addr)) { DEV_STATS_INC(geneve->dev, rx_errors); goto drop; } } else { skb_reset_mac_header(skb); skb->dev = geneve->dev; skb->pkt_type = PACKET_HOST; } /* Save offset of outer header relative to skb->head, * because we are going to reset the network header to the inner header * and might change skb->head. */ nh = skb_network_header(skb) - skb->head; skb_reset_network_header(skb); if (!pskb_inet_may_pull(skb)) { DEV_STATS_INC(geneve->dev, rx_length_errors); DEV_STATS_INC(geneve->dev, rx_errors); goto drop; } /* Get the outer header. */ oiph = skb->head + nh; if (geneve_get_sk_family(gs) == AF_INET) err = IP_ECN_decapsulate(oiph, skb); #if IS_ENABLED(CONFIG_IPV6) else err = IP6_ECN_decapsulate(oiph, skb); #endif if (unlikely(err)) { if (log_ecn_error) { if (geneve_get_sk_family(gs) == AF_INET) net_info_ratelimited("non-ECT from %pI4 " "with TOS=%#x\n", &((struct iphdr *)oiph)->saddr, ((struct iphdr *)oiph)->tos); #if IS_ENABLED(CONFIG_IPV6) else net_info_ratelimited("non-ECT from %pI6\n", &((struct ipv6hdr *)oiph)->saddr); #endif } if (err > 1) { DEV_STATS_INC(geneve->dev, rx_frame_errors); DEV_STATS_INC(geneve->dev, rx_errors); goto drop; } } len = skb->len; err = gro_cells_receive(&geneve->gro_cells, skb); if (likely(err == NET_RX_SUCCESS)) dev_dstats_rx_add(geneve->dev, len); return; drop: /* Consume bad packet */ kfree_skb(skb); } /* Setup stats when device is created */ static int geneve_init(struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); int err; err = gro_cells_init(&geneve->gro_cells, dev); if (err) return err; err = dst_cache_init(&geneve->cfg.info.dst_cache, GFP_KERNEL); if (err) { gro_cells_destroy(&geneve->gro_cells); return err; } netdev_lockdep_set_classes(dev); return 0; } static void geneve_uninit(struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); dst_cache_destroy(&geneve->cfg.info.dst_cache); gro_cells_destroy(&geneve->gro_cells); } /* Callback from net/ipv4/udp.c to receive packets */ static int geneve_udp_encap_recv(struct sock *sk, struct sk_buff *skb) { struct genevehdr *geneveh; struct geneve_dev *geneve; struct geneve_sock *gs; __be16 inner_proto; int opts_len; /* Need UDP and Geneve header to be present */ if (unlikely(!pskb_may_pull(skb, GENEVE_BASE_HLEN))) goto drop; /* Return packets with reserved bits set */ geneveh = geneve_hdr(skb); if (unlikely(geneveh->ver != GENEVE_VER)) goto drop; gs = rcu_dereference_sk_user_data(sk); if (!gs) goto drop; geneve = geneve_lookup_skb(gs, skb); if (!geneve) goto drop; inner_proto = geneveh->proto_type; if (unlikely((!geneve->cfg.inner_proto_inherit && inner_proto != htons(ETH_P_TEB)))) { dev_dstats_rx_dropped(geneve->dev); goto drop; } opts_len = geneveh->opt_len * 4; if (iptunnel_pull_header(skb, GENEVE_BASE_HLEN + opts_len, inner_proto, !net_eq(geneve->net, dev_net(geneve->dev)))) { dev_dstats_rx_dropped(geneve->dev); goto drop; } geneve_rx(geneve, gs, skb); return 0; drop: /* Consume bad packet */ kfree_skb(skb); return 0; } /* Callback from net/ipv{4,6}/udp.c to check that we have a tunnel for errors */ static int geneve_udp_encap_err_lookup(struct sock *sk, struct sk_buff *skb) { struct genevehdr *geneveh; struct geneve_sock *gs; u8 zero_vni[3] = { 0 }; u8 *vni = zero_vni; if (!pskb_may_pull(skb, skb_transport_offset(skb) + GENEVE_BASE_HLEN)) return -EINVAL; geneveh = geneve_hdr(skb); if (geneveh->ver != GENEVE_VER) return -EINVAL; if (geneveh->proto_type != htons(ETH_P_TEB)) return -EINVAL; gs = rcu_dereference_sk_user_data(sk); if (!gs) return -ENOENT; if (geneve_get_sk_family(gs) == AF_INET) { struct iphdr *iph = ip_hdr(skb); __be32 addr4 = 0; if (!gs->collect_md) { vni = geneve_hdr(skb)->vni; addr4 = iph->daddr; } return geneve_lookup(gs, addr4, vni) ? 0 : -ENOENT; } #if IS_ENABLED(CONFIG_IPV6) if (geneve_get_sk_family(gs) == AF_INET6) { struct ipv6hdr *ip6h = ipv6_hdr(skb); struct in6_addr addr6; memset(&addr6, 0, sizeof(struct in6_addr)); if (!gs->collect_md) { vni = geneve_hdr(skb)->vni; addr6 = ip6h->daddr; } return geneve6_lookup(gs, addr6, vni) ? 0 : -ENOENT; } #endif return -EPFNOSUPPORT; } static struct socket *geneve_create_sock(struct net *net, bool ipv6, __be16 port, bool ipv6_rx_csum) { struct socket *sock; struct udp_port_cfg udp_conf; int err; memset(&udp_conf, 0, sizeof(udp_conf)); if (ipv6) { udp_conf.family = AF_INET6; udp_conf.ipv6_v6only = 1; udp_conf.use_udp6_rx_checksums = ipv6_rx_csum; } else { udp_conf.family = AF_INET; udp_conf.local_ip.s_addr = htonl(INADDR_ANY); } udp_conf.local_udp_port = port; /* Open UDP socket */ err = udp_sock_create(net, &udp_conf, &sock); if (err < 0) return ERR_PTR(err); udp_allow_gso(sock->sk); return sock; } static int geneve_hlen(struct genevehdr *gh) { return sizeof(*gh) + gh->opt_len * 4; } static struct sk_buff *geneve_gro_receive(struct sock *sk, struct list_head *head, struct sk_buff *skb) { struct sk_buff *pp = NULL; struct sk_buff *p; struct genevehdr *gh, *gh2; unsigned int hlen, gh_len, off_gnv; const struct packet_offload *ptype; __be16 type; int flush = 1; off_gnv = skb_gro_offset(skb); hlen = off_gnv + sizeof(*gh); gh = skb_gro_header(skb, hlen, off_gnv); if (unlikely(!gh)) goto out; if (gh->ver != GENEVE_VER || gh->oam) goto out; gh_len = geneve_hlen(gh); hlen = off_gnv + gh_len; if (!skb_gro_may_pull(skb, hlen)) { gh = skb_gro_header_slow(skb, hlen, off_gnv); if (unlikely(!gh)) goto out; } list_for_each_entry(p, head, list) { if (!NAPI_GRO_CB(p)->same_flow) continue; gh2 = (struct genevehdr *)(p->data + off_gnv); if (gh->opt_len != gh2->opt_len || memcmp(gh, gh2, gh_len)) { NAPI_GRO_CB(p)->same_flow = 0; continue; } } skb_gro_pull(skb, gh_len); skb_gro_postpull_rcsum(skb, gh, gh_len); type = gh->proto_type; if (likely(type == htons(ETH_P_TEB))) return call_gro_receive(eth_gro_receive, head, skb); ptype = gro_find_receive_by_type(type); if (!ptype) goto out; pp = call_gro_receive(ptype->callbacks.gro_receive, head, skb); flush = 0; out: skb_gro_flush_final(skb, pp, flush); return pp; } static int geneve_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff) { struct genevehdr *gh; struct packet_offload *ptype; __be16 type; int gh_len; int err = -ENOSYS; gh = (struct genevehdr *)(skb->data + nhoff); gh_len = geneve_hlen(gh); type = gh->proto_type; /* since skb->encapsulation is set, eth_gro_complete() sets the inner mac header */ if (likely(type == htons(ETH_P_TEB))) return eth_gro_complete(skb, nhoff + gh_len); ptype = gro_find_complete_by_type(type); if (ptype) err = ptype->callbacks.gro_complete(skb, nhoff + gh_len); skb_set_inner_mac_header(skb, nhoff + gh_len); return err; } /* Create new listen socket if needed */ static struct geneve_sock *geneve_socket_create(struct net *net, __be16 port, bool ipv6, bool ipv6_rx_csum) { struct geneve_net *gn = net_generic(net, geneve_net_id); struct geneve_sock *gs; struct socket *sock; struct udp_tunnel_sock_cfg tunnel_cfg; int h; gs = kzalloc(sizeof(*gs), GFP_KERNEL); if (!gs) return ERR_PTR(-ENOMEM); sock = geneve_create_sock(net, ipv6, port, ipv6_rx_csum); if (IS_ERR(sock)) { kfree(gs); return ERR_CAST(sock); } gs->sock = sock; gs->refcnt = 1; for (h = 0; h < VNI_HASH_SIZE; ++h) INIT_HLIST_HEAD(&gs->vni_list[h]); /* Initialize the geneve udp offloads structure */ udp_tunnel_notify_add_rx_port(gs->sock, UDP_TUNNEL_TYPE_GENEVE); /* Mark socket as an encapsulation socket */ memset(&tunnel_cfg, 0, sizeof(tunnel_cfg)); tunnel_cfg.sk_user_data = gs; tunnel_cfg.encap_type = 1; tunnel_cfg.gro_receive = geneve_gro_receive; tunnel_cfg.gro_complete = geneve_gro_complete; tunnel_cfg.encap_rcv = geneve_udp_encap_recv; tunnel_cfg.encap_err_lookup = geneve_udp_encap_err_lookup; tunnel_cfg.encap_destroy = NULL; setup_udp_tunnel_sock(net, sock, &tunnel_cfg); list_add(&gs->list, &gn->sock_list); return gs; } static void __geneve_sock_release(struct geneve_sock *gs) { if (!gs || --gs->refcnt) return; list_del(&gs->list); udp_tunnel_notify_del_rx_port(gs->sock, UDP_TUNNEL_TYPE_GENEVE); udp_tunnel_sock_release(gs->sock); kfree_rcu(gs, rcu); } static void geneve_sock_release(struct geneve_dev *geneve) { struct geneve_sock *gs4 = rtnl_dereference(geneve->sock4); #if IS_ENABLED(CONFIG_IPV6) struct geneve_sock *gs6 = rtnl_dereference(geneve->sock6); rcu_assign_pointer(geneve->sock6, NULL); #endif rcu_assign_pointer(geneve->sock4, NULL); synchronize_net(); __geneve_sock_release(gs4); #if IS_ENABLED(CONFIG_IPV6) __geneve_sock_release(gs6); #endif } static struct geneve_sock *geneve_find_sock(struct geneve_net *gn, sa_family_t family, __be16 dst_port) { struct geneve_sock *gs; list_for_each_entry(gs, &gn->sock_list, list) { if (inet_sk(gs->sock->sk)->inet_sport == dst_port && geneve_get_sk_family(gs) == family) { return gs; } } return NULL; } static int geneve_sock_add(struct geneve_dev *geneve, bool ipv6) { struct net *net = geneve->net; struct geneve_net *gn = net_generic(net, geneve_net_id); struct geneve_dev_node *node; struct geneve_sock *gs; __u8 vni[3]; __u32 hash; gs = geneve_find_sock(gn, ipv6 ? AF_INET6 : AF_INET, geneve->cfg.info.key.tp_dst); if (gs) { gs->refcnt++; goto out; } gs = geneve_socket_create(net, geneve->cfg.info.key.tp_dst, ipv6, geneve->cfg.use_udp6_rx_checksums); if (IS_ERR(gs)) return PTR_ERR(gs); out: gs->collect_md = geneve->cfg.collect_md; #if IS_ENABLED(CONFIG_IPV6) if (ipv6) { rcu_assign_pointer(geneve->sock6, gs); node = &geneve->hlist6; } else #endif { rcu_assign_pointer(geneve->sock4, gs); node = &geneve->hlist4; } node->geneve = geneve; tunnel_id_to_vni(geneve->cfg.info.key.tun_id, vni); hash = geneve_net_vni_hash(vni); hlist_add_head_rcu(&node->hlist, &gs->vni_list[hash]); return 0; } static int geneve_open(struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); bool metadata = geneve->cfg.collect_md; bool ipv4, ipv6; int ret = 0; ipv6 = geneve->cfg.info.mode & IP_TUNNEL_INFO_IPV6 || metadata; ipv4 = !ipv6 || metadata; #if IS_ENABLED(CONFIG_IPV6) if (ipv6) { ret = geneve_sock_add(geneve, true); if (ret < 0 && ret != -EAFNOSUPPORT) ipv4 = false; } #endif if (ipv4) ret = geneve_sock_add(geneve, false); if (ret < 0) geneve_sock_release(geneve); return ret; } static int geneve_stop(struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); hlist_del_init_rcu(&geneve->hlist4.hlist); #if IS_ENABLED(CONFIG_IPV6) hlist_del_init_rcu(&geneve->hlist6.hlist); #endif geneve_sock_release(geneve); return 0; } static void geneve_build_header(struct genevehdr *geneveh, const struct ip_tunnel_info *info, __be16 inner_proto) { geneveh->ver = GENEVE_VER; geneveh->opt_len = info->options_len / 4; geneveh->oam = test_bit(IP_TUNNEL_OAM_BIT, info->key.tun_flags); geneveh->critical = test_bit(IP_TUNNEL_CRIT_OPT_BIT, info->key.tun_flags); geneveh->rsvd1 = 0; tunnel_id_to_vni(info->key.tun_id, geneveh->vni); geneveh->proto_type = inner_proto; geneveh->rsvd2 = 0; if (test_bit(IP_TUNNEL_GENEVE_OPT_BIT, info->key.tun_flags)) ip_tunnel_info_opts_get(geneveh->options, info); } static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb, const struct ip_tunnel_info *info, bool xnet, int ip_hdr_len, bool inner_proto_inherit) { bool udp_sum = test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags); struct genevehdr *gnvh; __be16 inner_proto; int min_headroom; int err; skb_reset_mac_header(skb); skb_scrub_packet(skb, xnet); min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len + GENEVE_BASE_HLEN + info->options_len + ip_hdr_len; err = skb_cow_head(skb, min_headroom); if (unlikely(err)) goto free_dst; err = udp_tunnel_handle_offloads(skb, udp_sum); if (err) goto free_dst; gnvh = __skb_push(skb, sizeof(*gnvh) + info->options_len); inner_proto = inner_proto_inherit ? skb->protocol : htons(ETH_P_TEB); geneve_build_header(gnvh, info, inner_proto); skb_set_inner_protocol(skb, inner_proto); return 0; free_dst: dst_release(dst); return err; } static u8 geneve_get_dsfield(struct sk_buff *skb, struct net_device *dev, const struct ip_tunnel_info *info, bool *use_cache) { struct geneve_dev *geneve = netdev_priv(dev); u8 dsfield; dsfield = info->key.tos; if (dsfield == 1 && !geneve->cfg.collect_md) { dsfield = ip_tunnel_get_dsfield(ip_hdr(skb), skb); *use_cache = false; } return dsfield; } static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, struct geneve_dev *geneve, const struct ip_tunnel_info *info) { bool inner_proto_inherit = geneve->cfg.inner_proto_inherit; bool xnet = !net_eq(geneve->net, dev_net(geneve->dev)); struct geneve_sock *gs4 = rcu_dereference(geneve->sock4); const struct ip_tunnel_key *key = &info->key; struct rtable *rt; bool use_cache; __u8 tos, ttl; __be16 df = 0; __be32 saddr; __be16 sport; int err; if (skb_vlan_inet_prepare(skb, inner_proto_inherit)) return -EINVAL; if (!gs4) return -EIO; use_cache = ip_tunnel_dst_cache_usable(skb, info); tos = geneve_get_dsfield(skb, dev, info, &use_cache); sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min, geneve->cfg.port_max, true); rt = udp_tunnel_dst_lookup(skb, dev, geneve->net, 0, &saddr, &info->key, sport, geneve->cfg.info.key.tp_dst, tos, use_cache ? (struct dst_cache *)&info->dst_cache : NULL); if (IS_ERR(rt)) return PTR_ERR(rt); err = skb_tunnel_check_pmtu(skb, &rt->dst, GENEVE_IPV4_HLEN + info->options_len, netif_is_any_bridge_port(dev)); if (err < 0) { dst_release(&rt->dst); return err; } else if (err) { struct ip_tunnel_info *info; info = skb_tunnel_info(skb); if (info) { struct ip_tunnel_info *unclone; unclone = skb_tunnel_info_unclone(skb); if (unlikely(!unclone)) { dst_release(&rt->dst); return -ENOMEM; } unclone->key.u.ipv4.dst = saddr; unclone->key.u.ipv4.src = info->key.u.ipv4.dst; } if (!pskb_may_pull(skb, ETH_HLEN)) { dst_release(&rt->dst); return -EINVAL; } skb->protocol = eth_type_trans(skb, geneve->dev); __netif_rx(skb); dst_release(&rt->dst); return -EMSGSIZE; } tos = ip_tunnel_ecn_encap(tos, ip_hdr(skb), skb); if (geneve->cfg.collect_md) { ttl = key->ttl; df = test_bit(IP_TUNNEL_DONT_FRAGMENT_BIT, key->tun_flags) ? htons(IP_DF) : 0; } else { if (geneve->cfg.ttl_inherit) ttl = ip_tunnel_get_ttl(ip_hdr(skb), skb); else ttl = key->ttl; ttl = ttl ? : ip4_dst_hoplimit(&rt->dst); if (geneve->cfg.df == GENEVE_DF_SET) { df = htons(IP_DF); } else if (geneve->cfg.df == GENEVE_DF_INHERIT) { struct ethhdr *eth = skb_eth_hdr(skb); if (ntohs(eth->h_proto) == ETH_P_IPV6) { df = htons(IP_DF); } else if (ntohs(eth->h_proto) == ETH_P_IP) { struct iphdr *iph = ip_hdr(skb); if (iph->frag_off & htons(IP_DF)) df = htons(IP_DF); } } } err = geneve_build_skb(&rt->dst, skb, info, xnet, sizeof(struct iphdr), inner_proto_inherit); if (unlikely(err)) return err; udp_tunnel_xmit_skb(rt, gs4->sock->sk, skb, saddr, info->key.u.ipv4.dst, tos, ttl, df, sport, geneve->cfg.info.key.tp_dst, !net_eq(geneve->net, dev_net(geneve->dev)), !test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags), 0); return 0; } #if IS_ENABLED(CONFIG_IPV6) static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, struct geneve_dev *geneve, const struct ip_tunnel_info *info) { bool inner_proto_inherit = geneve->cfg.inner_proto_inherit; bool xnet = !net_eq(geneve->net, dev_net(geneve->dev)); struct geneve_sock *gs6 = rcu_dereference(geneve->sock6); const struct ip_tunnel_key *key = &info->key; struct dst_entry *dst = NULL; struct in6_addr saddr; bool use_cache; __u8 prio, ttl; __be16 sport; int err; if (skb_vlan_inet_prepare(skb, inner_proto_inherit)) return -EINVAL; if (!gs6) return -EIO; use_cache = ip_tunnel_dst_cache_usable(skb, info); prio = geneve_get_dsfield(skb, dev, info, &use_cache); sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min, geneve->cfg.port_max, true); dst = udp_tunnel6_dst_lookup(skb, dev, geneve->net, gs6->sock, 0, &saddr, key, sport, geneve->cfg.info.key.tp_dst, prio, use_cache ? (struct dst_cache *)&info->dst_cache : NULL); if (IS_ERR(dst)) return PTR_ERR(dst); err = skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len, netif_is_any_bridge_port(dev)); if (err < 0) { dst_release(dst); return err; } else if (err) { struct ip_tunnel_info *info = skb_tunnel_info(skb); if (info) { struct ip_tunnel_info *unclone; unclone = skb_tunnel_info_unclone(skb); if (unlikely(!unclone)) { dst_release(dst); return -ENOMEM; } unclone->key.u.ipv6.dst = saddr; unclone->key.u.ipv6.src = info->key.u.ipv6.dst; } if (!pskb_may_pull(skb, ETH_HLEN)) { dst_release(dst); return -EINVAL; } skb->protocol = eth_type_trans(skb, geneve->dev); __netif_rx(skb); dst_release(dst); return -EMSGSIZE; } prio = ip_tunnel_ecn_encap(prio, ip_hdr(skb), skb); if (geneve->cfg.collect_md) { ttl = key->ttl; } else { if (geneve->cfg.ttl_inherit) ttl = ip_tunnel_get_ttl(ip_hdr(skb), skb); else ttl = key->ttl; ttl = ttl ? : ip6_dst_hoplimit(dst); } err = geneve_build_skb(dst, skb, info, xnet, sizeof(struct ipv6hdr), inner_proto_inherit); if (unlikely(err)) return err; udp_tunnel6_xmit_skb(dst, gs6->sock->sk, skb, dev, &saddr, &key->u.ipv6.dst, prio, ttl, info->key.label, sport, geneve->cfg.info.key.tp_dst, !test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags), 0); return 0; } #endif static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); struct ip_tunnel_info *info = NULL; int err; if (geneve->cfg.collect_md) { info = skb_tunnel_info(skb); if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) { netdev_dbg(dev, "no tunnel metadata\n"); dev_kfree_skb(skb); dev_dstats_tx_dropped(dev); return NETDEV_TX_OK; } } else { info = &geneve->cfg.info; } rcu_read_lock(); #if IS_ENABLED(CONFIG_IPV6) if (info->mode & IP_TUNNEL_INFO_IPV6) err = geneve6_xmit_skb(skb, dev, geneve, info); else #endif err = geneve_xmit_skb(skb, dev, geneve, info); rcu_read_unlock(); if (likely(!err)) return NETDEV_TX_OK; if (err != -EMSGSIZE) dev_kfree_skb(skb); if (err == -ELOOP) DEV_STATS_INC(dev, collisions); else if (err == -ENETUNREACH) DEV_STATS_INC(dev, tx_carrier_errors); DEV_STATS_INC(dev, tx_errors); return NETDEV_TX_OK; } static int geneve_change_mtu(struct net_device *dev, int new_mtu) { if (new_mtu > dev->max_mtu) new_mtu = dev->max_mtu; else if (new_mtu < dev->min_mtu) new_mtu = dev->min_mtu; WRITE_ONCE(dev->mtu, new_mtu); return 0; } static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb) { struct ip_tunnel_info *info = skb_tunnel_info(skb); struct geneve_dev *geneve = netdev_priv(dev); __be16 sport; if (ip_tunnel_info_af(info) == AF_INET) { struct rtable *rt; struct geneve_sock *gs4 = rcu_dereference(geneve->sock4); bool use_cache; __be32 saddr; u8 tos; if (!gs4) return -EIO; use_cache = ip_tunnel_dst_cache_usable(skb, info); tos = geneve_get_dsfield(skb, dev, info, &use_cache); sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min, geneve->cfg.port_max, true); rt = udp_tunnel_dst_lookup(skb, dev, geneve->net, 0, &saddr, &info->key, sport, geneve->cfg.info.key.tp_dst, tos, use_cache ? &info->dst_cache : NULL); if (IS_ERR(rt)) return PTR_ERR(rt); ip_rt_put(rt); info->key.u.ipv4.src = saddr; #if IS_ENABLED(CONFIG_IPV6) } else if (ip_tunnel_info_af(info) == AF_INET6) { struct dst_entry *dst; struct geneve_sock *gs6 = rcu_dereference(geneve->sock6); struct in6_addr saddr; bool use_cache; u8 prio; if (!gs6) return -EIO; use_cache = ip_tunnel_dst_cache_usable(skb, info); prio = geneve_get_dsfield(skb, dev, info, &use_cache); sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min, geneve->cfg.port_max, true); dst = udp_tunnel6_dst_lookup(skb, dev, geneve->net, gs6->sock, 0, &saddr, &info->key, sport, geneve->cfg.info.key.tp_dst, prio, use_cache ? &info->dst_cache : NULL); if (IS_ERR(dst)) return PTR_ERR(dst); dst_release(dst); info->key.u.ipv6.src = saddr; #endif } else { return -EINVAL; } info->key.tp_src = sport; info->key.tp_dst = geneve->cfg.info.key.tp_dst; return 0; } static const struct net_device_ops geneve_netdev_ops = { .ndo_init = geneve_init, .ndo_uninit = geneve_uninit, .ndo_open = geneve_open, .ndo_stop = geneve_stop, .ndo_start_xmit = geneve_xmit, .ndo_change_mtu = geneve_change_mtu, .ndo_validate_addr = eth_validate_addr, .ndo_set_mac_address = eth_mac_addr, .ndo_fill_metadata_dst = geneve_fill_metadata_dst, }; static void geneve_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *drvinfo) { strscpy(drvinfo->version, GENEVE_NETDEV_VER, sizeof(drvinfo->version)); strscpy(drvinfo->driver, "geneve", sizeof(drvinfo->driver)); } static const struct ethtool_ops geneve_ethtool_ops = { .get_drvinfo = geneve_get_drvinfo, .get_link = ethtool_op_get_link, }; /* Info for udev, that this is a virtual tunnel endpoint */ static const struct device_type geneve_type = { .name = "geneve", }; /* Calls the ndo_udp_tunnel_add of the caller in order to * supply the listening GENEVE udp ports. Callers are expected * to implement the ndo_udp_tunnel_add. */ static void geneve_offload_rx_ports(struct net_device *dev, bool push) { struct net *net = dev_net(dev); struct geneve_net *gn = net_generic(net, geneve_net_id); struct geneve_sock *gs; ASSERT_RTNL(); list_for_each_entry(gs, &gn->sock_list, list) { if (push) { udp_tunnel_push_rx_port(dev, gs->sock, UDP_TUNNEL_TYPE_GENEVE); } else { udp_tunnel_drop_rx_port(dev, gs->sock, UDP_TUNNEL_TYPE_GENEVE); } } } /* Initialize the device structure. */ static void geneve_setup(struct net_device *dev) { ether_setup(dev); dev->netdev_ops = &geneve_netdev_ops; dev->ethtool_ops = &geneve_ethtool_ops; dev->needs_free_netdev = true; SET_NETDEV_DEVTYPE(dev, &geneve_type); dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_FRAGLIST; dev->features |= NETIF_F_RXCSUM; dev->features |= NETIF_F_GSO_SOFTWARE; dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_FRAGLIST; dev->hw_features |= NETIF_F_RXCSUM; dev->hw_features |= NETIF_F_GSO_SOFTWARE; dev->pcpu_stat_type = NETDEV_PCPU_STAT_DSTATS; /* MTU range: 68 - (something less than 65535) */ dev->min_mtu = ETH_MIN_MTU; /* The max_mtu calculation does not take account of GENEVE * options, to avoid excluding potentially valid * configurations. This will be further reduced by IPvX hdr size. */ dev->max_mtu = IP_MAX_MTU - GENEVE_BASE_HLEN - dev->hard_header_len; netif_keep_dst(dev); dev->priv_flags &= ~IFF_TX_SKB_SHARING; dev->priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_NO_QUEUE; dev->lltx = true; eth_hw_addr_random(dev); } static const struct nla_policy geneve_policy[IFLA_GENEVE_MAX + 1] = { [IFLA_GENEVE_UNSPEC] = { .strict_start_type = IFLA_GENEVE_INNER_PROTO_INHERIT }, [IFLA_GENEVE_ID] = { .type = NLA_U32 }, [IFLA_GENEVE_REMOTE] = { .len = sizeof_field(struct iphdr, daddr) }, [IFLA_GENEVE_REMOTE6] = { .len = sizeof(struct in6_addr) }, [IFLA_GENEVE_TTL] = { .type = NLA_U8 }, [IFLA_GENEVE_TOS] = { .type = NLA_U8 }, [IFLA_GENEVE_LABEL] = { .type = NLA_U32 }, [IFLA_GENEVE_PORT] = { .type = NLA_U16 }, [IFLA_GENEVE_COLLECT_METADATA] = { .type = NLA_FLAG }, [IFLA_GENEVE_UDP_CSUM] = { .type = NLA_U8 }, [IFLA_GENEVE_UDP_ZERO_CSUM6_TX] = { .type = NLA_U8 }, [IFLA_GENEVE_UDP_ZERO_CSUM6_RX] = { .type = NLA_U8 }, [IFLA_GENEVE_TTL_INHERIT] = { .type = NLA_U8 }, [IFLA_GENEVE_DF] = { .type = NLA_U8 }, [IFLA_GENEVE_INNER_PROTO_INHERIT] = { .type = NLA_FLAG }, [IFLA_GENEVE_PORT_RANGE] = NLA_POLICY_EXACT_LEN(sizeof(struct ifla_geneve_port_range)), }; static int geneve_validate(struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { if (tb[IFLA_ADDRESS]) { if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) { NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_ADDRESS], "Provided link layer address is not Ethernet"); return -EINVAL; } if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS]))) { NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_ADDRESS], "Provided Ethernet address is not unicast"); return -EADDRNOTAVAIL; } } if (!data) { NL_SET_ERR_MSG(extack, "Not enough attributes provided to perform the operation"); return -EINVAL; } if (data[IFLA_GENEVE_ID]) { __u32 vni = nla_get_u32(data[IFLA_GENEVE_ID]); if (vni >= GENEVE_N_VID) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_ID], "Geneve ID must be lower than 16777216"); return -ERANGE; } } if (data[IFLA_GENEVE_DF]) { enum ifla_geneve_df df = nla_get_u8(data[IFLA_GENEVE_DF]); if (df < 0 || df > GENEVE_DF_MAX) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_DF], "Invalid DF attribute"); return -EINVAL; } } if (data[IFLA_GENEVE_PORT_RANGE]) { const struct ifla_geneve_port_range *p; p = nla_data(data[IFLA_GENEVE_PORT_RANGE]); if (ntohs(p->high) < ntohs(p->low)) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_PORT_RANGE], "Invalid source port range"); return -EINVAL; } } return 0; } static struct geneve_dev *geneve_find_dev(struct geneve_net *gn, const struct ip_tunnel_info *info, bool *tun_on_same_port, bool *tun_collect_md) { struct geneve_dev *geneve, *t = NULL; *tun_on_same_port = false; *tun_collect_md = false; list_for_each_entry(geneve, &gn->geneve_list, next) { if (info->key.tp_dst == geneve->cfg.info.key.tp_dst) { *tun_collect_md = geneve->cfg.collect_md; *tun_on_same_port = true; } if (info->key.tun_id == geneve->cfg.info.key.tun_id && info->key.tp_dst == geneve->cfg.info.key.tp_dst && !memcmp(&info->key.u, &geneve->cfg.info.key.u, sizeof(info->key.u))) t = geneve; } return t; } static bool is_tnl_info_zero(const struct ip_tunnel_info *info) { return !(info->key.tun_id || info->key.tos || !ip_tunnel_flags_empty(info->key.tun_flags) || info->key.ttl || info->key.label || info->key.tp_src || memchr_inv(&info->key.u, 0, sizeof(info->key.u))); } static bool geneve_dst_addr_equal(struct ip_tunnel_info *a, struct ip_tunnel_info *b) { if (ip_tunnel_info_af(a) == AF_INET) return a->key.u.ipv4.dst == b->key.u.ipv4.dst; else return ipv6_addr_equal(&a->key.u.ipv6.dst, &b->key.u.ipv6.dst); } static int geneve_configure(struct net *net, struct net_device *dev, struct netlink_ext_ack *extack, const struct geneve_config *cfg) { struct geneve_net *gn = net_generic(net, geneve_net_id); struct geneve_dev *t, *geneve = netdev_priv(dev); const struct ip_tunnel_info *info = &cfg->info; bool tun_collect_md, tun_on_same_port; int err, encap_len; if (cfg->collect_md && !is_tnl_info_zero(info)) { NL_SET_ERR_MSG(extack, "Device is externally controlled, so attributes (VNI, Port, and so on) must not be specified"); return -EINVAL; } geneve->net = net; geneve->dev = dev; t = geneve_find_dev(gn, info, &tun_on_same_port, &tun_collect_md); if (t) return -EBUSY; /* make enough headroom for basic scenario */ encap_len = GENEVE_BASE_HLEN + ETH_HLEN; if (!cfg->collect_md && ip_tunnel_info_af(info) == AF_INET) { encap_len += sizeof(struct iphdr); dev->max_mtu -= sizeof(struct iphdr); } else { encap_len += sizeof(struct ipv6hdr); dev->max_mtu -= sizeof(struct ipv6hdr); } dev->needed_headroom = encap_len + ETH_HLEN; if (cfg->collect_md) { if (tun_on_same_port) { NL_SET_ERR_MSG(extack, "There can be only one externally controlled device on a destination port"); return -EPERM; } } else { if (tun_collect_md) { NL_SET_ERR_MSG(extack, "There already exists an externally controlled device on this destination port"); return -EPERM; } } dst_cache_reset(&geneve->cfg.info.dst_cache); memcpy(&geneve->cfg, cfg, sizeof(*cfg)); if (geneve->cfg.inner_proto_inherit) { dev->header_ops = NULL; dev->type = ARPHRD_NONE; dev->hard_header_len = 0; dev->addr_len = 0; dev->flags = IFF_POINTOPOINT | IFF_NOARP; } err = register_netdevice(dev); if (err) return err; list_add(&geneve->next, &gn->geneve_list); return 0; } static void init_tnl_info(struct ip_tunnel_info *info, __u16 dst_port) { memset(info, 0, sizeof(*info)); info->key.tp_dst = htons(dst_port); } static int geneve_nl2info(struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack, struct geneve_config *cfg, bool changelink) { struct ip_tunnel_info *info = &cfg->info; int attrtype; if (data[IFLA_GENEVE_REMOTE] && data[IFLA_GENEVE_REMOTE6]) { NL_SET_ERR_MSG(extack, "Cannot specify both IPv4 and IPv6 Remote addresses"); return -EINVAL; } if (data[IFLA_GENEVE_REMOTE]) { if (changelink && (ip_tunnel_info_af(info) == AF_INET6)) { attrtype = IFLA_GENEVE_REMOTE; goto change_notsup; } info->key.u.ipv4.dst = nla_get_in_addr(data[IFLA_GENEVE_REMOTE]); if (ipv4_is_multicast(info->key.u.ipv4.dst)) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_REMOTE], "Remote IPv4 address cannot be Multicast"); return -EINVAL; } } if (data[IFLA_GENEVE_REMOTE6]) { #if IS_ENABLED(CONFIG_IPV6) if (changelink && (ip_tunnel_info_af(info) == AF_INET)) { attrtype = IFLA_GENEVE_REMOTE6; goto change_notsup; } info->mode = IP_TUNNEL_INFO_IPV6; info->key.u.ipv6.dst = nla_get_in6_addr(data[IFLA_GENEVE_REMOTE6]); if (ipv6_addr_type(&info->key.u.ipv6.dst) & IPV6_ADDR_LINKLOCAL) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_REMOTE6], "Remote IPv6 address cannot be link-local"); return -EINVAL; } if (ipv6_addr_is_multicast(&info->key.u.ipv6.dst)) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_REMOTE6], "Remote IPv6 address cannot be Multicast"); return -EINVAL; } __set_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags); cfg->use_udp6_rx_checksums = true; #else NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_REMOTE6], "IPv6 support not enabled in the kernel"); return -EPFNOSUPPORT; #endif } if (data[IFLA_GENEVE_ID]) { __u32 vni; __u8 tvni[3]; __be64 tunid; vni = nla_get_u32(data[IFLA_GENEVE_ID]); tvni[0] = (vni & 0x00ff0000) >> 16; tvni[1] = (vni & 0x0000ff00) >> 8; tvni[2] = vni & 0x000000ff; tunid = vni_to_tunnel_id(tvni); if (changelink && (tunid != info->key.tun_id)) { attrtype = IFLA_GENEVE_ID; goto change_notsup; } info->key.tun_id = tunid; } if (data[IFLA_GENEVE_TTL_INHERIT]) { if (nla_get_u8(data[IFLA_GENEVE_TTL_INHERIT])) cfg->ttl_inherit = true; else cfg->ttl_inherit = false; } else if (data[IFLA_GENEVE_TTL]) { info->key.ttl = nla_get_u8(data[IFLA_GENEVE_TTL]); cfg->ttl_inherit = false; } if (data[IFLA_GENEVE_TOS]) info->key.tos = nla_get_u8(data[IFLA_GENEVE_TOS]); if (data[IFLA_GENEVE_DF]) cfg->df = nla_get_u8(data[IFLA_GENEVE_DF]); if (data[IFLA_GENEVE_LABEL]) { info->key.label = nla_get_be32(data[IFLA_GENEVE_LABEL]) & IPV6_FLOWLABEL_MASK; if (info->key.label && (!(info->mode & IP_TUNNEL_INFO_IPV6))) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_LABEL], "Label attribute only applies for IPv6 Geneve devices"); return -EINVAL; } } if (data[IFLA_GENEVE_PORT]) { if (changelink) { attrtype = IFLA_GENEVE_PORT; goto change_notsup; } info->key.tp_dst = nla_get_be16(data[IFLA_GENEVE_PORT]); } if (data[IFLA_GENEVE_PORT_RANGE]) { const struct ifla_geneve_port_range *p; if (changelink) { attrtype = IFLA_GENEVE_PORT_RANGE; goto change_notsup; } p = nla_data(data[IFLA_GENEVE_PORT_RANGE]); cfg->port_min = ntohs(p->low); cfg->port_max = ntohs(p->high); } if (data[IFLA_GENEVE_COLLECT_METADATA]) { if (changelink) { attrtype = IFLA_GENEVE_COLLECT_METADATA; goto change_notsup; } cfg->collect_md = true; } if (data[IFLA_GENEVE_UDP_CSUM]) { if (changelink) { attrtype = IFLA_GENEVE_UDP_CSUM; goto change_notsup; } if (nla_get_u8(data[IFLA_GENEVE_UDP_CSUM])) __set_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags); } if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]) { #if IS_ENABLED(CONFIG_IPV6) if (changelink) { attrtype = IFLA_GENEVE_UDP_ZERO_CSUM6_TX; goto change_notsup; } if (nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX])) __clear_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags); #else NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX], "IPv6 support not enabled in the kernel"); return -EPFNOSUPPORT; #endif } if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX]) { #if IS_ENABLED(CONFIG_IPV6) if (changelink) { attrtype = IFLA_GENEVE_UDP_ZERO_CSUM6_RX; goto change_notsup; } if (nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX])) cfg->use_udp6_rx_checksums = false; #else NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX], "IPv6 support not enabled in the kernel"); return -EPFNOSUPPORT; #endif } if (data[IFLA_GENEVE_INNER_PROTO_INHERIT]) { if (changelink) { attrtype = IFLA_GENEVE_INNER_PROTO_INHERIT; goto change_notsup; } cfg->inner_proto_inherit = true; } return 0; change_notsup: NL_SET_ERR_MSG_ATTR(extack, data[attrtype], "Changing VNI, Port, endpoint IP address family, external, inner_proto_inherit, and UDP checksum attributes are not supported"); return -EOPNOTSUPP; } static void geneve_link_config(struct net_device *dev, struct ip_tunnel_info *info, struct nlattr *tb[]) { struct geneve_dev *geneve = netdev_priv(dev); int ldev_mtu = 0; if (tb[IFLA_MTU]) { geneve_change_mtu(dev, nla_get_u32(tb[IFLA_MTU])); return; } switch (ip_tunnel_info_af(info)) { case AF_INET: { struct flowi4 fl4 = { .daddr = info->key.u.ipv4.dst }; struct rtable *rt = ip_route_output_key(geneve->net, &fl4); if (!IS_ERR(rt) && rt->dst.dev) { ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV4_HLEN; ip_rt_put(rt); } break; } #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: { struct rt6_info *rt; if (!__in6_dev_get(dev)) break; rt = rt6_lookup(geneve->net, &info->key.u.ipv6.dst, NULL, 0, NULL, 0); if (rt && rt->dst.dev) ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV6_HLEN; ip6_rt_put(rt); break; } #endif } if (ldev_mtu <= 0) return; geneve_change_mtu(dev, ldev_mtu - info->options_len); } static int geneve_newlink(struct net_device *dev, struct rtnl_newlink_params *params, struct netlink_ext_ack *extack) { struct net *link_net = rtnl_newlink_link_net(params); struct nlattr **data = params->data; struct nlattr **tb = params->tb; struct geneve_config cfg = { .df = GENEVE_DF_UNSET, .use_udp6_rx_checksums = false, .ttl_inherit = false, .collect_md = false, .port_min = 1, .port_max = USHRT_MAX, }; int err; init_tnl_info(&cfg.info, GENEVE_UDP_PORT); err = geneve_nl2info(tb, data, extack, &cfg, false); if (err) return err; err = geneve_configure(link_net, dev, extack, &cfg); if (err) return err; geneve_link_config(dev, &cfg.info, tb); return 0; } /* Quiesces the geneve device data path for both TX and RX. * * On transmit geneve checks for non-NULL geneve_sock before it proceeds. * So, if we set that socket to NULL under RCU and wait for synchronize_net() * to complete for the existing set of in-flight packets to be transmitted, * then we would have quiesced the transmit data path. All the future packets * will get dropped until we unquiesce the data path. * * On receive geneve dereference the geneve_sock stashed in the socket. So, * if we set that to NULL under RCU and wait for synchronize_net() to * complete, then we would have quiesced the receive data path. */ static void geneve_quiesce(struct geneve_dev *geneve, struct geneve_sock **gs4, struct geneve_sock **gs6) { *gs4 = rtnl_dereference(geneve->sock4); rcu_assign_pointer(geneve->sock4, NULL); if (*gs4) rcu_assign_sk_user_data((*gs4)->sock->sk, NULL); #if IS_ENABLED(CONFIG_IPV6) *gs6 = rtnl_dereference(geneve->sock6); rcu_assign_pointer(geneve->sock6, NULL); if (*gs6) rcu_assign_sk_user_data((*gs6)->sock->sk, NULL); #else *gs6 = NULL; #endif synchronize_net(); } /* Resumes the geneve device data path for both TX and RX. */ static void geneve_unquiesce(struct geneve_dev *geneve, struct geneve_sock *gs4, struct geneve_sock __maybe_unused *gs6) { rcu_assign_pointer(geneve->sock4, gs4); if (gs4) rcu_assign_sk_user_data(gs4->sock->sk, gs4); #if IS_ENABLED(CONFIG_IPV6) rcu_assign_pointer(geneve->sock6, gs6); if (gs6) rcu_assign_sk_user_data(gs6->sock->sk, gs6); #endif synchronize_net(); } static int geneve_changelink(struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { struct geneve_dev *geneve = netdev_priv(dev); struct geneve_sock *gs4, *gs6; struct geneve_config cfg; int err; /* If the geneve device is configured for metadata (or externally * controlled, for example, OVS), then nothing can be changed. */ if (geneve->cfg.collect_md) return -EOPNOTSUPP; /* Start with the existing info. */ memcpy(&cfg, &geneve->cfg, sizeof(cfg)); err = geneve_nl2info(tb, data, extack, &cfg, true); if (err) return err; if (!geneve_dst_addr_equal(&geneve->cfg.info, &cfg.info)) { dst_cache_reset(&cfg.info.dst_cache); geneve_link_config(dev, &cfg.info, tb); } geneve_quiesce(geneve, &gs4, &gs6); memcpy(&geneve->cfg, &cfg, sizeof(cfg)); geneve_unquiesce(geneve, gs4, gs6); return 0; } static void geneve_dellink(struct net_device *dev, struct list_head *head) { struct geneve_dev *geneve = netdev_priv(dev); list_del(&geneve->next); unregister_netdevice_queue(dev, head); } static size_t geneve_get_size(const struct net_device *dev) { return nla_total_size(sizeof(__u32)) + /* IFLA_GENEVE_ID */ nla_total_size(sizeof(struct in6_addr)) + /* IFLA_GENEVE_REMOTE{6} */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TTL */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TOS */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_DF */ nla_total_size(sizeof(__be32)) + /* IFLA_GENEVE_LABEL */ nla_total_size(sizeof(__be16)) + /* IFLA_GENEVE_PORT */ nla_total_size(0) + /* IFLA_GENEVE_COLLECT_METADATA */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_CSUM */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_TX */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_RX */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TTL_INHERIT */ nla_total_size(0) + /* IFLA_GENEVE_INNER_PROTO_INHERIT */ nla_total_size(sizeof(struct ifla_geneve_port_range)) + /* IFLA_GENEVE_PORT_RANGE */ 0; } static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); struct ip_tunnel_info *info = &geneve->cfg.info; bool ttl_inherit = geneve->cfg.ttl_inherit; bool metadata = geneve->cfg.collect_md; struct ifla_geneve_port_range ports = { .low = htons(geneve->cfg.port_min), .high = htons(geneve->cfg.port_max), }; __u8 tmp_vni[3]; __u32 vni; tunnel_id_to_vni(info->key.tun_id, tmp_vni); vni = (tmp_vni[0] << 16) | (tmp_vni[1] << 8) | tmp_vni[2]; if (nla_put_u32(skb, IFLA_GENEVE_ID, vni)) goto nla_put_failure; if (!metadata && ip_tunnel_info_af(info) == AF_INET) { if (nla_put_in_addr(skb, IFLA_GENEVE_REMOTE, info->key.u.ipv4.dst)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM, test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags))) goto nla_put_failure; #if IS_ENABLED(CONFIG_IPV6) } else if (!metadata) { if (nla_put_in6_addr(skb, IFLA_GENEVE_REMOTE6, &info->key.u.ipv6.dst)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX, !test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags))) goto nla_put_failure; #endif } if (nla_put_u8(skb, IFLA_GENEVE_TTL, info->key.ttl) || nla_put_u8(skb, IFLA_GENEVE_TOS, info->key.tos) || nla_put_be32(skb, IFLA_GENEVE_LABEL, info->key.label)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_GENEVE_DF, geneve->cfg.df)) goto nla_put_failure; if (nla_put_be16(skb, IFLA_GENEVE_PORT, info->key.tp_dst)) goto nla_put_failure; if (metadata && nla_put_flag(skb, IFLA_GENEVE_COLLECT_METADATA)) goto nla_put_failure; #if IS_ENABLED(CONFIG_IPV6) if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX, !geneve->cfg.use_udp6_rx_checksums)) goto nla_put_failure; #endif if (nla_put_u8(skb, IFLA_GENEVE_TTL_INHERIT, ttl_inherit)) goto nla_put_failure; if (geneve->cfg.inner_proto_inherit && nla_put_flag(skb, IFLA_GENEVE_INNER_PROTO_INHERIT)) goto nla_put_failure; if (nla_put(skb, IFLA_GENEVE_PORT_RANGE, sizeof(ports), &ports)) goto nla_put_failure; return 0; nla_put_failure: return -EMSGSIZE; } static struct rtnl_link_ops geneve_link_ops __read_mostly = { .kind = "geneve", .maxtype = IFLA_GENEVE_MAX, .policy = geneve_policy, .priv_size = sizeof(struct geneve_dev), .setup = geneve_setup, .validate = geneve_validate, .newlink = geneve_newlink, .changelink = geneve_changelink, .dellink = geneve_dellink, .get_size = geneve_get_size, .fill_info = geneve_fill_info, }; struct net_device *geneve_dev_create_fb(struct net *net, const char *name, u8 name_assign_type, u16 dst_port) { struct nlattr *tb[IFLA_MAX + 1]; struct net_device *dev; LIST_HEAD(list_kill); int err; struct geneve_config cfg = { .df = GENEVE_DF_UNSET, .use_udp6_rx_checksums = true, .ttl_inherit = false, .collect_md = true, .port_min = 1, .port_max = USHRT_MAX, }; memset(tb, 0, sizeof(tb)); dev = rtnl_create_link(net, name, name_assign_type, &geneve_link_ops, tb, NULL); if (IS_ERR(dev)) return dev; init_tnl_info(&cfg.info, dst_port); err = geneve_configure(net, dev, NULL, &cfg); if (err) { free_netdev(dev); return ERR_PTR(err); } /* openvswitch users expect packet sizes to be unrestricted, * so set the largest MTU we can. */ err = geneve_change_mtu(dev, IP_MAX_MTU); if (err) goto err; err = rtnl_configure_link(dev, NULL, 0, NULL); if (err < 0) goto err; return dev; err: geneve_dellink(dev, &list_kill); unregister_netdevice_many(&list_kill); return ERR_PTR(err); } EXPORT_SYMBOL_GPL(geneve_dev_create_fb); static int geneve_netdevice_event(struct notifier_block *unused, unsigned long event, void *ptr) { struct net_device *dev = netdev_notifier_info_to_dev(ptr); if (event == NETDEV_UDP_TUNNEL_PUSH_INFO) geneve_offload_rx_ports(dev, true); else if (event == NETDEV_UDP_TUNNEL_DROP_INFO) geneve_offload_rx_ports(dev, false); return NOTIFY_DONE; } static struct notifier_block geneve_notifier_block __read_mostly = { .notifier_call = geneve_netdevice_event, }; static __net_init int geneve_init_net(struct net *net) { struct geneve_net *gn = net_generic(net, geneve_net_id); INIT_LIST_HEAD(&gn->geneve_list); INIT_LIST_HEAD(&gn->sock_list); return 0; } static void __net_exit geneve_exit_rtnl_net(struct net *net, struct list_head *dev_to_kill) { struct geneve_net *gn = net_generic(net, geneve_net_id); struct geneve_dev *geneve, *next; list_for_each_entry_safe(geneve, next, &gn->geneve_list, next) geneve_dellink(geneve->dev, dev_to_kill); } static void __net_exit geneve_exit_net(struct net *net) { const struct geneve_net *gn = net_generic(net, geneve_net_id); WARN_ON_ONCE(!list_empty(&gn->sock_list)); } static struct pernet_operations geneve_net_ops = { .init = geneve_init_net, .exit_rtnl = geneve_exit_rtnl_net, .exit = geneve_exit_net, .id = &geneve_net_id, .size = sizeof(struct geneve_net), }; static int __init geneve_init_module(void) { int rc; rc = register_pernet_subsys(&geneve_net_ops); if (rc) goto out1; rc = register_netdevice_notifier(&geneve_notifier_block); if (rc) goto out2; rc = rtnl_link_register(&geneve_link_ops); if (rc) goto out3; return 0; out3: unregister_netdevice_notifier(&geneve_notifier_block); out2: unregister_pernet_subsys(&geneve_net_ops); out1: return rc; } late_initcall(geneve_init_module); static void __exit geneve_cleanup_module(void) { rtnl_link_unregister(&geneve_link_ops); unregister_netdevice_notifier(&geneve_notifier_block); unregister_pernet_subsys(&geneve_net_ops); } module_exit(geneve_cleanup_module); MODULE_LICENSE("GPL"); MODULE_VERSION(GENEVE_NETDEV_VER); MODULE_AUTHOR("John W. Linville <linville@tuxdriver.com>"); MODULE_DESCRIPTION("Interface driver for GENEVE encapsulated traffic"); MODULE_ALIAS_RTNL_LINK("geneve"); |
| 29 27 28 6 6 10 1 10 10 10 3 9 10 10 10 10 9 10 6 4 6 6 6 6 1 1 1 1 6 6 8 7 7 8 8 6 6 5 6 6 10 10 9 10 9 9 10 6 6 6 1 6 6 6 6 6 6 6 6 6 6 6 6 4 2 2 4 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 1 1 1 1 1 1 1 1 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 5 3 3 6 23 24 24 24 24 23 24 22 24 21 4 4 4 4 4 4 2 2 2 2 2 2 1 1 2 3 2 1 1 1 1 3 2 1 1 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 | /* * linux/drivers/video/fbcon.c -- Low level frame buffer based console driver * * Copyright (C) 1995 Geert Uytterhoeven * * * This file is based on the original Amiga console driver (amicon.c): * * Copyright (C) 1993 Hamish Macdonald * Greg Harp * Copyright (C) 1994 David Carter [carter@compsci.bristol.ac.uk] * * with work by William Rucklidge (wjr@cs.cornell.edu) * Geert Uytterhoeven * Jes Sorensen (jds@kom.auc.dk) * Martin Apel * * and on the original Atari console driver (atacon.c): * * Copyright (C) 1993 Bjoern Brauel * Roman Hodek * * with work by Guenther Kelleter * Martin Schaller * Andreas Schwab * * Hardware cursor support added by Emmanuel Marty (core@ggi-project.org) * Smart redraw scrolling, arbitrary font width support, 512char font support * and software scrollback added by * Jakub Jelinek (jj@ultra.linux.cz) * * Random hacking by Martin Mares <mj@ucw.cz> * * 2001 - Documented with DocBook * - Brad Douglas <brad@neruo.com> * * The low level operations for the various display memory organizations are * now in separate source files. * * Currently the following organizations are supported: * * o afb Amiga bitplanes * o cfb{2,4,8,16,24,32} Packed pixels * o ilbm Amiga interleaved bitplanes * o iplan2p[248] Atari interleaved bitplanes * o mfb Monochrome * o vga VGA characters/attributes * * To do: * * - Implement 16 plane mode (iplan2p16) * * * This file is subject to the terms and conditions of the GNU General Public * License. See the file COPYING in the main directory of this archive for * more details. */ #include <linux/export.h> #include <linux/module.h> #include <linux/types.h> #include <linux/fs.h> #include <linux/kernel.h> #include <linux/delay.h> /* MSch: for IRQ probe */ #include <linux/console.h> #include <linux/string.h> #include <linux/kd.h> #include <linux/panic.h> #include <linux/pci.h> #include <linux/printk.h> #include <linux/slab.h> #include <linux/fb.h> #include <linux/fbcon.h> #include <linux/vt_kern.h> #include <linux/selection.h> #include <linux/font.h> #include <linux/smp.h> #include <linux/init.h> #include <linux/interrupt.h> #include <linux/crc32.h> /* For counting font checksums */ #include <linux/uaccess.h> #include <linux/vga_switcheroo.h> #include <asm/irq.h> #include "fbcon.h" #include "fbcon_rotate.h" #include "fb_internal.h" /* * FIXME: Locking * * - fbcon state itself is protected by the console_lock, and the code does a * pretty good job at making sure that lock is held everywhere it's needed. * * - fbcon doesn't bother with fb_lock/unlock at all. This is buggy, since it * means concurrent access to the same fbdev from both fbcon and userspace * will blow up. To fix this all fbcon calls from fbmem.c need to be moved out * of fb_lock/unlock protected sections, since otherwise we'll recurse and * deadlock eventually. Aside: Due to these deadlock issues the fbdev code in * fbmem.c cannot use locking asserts, and there's lots of callers which get * the rules wrong, e.g. fbsysfs.c entirely missed fb_lock/unlock calls too. */ enum { FBCON_LOGO_CANSHOW = -1, /* the logo can be shown */ FBCON_LOGO_DRAW = -2, /* draw the logo to a console */ FBCON_LOGO_DONTSHOW = -3 /* do not show the logo */ }; static struct fbcon_display fb_display[MAX_NR_CONSOLES]; static struct fb_info *fbcon_registered_fb[FB_MAX]; static int fbcon_num_registered_fb; #define fbcon_for_each_registered_fb(i) \ for (i = 0; WARN_CONSOLE_UNLOCKED(), i < FB_MAX; i++) \ if (!fbcon_registered_fb[i]) {} else static signed char con2fb_map[MAX_NR_CONSOLES]; static signed char con2fb_map_boot[MAX_NR_CONSOLES]; static struct fb_info *fbcon_info_from_console(int console) { signed char fb; WARN_CONSOLE_UNLOCKED(); fb = con2fb_map[console]; if (fb < 0 || fb >= ARRAY_SIZE(fbcon_registered_fb)) return NULL; return fbcon_registered_fb[fb]; } static int logo_lines; /* logo_shown is an index to vc_cons when >= 0; otherwise follows FBCON_LOGO enums. */ static int logo_shown = FBCON_LOGO_CANSHOW; /* console mappings */ static unsigned int first_fb_vc; static unsigned int last_fb_vc = MAX_NR_CONSOLES - 1; static bool fbcon_is_default = true; static int primary_device = -1; static bool fbcon_has_console_bind; #ifdef CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY static int map_override; static inline void fbcon_map_override(void) { map_override = 1; } #else static inline void fbcon_map_override(void) { } #endif /* CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY */ #ifdef CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER static bool deferred_takeover = true; #else #define deferred_takeover false #endif /* font data */ static char fontname[40]; /* current fb_info */ static int info_idx = -1; /* console rotation */ static int initial_rotation = -1; static int margin_color; static const struct consw fb_con; #define advance_row(p, delta) (unsigned short *)((unsigned long)(p) + (delta) * vc->vc_size_row) static bool fbcon_cursor_blink = true; #define divides(a, b) ((!(a) || (b)%(a)) ? 0 : 1) /* * Interface used by the world */ static void fbcon_clear_margins(struct vc_data *vc, int bottom_only); static void fbcon_set_palette(struct vc_data *vc, const unsigned char *table); /* * Internal routines */ static void fbcon_set_disp(struct fb_info *info, struct fb_var_screeninfo *var, int unit); static void fbcon_redraw_move(struct vc_data *vc, struct fbcon_display *p, int line, int count, int dy); static void fbcon_modechanged(struct fb_info *info); static void fbcon_set_all_vcs(struct fb_info *info); static struct device *fbcon_device; #ifdef CONFIG_FRAMEBUFFER_CONSOLE_ROTATION static inline void fbcon_set_rotation(struct fb_info *info) { struct fbcon_par *par = info->fbcon_par; if (!(info->flags & FBINFO_MISC_TILEBLITTING) && par->p->con_rotate < 4) par->rotate = par->p->con_rotate; else par->rotate = 0; } static void fbcon_rotate(struct fb_info *info, u32 rotate) { struct fbcon_par *par = info->fbcon_par; struct fb_info *fb_info; if (!par || par->currcon == -1) return; fb_info = fbcon_info_from_console(par->currcon); if (info == fb_info) { struct fbcon_display *p = &fb_display[par->currcon]; if (rotate < 4) p->con_rotate = rotate; else p->con_rotate = 0; fbcon_modechanged(info); } } static void fbcon_rotate_all(struct fb_info *info, u32 rotate) { struct fbcon_par *par = info->fbcon_par; struct vc_data *vc; struct fbcon_display *p; int i; if (!par || par->currcon < 0 || rotate > 3) return; for (i = first_fb_vc; i <= last_fb_vc; i++) { vc = vc_cons[i].d; if (!vc || vc->vc_mode != KD_TEXT || fbcon_info_from_console(i) != info) continue; p = &fb_display[vc->vc_num]; p->con_rotate = rotate; } fbcon_set_all_vcs(info); } #else static inline void fbcon_set_rotation(struct fb_info *info) { struct fbcon_par *par = info->fbcon_par; par->rotate = FB_ROTATE_UR; } static void fbcon_rotate(struct fb_info *info, u32 rotate) { return; } static void fbcon_rotate_all(struct fb_info *info, u32 rotate) { return; } #endif /* CONFIG_FRAMEBUFFER_CONSOLE_ROTATION */ static void fbcon_set_bitops(struct fbcon_par *par) { switch (par->rotate) { default: fallthrough; case FB_ROTATE_UR: fbcon_set_bitops_ur(par); break; case FB_ROTATE_CW: fbcon_set_bitops_cw(par); break; case FB_ROTATE_UD: fbcon_set_bitops_ud(par); break; case FB_ROTATE_CCW: fbcon_set_bitops_ccw(par); break; } } static int fbcon_get_rotate(struct fb_info *info) { struct fbcon_par *par = info->fbcon_par; return (par) ? par->rotate : 0; } static bool fbcon_skip_panic(struct fb_info *info) { return (info->skip_panic && unlikely(panic_in_progress())); } static inline bool fbcon_is_active(struct vc_data *vc, struct fb_info *info) { struct fbcon_par *par = info->fbcon_par; return info->state == FBINFO_STATE_RUNNING && vc->vc_mode == KD_TEXT && !par->graphics && !fbcon_skip_panic(info); } static int get_color(struct vc_data *vc, struct fb_info *info, u16 c, bool is_fg) { int depth = fb_get_color_depth(&info->var, &info->fix); int color = 0; if (console_blanked) { unsigned short charmask = vc->vc_hi_font_mask ? 0x1ff : 0xff; c = vc->vc_video_erase_char & charmask; } if (depth != 1) color = (is_fg) ? attr_fgcol((vc->vc_hi_font_mask) ? 9 : 8, c) : attr_bgcol((vc->vc_hi_font_mask) ? 13 : 12, c); switch (depth) { case 1: { int col = mono_col(info); /* 0 or 1 */ int fg = (info->fix.visual != FB_VISUAL_MONO01) ? col : 0; int bg = (info->fix.visual != FB_VISUAL_MONO01) ? 0 : col; if (console_blanked) fg = bg; color = (is_fg) ? fg : bg; break; } case 2: /* * Scale down 16-colors to 4 colors. Default 4-color palette * is grayscale. However, simply dividing the values by 4 * will not work, as colors 1, 2 and 3 will be scaled-down * to zero rendering them invisible. So empirically convert * colors to a sane 4-level grayscale. */ switch (color) { case 0: color = 0; /* black */ break; case 1 ... 6: color = 2; /* white */ break; case 7 ... 8: color = 1; /* gray */ break; default: color = 3; /* intense white */ break; } break; case 3: /* * Last 8 entries of default 16-color palette is a more intense * version of the first 8 (i.e., same chrominance, different * luminance). */ color &= 7; break; } return color; } static int get_fg_color(struct vc_data *vc, struct fb_info *info, u16 c) { return get_color(vc, info, c, true); } static int get_bg_color(struct vc_data *vc, struct fb_info *info, u16 c) { return get_color(vc, info, c, false); } static void fb_flashcursor(struct work_struct *work) { struct fbcon_par *par = container_of(work, struct fbcon_par, cursor_work.work); struct fb_info *info; struct vc_data *vc = NULL; int c; bool enable; int ret; /* FIXME: we should sort out the unbind locking instead */ /* instead we just fail to flash the cursor if we can't get * the lock instead of blocking fbcon deinit */ ret = console_trylock(); if (ret == 0) return; /* protected by console_lock */ info = par->info; if (par->currcon != -1) vc = vc_cons[par->currcon].d; if (!vc || !con_is_visible(vc) || fbcon_info_from_console(vc->vc_num) != info || vc->vc_deccm != 1) { console_unlock(); return; } c = scr_readw((u16 *) vc->vc_pos); enable = par->cursor_flash && !par->cursor_state.enable; par->bitops->cursor(vc, info, enable, get_fg_color(vc, info, c), get_bg_color(vc, info, c)); console_unlock(); queue_delayed_work(system_power_efficient_wq, &par->cursor_work, par->cur_blink_jiffies); } static void fbcon_add_cursor_work(struct fb_info *info) { struct fbcon_par *par = info->fbcon_par; if (fbcon_cursor_blink) queue_delayed_work(system_power_efficient_wq, &par->cursor_work, par->cur_blink_jiffies); } static void fbcon_del_cursor_work(struct fb_info *info) { struct fbcon_par *par = info->fbcon_par; cancel_delayed_work_sync(&par->cursor_work); } #ifndef MODULE static int __init fb_console_setup(char *this_opt) { char *options; int i, j; if (!this_opt || !*this_opt) return 1; while ((options = strsep(&this_opt, ",")) != NULL) { if (!strncmp(options, "font:", 5)) { strscpy(fontname, options + 5, sizeof(fontname)); continue; } if (!strncmp(options, "scrollback:", 11)) { pr_warn("Ignoring scrollback size option\n"); continue; } if (!strncmp(options, "map:", 4)) { options += 4; if (*options) { for (i = 0, j = 0; i < MAX_NR_CONSOLES; i++) { if (!options[j]) j = 0; con2fb_map_boot[i] = (options[j++]-'0') % FB_MAX; } fbcon_map_override(); } continue; } if (!strncmp(options, "vc:", 3)) { options += 3; if (*options) first_fb_vc = simple_strtoul(options, &options, 10) - 1; if (first_fb_vc >= MAX_NR_CONSOLES) first_fb_vc = 0; if (*options++ == '-') last_fb_vc = simple_strtoul(options, &options, 10) - 1; if (last_fb_vc < first_fb_vc || last_fb_vc >= MAX_NR_CONSOLES) last_fb_vc = MAX_NR_CONSOLES - 1; fbcon_is_default = false; continue; } if (!strncmp(options, "rotate:", 7)) { options += 7; if (*options) initial_rotation = simple_strtoul(options, &options, 0); if (initial_rotation > 3) initial_rotation = 0; continue; } if (!strncmp(options, "margin:", 7)) { options += 7; if (*options) margin_color = simple_strtoul(options, &options, 0); continue; } #ifdef CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER if (!strcmp(options, "nodefer")) { deferred_takeover = false; continue; } #endif #ifdef CONFIG_LOGO if (!strncmp(options, "logo-pos:", 9)) { options += 9; if (!strcmp(options, "center")) fb_center_logo = true; continue; } if (!strncmp(options, "logo-count:", 11)) { options += 11; if (*options) fb_logo_count = simple_strtol(options, &options, 0); continue; } #endif } return 1; } __setup("fbcon=", fb_console_setup); #endif static int search_fb_in_map(int idx) { int i, retval = 0; for (i = first_fb_vc; i <= last_fb_vc; i++) { if (con2fb_map[i] == idx) { retval = 1; break; } } return retval; } static int search_for_mapped_con(void) { int i, retval = 0; for (i = first_fb_vc; i <= last_fb_vc; i++) { if (con2fb_map[i] != -1) { retval = 1; break; } } return retval; } static int do_fbcon_takeover(int show_logo) { int err, i; if (!fbcon_num_registered_fb) return -ENODEV; if (!show_logo) logo_shown = FBCON_LOGO_DONTSHOW; for (i = first_fb_vc; i <= last_fb_vc; i++) con2fb_map[i] = info_idx; err = do_take_over_console(&fb_con, first_fb_vc, last_fb_vc, fbcon_is_default); if (err) { for (i = first_fb_vc; i <= last_fb_vc; i++) con2fb_map[i] = -1; info_idx = -1; } else { fbcon_has_console_bind = true; } return err; } #ifdef MODULE static void fbcon_prepare_logo(struct vc_data *vc, struct fb_info *info, int cols, int rows, int new_cols, int new_rows) { logo_shown = FBCON_LOGO_DONTSHOW; } #else static void fbcon_prepare_logo(struct vc_data *vc, struct fb_info *info, int cols, int rows, int new_cols, int new_rows) { /* Need to make room for the logo */ struct fbcon_par *par = info->fbcon_par; int cnt, erase = vc->vc_video_erase_char, step; unsigned short *save = NULL, *r, *q; int logo_height; if (info->fbops->owner) { logo_shown = FBCON_LOGO_DONTSHOW; return; } /* * remove underline attribute from erase character * if black and white framebuffer. */ if (fb_get_color_depth(&info->var, &info->fix) == 1) erase &= ~0x400; logo_height = fb_prepare_logo(info, par->rotate); logo_lines = DIV_ROUND_UP(logo_height, vc->vc_font.height); q = (unsigned short *) (vc->vc_origin + vc->vc_size_row * rows); step = logo_lines * cols; for (r = q - logo_lines * cols; r < q; r++) if (scr_readw(r) != vc->vc_video_erase_char) break; if (r != q && new_rows >= rows + logo_lines) { save = kmalloc(array3_size(logo_lines, new_cols, 2), GFP_KERNEL); if (save) { int i = min(cols, new_cols); scr_memsetw(save, erase, array3_size(logo_lines, new_cols, 2)); r = q - step; for (cnt = 0; cnt < logo_lines; cnt++, r += i) scr_memcpyw(save + cnt * new_cols, r, 2 * i); r = q; } } if (r == q) { /* We can scroll screen down */ r = q - step - cols; for (cnt = rows - logo_lines; cnt > 0; cnt--) { scr_memcpyw(r + step, r, vc->vc_size_row); r -= cols; } if (!save) { int lines; if (vc->state.y + logo_lines >= rows) lines = rows - vc->state.y - 1; else lines = logo_lines; vc->state.y += lines; vc->vc_pos += lines * vc->vc_size_row; } } scr_memsetw((unsigned short *) vc->vc_origin, erase, vc->vc_size_row * logo_lines); if (con_is_visible(vc) && vc->vc_mode == KD_TEXT) { fbcon_clear_margins(vc, 0); update_screen(vc); } if (save) { q = (unsigned short *) (vc->vc_origin + vc->vc_size_row * rows); scr_memcpyw(q, save, array3_size(logo_lines, new_cols, 2)); vc->state.y += logo_lines; vc->vc_pos += logo_lines * vc->vc_size_row; kfree(save); } if (logo_shown == FBCON_LOGO_DONTSHOW) return; if (logo_lines > vc->vc_bottom) { logo_shown = FBCON_LOGO_CANSHOW; pr_info("fbcon: disable boot-logo (boot-logo bigger than screen).\n"); } else { logo_shown = FBCON_LOGO_DRAW; vc->vc_top = logo_lines; } } #endif /* MODULE */ #ifdef CONFIG_FB_TILEBLITTING static void set_blitting_type(struct vc_data *vc, struct fb_info *info) { struct fbcon_par *par = info->fbcon_par; par->p = &fb_display[vc->vc_num]; if ((info->flags & FBINFO_MISC_TILEBLITTING)) fbcon_set_tileops(vc, info); else { fbcon_set_rotation(info); fbcon_set_bitops(par); } } static int fbcon_invalid_charcount(struct fb_info *info, unsigned charcount) { int err = 0; if (info->flags & FBINFO_MISC_TILEBLITTING && info->tileops->fb_get_tilemax(info) < charcount) err = 1; return err; } #else static void set_blitting_type(struct vc_data *vc, struct fb_info *info) { struct fbcon_par *par = info->fbcon_par; info->flags &= ~FBINFO_MISC_TILEBLITTING; par->p = &fb_display[vc->vc_num]; fbcon_set_rotation(info); fbcon_set_bitops(par); } static int fbcon_invalid_charcount(struct fb_info *info, unsigned charcount) { return 0; } #endif /* CONFIG_MISC_TILEBLITTING */ static void fbcon_release(struct fb_info *info) { lock_fb_info(info); if (info->fbops->fb_release) info->fbops->fb_release(info, 0); unlock_fb_info(info); module_put(info->fbops->owner); if (info->fbcon_par) { struct fbcon_par *par = info->fbcon_par; fbcon_del_cursor_work(info); kfree(par->cursor_state.mask); kfree(par->cursor_data); kfree(par->cursor_src); kfree(par->fontbuffer); kfree(info->fbcon_par); info->fbcon_par = NULL; } } static int fbcon_open(struct fb_info *info) { struct fbcon_par *par; if (!try_module_get(info->fbops->owner)) return -ENODEV; lock_fb_info(info); if (info->fbops->fb_open && info->fbops->fb_open(info, 0)) { unlock_fb_info(info); module_put(info->fbops->owner); return -ENODEV; } unlock_fb_info(info); par = kzalloc(sizeof(*par), GFP_KERNEL); if (!par) { fbcon_release(info); return -ENOMEM; } INIT_DELAYED_WORK(&par->cursor_work, fb_flashcursor); par->info = info; info->fbcon_par = par; par->cur_blink_jiffies = HZ / 5; return 0; } static int con2fb_acquire_newinfo(struct vc_data *vc, struct fb_info *info, int unit) { int err; err = fbcon_open(info); if (err) return err; if (vc) set_blitting_type(vc, info); return err; } static void con2fb_release_oldinfo(struct vc_data *vc, struct fb_info *oldinfo, struct fb_info *newinfo) { int ret; fbcon_release(oldinfo); /* If oldinfo and newinfo are driving the same hardware, the fb_release() method of oldinfo may attempt to restore the hardware state. This will leave the newinfo in an undefined state. Thus, a call to fb_set_par() may be needed for the newinfo. */ if (newinfo && newinfo->fbops->fb_set_par) { ret = newinfo->fbops->fb_set_par(newinfo); if (ret) printk(KERN_ERR "con2fb_release_oldinfo: " "detected unhandled fb_set_par error, " "error code %d\n", ret); } } static void con2fb_init_display(struct vc_data *vc, struct fb_info *info, int unit, int show_logo) { struct fbcon_par *par = info->fbcon_par; int ret; par->currcon = fg_console; if (info->fbops->fb_set_par && !par->initialized) { ret = info->fbops->fb_set_par(info); if (ret) printk(KERN_ERR "con2fb_init_display: detected " "unhandled fb_set_par error, " "error code %d\n", ret); } par->initialized = true; par->graphics = 0; fbcon_set_disp(info, &info->var, unit); if (show_logo) { struct vc_data *fg_vc = vc_cons[fg_console].d; struct fb_info *fg_info = fbcon_info_from_console(fg_console); fbcon_prepare_logo(fg_vc, fg_info, fg_vc->vc_cols, fg_vc->vc_rows, fg_vc->vc_cols, fg_vc->vc_rows); } if (fg_console != unit) update_screen(vc_cons[fg_console].d); } /** * set_con2fb_map - map console to frame buffer device * @unit: virtual console number to map * @newidx: frame buffer index to map virtual console to * @user: user request * * Maps a virtual console @unit to a frame buffer device * @newidx. * * This should be called with the console lock held. */ static int set_con2fb_map(int unit, int newidx, int user) { struct vc_data *vc = vc_cons[unit].d; int oldidx = con2fb_map[unit]; struct fb_info *info = fbcon_registered_fb[newidx]; struct fb_info *oldinfo = NULL; int err = 0, show_logo; WARN_CONSOLE_UNLOCKED(); if (oldidx == newidx) return 0; if (!info) return -EINVAL; if (!search_for_mapped_con() || !con_is_bound(&fb_con)) { info_idx = newidx; return do_fbcon_takeover(0); } if (oldidx != -1) oldinfo = fbcon_registered_fb[oldidx]; if (!search_fb_in_map(newidx)) { err = con2fb_acquire_newinfo(vc, info, unit); if (err) return err; fbcon_add_cursor_work(info); } else if (vc) { set_blitting_type(vc, info); } con2fb_map[unit] = newidx; /* * If old fb is not mapped to any of the consoles, * fbcon should release it. */ if (oldinfo && !search_fb_in_map(oldidx)) con2fb_release_oldinfo(vc, oldinfo, info); show_logo = (fg_console == 0 && !user && logo_shown != FBCON_LOGO_DONTSHOW); con2fb_map_boot[unit] = newidx; con2fb_init_display(vc, info, unit, show_logo); if (!search_fb_in_map(info_idx)) info_idx = newidx; return err; } /* * Low Level Operations */ /* NOTE: fbcon cannot be __init: it may be called from do_take_over_console later */ static int var_to_display(struct fbcon_display *disp, struct fb_var_screeninfo *var, struct fb_info *info) { disp->xres_virtual = var->xres_virtual; disp->yres_virtual = var->yres_virtual; disp->bits_per_pixel = var->bits_per_pixel; disp->grayscale = var->grayscale; disp->nonstd = var->nonstd; disp->accel_flags = var->accel_flags; disp->height = var->height; disp->width = var->width; disp->red = var->red; disp->green = var->green; disp->blue = var->blue; disp->transp = var->transp; disp->rotate = var->rotate; disp->mode = fb_match_mode(var, &info->modelist); if (disp->mode == NULL) /* This should not happen */ return -EINVAL; return 0; } static void display_to_var(struct fb_var_screeninfo *var, struct fbcon_display *disp) { fb_videomode_to_var(var, disp->mode); var->xres_virtual = disp->xres_virtual; var->yres_virtual = disp->yres_virtual; var->bits_per_pixel = disp->bits_per_pixel; var->grayscale = disp->grayscale; var->nonstd = disp->nonstd; var->accel_flags = disp->accel_flags; var->height = disp->height; var->width = disp->width; var->red = disp->red; var->green = disp->green; var->blue = disp->blue; var->transp = disp->transp; var->rotate = disp->rotate; } static const char *fbcon_startup(void) { static const char display_desc[] = "frame buffer device"; struct fbcon_display *p = &fb_display[fg_console]; struct vc_data *vc = vc_cons[fg_console].d; const struct font_desc *font = NULL; struct fb_info *info = NULL; struct fbcon_par *par; int rows, cols; /* * If fbcon_num_registered_fb is zero, this is a call for the dummy part. * The frame buffer devices weren't initialized yet. */ if (!fbcon_num_registered_fb || info_idx == -1) return display_desc; /* * Instead of blindly using fbcon_registered_fb[0], we use info_idx, set by * fbcon_fb_registered(); */ info = fbcon_registered_fb[info_idx]; if (!info) return NULL; if (fbcon_open(info)) return NULL; par = info->fbcon_par; par->currcon = -1; par->graphics = 1; par->cur_rotate = -1; p->con_rotate = initial_rotation; if (p->con_rotate == -1) p->con_rotate = info->fbcon_rotate_hint; if (p->con_rotate == -1) p->con_rotate = FB_ROTATE_UR; set_blitting_type(vc, info); /* Setup default font */ if (!p->fontdata) { if (!fontname[0] || !(font = find_font(fontname))) font = get_default_font(info->var.xres, info->var.yres, info->pixmap.blit_x, info->pixmap.blit_y); vc->vc_font.width = font->width; vc->vc_font.height = font->height; vc->vc_font.data = (void *)(p->fontdata = font->data); vc->vc_font.charcount = font->charcount; } cols = FBCON_SWAP(par->rotate, info->var.xres, info->var.yres); rows = FBCON_SWAP(par->rotate, info->var.yres, info->var.xres); cols /= vc->vc_font.width; rows /= vc->vc_font.height; vc_resize(vc, cols, rows); pr_debug("mode: %s\n", info->fix.id); pr_debug("visual: %d\n", info->fix.visual); pr_debug("res: %dx%d-%d\n", info->var.xres, info->var.yres, info->var.bits_per_pixel); fbcon_add_cursor_work(info); return display_desc; } static void fbcon_init(struct vc_data *vc, bool init) { struct fb_info *info; struct fbcon_par *par; struct vc_data **default_mode = vc->vc_display_fg; struct vc_data *svc = *default_mode; struct fbcon_display *t, *p = &fb_display[vc->vc_num]; int logo = 1, new_rows, new_cols, rows, cols; int ret; if (WARN_ON(info_idx == -1)) return; if (con2fb_map[vc->vc_num] == -1) con2fb_map[vc->vc_num] = info_idx; info = fbcon_info_from_console(vc->vc_num); if (logo_shown < 0 && console_loglevel <= CONSOLE_LOGLEVEL_QUIET) logo_shown = FBCON_LOGO_DONTSHOW; if (vc != svc || logo_shown == FBCON_LOGO_DONTSHOW || (info->fix.type == FB_TYPE_TEXT)) logo = 0; if (var_to_display(p, &info->var, info)) return; if (!info->fbcon_par) con2fb_acquire_newinfo(vc, info, vc->vc_num); /* If we are not the first console on this fb, copy the font from that console */ t = &fb_display[fg_console]; if (!p->fontdata) { if (t->fontdata) { struct vc_data *fvc = vc_cons[fg_console].d; vc->vc_font.data = (void *)(p->fontdata = fvc->vc_font.data); vc->vc_font.width = fvc->vc_font.width; vc->vc_font.height = fvc->vc_font.height; vc->vc_font.charcount = fvc->vc_font.charcount; p->userfont = t->userfont; if (p->userfont) REFCOUNT(p->fontdata)++; } else { const struct font_desc *font = NULL; if (!fontname[0] || !(font = find_font(fontname))) font = get_default_font(info->var.xres, info->var.yres, info->pixmap.blit_x, info->pixmap.blit_y); vc->vc_font.width = font->width; vc->vc_font.height = font->height; vc->vc_font.data = (void *)(p->fontdata = font->data); vc->vc_font.charcount = font->charcount; } } vc->vc_can_do_color = (fb_get_color_depth(&info->var, &info->fix)!=1); vc->vc_complement_mask = vc->vc_can_do_color ? 0x7700 : 0x0800; if (vc->vc_font.charcount == 256) { vc->vc_hi_font_mask = 0; } else { vc->vc_hi_font_mask = 0x100; if (vc->vc_can_do_color) vc->vc_complement_mask <<= 1; } if (!*svc->uni_pagedict_loc) con_set_default_unimap(svc); if (!*vc->uni_pagedict_loc) con_copy_unimap(vc, svc); par = info->fbcon_par; par->cur_blink_jiffies = msecs_to_jiffies(vc->vc_cur_blink_ms); p->con_rotate = initial_rotation; if (p->con_rotate == -1) p->con_rotate = info->fbcon_rotate_hint; if (p->con_rotate == -1) p->con_rotate = FB_ROTATE_UR; set_blitting_type(vc, info); cols = vc->vc_cols; rows = vc->vc_rows; new_cols = FBCON_SWAP(par->rotate, info->var.xres, info->var.yres); new_rows = FBCON_SWAP(par->rotate, info->var.yres, info->var.xres); new_cols /= vc->vc_font.width; new_rows /= vc->vc_font.height; /* * We must always set the mode. The mode of the previous console * driver could be in the same resolution but we are using different * hardware so we have to initialize the hardware. * * We need to do it in fbcon_init() to prevent screen corruption. */ if (con_is_visible(vc) && vc->vc_mode == KD_TEXT) { if (info->fbops->fb_set_par && !par->initialized) { ret = info->fbops->fb_set_par(info); if (ret) printk(KERN_ERR "fbcon_init: detected " "unhandled fb_set_par error, " "error code %d\n", ret); } par->initialized = true; } par->graphics = 0; #ifdef CONFIG_FRAMEBUFFER_CONSOLE_LEGACY_ACCELERATION if ((info->flags & FBINFO_HWACCEL_COPYAREA) && !(info->flags & FBINFO_HWACCEL_DISABLED)) p->scrollmode = SCROLL_MOVE; else /* default to something safe */ p->scrollmode = SCROLL_REDRAW; #endif /* * ++guenther: console.c:vc_allocate() relies on initializing * vc_{cols,rows}, but we must not set those if we are only * resizing the console. */ if (init) { vc->vc_cols = new_cols; vc->vc_rows = new_rows; } else vc_resize(vc, new_cols, new_rows); if (logo) fbcon_prepare_logo(vc, info, cols, rows, new_cols, new_rows); if (par->bitops->rotate_font && par->bitops->rotate_font(info, vc)) { par->rotate = FB_ROTATE_UR; set_blitting_type(vc, info); } par->p = &fb_display[fg_console]; } static void fbcon_free_font(struct fbcon_display *p) { if (p->userfont && p->fontdata && (--REFCOUNT(p->fontdata) == 0)) kfree(p->fontdata - FONT_EXTRA_WORDS * sizeof(int)); p->fontdata = NULL; p->userfont = 0; } static void set_vc_hi_font(struct vc_data *vc, bool set); static void fbcon_release_all(void) { struct fb_info *info; int i, j, mapped; fbcon_for_each_registered_fb(i) { mapped = 0; info = fbcon_registered_fb[i]; for (j = first_fb_vc; j <= last_fb_vc; j++) { if (con2fb_map[j] == i) { mapped = 1; con2fb_map[j] = -1; } } if (mapped) fbcon_release(info); } } static void fbcon_deinit(struct vc_data *vc) { struct fbcon_display *p = &fb_display[vc->vc_num]; struct fb_info *info; struct fbcon_par *par; int idx; fbcon_free_font(p); idx = con2fb_map[vc->vc_num]; if (idx == -1) goto finished; info = fbcon_registered_fb[idx]; if (!info) goto finished; par = info->fbcon_par; if (!par) goto finished; if (con_is_visible(vc)) fbcon_del_cursor_work(info); par->initialized = false; finished: fbcon_free_font(p); vc->vc_font.data = NULL; if (vc->vc_hi_font_mask && vc->vc_screenbuf) set_vc_hi_font(vc, false); if (!con_is_bound(&fb_con)) fbcon_release_all(); if (vc->vc_num == logo_shown) logo_shown = FBCON_LOGO_CANSHOW; return; } /* ====================================================================== */ /* fbcon_XXX routines - interface used by the world * * This system is now divided into two levels because of complications * caused by hardware scrolling. Top level functions: * * fbcon_bmove(), fbcon_clear(), fbcon_putc(), fbcon_clear_margins() * * handles y values in range [0, scr_height-1] that correspond to real * screen positions. y_wrap shift means that first line of bitmap may be * anywhere on this display. These functions convert lineoffsets to * bitmap offsets and deal with the wrap-around case by splitting blits. * * fbcon_bmove_physical_8() -- These functions fast implementations * fbcon_clear_physical_8() -- of original fbcon_XXX fns. * fbcon_putc_physical_8() -- (font width != 8) may be added later * * WARNING: * * At the moment fbcon_putc() cannot blit across vertical wrap boundary * Implies should only really hardware scroll in rows. Only reason for * restriction is simplicity & efficiency at the moment. */ static void __fbcon_clear(struct vc_data *vc, unsigned int sy, unsigned int sx, unsigned int height, unsigned int width) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; int fg, bg; struct fbcon_display *p = &fb_display[vc->vc_num]; u_int y_break; if (!fbcon_is_active(vc, info)) return; if (!height || !width) return; if (sy < vc->vc_top && vc->vc_top == logo_lines) { vc->vc_top = 0; /* * If the font dimensions are not an integral of the display * dimensions then the par->clear below won't end up clearing * the margins. Call clear_margins here in case the logo * bitmap stretched into the margin area. */ fbcon_clear_margins(vc, 0); } fg = get_color(vc, info, vc->vc_video_erase_char, 1); bg = get_color(vc, info, vc->vc_video_erase_char, 0); /* Split blits that cross physical y_wrap boundary */ y_break = p->vrows - p->yscroll; if (sy < y_break && sy + height - 1 >= y_break) { u_int b = y_break - sy; par->bitops->clear(vc, info, real_y(p, sy), sx, b, width, fg, bg); par->bitops->clear(vc, info, real_y(p, sy + b), sx, height - b, width, fg, bg); } else par->bitops->clear(vc, info, real_y(p, sy), sx, height, width, fg, bg); } static void fbcon_clear(struct vc_data *vc, unsigned int sy, unsigned int sx, unsigned int width) { __fbcon_clear(vc, sy, sx, 1, width); } static void fbcon_putcs(struct vc_data *vc, const u16 *s, unsigned int count, unsigned int ypos, unsigned int xpos) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_display *p = &fb_display[vc->vc_num]; struct fbcon_par *par = info->fbcon_par; if (fbcon_is_active(vc, info)) par->bitops->putcs(vc, info, s, count, real_y(p, ypos), xpos, get_fg_color(vc, info, scr_readw(s)), get_bg_color(vc, info, scr_readw(s))); } static void fbcon_clear_margins(struct vc_data *vc, int bottom_only) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; if (fbcon_is_active(vc, info)) par->bitops->clear_margins(vc, info, margin_color, bottom_only); } static void fbcon_cursor(struct vc_data *vc, bool enable) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; int c = scr_readw((u16 *) vc->vc_pos); par->cur_blink_jiffies = msecs_to_jiffies(vc->vc_cur_blink_ms); if (!fbcon_is_active(vc, info) || vc->vc_deccm != 1) return; if (vc->vc_cursor_type & CUR_SW) fbcon_del_cursor_work(info); else fbcon_add_cursor_work(info); par->cursor_flash = enable; if (!par->bitops->cursor) return; par->bitops->cursor(vc, info, enable, get_fg_color(vc, info, c), get_bg_color(vc, info, c)); } static int scrollback_phys_max = 0; static int scrollback_max = 0; static int scrollback_current = 0; static void fbcon_set_disp(struct fb_info *info, struct fb_var_screeninfo *var, int unit) { struct fbcon_display *p, *t; struct vc_data **default_mode, *vc; struct vc_data *svc; struct fbcon_par *par = info->fbcon_par; int rows, cols; unsigned long ret = 0; p = &fb_display[unit]; if (var_to_display(p, var, info)) return; vc = vc_cons[unit].d; if (!vc) return; default_mode = vc->vc_display_fg; svc = *default_mode; t = &fb_display[svc->vc_num]; if (!vc->vc_font.data) { vc->vc_font.data = (void *)(p->fontdata = t->fontdata); vc->vc_font.width = (*default_mode)->vc_font.width; vc->vc_font.height = (*default_mode)->vc_font.height; vc->vc_font.charcount = (*default_mode)->vc_font.charcount; p->userfont = t->userfont; if (p->userfont) REFCOUNT(p->fontdata)++; } var->activate = FB_ACTIVATE_NOW; info->var.activate = var->activate; var->yoffset = info->var.yoffset; var->xoffset = info->var.xoffset; fb_set_var(info, var); par->var = info->var; vc->vc_can_do_color = (fb_get_color_depth(&info->var, &info->fix)!=1); vc->vc_complement_mask = vc->vc_can_do_color ? 0x7700 : 0x0800; if (vc->vc_font.charcount == 256) { vc->vc_hi_font_mask = 0; } else { vc->vc_hi_font_mask = 0x100; if (vc->vc_can_do_color) vc->vc_complement_mask <<= 1; } if (!*svc->uni_pagedict_loc) con_set_default_unimap(svc); if (!*vc->uni_pagedict_loc) con_copy_unimap(vc, svc); cols = FBCON_SWAP(par->rotate, info->var.xres, info->var.yres); rows = FBCON_SWAP(par->rotate, info->var.yres, info->var.xres); cols /= vc->vc_font.width; rows /= vc->vc_font.height; ret = vc_resize(vc, cols, rows); if (con_is_visible(vc) && !ret) update_screen(vc); } static __inline__ void ywrap_up(struct vc_data *vc, int count) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; struct fbcon_display *p = &fb_display[vc->vc_num]; p->yscroll += count; if (p->yscroll >= p->vrows) /* Deal with wrap */ p->yscroll -= p->vrows; par->var.xoffset = 0; par->var.yoffset = p->yscroll * vc->vc_font.height; par->var.vmode |= FB_VMODE_YWRAP; par->bitops->update_start(info); scrollback_max += count; if (scrollback_max > scrollback_phys_max) scrollback_max = scrollback_phys_max; scrollback_current = 0; } static __inline__ void ywrap_down(struct vc_data *vc, int count) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; struct fbcon_display *p = &fb_display[vc->vc_num]; p->yscroll -= count; if (p->yscroll < 0) /* Deal with wrap */ p->yscroll += p->vrows; par->var.xoffset = 0; par->var.yoffset = p->yscroll * vc->vc_font.height; par->var.vmode |= FB_VMODE_YWRAP; par->bitops->update_start(info); scrollback_max -= count; if (scrollback_max < 0) scrollback_max = 0; scrollback_current = 0; } static __inline__ void ypan_up(struct vc_data *vc, int count) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_display *p = &fb_display[vc->vc_num]; struct fbcon_par *par = info->fbcon_par; p->yscroll += count; if (p->yscroll > p->vrows - vc->vc_rows) { par->bitops->bmove(vc, info, p->vrows - vc->vc_rows, 0, 0, 0, vc->vc_rows, vc->vc_cols); p->yscroll -= p->vrows - vc->vc_rows; } par->var.xoffset = 0; par->var.yoffset = p->yscroll * vc->vc_font.height; par->var.vmode &= ~FB_VMODE_YWRAP; par->bitops->update_start(info); fbcon_clear_margins(vc, 1); scrollback_max += count; if (scrollback_max > scrollback_phys_max) scrollback_max = scrollback_phys_max; scrollback_current = 0; } static __inline__ void ypan_up_redraw(struct vc_data *vc, int t, int count) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; struct fbcon_display *p = &fb_display[vc->vc_num]; p->yscroll += count; if (p->yscroll > p->vrows - vc->vc_rows) { p->yscroll -= p->vrows - vc->vc_rows; fbcon_redraw_move(vc, p, t + count, vc->vc_rows - count, t); } par->var.xoffset = 0; par->var.yoffset = p->yscroll * vc->vc_font.height; par->var.vmode &= ~FB_VMODE_YWRAP; par->bitops->update_start(info); fbcon_clear_margins(vc, 1); scrollback_max += count; if (scrollback_max > scrollback_phys_max) scrollback_max = scrollback_phys_max; scrollback_current = 0; } static __inline__ void ypan_down(struct vc_data *vc, int count) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_display *p = &fb_display[vc->vc_num]; struct fbcon_par *par = info->fbcon_par; p->yscroll -= count; if (p->yscroll < 0) { par->bitops->bmove(vc, info, 0, 0, p->vrows - vc->vc_rows, 0, vc->vc_rows, vc->vc_cols); p->yscroll += p->vrows - vc->vc_rows; } par->var.xoffset = 0; par->var.yoffset = p->yscroll * vc->vc_font.height; par->var.vmode &= ~FB_VMODE_YWRAP; par->bitops->update_start(info); fbcon_clear_margins(vc, 1); scrollback_max -= count; if (scrollback_max < 0) scrollback_max = 0; scrollback_current = 0; } static __inline__ void ypan_down_redraw(struct vc_data *vc, int t, int count) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; struct fbcon_display *p = &fb_display[vc->vc_num]; p->yscroll -= count; if (p->yscroll < 0) { p->yscroll += p->vrows - vc->vc_rows; fbcon_redraw_move(vc, p, t, vc->vc_rows - count, t + count); } par->var.xoffset = 0; par->var.yoffset = p->yscroll * vc->vc_font.height; par->var.vmode &= ~FB_VMODE_YWRAP; par->bitops->update_start(info); fbcon_clear_margins(vc, 1); scrollback_max -= count; if (scrollback_max < 0) scrollback_max = 0; scrollback_current = 0; } static void fbcon_redraw_move(struct vc_data *vc, struct fbcon_display *p, int line, int count, int dy) { unsigned short *s = (unsigned short *) (vc->vc_origin + vc->vc_size_row * line); while (count--) { unsigned short *start = s; unsigned short *le = advance_row(s, 1); unsigned short c; int x = 0; unsigned short attr = 1; do { c = scr_readw(s); if (attr != (c & 0xff00)) { attr = c & 0xff00; if (s > start) { fbcon_putcs(vc, start, s - start, dy, x); x += s - start; start = s; } } console_conditional_schedule(); s++; } while (s < le); if (s > start) fbcon_putcs(vc, start, s - start, dy, x); console_conditional_schedule(); dy++; } } static void fbcon_redraw_blit(struct vc_data *vc, struct fb_info *info, struct fbcon_display *p, int line, int count, int ycount) { int offset = ycount * vc->vc_cols; unsigned short *d = (unsigned short *) (vc->vc_origin + vc->vc_size_row * line); unsigned short *s = d + offset; struct fbcon_par *par = info->fbcon_par; while (count--) { unsigned short *start = s; unsigned short *le = advance_row(s, 1); unsigned short c; int x = 0; do { c = scr_readw(s); if (c == scr_readw(d)) { if (s > start) { par->bitops->bmove(vc, info, line + ycount, x, line, x, 1, s - start); x += s - start + 1; start = s + 1; } else { x++; start++; } } scr_writew(c, d); console_conditional_schedule(); s++; d++; } while (s < le); if (s > start) par->bitops->bmove(vc, info, line + ycount, x, line, x, 1, s - start); console_conditional_schedule(); if (ycount > 0) line++; else { line--; /* NOTE: We subtract two lines from these pointers */ s -= vc->vc_size_row; d -= vc->vc_size_row; } } } static void fbcon_redraw(struct vc_data *vc, int line, int count, int offset) { unsigned short *d = (unsigned short *) (vc->vc_origin + vc->vc_size_row * line); unsigned short *s = d + offset; while (count--) { unsigned short *start = s; unsigned short *le = advance_row(s, 1); unsigned short c; int x = 0; unsigned short attr = 1; do { c = scr_readw(s); if (attr != (c & 0xff00)) { attr = c & 0xff00; if (s > start) { fbcon_putcs(vc, start, s - start, line, x); x += s - start; start = s; } } if (c == scr_readw(d)) { if (s > start) { fbcon_putcs(vc, start, s - start, line, x); x += s - start + 1; start = s + 1; } else { x++; start++; } } scr_writew(c, d); console_conditional_schedule(); s++; d++; } while (s < le); if (s > start) fbcon_putcs(vc, start, s - start, line, x); console_conditional_schedule(); if (offset > 0) line++; else { line--; /* NOTE: We subtract two lines from these pointers */ s -= vc->vc_size_row; d -= vc->vc_size_row; } } } static void fbcon_bmove_rec(struct vc_data *vc, struct fbcon_display *p, int sy, int sx, int dy, int dx, int height, int width, u_int y_break) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; u_int b; if (sy < y_break && sy + height > y_break) { b = y_break - sy; if (dy < sy) { /* Avoid trashing self */ fbcon_bmove_rec(vc, p, sy, sx, dy, dx, b, width, y_break); fbcon_bmove_rec(vc, p, sy + b, sx, dy + b, dx, height - b, width, y_break); } else { fbcon_bmove_rec(vc, p, sy + b, sx, dy + b, dx, height - b, width, y_break); fbcon_bmove_rec(vc, p, sy, sx, dy, dx, b, width, y_break); } return; } if (dy < y_break && dy + height > y_break) { b = y_break - dy; if (dy < sy) { /* Avoid trashing self */ fbcon_bmove_rec(vc, p, sy, sx, dy, dx, b, width, y_break); fbcon_bmove_rec(vc, p, sy + b, sx, dy + b, dx, height - b, width, y_break); } else { fbcon_bmove_rec(vc, p, sy + b, sx, dy + b, dx, height - b, width, y_break); fbcon_bmove_rec(vc, p, sy, sx, dy, dx, b, width, y_break); } return; } par->bitops->bmove(vc, info, real_y(p, sy), sx, real_y(p, dy), dx, height, width); } static void fbcon_bmove(struct vc_data *vc, int sy, int sx, int dy, int dx, int height, int width) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_display *p = &fb_display[vc->vc_num]; if (!fbcon_is_active(vc, info)) return; if (!width || !height) return; /* Split blits that cross physical y_wrap case. * Pathological case involves 4 blits, better to use recursive * code rather than unrolled case * * Recursive invocations don't need to erase the cursor over and * over again, so we use fbcon_bmove_rec() */ fbcon_bmove_rec(vc, p, sy, sx, dy, dx, height, width, p->vrows - p->yscroll); } static bool fbcon_scroll(struct vc_data *vc, unsigned int t, unsigned int b, enum con_scroll dir, unsigned int count) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_display *p = &fb_display[vc->vc_num]; int scroll_partial = info->flags & FBINFO_PARTIAL_PAN_OK; if (!fbcon_is_active(vc, info)) return true; fbcon_cursor(vc, false); /* * ++Geert: Only use ywrap/ypan if the console is in text mode * ++Andrew: Only use ypan on hardware text mode when scrolling the * whole screen (prevents flicker). */ switch (dir) { case SM_UP: if (count > vc->vc_rows) /* Maximum realistic size */ count = vc->vc_rows; switch (fb_scrollmode(p)) { case SCROLL_MOVE: fbcon_redraw_blit(vc, info, p, t, b - t - count, count); __fbcon_clear(vc, b - count, 0, count, vc->vc_cols); scr_memsetw((unsigned short *) (vc->vc_origin + vc->vc_size_row * (b - count)), vc->vc_video_erase_char, vc->vc_size_row * count); return true; case SCROLL_WRAP_MOVE: if (b - t - count > 3 * vc->vc_rows >> 2) { if (t > 0) fbcon_bmove(vc, 0, 0, count, 0, t, vc->vc_cols); ywrap_up(vc, count); if (vc->vc_rows - b > 0) fbcon_bmove(vc, b - count, 0, b, 0, vc->vc_rows - b, vc->vc_cols); } else if (info->flags & FBINFO_READS_FAST) fbcon_bmove(vc, t + count, 0, t, 0, b - t - count, vc->vc_cols); else goto redraw_up; __fbcon_clear(vc, b - count, 0, count, vc->vc_cols); break; case SCROLL_PAN_REDRAW: if ((p->yscroll + count <= 2 * (p->vrows - vc->vc_rows)) && ((!scroll_partial && (b - t == vc->vc_rows)) || (scroll_partial && (b - t - count > 3 * vc->vc_rows >> 2)))) { if (t > 0) fbcon_redraw_move(vc, p, 0, t, count); ypan_up_redraw(vc, t, count); if (vc->vc_rows - b > 0) fbcon_redraw_move(vc, p, b, vc->vc_rows - b, b); } else fbcon_redraw_move(vc, p, t + count, b - t - count, t); __fbcon_clear(vc, b - count, 0, count, vc->vc_cols); break; case SCROLL_PAN_MOVE: if ((p->yscroll + count <= 2 * (p->vrows - vc->vc_rows)) && ((!scroll_partial && (b - t == vc->vc_rows)) || (scroll_partial && (b - t - count > 3 * vc->vc_rows >> 2)))) { if (t > 0) fbcon_bmove(vc, 0, 0, count, 0, t, vc->vc_cols); ypan_up(vc, count); if (vc->vc_rows - b > 0) fbcon_bmove(vc, b - count, 0, b, 0, vc->vc_rows - b, vc->vc_cols); } else if (info->flags & FBINFO_READS_FAST) fbcon_bmove(vc, t + count, 0, t, 0, b - t - count, vc->vc_cols); else goto redraw_up; __fbcon_clear(vc, b - count, 0, count, vc->vc_cols); break; case SCROLL_REDRAW: redraw_up: fbcon_redraw(vc, t, b - t - count, count * vc->vc_cols); __fbcon_clear(vc, b - count, 0, count, vc->vc_cols); scr_memsetw((unsigned short *) (vc->vc_origin + vc->vc_size_row * (b - count)), vc->vc_video_erase_char, vc->vc_size_row * count); return true; } break; case SM_DOWN: if (count > vc->vc_rows) /* Maximum realistic size */ count = vc->vc_rows; switch (fb_scrollmode(p)) { case SCROLL_MOVE: fbcon_redraw_blit(vc, info, p, b - 1, b - t - count, -count); __fbcon_clear(vc, t, 0, count, vc->vc_cols); scr_memsetw((unsigned short *) (vc->vc_origin + vc->vc_size_row * t), vc->vc_video_erase_char, vc->vc_size_row * count); return true; case SCROLL_WRAP_MOVE: if (b - t - count > 3 * vc->vc_rows >> 2) { if (vc->vc_rows - b > 0) fbcon_bmove(vc, b, 0, b - count, 0, vc->vc_rows - b, vc->vc_cols); ywrap_down(vc, count); if (t > 0) fbcon_bmove(vc, count, 0, 0, 0, t, vc->vc_cols); } else if (info->flags & FBINFO_READS_FAST) fbcon_bmove(vc, t, 0, t + count, 0, b - t - count, vc->vc_cols); else goto redraw_down; __fbcon_clear(vc, t, 0, count, vc->vc_cols); break; case SCROLL_PAN_MOVE: if ((count - p->yscroll <= p->vrows - vc->vc_rows) && ((!scroll_partial && (b - t == vc->vc_rows)) || (scroll_partial && (b - t - count > 3 * vc->vc_rows >> 2)))) { if (vc->vc_rows - b > 0) fbcon_bmove(vc, b, 0, b - count, 0, vc->vc_rows - b, vc->vc_cols); ypan_down(vc, count); if (t > 0) fbcon_bmove(vc, count, 0, 0, 0, t, vc->vc_cols); } else if (info->flags & FBINFO_READS_FAST) fbcon_bmove(vc, t, 0, t + count, 0, b - t - count, vc->vc_cols); else goto redraw_down; __fbcon_clear(vc, t, 0, count, vc->vc_cols); break; case SCROLL_PAN_REDRAW: if ((count - p->yscroll <= p->vrows - vc->vc_rows) && ((!scroll_partial && (b - t == vc->vc_rows)) || (scroll_partial && (b - t - count > 3 * vc->vc_rows >> 2)))) { if (vc->vc_rows - b > 0) fbcon_redraw_move(vc, p, b, vc->vc_rows - b, b - count); ypan_down_redraw(vc, t, count); if (t > 0) fbcon_redraw_move(vc, p, count, t, 0); } else fbcon_redraw_move(vc, p, t, b - t - count, t + count); __fbcon_clear(vc, t, 0, count, vc->vc_cols); break; case SCROLL_REDRAW: redraw_down: fbcon_redraw(vc, b - 1, b - t - count, -count * vc->vc_cols); __fbcon_clear(vc, t, 0, count, vc->vc_cols); scr_memsetw((unsigned short *) (vc->vc_origin + vc->vc_size_row * t), vc->vc_video_erase_char, vc->vc_size_row * count); return true; } } return false; } static void updatescrollmode_accel(struct fbcon_display *p, struct fb_info *info, struct vc_data *vc) { #ifdef CONFIG_FRAMEBUFFER_CONSOLE_LEGACY_ACCELERATION struct fbcon_par *par = info->fbcon_par; int cap = info->flags; u16 t = 0; int ypan = FBCON_SWAP(par->rotate, info->fix.ypanstep, info->fix.xpanstep); int ywrap = FBCON_SWAP(par->rotate, info->fix.ywrapstep, t); int yres = FBCON_SWAP(par->rotate, info->var.yres, info->var.xres); int vyres = FBCON_SWAP(par->rotate, info->var.yres_virtual, info->var.xres_virtual); int good_pan = (cap & FBINFO_HWACCEL_YPAN) && divides(ypan, vc->vc_font.height) && vyres > yres; int good_wrap = (cap & FBINFO_HWACCEL_YWRAP) && divides(ywrap, vc->vc_font.height) && divides(vc->vc_font.height, vyres) && divides(vc->vc_font.height, yres); int reading_fast = cap & FBINFO_READS_FAST; int fast_copyarea = (cap & FBINFO_HWACCEL_COPYAREA) && !(cap & FBINFO_HWACCEL_DISABLED); int fast_imageblit = (cap & FBINFO_HWACCEL_IMAGEBLIT) && !(cap & FBINFO_HWACCEL_DISABLED); if (good_wrap || good_pan) { if (reading_fast || fast_copyarea) p->scrollmode = good_wrap ? SCROLL_WRAP_MOVE : SCROLL_PAN_MOVE; else p->scrollmode = good_wrap ? SCROLL_REDRAW : SCROLL_PAN_REDRAW; } else { if (reading_fast || (fast_copyarea && !fast_imageblit)) p->scrollmode = SCROLL_MOVE; else p->scrollmode = SCROLL_REDRAW; } #endif } static void updatescrollmode(struct fbcon_display *p, struct fb_info *info, struct vc_data *vc) { struct fbcon_par *par = info->fbcon_par; int fh = vc->vc_font.height; int yres = FBCON_SWAP(par->rotate, info->var.yres, info->var.xres); int vyres = FBCON_SWAP(par->rotate, info->var.yres_virtual, info->var.xres_virtual); p->vrows = vyres/fh; if (yres > (fh * (vc->vc_rows + 1))) p->vrows -= (yres - (fh * vc->vc_rows)) / fh; if ((yres % fh) && (vyres % fh < yres % fh)) p->vrows--; /* update scrollmode in case hardware acceleration is used */ updatescrollmode_accel(p, info, vc); } #define PITCH(w) (((w) + 7) >> 3) #define CALC_FONTSZ(h, p, c) ((h) * (p) * (c)) /* size = height * pitch * charcount */ static int fbcon_resize(struct vc_data *vc, unsigned int width, unsigned int height, bool from_user) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; struct fbcon_display *p = &fb_display[vc->vc_num]; struct fb_var_screeninfo var = info->var; int x_diff, y_diff, virt_w, virt_h, virt_fw, virt_fh; if (p->userfont && FNTSIZE(vc->vc_font.data)) { int size; int pitch = PITCH(vc->vc_font.width); /* * If user font, ensure that a possible change to user font * height or width will not allow a font data out-of-bounds access. * NOTE: must use original charcount in calculation as font * charcount can change and cannot be used to determine the * font data allocated size. */ if (pitch <= 0) return -EINVAL; size = CALC_FONTSZ(vc->vc_font.height, pitch, vc->vc_font.charcount); if (size > FNTSIZE(vc->vc_font.data)) return -EINVAL; } virt_w = FBCON_SWAP(par->rotate, width, height); virt_h = FBCON_SWAP(par->rotate, height, width); virt_fw = FBCON_SWAP(par->rotate, vc->vc_font.width, vc->vc_font.height); virt_fh = FBCON_SWAP(par->rotate, vc->vc_font.height, vc->vc_font.width); var.xres = virt_w * virt_fw; var.yres = virt_h * virt_fh; x_diff = info->var.xres - var.xres; y_diff = info->var.yres - var.yres; if (x_diff < 0 || x_diff > virt_fw || y_diff < 0 || y_diff > virt_fh) { const struct fb_videomode *mode; pr_debug("attempting resize %ix%i\n", var.xres, var.yres); mode = fb_find_best_mode(&var, &info->modelist); if (mode == NULL) return -EINVAL; display_to_var(&var, p); fb_videomode_to_var(&var, mode); if (virt_w > var.xres/virt_fw || virt_h > var.yres/virt_fh) return -EINVAL; pr_debug("resize now %ix%i\n", var.xres, var.yres); if (con_is_visible(vc) && vc->vc_mode == KD_TEXT) { var.activate = FB_ACTIVATE_NOW | FB_ACTIVATE_FORCE; fb_set_var(info, &var); } var_to_display(p, &info->var, info); par->var = info->var; } updatescrollmode(p, info, vc); return 0; } static bool fbcon_switch(struct vc_data *vc) { struct fb_info *info, *old_info = NULL; struct fbcon_par *par; struct fbcon_display *p = &fb_display[vc->vc_num]; struct fb_var_screeninfo var; int i, ret, prev_console; info = fbcon_info_from_console(vc->vc_num); par = info->fbcon_par; if (logo_shown >= 0) { struct vc_data *conp2 = vc_cons[logo_shown].d; if (conp2->vc_top == logo_lines && conp2->vc_bottom == conp2->vc_rows) conp2->vc_top = 0; logo_shown = FBCON_LOGO_CANSHOW; } prev_console = par->currcon; if (prev_console != -1) old_info = fbcon_info_from_console(prev_console); /* * FIXME: If we have multiple fbdev's loaded, we need to * update all info->currcon. Perhaps, we can place this * in a centralized structure, but this might break some * drivers. * * info->currcon = vc->vc_num; */ fbcon_for_each_registered_fb(i) { if (fbcon_registered_fb[i]->fbcon_par) { struct fbcon_par *par = fbcon_registered_fb[i]->fbcon_par; par->currcon = vc->vc_num; } } memset(&var, 0, sizeof(struct fb_var_screeninfo)); display_to_var(&var, p); var.activate = FB_ACTIVATE_NOW; /* * make sure we don't unnecessarily trip the memcmp() * in fb_set_var() */ info->var.activate = var.activate; var.vmode |= info->var.vmode & ~FB_VMODE_MASK; fb_set_var(info, &var); par->var = info->var; if (old_info != NULL && (old_info != info || info->flags & FBINFO_MISC_ALWAYS_SETPAR)) { if (info->fbops->fb_set_par) { ret = info->fbops->fb_set_par(info); if (ret) printk(KERN_ERR "fbcon_switch: detected " "unhandled fb_set_par error, " "error code %d\n", ret); } if (old_info != info) fbcon_del_cursor_work(old_info); } if (!fbcon_is_active(vc, info) || par->blank_state != FB_BLANK_UNBLANK) fbcon_del_cursor_work(info); else fbcon_add_cursor_work(info); set_blitting_type(vc, info); par->cursor_reset = 1; if (par->bitops->rotate_font && par->bitops->rotate_font(info, vc)) { par->rotate = FB_ROTATE_UR; set_blitting_type(vc, info); } vc->vc_can_do_color = (fb_get_color_depth(&info->var, &info->fix)!=1); vc->vc_complement_mask = vc->vc_can_do_color ? 0x7700 : 0x0800; if (vc->vc_font.charcount > 256) vc->vc_complement_mask <<= 1; updatescrollmode(p, info, vc); switch (fb_scrollmode(p)) { case SCROLL_WRAP_MOVE: scrollback_phys_max = p->vrows - vc->vc_rows; break; case SCROLL_PAN_MOVE: case SCROLL_PAN_REDRAW: scrollback_phys_max = p->vrows - 2 * vc->vc_rows; if (scrollback_phys_max < 0) scrollback_phys_max = 0; break; default: scrollback_phys_max = 0; break; } scrollback_max = 0; scrollback_current = 0; if (fbcon_is_active(vc, info)) { par->var.xoffset = par->var.yoffset = p->yscroll = 0; par->bitops->update_start(info); } fbcon_set_palette(vc, color_table); fbcon_clear_margins(vc, 0); if (logo_shown == FBCON_LOGO_DRAW) { logo_shown = fg_console; fb_show_logo(info, par->rotate); update_region(vc, vc->vc_origin + vc->vc_size_row * vc->vc_top, vc->vc_size_row * (vc->vc_bottom - vc->vc_top) / 2); return false; } return true; } static void fbcon_generic_blank(struct vc_data *vc, struct fb_info *info, int blank) { if (blank) { unsigned short charmask = vc->vc_hi_font_mask ? 0x1ff : 0xff; unsigned short oldc; oldc = vc->vc_video_erase_char; vc->vc_video_erase_char &= charmask; __fbcon_clear(vc, 0, 0, vc->vc_rows, vc->vc_cols); vc->vc_video_erase_char = oldc; } } static bool fbcon_blank(struct vc_data *vc, enum vesa_blank_mode blank, bool mode_switch) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; if (mode_switch) { struct fb_var_screeninfo var = info->var; par->graphics = 1; if (!blank) { var.activate = FB_ACTIVATE_NOW | FB_ACTIVATE_FORCE | FB_ACTIVATE_KD_TEXT; fb_set_var(info, &var); par->graphics = 0; par->var = info->var; } } if (fbcon_is_active(vc, info)) { if (par->blank_state != blank) { par->blank_state = blank; fbcon_cursor(vc, !blank); par->cursor_flash = (!blank); if (fb_blank(info, blank)) fbcon_generic_blank(vc, info, blank); } if (!blank) update_screen(vc); } if (mode_switch || !fbcon_is_active(vc, info) || par->blank_state != FB_BLANK_UNBLANK) fbcon_del_cursor_work(info); else fbcon_add_cursor_work(info); return false; } static void fbcon_debug_enter(struct vc_data *vc) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; par->save_graphics = par->graphics; par->graphics = 0; if (info->fbops->fb_debug_enter) info->fbops->fb_debug_enter(info); fbcon_set_palette(vc, color_table); } static void fbcon_debug_leave(struct vc_data *vc) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; par->graphics = par->save_graphics; if (info->fbops->fb_debug_leave) info->fbops->fb_debug_leave(info); } static int fbcon_get_font(struct vc_data *vc, struct console_font *font, unsigned int vpitch) { u8 *fontdata = vc->vc_font.data; u8 *data = font->data; int i, j; font->width = vc->vc_font.width; font->height = vc->vc_font.height; if (font->height > vpitch) return -ENOSPC; font->charcount = vc->vc_hi_font_mask ? 512 : 256; if (!font->data) return 0; if (font->width <= 8) { j = vc->vc_font.height; if (font->charcount * j > FNTSIZE(fontdata)) return -EINVAL; for (i = 0; i < font->charcount; i++) { memcpy(data, fontdata, j); memset(data + j, 0, vpitch - j); data += vpitch; fontdata += j; } } else if (font->width <= 16) { j = vc->vc_font.height * 2; if (font->charcount * j > FNTSIZE(fontdata)) return -EINVAL; for (i = 0; i < font->charcount; i++) { memcpy(data, fontdata, j); memset(data + j, 0, 2*vpitch - j); data += 2*vpitch; fontdata += j; } } else if (font->width <= 24) { if (font->charcount * (vc->vc_font.height * sizeof(u32)) > FNTSIZE(fontdata)) return -EINVAL; for (i = 0; i < font->charcount; i++) { for (j = 0; j < vc->vc_font.height; j++) { *data++ = fontdata[0]; *data++ = fontdata[1]; *data++ = fontdata[2]; fontdata += sizeof(u32); } memset(data, 0, 3 * (vpitch - j)); data += 3 * (vpitch - j); } } else { j = vc->vc_font.height * 4; if (font->charcount * j > FNTSIZE(fontdata)) return -EINVAL; for (i = 0; i < font->charcount; i++) { memcpy(data, fontdata, j); memset(data + j, 0, 4 * vpitch - j); data += 4 * vpitch; fontdata += j; } } return 0; } /* set/clear vc_hi_font_mask and update vc attrs accordingly */ static void set_vc_hi_font(struct vc_data *vc, bool set) { if (!set) { vc->vc_hi_font_mask = 0; if (vc->vc_can_do_color) { vc->vc_complement_mask >>= 1; vc->vc_s_complement_mask >>= 1; } /* ++Edmund: reorder the attribute bits */ if (vc->vc_can_do_color) { unsigned short *cp = (unsigned short *) vc->vc_origin; int count = vc->vc_screenbuf_size / 2; unsigned short c; for (; count > 0; count--, cp++) { c = scr_readw(cp); scr_writew(((c & 0xfe00) >> 1) | (c & 0xff), cp); } c = vc->vc_video_erase_char; vc->vc_video_erase_char = ((c & 0xfe00) >> 1) | (c & 0xff); vc->vc_attr >>= 1; } } else { vc->vc_hi_font_mask = 0x100; if (vc->vc_can_do_color) { vc->vc_complement_mask <<= 1; vc->vc_s_complement_mask <<= 1; } /* ++Edmund: reorder the attribute bits */ { unsigned short *cp = (unsigned short *) vc->vc_origin; int count = vc->vc_screenbuf_size / 2; unsigned short c; for (; count > 0; count--, cp++) { unsigned short newc; c = scr_readw(cp); if (vc->vc_can_do_color) newc = ((c & 0xff00) << 1) | (c & 0xff); else newc = c & ~0x100; scr_writew(newc, cp); } c = vc->vc_video_erase_char; if (vc->vc_can_do_color) { vc->vc_video_erase_char = ((c & 0xff00) << 1) | (c & 0xff); vc->vc_attr <<= 1; } else vc->vc_video_erase_char = c & ~0x100; } } } static int fbcon_do_set_font(struct vc_data *vc, int w, int h, int charcount, const u8 * data, int userfont) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_par *par = info->fbcon_par; struct fbcon_display *p = &fb_display[vc->vc_num]; int resize, ret, old_userfont, old_width, old_height, old_charcount; u8 *old_data = vc->vc_font.data; resize = (w != vc->vc_font.width) || (h != vc->vc_font.height); vc->vc_font.data = (void *)(p->fontdata = data); old_userfont = p->userfont; if ((p->userfont = userfont)) REFCOUNT(data)++; old_width = vc->vc_font.width; old_height = vc->vc_font.height; old_charcount = vc->vc_font.charcount; vc->vc_font.width = w; vc->vc_font.height = h; vc->vc_font.charcount = charcount; if (vc->vc_hi_font_mask && charcount == 256) set_vc_hi_font(vc, false); else if (!vc->vc_hi_font_mask && charcount == 512) set_vc_hi_font(vc, true); if (resize) { int cols, rows; cols = FBCON_SWAP(par->rotate, info->var.xres, info->var.yres); rows = FBCON_SWAP(par->rotate, info->var.yres, info->var.xres); cols /= w; rows /= h; ret = vc_resize(vc, cols, rows); if (ret) goto err_out; } else if (con_is_visible(vc) && vc->vc_mode == KD_TEXT) { fbcon_clear_margins(vc, 0); update_screen(vc); } if (old_userfont && (--REFCOUNT(old_data) == 0)) kfree(old_data - FONT_EXTRA_WORDS * sizeof(int)); return 0; err_out: p->fontdata = old_data; vc->vc_font.data = old_data; if (userfont) { p->userfont = old_userfont; if (--REFCOUNT(data) == 0) kfree(data - FONT_EXTRA_WORDS * sizeof(int)); } vc->vc_font.width = old_width; vc->vc_font.height = old_height; vc->vc_font.charcount = old_charcount; return ret; } /* * User asked to set font; we are guaranteed that charcount does not exceed 512 * but lets not assume that, since charcount of 512 is small for unicode support. */ static int fbcon_set_font(struct vc_data *vc, const struct console_font *font, unsigned int vpitch, unsigned int flags) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); unsigned charcount = font->charcount; int w = font->width; int h = font->height; int size, alloc_size; int i, csum; u8 *new_data, *data = font->data; int pitch = PITCH(font->width); /* Is there a reason why fbconsole couldn't handle any charcount >256? * If not this check should be changed to charcount < 256 */ if (charcount != 256 && charcount != 512) return -EINVAL; /* font bigger than screen resolution ? */ if (w > FBCON_SWAP(info->var.rotate, info->var.xres, info->var.yres) || h > FBCON_SWAP(info->var.rotate, info->var.yres, info->var.xres)) return -EINVAL; if (font->width > FB_MAX_BLIT_WIDTH || font->height > FB_MAX_BLIT_HEIGHT) return -EINVAL; /* Make sure drawing engine can handle the font */ if (!test_bit(font->width - 1, info->pixmap.blit_x) || !test_bit(font->height - 1, info->pixmap.blit_y)) return -EINVAL; /* Make sure driver can handle the font length */ if (fbcon_invalid_charcount(info, charcount)) return -EINVAL; /* Check for integer overflow in font size calculation */ if (check_mul_overflow(h, pitch, &size) || check_mul_overflow(size, charcount, &size)) return -EINVAL; /* Check for overflow in allocation size calculation */ if (check_add_overflow(FONT_EXTRA_WORDS * sizeof(int), size, &alloc_size)) return -EINVAL; new_data = kmalloc(alloc_size, GFP_USER); if (!new_data) return -ENOMEM; memset(new_data, 0, FONT_EXTRA_WORDS * sizeof(int)); new_data += FONT_EXTRA_WORDS * sizeof(int); FNTSIZE(new_data) = size; REFCOUNT(new_data) = 0; /* usage counter */ for (i=0; i< charcount; i++) { memcpy(new_data + i*h*pitch, data + i*vpitch*pitch, h*pitch); } /* Since linux has a nice crc32 function use it for counting font * checksums. */ csum = crc32(0, new_data, size); FNTSUM(new_data) = csum; /* Check if the same font is on some other console already */ for (i = first_fb_vc; i <= last_fb_vc; i++) { struct vc_data *tmp = vc_cons[i].d; if (fb_display[i].userfont && fb_display[i].fontdata && FNTSUM(fb_display[i].fontdata) == csum && FNTSIZE(fb_display[i].fontdata) == size && tmp->vc_font.width == w && !memcmp(fb_display[i].fontdata, new_data, size)) { kfree(new_data - FONT_EXTRA_WORDS * sizeof(int)); new_data = (u8 *)fb_display[i].fontdata; break; } } return fbcon_do_set_font(vc, font->width, font->height, charcount, new_data, 1); } static int fbcon_set_def_font(struct vc_data *vc, struct console_font *font, const char *name) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); const struct font_desc *f; if (!name) f = get_default_font(info->var.xres, info->var.yres, info->pixmap.blit_x, info->pixmap.blit_y); else if (!(f = find_font(name))) return -ENOENT; font->width = f->width; font->height = f->height; return fbcon_do_set_font(vc, f->width, f->height, f->charcount, f->data, 0); } static u16 palette_red[16]; static u16 palette_green[16]; static u16 palette_blue[16]; static struct fb_cmap palette_cmap = { 0, 16, palette_red, palette_green, palette_blue, NULL }; static void fbcon_set_palette(struct vc_data *vc, const unsigned char *table) { struct fb_info *info = fbcon_info_from_console(vc->vc_num); int i, j, k, depth; u8 val; if (!fbcon_is_active(vc, info)) return; if (!con_is_visible(vc)) return; depth = fb_get_color_depth(&info->var, &info->fix); if (depth > 3) { for (i = j = 0; i < 16; i++) { k = table[i]; val = vc->vc_palette[j++]; palette_red[k] = (val << 8) | val; val = vc->vc_palette[j++]; palette_green[k] = (val << 8) | val; val = vc->vc_palette[j++]; palette_blue[k] = (val << 8) | val; } palette_cmap.len = 16; palette_cmap.start = 0; /* * If framebuffer is capable of less than 16 colors, * use default palette of fbcon. */ } else fb_copy_cmap(fb_default_cmap(1 << depth), &palette_cmap); fb_set_cmap(&palette_cmap, info); } /* As we might be inside of softback, we may work with non-contiguous buffer, that's why we have to use a separate routine. */ static void fbcon_invert_region(struct vc_data *vc, u16 * p, int cnt) { while (cnt--) { u16 a = scr_readw(p); if (!vc->vc_can_do_color) a ^= 0x0800; else if (vc->vc_hi_font_mask == 0x100) a = ((a) & 0x11ff) | (((a) & 0xe000) >> 4) | (((a) & 0x0e00) << 4); else a = ((a) & 0x88ff) | (((a) & 0x7000) >> 4) | (((a) & 0x0700) << 4); scr_writew(a, p++); } } void fbcon_suspended(struct fb_info *info) { struct vc_data *vc = NULL; struct fbcon_par *par = info->fbcon_par; if (!par || par->currcon < 0) return; vc = vc_cons[par->currcon].d; /* Clear cursor, restore saved data */ fbcon_cursor(vc, false); } void fbcon_resumed(struct fb_info *info) { struct vc_data *vc; struct fbcon_par *par = info->fbcon_par; if (!par || par->currcon < 0) return; vc = vc_cons[par->currcon].d; update_screen(vc); } static void fbcon_modechanged(struct fb_info *info) { struct fbcon_par *par = info->fbcon_par; struct vc_data *vc; struct fbcon_display *p; int rows, cols; if (!par || par->currcon < 0) return; vc = vc_cons[par->currcon].d; if (vc->vc_mode != KD_TEXT || fbcon_info_from_console(par->currcon) != info) return; p = &fb_display[vc->vc_num]; set_blitting_type(vc, info); if (con_is_visible(vc)) { var_to_display(p, &info->var, info); cols = FBCON_SWAP(par->rotate, info->var.xres, info->var.yres); rows = FBCON_SWAP(par->rotate, info->var.yres, info->var.xres); cols /= vc->vc_font.width; rows /= vc->vc_font.height; vc_resize(vc, cols, rows); updatescrollmode(p, info, vc); scrollback_max = 0; scrollback_current = 0; if (fbcon_is_active(vc, info)) { par->var.xoffset = par->var.yoffset = p->yscroll = 0; par->bitops->update_start(info); } fbcon_set_palette(vc, color_table); update_screen(vc); } } static void fbcon_set_all_vcs(struct fb_info *info) { struct fbcon_par *par = info->fbcon_par; struct vc_data *vc; struct fbcon_display *p; int i, rows, cols, fg = -1; if (!par || par->currcon < 0) return; for (i = first_fb_vc; i <= last_fb_vc; i++) { vc = vc_cons[i].d; if (!vc || vc->vc_mode != KD_TEXT || fbcon_info_from_console(i) != info) continue; if (con_is_visible(vc)) { fg = i; continue; } p = &fb_display[vc->vc_num]; set_blitting_type(vc, info); var_to_display(p, &info->var, info); cols = FBCON_SWAP(par->rotate, info->var.xres, info->var.yres); rows = FBCON_SWAP(par->rotate, info->var.yres, info->var.xres); cols /= vc->vc_font.width; rows /= vc->vc_font.height; vc_resize(vc, cols, rows); } if (fg != -1) fbcon_modechanged(info); } void fbcon_update_vcs(struct fb_info *info, bool all) { if (all) fbcon_set_all_vcs(info); else fbcon_modechanged(info); } EXPORT_SYMBOL(fbcon_update_vcs); /* let fbcon check if it supports a new screen resolution */ int fbcon_modechange_possible(struct fb_info *info, struct fb_var_screeninfo *var) { struct fbcon_par *par = info->fbcon_par; struct vc_data *vc; unsigned int i; WARN_CONSOLE_UNLOCKED(); if (!par) return 0; /* prevent setting a screen size which is smaller than font size */ for (i = first_fb_vc; i <= last_fb_vc; i++) { vc = vc_cons[i].d; if (!vc || vc->vc_mode != KD_TEXT || fbcon_info_from_console(i) != info) continue; if (vc->vc_font.width > FBCON_SWAP(var->rotate, var->xres, var->yres) || vc->vc_font.height > FBCON_SWAP(var->rotate, var->yres, var->xres)) return -EINVAL; } return 0; } EXPORT_SYMBOL_GPL(fbcon_modechange_possible); int fbcon_mode_deleted(struct fb_info *info, struct fb_videomode *mode) { struct fb_info *fb_info; struct fbcon_display *p; int i, j, found = 0; /* before deletion, ensure that mode is not in use */ for (i = first_fb_vc; i <= last_fb_vc; i++) { j = con2fb_map[i]; if (j == -1) continue; fb_info = fbcon_registered_fb[j]; if (fb_info != info) continue; p = &fb_display[i]; if (!p || !p->mode) continue; if (fb_mode_is_equal(p->mode, mode)) { found = 1; break; } } return found; } static void fbcon_delete_mode(struct fb_videomode *m) { struct fbcon_display *p; for (int i = first_fb_vc; i <= last_fb_vc; i++) { p = &fb_display[i]; if (p->mode == m) p->mode = NULL; } } void fbcon_delete_modelist(struct list_head *head) { struct fb_modelist *modelist; list_for_each_entry(modelist, head, list) fbcon_delete_mode(&modelist->mode); } #ifdef CONFIG_VT_HW_CONSOLE_BINDING static void fbcon_unbind(void) { int ret; ret = do_unbind_con_driver(&fb_con, first_fb_vc, last_fb_vc, fbcon_is_default); if (!ret) fbcon_has_console_bind = false; } #else static inline void fbcon_unbind(void) {} #endif /* CONFIG_VT_HW_CONSOLE_BINDING */ void fbcon_fb_unbind(struct fb_info *info) { int i, new_idx = -1; int idx = info->node; console_lock(); if (!fbcon_has_console_bind) { console_unlock(); return; } for (i = first_fb_vc; i <= last_fb_vc; i++) { if (con2fb_map[i] != idx && con2fb_map[i] != -1) { new_idx = con2fb_map[i]; break; } } if (new_idx != -1) { for (i = first_fb_vc; i <= last_fb_vc; i++) { if (con2fb_map[i] == idx) set_con2fb_map(i, new_idx, 0); } } else { struct fb_info *info = fbcon_registered_fb[idx]; /* This is sort of like set_con2fb_map, except it maps * the consoles to no device and then releases the * oldinfo to free memory and cancel the cursor blink * timer. I can imagine this just becoming part of * set_con2fb_map where new_idx is -1 */ for (i = first_fb_vc; i <= last_fb_vc; i++) { if (con2fb_map[i] == idx) { con2fb_map[i] = -1; if (!search_fb_in_map(idx)) { con2fb_release_oldinfo(vc_cons[i].d, info, NULL); } } } fbcon_unbind(); } console_unlock(); } void fbcon_fb_unregistered(struct fb_info *info) { int i, idx; console_lock(); if (info->device && dev_is_pci(info->device)) vga_switcheroo_client_fb_set(to_pci_dev(info->device), NULL); fbcon_registered_fb[info->node] = NULL; fbcon_num_registered_fb--; if (deferred_takeover) { console_unlock(); return; } idx = info->node; for (i = first_fb_vc; i <= last_fb_vc; i++) { if (con2fb_map[i] == idx) con2fb_map[i] = -1; } if (idx == info_idx) { info_idx = -1; fbcon_for_each_registered_fb(i) { info_idx = i; break; } } if (info_idx != -1) { for (i = first_fb_vc; i <= last_fb_vc; i++) { if (con2fb_map[i] == -1) con2fb_map[i] = info_idx; } } if (primary_device == idx) primary_device = -1; if (!fbcon_num_registered_fb) do_unregister_con_driver(&fb_con); console_unlock(); } void fbcon_remap_all(struct fb_info *info) { int i, idx = info->node; console_lock(); if (deferred_takeover) { for (i = first_fb_vc; i <= last_fb_vc; i++) con2fb_map_boot[i] = idx; fbcon_map_override(); console_unlock(); return; } for (i = first_fb_vc; i <= last_fb_vc; i++) set_con2fb_map(i, idx, 0); if (con_is_bound(&fb_con)) { printk(KERN_INFO "fbcon: Remapping primary device, " "fb%i, to tty %i-%i\n", idx, first_fb_vc + 1, last_fb_vc + 1); info_idx = idx; } console_unlock(); } #ifdef CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY static void fbcon_select_primary(struct fb_info *info) { if (!map_override && primary_device == -1 && video_is_primary_device(info->device)) { int i; printk(KERN_INFO "fbcon: %s (fb%i) is primary device\n", info->fix.id, info->node); primary_device = info->node; for (i = first_fb_vc; i <= last_fb_vc; i++) con2fb_map_boot[i] = primary_device; if (con_is_bound(&fb_con)) { printk(KERN_INFO "fbcon: Remapping primary device, " "fb%i, to tty %i-%i\n", info->node, first_fb_vc + 1, last_fb_vc + 1); info_idx = primary_device; } } } #else static inline void fbcon_select_primary(struct fb_info *info) { return; } #endif /* CONFIG_FRAMEBUFFER_DETECT_PRIMARY */ static bool lockless_register_fb; module_param_named_unsafe(lockless_register_fb, lockless_register_fb, bool, 0400); MODULE_PARM_DESC(lockless_register_fb, "Lockless framebuffer registration for debugging [default=off]"); /* called with console_lock held */ static int do_fb_registered(struct fb_info *info) { int ret = 0, i, idx; WARN_CONSOLE_UNLOCKED(); fbcon_registered_fb[info->node] = info; fbcon_num_registered_fb++; idx = info->node; fbcon_select_primary(info); if (deferred_takeover) { pr_info("fbcon: Deferring console take-over\n"); return 0; } if (info_idx == -1) { for (i = first_fb_vc; i <= last_fb_vc; i++) { if (con2fb_map_boot[i] == idx) { info_idx = idx; break; } } if (info_idx != -1) ret = do_fbcon_takeover(1); } else { for (i = first_fb_vc; i <= last_fb_vc; i++) { if (con2fb_map_boot[i] == idx) set_con2fb_map(i, idx, 0); } } /* Set the fb info for vga_switcheroo clients. Does nothing otherwise. */ if (info->device && dev_is_pci(info->device)) vga_switcheroo_client_fb_set(to_pci_dev(info->device), info); return ret; } int fbcon_fb_registered(struct fb_info *info) { int ret; if (!lockless_register_fb) console_lock(); else atomic_inc(&ignore_console_lock_warning); ret = do_fb_registered(info); if (!lockless_register_fb) console_unlock(); else atomic_dec(&ignore_console_lock_warning); return ret; } void fbcon_fb_blanked(struct fb_info *info, int blank) { struct fbcon_par *par = info->fbcon_par; struct vc_data *vc; if (!par || par->currcon < 0) return; vc = vc_cons[par->currcon].d; if (vc->vc_mode != KD_TEXT || fbcon_info_from_console(par->currcon) != info) return; if (con_is_visible(vc)) { if (blank) do_blank_screen(0); else do_unblank_screen(0); } par->blank_state = blank; } void fbcon_new_modelist(struct fb_info *info) { int i; struct vc_data *vc; struct fb_var_screeninfo var; const struct fb_videomode *mode; for (i = first_fb_vc; i <= last_fb_vc; i++) { if (fbcon_info_from_console(i) != info) continue; if (!fb_display[i].mode) continue; vc = vc_cons[i].d; display_to_var(&var, &fb_display[i]); mode = fb_find_nearest_mode(fb_display[i].mode, &info->modelist); fb_videomode_to_var(&var, mode); fbcon_set_disp(info, &var, vc->vc_num); } } void fbcon_get_requirement(struct fb_info *info, struct fb_blit_caps *caps) { struct vc_data *vc; if (caps->flags) { int i, charcnt; for (i = first_fb_vc; i <= last_fb_vc; i++) { vc = vc_cons[i].d; if (vc && vc->vc_mode == KD_TEXT && info->node == con2fb_map[i]) { set_bit(vc->vc_font.width - 1, caps->x); set_bit(vc->vc_font.height - 1, caps->y); charcnt = vc->vc_font.charcount; if (caps->len < charcnt) caps->len = charcnt; } } } else { vc = vc_cons[fg_console].d; if (vc && vc->vc_mode == KD_TEXT && info->node == con2fb_map[fg_console]) { bitmap_zero(caps->x, FB_MAX_BLIT_WIDTH); set_bit(vc->vc_font.width - 1, caps->x); bitmap_zero(caps->y, FB_MAX_BLIT_HEIGHT); set_bit(vc->vc_font.height - 1, caps->y); caps->len = vc->vc_font.charcount; } } } int fbcon_set_con2fb_map_ioctl(void __user *argp) { struct fb_con2fbmap con2fb; int ret; if (copy_from_user(&con2fb, argp, sizeof(con2fb))) return -EFAULT; if (con2fb.console < 1 || con2fb.console > MAX_NR_CONSOLES) return -EINVAL; if (con2fb.framebuffer >= FB_MAX) return -EINVAL; if (!fbcon_registered_fb[con2fb.framebuffer]) request_module("fb%d", con2fb.framebuffer); if (!fbcon_registered_fb[con2fb.framebuffer]) { return -EINVAL; } console_lock(); ret = set_con2fb_map(con2fb.console - 1, con2fb.framebuffer, 1); console_unlock(); return ret; } int fbcon_get_con2fb_map_ioctl(void __user *argp) { struct fb_con2fbmap con2fb; if (copy_from_user(&con2fb, argp, sizeof(con2fb))) return -EFAULT; if (con2fb.console < 1 || con2fb.console > MAX_NR_CONSOLES) return -EINVAL; console_lock(); con2fb.framebuffer = con2fb_map[con2fb.console - 1]; console_unlock(); return copy_to_user(argp, &con2fb, sizeof(con2fb)) ? -EFAULT : 0; } /* * The console `switch' structure for the frame buffer based console */ static const struct consw fb_con = { .owner = THIS_MODULE, .con_startup = fbcon_startup, .con_init = fbcon_init, .con_deinit = fbcon_deinit, .con_clear = fbcon_clear, .con_putcs = fbcon_putcs, .con_cursor = fbcon_cursor, .con_scroll = fbcon_scroll, .con_switch = fbcon_switch, .con_blank = fbcon_blank, .con_font_set = fbcon_set_font, .con_font_get = fbcon_get_font, .con_font_default = fbcon_set_def_font, .con_set_palette = fbcon_set_palette, .con_invert_region = fbcon_invert_region, .con_resize = fbcon_resize, .con_debug_enter = fbcon_debug_enter, .con_debug_leave = fbcon_debug_leave, }; static ssize_t rotate_store(struct device *device, struct device_attribute *attr, const char *buf, size_t count) { struct fb_info *info; int rotate, idx; char **last = NULL; console_lock(); idx = con2fb_map[fg_console]; if (idx == -1 || fbcon_registered_fb[idx] == NULL) goto err; info = fbcon_registered_fb[idx]; rotate = simple_strtoul(buf, last, 0); fbcon_rotate(info, rotate); err: console_unlock(); return count; } static ssize_t rotate_all_store(struct device *device, struct device_attribute *attr,const char *buf, size_t count) { struct fb_info *info; int rotate, idx; char **last = NULL; console_lock(); idx = con2fb_map[fg_console]; if (idx == -1 || fbcon_registered_fb[idx] == NULL) goto err; info = fbcon_registered_fb[idx]; rotate = simple_strtoul(buf, last, 0); fbcon_rotate_all(info, rotate); err: console_unlock(); return count; } static ssize_t rotate_show(struct device *device, struct device_attribute *attr,char *buf) { struct fb_info *info; int rotate = 0, idx; console_lock(); idx = con2fb_map[fg_console]; if (idx == -1 || fbcon_registered_fb[idx] == NULL) goto err; info = fbcon_registered_fb[idx]; rotate = fbcon_get_rotate(info); err: console_unlock(); return sysfs_emit(buf, "%d\n", rotate); } static ssize_t cursor_blink_show(struct device *device, struct device_attribute *attr, char *buf) { struct fb_info *info; struct fbcon_par *par; int idx, blink = -1; console_lock(); idx = con2fb_map[fg_console]; if (idx == -1 || fbcon_registered_fb[idx] == NULL) goto err; info = fbcon_registered_fb[idx]; par = info->fbcon_par; if (!par) goto err; blink = delayed_work_pending(&par->cursor_work); err: console_unlock(); return sysfs_emit(buf, "%d\n", blink); } static ssize_t cursor_blink_store(struct device *device, struct device_attribute *attr, const char *buf, size_t count) { struct fb_info *info; char **last = NULL; bool blink; int idx; console_lock(); idx = con2fb_map[fg_console]; if (idx == -1 || fbcon_registered_fb[idx] == NULL) goto err; info = fbcon_registered_fb[idx]; if (!info->fbcon_par) goto err; blink = simple_strtoul(buf, last, 0); if (blink) { fbcon_cursor_blink = true; fbcon_add_cursor_work(info); } else { fbcon_cursor_blink = false; fbcon_del_cursor_work(info); } err: console_unlock(); return count; } static DEVICE_ATTR_RW(cursor_blink); static DEVICE_ATTR_RW(rotate); static DEVICE_ATTR_WO(rotate_all); static struct attribute *fbcon_device_attrs[] = { &dev_attr_cursor_blink.attr, &dev_attr_rotate.attr, &dev_attr_rotate_all.attr, NULL }; ATTRIBUTE_GROUPS(fbcon_device); #ifdef CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER static void fbcon_register_existing_fbs(struct work_struct *work) { int i; console_lock(); deferred_takeover = false; logo_shown = FBCON_LOGO_DONTSHOW; fbcon_for_each_registered_fb(i) do_fb_registered(fbcon_registered_fb[i]); console_unlock(); } static struct notifier_block fbcon_output_nb; static DECLARE_WORK(fbcon_deferred_takeover_work, fbcon_register_existing_fbs); static int fbcon_output_notifier(struct notifier_block *nb, unsigned long action, void *data) { WARN_CONSOLE_UNLOCKED(); pr_info("fbcon: Taking over console\n"); dummycon_unregister_output_notifier(&fbcon_output_nb); /* We may get called in atomic context */ schedule_work(&fbcon_deferred_takeover_work); return NOTIFY_OK; } #endif static void fbcon_start(void) { WARN_CONSOLE_UNLOCKED(); #ifdef CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER if (conswitchp != &dummy_con) deferred_takeover = false; if (deferred_takeover) { fbcon_output_nb.notifier_call = fbcon_output_notifier; dummycon_register_output_notifier(&fbcon_output_nb); return; } #endif } void __init fb_console_init(void) { int i; console_lock(); fbcon_device = device_create_with_groups(fb_class, NULL, MKDEV(0, 0), NULL, fbcon_device_groups, "fbcon"); if (IS_ERR(fbcon_device)) { printk(KERN_WARNING "Unable to create device " "for fbcon; errno = %ld\n", PTR_ERR(fbcon_device)); fbcon_device = NULL; } for (i = 0; i < MAX_NR_CONSOLES; i++) con2fb_map[i] = -1; fbcon_start(); console_unlock(); } #ifdef MODULE void __exit fb_console_exit(void) { #ifdef CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER console_lock(); if (deferred_takeover) dummycon_unregister_output_notifier(&fbcon_output_nb); console_unlock(); cancel_work_sync(&fbcon_deferred_takeover_work); #endif console_lock(); device_destroy(fb_class, MKDEV(0, 0)); do_unregister_con_driver(&fb_con); console_unlock(); } #endif |
| 18 15 12 3 3 2 2 4 1 1 5 11 11 2 6 6 6 6 6 6 6 6 3 1 2 3 3 6 2 3 3 13 3 2 1 2 2 1 1 1 2 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright (C) 2001 Jens Axboe <axboe@suse.de> */ #ifndef __LINUX_BIO_H #define __LINUX_BIO_H #include <linux/mempool.h> /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */ #include <linux/blk_types.h> #include <linux/uio.h> #define BIO_MAX_VECS 256U #define BIO_MAX_INLINE_VECS UIO_MAXIOV struct queue_limits; static inline unsigned int bio_max_segs(unsigned int nr_segs) { return min(nr_segs, BIO_MAX_VECS); } #define bio_iter_iovec(bio, iter) \ bvec_iter_bvec((bio)->bi_io_vec, (iter)) #define bio_iter_page(bio, iter) \ bvec_iter_page((bio)->bi_io_vec, (iter)) #define bio_iter_len(bio, iter) \ bvec_iter_len((bio)->bi_io_vec, (iter)) #define bio_iter_offset(bio, iter) \ bvec_iter_offset((bio)->bi_io_vec, (iter)) #define bio_page(bio) bio_iter_page((bio), (bio)->bi_iter) #define bio_offset(bio) bio_iter_offset((bio), (bio)->bi_iter) #define bio_iovec(bio) bio_iter_iovec((bio), (bio)->bi_iter) #define bvec_iter_sectors(iter) ((iter).bi_size >> 9) #define bvec_iter_end_sector(iter) ((iter).bi_sector + bvec_iter_sectors((iter))) #define bio_sectors(bio) bvec_iter_sectors((bio)->bi_iter) #define bio_end_sector(bio) bvec_iter_end_sector((bio)->bi_iter) /* * Return the data direction, READ or WRITE. */ #define bio_data_dir(bio) \ (op_is_write(bio_op(bio)) ? WRITE : READ) static inline bool bio_flagged(const struct bio *bio, unsigned int bit) { return bio->bi_flags & (1U << bit); } static inline void bio_set_flag(struct bio *bio, unsigned int bit) { bio->bi_flags |= (1U << bit); } static inline void bio_clear_flag(struct bio *bio, unsigned int bit) { bio->bi_flags &= ~(1U << bit); } /* * Check whether this bio carries any data or not. A NULL bio is allowed. */ static inline bool bio_has_data(struct bio *bio) { if (bio && bio->bi_iter.bi_size && bio_op(bio) != REQ_OP_DISCARD && bio_op(bio) != REQ_OP_SECURE_ERASE && bio_op(bio) != REQ_OP_WRITE_ZEROES) return true; return false; } static inline bool bio_no_advance_iter(const struct bio *bio) { return bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_SECURE_ERASE || bio_op(bio) == REQ_OP_WRITE_ZEROES; } static inline void *bio_data(struct bio *bio) { if (bio_has_data(bio)) return page_address(bio_page(bio)) + bio_offset(bio); return NULL; } static inline bool bio_next_segment(const struct bio *bio, struct bvec_iter_all *iter) { if (iter->idx >= bio->bi_vcnt) return false; bvec_advance(&bio->bi_io_vec[iter->idx], iter); return true; } /* * drivers should _never_ use the all version - the bio may have been split * before it got to the driver and the driver won't own all of it */ #define bio_for_each_segment_all(bvl, bio, iter) \ for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); ) static inline void bio_advance_iter(const struct bio *bio, struct bvec_iter *iter, unsigned int bytes) { iter->bi_sector += bytes >> 9; if (bio_no_advance_iter(bio)) iter->bi_size -= bytes; else bvec_iter_advance(bio->bi_io_vec, iter, bytes); /* TODO: It is reasonable to complete bio with error here. */ } /* @bytes should be less or equal to bvec[i->bi_idx].bv_len */ static inline void bio_advance_iter_single(const struct bio *bio, struct bvec_iter *iter, unsigned int bytes) { iter->bi_sector += bytes >> 9; if (bio_no_advance_iter(bio)) iter->bi_size -= bytes; else bvec_iter_advance_single(bio->bi_io_vec, iter, bytes); } void __bio_advance(struct bio *, unsigned bytes); /** * bio_advance - increment/complete a bio by some number of bytes * @bio: bio to advance * @nbytes: number of bytes to complete * * This updates bi_sector, bi_size and bi_idx; if the number of bytes to * complete doesn't align with a bvec boundary, then bv_len and bv_offset will * be updated on the last bvec as well. * * @bio will then represent the remaining, uncompleted portion of the io. */ static inline void bio_advance(struct bio *bio, unsigned int nbytes) { if (nbytes == bio->bi_iter.bi_size) { bio->bi_iter.bi_size = 0; return; } __bio_advance(bio, nbytes); } #define __bio_for_each_segment(bvl, bio, iter, start) \ for (iter = (start); \ (iter).bi_size && \ ((bvl = bio_iter_iovec((bio), (iter))), 1); \ bio_advance_iter_single((bio), &(iter), (bvl).bv_len)) #define bio_for_each_segment(bvl, bio, iter) \ __bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter) #define __bio_for_each_bvec(bvl, bio, iter, start) \ for (iter = (start); \ (iter).bi_size && \ ((bvl = mp_bvec_iter_bvec((bio)->bi_io_vec, (iter))), 1); \ bio_advance_iter_single((bio), &(iter), (bvl).bv_len)) /* iterate over multi-page bvec */ #define bio_for_each_bvec(bvl, bio, iter) \ __bio_for_each_bvec(bvl, bio, iter, (bio)->bi_iter) /* * Iterate over all multi-page bvecs. Drivers shouldn't use this version for the * same reasons as bio_for_each_segment_all(). */ #define bio_for_each_bvec_all(bvl, bio, i) \ for (i = 0, bvl = bio_first_bvec_all(bio); \ i < (bio)->bi_vcnt; i++, bvl++) #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len) static inline unsigned bio_segments(struct bio *bio) { unsigned segs = 0; struct bio_vec bv; struct bvec_iter iter; /* * We special case discard/write same/write zeroes, because they * interpret bi_size differently: */ switch (bio_op(bio)) { case REQ_OP_DISCARD: case REQ_OP_SECURE_ERASE: case REQ_OP_WRITE_ZEROES: return 0; default: break; } bio_for_each_segment(bv, bio, iter) segs++; return segs; } /* * get a reference to a bio, so it won't disappear. the intended use is * something like: * * bio_get(bio); * submit_bio(rw, bio); * if (bio->bi_flags ...) * do_something * bio_put(bio); * * without the bio_get(), it could potentially complete I/O before submit_bio * returns. and then bio would be freed memory when if (bio->bi_flags ...) * runs */ static inline void bio_get(struct bio *bio) { bio->bi_flags |= (1 << BIO_REFFED); smp_mb__before_atomic(); atomic_inc(&bio->__bi_cnt); } static inline void bio_cnt_set(struct bio *bio, unsigned int count) { if (count != 1) { bio->bi_flags |= (1 << BIO_REFFED); smp_mb(); } atomic_set(&bio->__bi_cnt, count); } static inline struct bio_vec *bio_first_bvec_all(struct bio *bio) { WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)); return bio->bi_io_vec; } static inline struct page *bio_first_page_all(struct bio *bio) { return bio_first_bvec_all(bio)->bv_page; } static inline struct folio *bio_first_folio_all(struct bio *bio) { return page_folio(bio_first_page_all(bio)); } static inline struct bio_vec *bio_last_bvec_all(struct bio *bio) { WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)); return &bio->bi_io_vec[bio->bi_vcnt - 1]; } /** * struct folio_iter - State for iterating all folios in a bio. * @folio: The current folio we're iterating. NULL after the last folio. * @offset: The byte offset within the current folio. * @length: The number of bytes in this iteration (will not cross folio * boundary). */ struct folio_iter { struct folio *folio; size_t offset; size_t length; /* private: for use by the iterator */ struct folio *_next; size_t _seg_count; int _i; }; static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio, int i) { struct bio_vec *bvec = bio_first_bvec_all(bio) + i; if (unlikely(i >= bio->bi_vcnt)) { fi->folio = NULL; return; } fi->folio = page_folio(bvec->bv_page); fi->offset = bvec->bv_offset + PAGE_SIZE * folio_page_idx(fi->folio, bvec->bv_page); fi->_seg_count = bvec->bv_len; fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count); fi->_next = folio_next(fi->folio); fi->_i = i; } static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio) { fi->_seg_count -= fi->length; if (fi->_seg_count) { fi->folio = fi->_next; fi->offset = 0; fi->length = min(folio_size(fi->folio), fi->_seg_count); fi->_next = folio_next(fi->folio); } else { bio_first_folio(fi, bio, fi->_i + 1); } } /** * bio_for_each_folio_all - Iterate over each folio in a bio. * @fi: struct folio_iter which is updated for each folio. * @bio: struct bio to iterate over. */ #define bio_for_each_folio_all(fi, bio) \ for (bio_first_folio(&fi, bio, 0); fi.folio; bio_next_folio(&fi, bio)) void bio_trim(struct bio *bio, sector_t offset, sector_t size); extern struct bio *bio_split(struct bio *bio, int sectors, gfp_t gfp, struct bio_set *bs); int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, unsigned *segs, unsigned max_bytes, unsigned len_align); u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next, u8 gaps_bit); /** * bio_next_split - get next @sectors from a bio, splitting if necessary * @bio: bio to split * @sectors: number of sectors to split from the front of @bio * @gfp: gfp mask * @bs: bio set to allocate from * * Return: a bio representing the next @sectors of @bio - if the bio is smaller * than @sectors, returns the original bio unchanged. */ static inline struct bio *bio_next_split(struct bio *bio, int sectors, gfp_t gfp, struct bio_set *bs) { if (sectors >= bio_sectors(bio)) return bio; return bio_split(bio, sectors, gfp, bs); } enum { BIOSET_NEED_BVECS = BIT(0), BIOSET_NEED_RESCUER = BIT(1), BIOSET_PERCPU_CACHE = BIT(2), }; extern int bioset_init(struct bio_set *, unsigned int, unsigned int, int flags); extern void bioset_exit(struct bio_set *); extern int biovec_init_pool(mempool_t *pool, int pool_entries); struct bio *bio_alloc_bioset(struct block_device *bdev, unsigned short nr_vecs, blk_opf_t opf, gfp_t gfp_mask, struct bio_set *bs); struct bio *bio_kmalloc(unsigned short nr_vecs, gfp_t gfp_mask); extern void bio_put(struct bio *); struct bio *bio_alloc_clone(struct block_device *bdev, struct bio *bio_src, gfp_t gfp, struct bio_set *bs); int bio_init_clone(struct block_device *bdev, struct bio *bio, struct bio *bio_src, gfp_t gfp); extern struct bio_set fs_bio_set; static inline struct bio *bio_alloc(struct block_device *bdev, unsigned short nr_vecs, blk_opf_t opf, gfp_t gfp_mask) { return bio_alloc_bioset(bdev, nr_vecs, opf, gfp_mask, &fs_bio_set); } void submit_bio(struct bio *bio); extern void bio_endio(struct bio *); static inline void bio_io_error(struct bio *bio) { bio->bi_status = BLK_STS_IOERR; bio_endio(bio); } static inline void bio_wouldblock_error(struct bio *bio) { bio_set_flag(bio, BIO_QUIET); bio->bi_status = BLK_STS_AGAIN; bio_endio(bio); } /* * Calculate number of bvec segments that should be allocated to fit data * pointed by @iter. If @iter is backed by bvec it's going to be reused * instead of allocating a new one. */ static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs) { if (iov_iter_is_bvec(iter)) return 0; return iov_iter_npages(iter, max_segs); } struct request_queue; void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table, unsigned short max_vecs, blk_opf_t opf); static inline void bio_init_inline(struct bio *bio, struct block_device *bdev, unsigned short max_vecs, blk_opf_t opf) { bio_init(bio, bdev, bio_inline_vecs(bio), max_vecs, opf); } extern void bio_uninit(struct bio *); void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf); void bio_chain(struct bio *, struct bio *); int __must_check bio_add_page(struct bio *bio, struct page *page, unsigned len, unsigned off); bool __must_check bio_add_folio(struct bio *bio, struct folio *folio, size_t len, size_t off); void __bio_add_page(struct bio *bio, struct page *page, unsigned int len, unsigned int off); void bio_add_folio_nofail(struct bio *bio, struct folio *folio, size_t len, size_t off); void bio_add_virt_nofail(struct bio *bio, void *vaddr, unsigned len); /** * bio_add_max_vecs - number of bio_vecs needed to add data to a bio * @kaddr: kernel virtual address to add * @len: length in bytes to add * * Calculate how many bio_vecs need to be allocated to add the kernel virtual * address range in [@kaddr:@len] in the worse case. */ static inline unsigned int bio_add_max_vecs(void *kaddr, unsigned int len) { if (is_vmalloc_addr(kaddr)) return DIV_ROUND_UP(offset_in_page(kaddr) + len, PAGE_SIZE); return 1; } unsigned int bio_add_vmalloc_chunk(struct bio *bio, void *vaddr, unsigned len); bool bio_add_vmalloc(struct bio *bio, void *vaddr, unsigned int len); int submit_bio_wait(struct bio *bio); int bdev_rw_virt(struct block_device *bdev, sector_t sector, void *data, size_t len, enum req_op op); int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter, unsigned len_align_mask); void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter); void __bio_release_pages(struct bio *bio, bool mark_dirty); extern void bio_set_pages_dirty(struct bio *bio); extern void bio_check_pages_dirty(struct bio *bio); extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter, struct bio *src, struct bvec_iter *src_iter); extern void bio_copy_data(struct bio *dst, struct bio *src); extern void bio_free_pages(struct bio *bio); void guard_bio_eod(struct bio *bio); void zero_fill_bio_iter(struct bio *bio, struct bvec_iter iter); static inline void zero_fill_bio(struct bio *bio) { zero_fill_bio_iter(bio, bio->bi_iter); } static inline void bio_release_pages(struct bio *bio, bool mark_dirty) { if (bio_flagged(bio, BIO_PAGE_PINNED)) __bio_release_pages(bio, mark_dirty); } #define bio_dev(bio) \ disk_devt((bio)->bi_bdev->bd_disk) #ifdef CONFIG_BLK_CGROUP void bio_associate_blkg(struct bio *bio); void bio_associate_blkg_from_css(struct bio *bio, struct cgroup_subsys_state *css); void bio_clone_blkg_association(struct bio *dst, struct bio *src); void blkcg_punt_bio_submit(struct bio *bio); #else /* CONFIG_BLK_CGROUP */ static inline void bio_associate_blkg(struct bio *bio) { } static inline void bio_associate_blkg_from_css(struct bio *bio, struct cgroup_subsys_state *css) { } static inline void bio_clone_blkg_association(struct bio *dst, struct bio *src) { } static inline void blkcg_punt_bio_submit(struct bio *bio) { submit_bio(bio); } #endif /* CONFIG_BLK_CGROUP */ static inline void bio_set_dev(struct bio *bio, struct block_device *bdev) { bio_clear_flag(bio, BIO_REMAPPED); if (bio->bi_bdev != bdev) bio_clear_flag(bio, BIO_BPS_THROTTLED); bio->bi_bdev = bdev; bio_associate_blkg(bio); } /* * BIO list management for use by remapping drivers (e.g. DM or MD) and loop. * * A bio_list anchors a singly-linked list of bios chained through the bi_next * member of the bio. The bio_list also caches the last list member to allow * fast access to the tail. */ struct bio_list { struct bio *head; struct bio *tail; }; static inline int bio_list_empty(const struct bio_list *bl) { return bl->head == NULL; } static inline void bio_list_init(struct bio_list *bl) { bl->head = bl->tail = NULL; } #define BIO_EMPTY_LIST { NULL, NULL } #define bio_list_for_each(bio, bl) \ for (bio = (bl)->head; bio; bio = bio->bi_next) static inline unsigned bio_list_size(const struct bio_list *bl) { unsigned sz = 0; struct bio *bio; bio_list_for_each(bio, bl) sz++; return sz; } static inline void bio_list_add(struct bio_list *bl, struct bio *bio) { bio->bi_next = NULL; if (bl->tail) bl->tail->bi_next = bio; else bl->head = bio; bl->tail = bio; } static inline void bio_list_add_head(struct bio_list *bl, struct bio *bio) { bio->bi_next = bl->head; bl->head = bio; if (!bl->tail) bl->tail = bio; } static inline void bio_list_merge(struct bio_list *bl, struct bio_list *bl2) { if (!bl2->head) return; if (bl->tail) bl->tail->bi_next = bl2->head; else bl->head = bl2->head; bl->tail = bl2->tail; } static inline void bio_list_merge_init(struct bio_list *bl, struct bio_list *bl2) { bio_list_merge(bl, bl2); bio_list_init(bl2); } static inline void bio_list_merge_head(struct bio_list *bl, struct bio_list *bl2) { if (!bl2->head) return; if (bl->head) bl2->tail->bi_next = bl->head; else bl->tail = bl2->tail; bl->head = bl2->head; } static inline struct bio *bio_list_peek(struct bio_list *bl) { return bl->head; } static inline struct bio *bio_list_pop(struct bio_list *bl) { struct bio *bio = bl->head; if (bio) { bl->head = bl->head->bi_next; if (!bl->head) bl->tail = NULL; bio->bi_next = NULL; } return bio; } static inline struct bio *bio_list_get(struct bio_list *bl) { struct bio *bio = bl->head; bl->head = bl->tail = NULL; return bio; } /* * Increment chain count for the bio. Make sure the CHAIN flag update * is visible before the raised count. */ static inline void bio_inc_remaining(struct bio *bio) { bio_set_flag(bio, BIO_CHAIN); smp_mb__before_atomic(); atomic_inc(&bio->__bi_remaining); } /* * bio_set is used to allow other portions of the IO system to * allocate their own private memory pools for bio and iovec structures. * These memory pools in turn all allocate from the bio_slab * and the bvec_slabs[]. */ #define BIO_POOL_SIZE 2 struct bio_set { struct kmem_cache *bio_slab; unsigned int front_pad; /* * per-cpu bio alloc cache */ struct bio_alloc_cache __percpu *cache; mempool_t bio_pool; mempool_t bvec_pool; unsigned int back_pad; /* * Deadlock avoidance for stacking block drivers: see comments in * bio_alloc_bioset() for details */ spinlock_t rescue_lock; struct bio_list rescue_list; struct work_struct rescue_work; struct workqueue_struct *rescue_workqueue; /* * Hot un-plug notifier for the per-cpu cache, if used */ struct hlist_node cpuhp_dead; }; static inline bool bioset_initialized(struct bio_set *bs) { return bs->bio_slab != NULL; } /* * Mark a bio as polled. Note that for async polled IO, the caller must * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). * We cannot block waiting for requests on polled IO, as those completions * must be found by the caller. This is different than IRQ driven IO, where * it's safe to wait for IO to complete. */ static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) { bio->bi_opf |= REQ_POLLED; if (kiocb->ki_flags & IOCB_NOWAIT) bio->bi_opf |= REQ_NOWAIT; } static inline void bio_clear_polled(struct bio *bio) { bio->bi_opf &= ~REQ_POLLED; } /** * bio_is_zone_append - is this a zone append bio? * @bio: bio to check * * Check if @bio is a zone append operation. Core block layer code and end_io * handlers must use this instead of an open coded REQ_OP_ZONE_APPEND check * because the block layer can rewrite REQ_OP_ZONE_APPEND to REQ_OP_WRITE if * it is not natively supported. */ static inline bool bio_is_zone_append(struct bio *bio) { if (!IS_ENABLED(CONFIG_BLK_DEV_ZONED)) return false; return bio_op(bio) == REQ_OP_ZONE_APPEND || bio_flagged(bio, BIO_EMULATES_ZONE_APPEND); } struct bio *blk_next_bio(struct bio *bio, struct block_device *bdev, unsigned int nr_pages, blk_opf_t opf, gfp_t gfp); struct bio *bio_chain_and_submit(struct bio *prev, struct bio *new); struct bio *blk_alloc_discard_bio(struct block_device *bdev, sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask); #endif /* __LINUX_BIO_H */ |
| 3119 13 1081 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 | /* * include/linux/topology.h * * Written by: Matthew Dobson, IBM Corporation * * Copyright (C) 2002, IBM Corp. * * All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or * NON INFRINGEMENT. See the GNU General Public License for more * details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. * * Send feedback to <colpatch@us.ibm.com> */ #ifndef _LINUX_TOPOLOGY_H #define _LINUX_TOPOLOGY_H #include <linux/arch_topology.h> #include <linux/cpumask.h> #include <linux/nodemask.h> #include <linux/bitops.h> #include <linux/mmzone.h> #include <linux/smp.h> #include <linux/percpu.h> #include <asm/topology.h> #ifndef nr_cpus_node #define nr_cpus_node(node) cpumask_weight(cpumask_of_node(node)) #endif int arch_update_cpu_topology(void); /* Conform to ACPI 2.0 SLIT distance definitions */ #define LOCAL_DISTANCE 10 #define REMOTE_DISTANCE 20 #define DISTANCE_BITS 8 #ifndef node_distance #define node_distance(from,to) ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE) #endif #ifndef RECLAIM_DISTANCE /* * If the distance between nodes in a system is larger than RECLAIM_DISTANCE * (in whatever arch specific measurement units returned by node_distance()) * and node_reclaim_mode is enabled then the VM will only call node_reclaim() * on nodes within this distance. */ #define RECLAIM_DISTANCE 30 #endif /* * The following tunable allows platforms to override the default node * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are * sufficiently fast that the default value actually hurts * performance. * * AMD EPYC machines use this because even though the 2-hop distance * is 32 (3.2x slower than a local memory access) performance actually * *improves* if allowed to reclaim memory and load balance tasks * between NUMA nodes 2-hops apart. */ extern int __read_mostly node_reclaim_distance; #ifndef PENALTY_FOR_NODE_WITH_CPUS #define PENALTY_FOR_NODE_WITH_CPUS (1) #endif #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DECLARE_PER_CPU(int, numa_node); #ifndef numa_node_id /* Returns the number of the current Node. */ static inline int numa_node_id(void) { return raw_cpu_read(numa_node); } #endif #ifndef cpu_to_node static inline int cpu_to_node(int cpu) { return per_cpu(numa_node, cpu); } #endif #ifndef set_numa_node static inline void set_numa_node(int node) { this_cpu_write(numa_node, node); } #endif #ifndef set_cpu_numa_node static inline void set_cpu_numa_node(int cpu, int node) { per_cpu(numa_node, cpu) = node; } #endif #else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */ /* Returns the number of the current Node. */ #ifndef numa_node_id static inline int numa_node_id(void) { return cpu_to_node(raw_smp_processor_id()); } #endif #endif /* [!]CONFIG_USE_PERCPU_NUMA_NODE_ID */ #ifdef CONFIG_HAVE_MEMORYLESS_NODES /* * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly. * It will not be defined when CONFIG_HAVE_MEMORYLESS_NODES is not defined. * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem(). */ DECLARE_PER_CPU(int, _numa_mem_); #ifndef set_numa_mem static inline void set_numa_mem(int node) { this_cpu_write(_numa_mem_, node); } #endif #ifndef numa_mem_id /* Returns the number of the nearest Node with memory */ static inline int numa_mem_id(void) { return raw_cpu_read(_numa_mem_); } #endif #ifndef cpu_to_mem static inline int cpu_to_mem(int cpu) { return per_cpu(_numa_mem_, cpu); } #endif #ifndef set_cpu_numa_mem static inline void set_cpu_numa_mem(int cpu, int node) { per_cpu(_numa_mem_, cpu) = node; } #endif #else /* !CONFIG_HAVE_MEMORYLESS_NODES */ #ifndef numa_mem_id /* Returns the number of the nearest Node with memory */ static inline int numa_mem_id(void) { return numa_node_id(); } #endif #ifndef cpu_to_mem static inline int cpu_to_mem(int cpu) { return cpu_to_node(cpu); } #endif #endif /* [!]CONFIG_HAVE_MEMORYLESS_NODES */ #if defined(topology_die_id) && defined(topology_die_cpumask) #define TOPOLOGY_DIE_SYSFS #endif #if defined(topology_cluster_id) && defined(topology_cluster_cpumask) #define TOPOLOGY_CLUSTER_SYSFS #endif #if defined(topology_book_id) && defined(topology_book_cpumask) #define TOPOLOGY_BOOK_SYSFS #endif #if defined(topology_drawer_id) && defined(topology_drawer_cpumask) #define TOPOLOGY_DRAWER_SYSFS #endif #ifndef topology_physical_package_id #define topology_physical_package_id(cpu) ((void)(cpu), -1) #endif #ifndef topology_die_id #define topology_die_id(cpu) ((void)(cpu), -1) #endif #ifndef topology_cluster_id #define topology_cluster_id(cpu) ((void)(cpu), -1) #endif #ifndef topology_core_id #define topology_core_id(cpu) ((void)(cpu), 0) #endif #ifndef topology_book_id #define topology_book_id(cpu) ((void)(cpu), -1) #endif #ifndef topology_drawer_id #define topology_drawer_id(cpu) ((void)(cpu), -1) #endif #ifndef topology_ppin #define topology_ppin(cpu) ((void)(cpu), 0ull) #endif #ifndef topology_sibling_cpumask #define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif #ifndef topology_core_cpumask #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif #ifndef topology_cluster_cpumask #define topology_cluster_cpumask(cpu) cpumask_of(cpu) #endif #ifndef topology_die_cpumask #define topology_die_cpumask(cpu) cpumask_of(cpu) #endif #ifndef topology_book_cpumask #define topology_book_cpumask(cpu) cpumask_of(cpu) #endif #ifndef topology_drawer_cpumask #define topology_drawer_cpumask(cpu) cpumask_of(cpu) #endif #if defined(CONFIG_SCHED_SMT) && !defined(cpu_smt_mask) static inline const struct cpumask *cpu_smt_mask(int cpu) { return topology_sibling_cpumask(cpu); } #endif #ifndef topology_is_primary_thread static inline bool topology_is_primary_thread(unsigned int cpu) { /* * When disabling SMT, the primary thread of the SMT will remain * enabled/active. Architectures that have a special primary thread * (e.g. x86) need to override this function. Otherwise the first * thread in the SMT can be made the primary thread. * * The sibling cpumask of an offline CPU always contains the CPU * itself on architectures using the implementation of * CONFIG_GENERIC_ARCH_TOPOLOGY for building their topology. * Other architectures not using CONFIG_GENERIC_ARCH_TOPOLOGY for * building their topology have to check whether to use this default * implementation or to override it. */ return cpu == cpumask_first(topology_sibling_cpumask(cpu)); } #define topology_is_primary_thread topology_is_primary_thread #endif static inline const struct cpumask *cpu_node_mask(int cpu) { return cpumask_of_node(cpu_to_node(cpu)); } #ifdef CONFIG_NUMA int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node); extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops); #else static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node) { return cpumask_nth_and(cpu, cpus, cpu_online_mask); } static inline const struct cpumask * sched_numa_hop_mask(unsigned int node, unsigned int hops) { return ERR_PTR(-EOPNOTSUPP); } #endif /* CONFIG_NUMA */ /** * for_each_node_numadist() - iterate over nodes in increasing distance * order, starting from a given node * @node: the iteration variable and the starting node. * @unvisited: a nodemask to keep track of the unvisited nodes. * * This macro iterates over NUMA node IDs in increasing distance from the * starting @node and yields MAX_NUMNODES when all the nodes have been * visited. * * Note that by the time the loop completes, the @unvisited nodemask will * be fully cleared, unless the loop exits early. * * The difference between for_each_node() and for_each_node_numadist() is * that the former allows to iterate over nodes in numerical order, whereas * the latter iterates over nodes in increasing order of distance. * * This complexity of this iterator is O(N^2), where N represents the * number of nodes, as each iteration involves scanning all nodes to * find the one with the shortest distance. * * Requires rcu_lock to be held. */ #define for_each_node_numadist(node, unvisited) \ for (int __start = (node), \ (node) = nearest_node_nodemask((__start), &(unvisited)); \ (node) < MAX_NUMNODES; \ node_clear((node), (unvisited)), \ (node) = nearest_node_nodemask((__start), &(unvisited))) /** * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance * from a given node. * @mask: the iteration variable. * @node: the NUMA node to start the search from. * * Requires rcu_lock to be held. * * Yields cpu_online_mask for @node == NUMA_NO_NODE. */ #define for_each_numa_hop_mask(mask, node) \ for (unsigned int __hops = 0; \ mask = (node != NUMA_NO_NODE || __hops) ? \ sched_numa_hop_mask(node, __hops) : \ cpu_online_mask, \ !IS_ERR_OR_NULL(mask); \ __hops++) DECLARE_PER_CPU(unsigned long, cpu_scale); static inline unsigned long topology_get_cpu_scale(int cpu) { return per_cpu(cpu_scale, cpu); } void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity); #endif /* _LINUX_TOPOLOGY_H */ |
| 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | /* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM mctp #if !defined(_TRACE_MCTP_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_MCTP_H #include <linux/tracepoint.h> #ifndef __TRACE_MCTP_ENUMS #define __TRACE_MCTP_ENUMS enum { MCTP_TRACE_KEY_TIMEOUT, MCTP_TRACE_KEY_REPLIED, MCTP_TRACE_KEY_INVALIDATED, MCTP_TRACE_KEY_CLOSED, MCTP_TRACE_KEY_DROPPED, }; #endif /* __TRACE_MCTP_ENUMS */ TRACE_DEFINE_ENUM(MCTP_TRACE_KEY_TIMEOUT); TRACE_DEFINE_ENUM(MCTP_TRACE_KEY_REPLIED); TRACE_DEFINE_ENUM(MCTP_TRACE_KEY_INVALIDATED); TRACE_DEFINE_ENUM(MCTP_TRACE_KEY_CLOSED); TRACE_DEFINE_ENUM(MCTP_TRACE_KEY_DROPPED); TRACE_EVENT(mctp_key_acquire, TP_PROTO(const struct mctp_sk_key *key), TP_ARGS(key), TP_STRUCT__entry( __field(__u8, paddr) __field(__u8, laddr) __field(__u8, tag) ), TP_fast_assign( __entry->paddr = key->peer_addr; __entry->laddr = key->local_addr; __entry->tag = key->tag; ), TP_printk("local %d, peer %d, tag %1x", __entry->laddr, __entry->paddr, __entry->tag ) ); TRACE_EVENT(mctp_key_release, TP_PROTO(const struct mctp_sk_key *key, int reason), TP_ARGS(key, reason), TP_STRUCT__entry( __field(__u8, paddr) __field(__u8, laddr) __field(__u8, tag) __field(int, reason) ), TP_fast_assign( __entry->paddr = key->peer_addr; __entry->laddr = key->local_addr; __entry->tag = key->tag; __entry->reason = reason; ), TP_printk("local %d, peer %d, tag %1x %s", __entry->laddr, __entry->paddr, __entry->tag, __print_symbolic(__entry->reason, { MCTP_TRACE_KEY_TIMEOUT, "timeout" }, { MCTP_TRACE_KEY_REPLIED, "replied" }, { MCTP_TRACE_KEY_INVALIDATED, "invalidated" }, { MCTP_TRACE_KEY_CLOSED, "closed" }, { MCTP_TRACE_KEY_DROPPED, "dropped" }) ) ); #endif #include <trace/define_trace.h> |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Management Component Transport Protocol (MCTP) * * Copyright (c) 2021 Code Construct * Copyright (c) 2021 Google */ #ifndef __NET_MCTP_H #define __NET_MCTP_H #include <linux/bits.h> #include <linux/mctp.h> #include <linux/netdevice.h> #include <net/net_namespace.h> #include <net/sock.h> /* MCTP packet definitions */ struct mctp_hdr { u8 ver; u8 dest; u8 src; u8 flags_seq_tag; }; #define MCTP_VER_MIN 1 #define MCTP_VER_MAX 1 /* Definitions for flags_seq_tag field */ #define MCTP_HDR_FLAG_SOM BIT(7) #define MCTP_HDR_FLAG_EOM BIT(6) #define MCTP_HDR_FLAG_TO BIT(3) #define MCTP_HDR_FLAGS GENMASK(5, 3) #define MCTP_HDR_SEQ_SHIFT 4 #define MCTP_HDR_SEQ_MASK GENMASK(1, 0) #define MCTP_HDR_TAG_SHIFT 0 #define MCTP_HDR_TAG_MASK GENMASK(2, 0) #define MCTP_INITIAL_DEFAULT_NET 1 static inline bool mctp_address_unicast(mctp_eid_t eid) { return eid >= 8 && eid < 255; } static inline bool mctp_address_broadcast(mctp_eid_t eid) { return eid == 255; } static inline bool mctp_address_null(mctp_eid_t eid) { return eid == 0; } static inline bool mctp_address_matches(mctp_eid_t match, mctp_eid_t eid) { return match == eid || match == MCTP_ADDR_ANY; } static inline struct mctp_hdr *mctp_hdr(struct sk_buff *skb) { return (struct mctp_hdr *)skb_network_header(skb); } /* socket implementation */ struct mctp_sock { struct sock sk; /* bind() params */ unsigned int bind_net; mctp_eid_t bind_local_addr; mctp_eid_t bind_peer_addr; unsigned int bind_peer_net; bool bind_peer_set; __u8 bind_type; /* sendmsg()/recvmsg() uses struct sockaddr_mctp_ext */ bool addr_ext; /* list of mctp_sk_key, for incoming tag lookup. updates protected * by sk->net->keys_lock */ struct hlist_head keys; /* mechanism for expiring allocated keys; will release an allocated * tag, and any netdev state for a request/response pairing */ struct timer_list key_expiry; }; /* Key for matching incoming packets to sockets or reassembly contexts. * Packets are matched on (peer EID, local EID, tag). * * Lifetime / locking requirements: * * - individual key data (ie, the struct itself) is protected by key->lock; * changes must be made with that lock held. * * - the lookup fields: peer_addr, local_addr and tag are set before the * key is added to lookup lists, and never updated. * * - A ref to the key must be held (throuh key->refs) if a pointer to the * key is to be accessed after key->lock is released. * * - a mctp_sk_key contains a reference to a struct sock; this is valid * for the life of the key. On sock destruction (through unhash), the key is * removed from lists (see below), and marked invalid. * * - these mctp_sk_keys appear on two lists: * 1) the struct mctp_sock->keys list * 2) the struct netns_mctp->keys list * * presences on these lists requires a (single) refcount to be held; both * lists are updated as a single operation. * * Updates and lookups in either list are performed under the * netns_mctp->keys lock. Lookup functions will need to lock the key and * take a reference before unlocking the keys_lock. Consequently, the list's * keys_lock *cannot* be acquired with the individual key->lock held. * * - a key may have a sk_buff attached as part of an in-progress message * reassembly (->reasm_head). The reasm data is protected by the individual * key->lock. * * - there are two destruction paths for a mctp_sk_key: * * - through socket unhash (see mctp_sk_unhash). This performs the list * removal under keys_lock. * * - where a key is established to receive a reply message: after receiving * the (complete) reply, or during reassembly errors. Here, we clean up * the reassembly context (marking reasm_dead, to prevent another from * starting), and remove the socket from the netns & socket lists. * * - through an expiry timeout, on a per-socket timer */ struct mctp_sk_key { unsigned int net; mctp_eid_t peer_addr; mctp_eid_t local_addr; /* MCTP_ADDR_ANY for local owned tags */ __u8 tag; /* incoming tag match; invert TO for local */ /* we hold a ref to sk when set */ struct sock *sk; /* routing lookup list */ struct hlist_node hlist; /* per-socket list */ struct hlist_node sklist; /* lock protects against concurrent updates to the reassembly and * expiry data below. */ spinlock_t lock; /* Keys are referenced during the output path, which may sleep */ refcount_t refs; /* incoming fragment reassembly context */ struct sk_buff *reasm_head; struct sk_buff **reasm_tailp; bool reasm_dead; u8 last_seq; /* key validity */ bool valid; /* expiry timeout; valid (above) cleared on expiry */ unsigned long expiry; /* free to use for device flow state tracking. Initialised to * zero on initial key creation */ unsigned long dev_flow_state; struct mctp_dev *dev; /* a tag allocated with SIOCMCTPALLOCTAG ioctl will not expire * automatically on timeout or response, instead SIOCMCTPDROPTAG * is used. */ bool manual_alloc; }; struct mctp_skb_cb { unsigned int magic; unsigned int net; /* fields below provide extended addressing for ingress to recvmsg() */ int ifindex; unsigned char halen; unsigned char haddr[MAX_ADDR_LEN]; }; /* skb control-block accessors with a little extra debugging for initial * development. * * TODO: remove checks & mctp_skb_cb->magic; replace callers of __mctp_cb * with mctp_cb(). * * __mctp_cb() is only for the initial ingress code; we should see ->magic set * at all times after this. */ static inline struct mctp_skb_cb *__mctp_cb(struct sk_buff *skb) { struct mctp_skb_cb *cb = (void *)skb->cb; cb->magic = 0x4d435450; return cb; } static inline struct mctp_skb_cb *mctp_cb(struct sk_buff *skb) { struct mctp_skb_cb *cb = (void *)skb->cb; BUILD_BUG_ON(sizeof(struct mctp_skb_cb) > sizeof(skb->cb)); WARN_ON(cb->magic != 0x4d435450); return cb; } /* If CONFIG_MCTP_FLOWS, we may add one of these as a SKB extension, * indicating the flow to the device driver. */ struct mctp_flow { struct mctp_sk_key *key; }; struct mctp_dst; /* Route definition. * * These are held in the pernet->mctp.routes list, with RCU protection for * removed routes. We hold a reference to the netdev; routes need to be * dropped on NETDEV_UNREGISTER events. * * Updates to the route table are performed under rtnl; all reads under RCU, * so routes cannot be referenced over a RCU grace period. */ struct mctp_route { mctp_eid_t min, max; unsigned char type; unsigned int mtu; enum { MCTP_ROUTE_DIRECT, MCTP_ROUTE_GATEWAY, } dst_type; union { struct mctp_dev *dev; struct mctp_fq_addr gateway; }; int (*output)(struct mctp_dst *dst, struct sk_buff *skb); struct list_head list; refcount_t refs; struct rcu_head rcu; }; /* Route lookup result: dst. Represents the results of a routing decision, * but is only held over the individual routing operation. * * Will typically be stored on the caller stack, and must be released after * usage. */ struct mctp_dst { struct mctp_dev *dev; unsigned int mtu; mctp_eid_t nexthop; /* set for direct addressing */ unsigned char halen; unsigned char haddr[MAX_ADDR_LEN]; int (*output)(struct mctp_dst *dst, struct sk_buff *skb); }; int mctp_dst_from_extaddr(struct mctp_dst *dst, struct net *net, int ifindex, unsigned char halen, const unsigned char *haddr); /* route interfaces */ int mctp_route_lookup(struct net *net, unsigned int dnet, mctp_eid_t daddr, struct mctp_dst *dst); void mctp_dst_release(struct mctp_dst *dst); /* always takes ownership of skb */ int mctp_local_output(struct sock *sk, struct mctp_dst *dst, struct sk_buff *skb, mctp_eid_t daddr, u8 req_tag); void mctp_key_unref(struct mctp_sk_key *key); struct mctp_sk_key *mctp_alloc_local_tag(struct mctp_sock *msk, unsigned int netid, mctp_eid_t local, mctp_eid_t peer, bool manual, u8 *tagp); /* routing <--> device interface */ unsigned int mctp_default_net(struct net *net); int mctp_default_net_set(struct net *net, unsigned int index); int mctp_route_add_local(struct mctp_dev *mdev, mctp_eid_t addr); int mctp_route_remove_local(struct mctp_dev *mdev, mctp_eid_t addr); void mctp_route_remove_dev(struct mctp_dev *mdev); /* neighbour definitions */ enum mctp_neigh_source { MCTP_NEIGH_STATIC, MCTP_NEIGH_DISCOVER, }; struct mctp_neigh { struct mctp_dev *dev; mctp_eid_t eid; enum mctp_neigh_source source; unsigned char ha[MAX_ADDR_LEN]; struct list_head list; struct rcu_head rcu; }; int mctp_neigh_init(void); void mctp_neigh_exit(void); // ret_hwaddr may be NULL, otherwise must have space for MAX_ADDR_LEN int mctp_neigh_lookup(struct mctp_dev *dev, mctp_eid_t eid, void *ret_hwaddr); void mctp_neigh_remove_dev(struct mctp_dev *mdev); int mctp_routes_init(void); void mctp_routes_exit(void); int mctp_device_init(void); void mctp_device_exit(void); /* MCTP IDs and Codes from DMTF specification * "DSP0239 Management Component Transport Protocol (MCTP) IDs and Codes" * https://www.dmtf.org/sites/default/files/standards/documents/DSP0239_1.11.1.pdf */ enum mctp_phys_binding { MCTP_PHYS_BINDING_UNSPEC = 0x00, MCTP_PHYS_BINDING_SMBUS = 0x01, MCTP_PHYS_BINDING_PCIE_VDM = 0x02, MCTP_PHYS_BINDING_USB = 0x03, MCTP_PHYS_BINDING_KCS = 0x04, MCTP_PHYS_BINDING_SERIAL = 0x05, MCTP_PHYS_BINDING_I3C = 0x06, MCTP_PHYS_BINDING_MMBI = 0x07, MCTP_PHYS_BINDING_PCC = 0x08, MCTP_PHYS_BINDING_UCIE = 0x09, MCTP_PHYS_BINDING_VENDOR = 0xFF, }; #endif /* __NET_MCTP_H */ |
| 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (C) 2023 Thomas Weißschuh <linux@weissschuh.net> */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/completion.h> #include <linux/device.h> #include <linux/hwmon.h> #include <linux/module.h> #include <linux/mutex.h> #include <linux/types.h> #include <linux/usb.h> #define DRIVER_NAME "powerz" #define POWERZ_EP_CMD_OUT 0x01 #define POWERZ_EP_DATA_IN 0x81 struct powerz_sensor_data { u8 _unknown_1[8]; __le32 V_bus; __le32 I_bus; __le32 V_bus_avg; __le32 I_bus_avg; u8 _unknown_2[8]; u8 temp[2]; __le16 V_cc1; __le16 V_cc2; __le16 V_dp; __le16 V_dm; __le16 V_dd; u8 _unknown_3[4]; } __packed; struct powerz_priv { char transfer_buffer[64]; /* first member to satisfy DMA alignment */ struct mutex mutex; struct completion completion; struct urb *urb; int status; }; static const struct hwmon_channel_info *const powerz_info[] = { HWMON_CHANNEL_INFO(in, HWMON_I_INPUT | HWMON_I_LABEL | HWMON_I_AVERAGE, HWMON_I_INPUT | HWMON_I_LABEL, HWMON_I_INPUT | HWMON_I_LABEL, HWMON_I_INPUT | HWMON_I_LABEL, HWMON_I_INPUT | HWMON_I_LABEL, HWMON_I_INPUT | HWMON_I_LABEL), HWMON_CHANNEL_INFO(curr, HWMON_C_INPUT | HWMON_C_LABEL | HWMON_C_AVERAGE), HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT | HWMON_T_LABEL), NULL }; static int powerz_read_string(struct device *dev, enum hwmon_sensor_types type, u32 attr, int channel, const char **str) { if (type == hwmon_curr && attr == hwmon_curr_label) { *str = "IBUS"; } else if (type == hwmon_in && attr == hwmon_in_label) { if (channel == 0) *str = "VBUS"; else if (channel == 1) *str = "VCC1"; else if (channel == 2) *str = "VCC2"; else if (channel == 3) *str = "VDP"; else if (channel == 4) *str = "VDM"; else if (channel == 5) *str = "VDD"; else return -EOPNOTSUPP; } else if (type == hwmon_temp && attr == hwmon_temp_label) { *str = "TEMP"; } else { return -EOPNOTSUPP; } return 0; } static void powerz_usb_data_complete(struct urb *urb) { struct powerz_priv *priv = urb->context; complete(&priv->completion); } static void powerz_usb_cmd_complete(struct urb *urb) { struct powerz_priv *priv = urb->context; usb_fill_bulk_urb(urb, urb->dev, usb_rcvbulkpipe(urb->dev, POWERZ_EP_DATA_IN), priv->transfer_buffer, sizeof(priv->transfer_buffer), powerz_usb_data_complete, priv); priv->status = usb_submit_urb(urb, GFP_ATOMIC); if (priv->status) complete(&priv->completion); } static int powerz_read_data(struct usb_device *udev, struct powerz_priv *priv) { int ret; priv->status = -ETIMEDOUT; reinit_completion(&priv->completion); priv->transfer_buffer[0] = 0x0c; priv->transfer_buffer[1] = 0x00; priv->transfer_buffer[2] = 0x02; priv->transfer_buffer[3] = 0x00; usb_fill_bulk_urb(priv->urb, udev, usb_sndbulkpipe(udev, POWERZ_EP_CMD_OUT), priv->transfer_buffer, 4, powerz_usb_cmd_complete, priv); ret = usb_submit_urb(priv->urb, GFP_KERNEL); if (ret) return ret; if (!wait_for_completion_interruptible_timeout (&priv->completion, msecs_to_jiffies(5))) { usb_kill_urb(priv->urb); return -EIO; } if (priv->urb->actual_length < sizeof(struct powerz_sensor_data)) return -EIO; return priv->status; } static int powerz_read(struct device *dev, enum hwmon_sensor_types type, u32 attr, int channel, long *val) { struct usb_interface *intf = to_usb_interface(dev->parent); struct usb_device *udev = interface_to_usbdev(intf); struct powerz_priv *priv = usb_get_intfdata(intf); struct powerz_sensor_data *data; int ret; if (!priv) return -EIO; /* disconnected */ mutex_lock(&priv->mutex); ret = powerz_read_data(udev, priv); if (ret) goto out; data = (struct powerz_sensor_data *)priv->transfer_buffer; if (type == hwmon_curr) { if (attr == hwmon_curr_input) *val = ((s32)le32_to_cpu(data->I_bus)) / 1000; else if (attr == hwmon_curr_average) *val = ((s32)le32_to_cpu(data->I_bus_avg)) / 1000; else ret = -EOPNOTSUPP; } else if (type == hwmon_in) { if (attr == hwmon_in_input) { if (channel == 0) *val = le32_to_cpu(data->V_bus) / 1000; else if (channel == 1) *val = le16_to_cpu(data->V_cc1) / 10; else if (channel == 2) *val = le16_to_cpu(data->V_cc2) / 10; else if (channel == 3) *val = le16_to_cpu(data->V_dp) / 10; else if (channel == 4) *val = le16_to_cpu(data->V_dm) / 10; else if (channel == 5) *val = le16_to_cpu(data->V_dd) / 10; else ret = -EOPNOTSUPP; } else if (attr == hwmon_in_average && channel == 0) { *val = le32_to_cpu(data->V_bus_avg) / 1000; } else { ret = -EOPNOTSUPP; } } else if (type == hwmon_temp && attr == hwmon_temp_input) { *val = data->temp[1] * 2000 + data->temp[0] * 1000 / 128; } else { ret = -EOPNOTSUPP; } out: mutex_unlock(&priv->mutex); return ret; } static const struct hwmon_ops powerz_hwmon_ops = { .visible = 0444, .read = powerz_read, .read_string = powerz_read_string, }; static const struct hwmon_chip_info powerz_chip_info = { .ops = &powerz_hwmon_ops, .info = powerz_info, }; static int powerz_probe(struct usb_interface *intf, const struct usb_device_id *id) { struct powerz_priv *priv; struct device *hwmon_dev; struct device *parent; parent = &intf->dev; priv = devm_kzalloc(parent, sizeof(*priv), GFP_KERNEL); if (!priv) return -ENOMEM; priv->urb = usb_alloc_urb(0, GFP_KERNEL); if (!priv->urb) return -ENOMEM; mutex_init(&priv->mutex); init_completion(&priv->completion); hwmon_dev = devm_hwmon_device_register_with_info(parent, DRIVER_NAME, priv, &powerz_chip_info, NULL); if (IS_ERR(hwmon_dev)) { usb_free_urb(priv->urb); return PTR_ERR(hwmon_dev); } usb_set_intfdata(intf, priv); return 0; } static void powerz_disconnect(struct usb_interface *intf) { struct powerz_priv *priv = usb_get_intfdata(intf); mutex_lock(&priv->mutex); usb_kill_urb(priv->urb); usb_free_urb(priv->urb); mutex_unlock(&priv->mutex); } static const struct usb_device_id powerz_id_table[] = { { USB_DEVICE_INTERFACE_NUMBER(0x5FC9, 0x0061, 0x00) }, /* ChargerLAB POWER-Z KM002C */ { USB_DEVICE_INTERFACE_NUMBER(0x5FC9, 0x0063, 0x00) }, /* ChargerLAB POWER-Z KM003C */ { } }; MODULE_DEVICE_TABLE(usb, powerz_id_table); static struct usb_driver powerz_driver = { .name = DRIVER_NAME, .id_table = powerz_id_table, .probe = powerz_probe, .disconnect = powerz_disconnect, }; module_usb_driver(powerz_driver); MODULE_LICENSE("GPL"); MODULE_AUTHOR("Thomas Weißschuh <linux@weissschuh.net>"); MODULE_DESCRIPTION("ChargerLAB POWER-Z USB-C tester"); |
| 249 109 249 1 247 5 247 247 245 246 247 246 246 247 246 246 245 247 247 248 247 247 251 247 245 252 252 251 252 249 246 246 5 246 248 248 244 252 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (C) 2008 IBM Corporation * Author: Mimi Zohar <zohar@us.ibm.com> * * ima_policy.c * - initialize default measure policy rules */ #include <linux/init.h> #include <linux/list.h> #include <linux/kernel_read_file.h> #include <linux/fs.h> #include <linux/security.h> #include <linux/magic.h> #include <linux/parser.h> #include <linux/slab.h> #include <linux/rculist.h> #include <linux/seq_file.h> #include <linux/ima.h> #include "ima.h" /* flags definitions */ #define IMA_FUNC 0x0001 #define IMA_MASK 0x0002 #define IMA_FSMAGIC 0x0004 #define IMA_UID 0x0008 #define IMA_FOWNER 0x0010 #define IMA_FSUUID 0x0020 #define IMA_INMASK 0x0040 #define IMA_EUID 0x0080 #define IMA_PCR 0x0100 #define IMA_FSNAME 0x0200 #define IMA_KEYRINGS 0x0400 #define IMA_LABEL 0x0800 #define IMA_VALIDATE_ALGOS 0x1000 #define IMA_GID 0x2000 #define IMA_EGID 0x4000 #define IMA_FGROUP 0x8000 #define IMA_FS_SUBTYPE 0x10000 #define UNKNOWN 0 #define MEASURE 0x0001 /* same as IMA_MEASURE */ #define DONT_MEASURE 0x0002 #define APPRAISE 0x0004 /* same as IMA_APPRAISE */ #define DONT_APPRAISE 0x0008 #define AUDIT 0x0040 #define DONT_AUDIT 0x0080 #define HASH 0x0100 #define DONT_HASH 0x0200 #define INVALID_PCR(a) (((a) < 0) || \ (a) >= (sizeof_field(struct ima_iint_cache, measured_pcrs) * 8)) int ima_policy_flag; static int temp_ima_appraise; static int build_ima_appraise __ro_after_init; atomic_t ima_setxattr_allowed_hash_algorithms; #define MAX_LSM_RULES 6 enum lsm_rule_types { LSM_OBJ_USER, LSM_OBJ_ROLE, LSM_OBJ_TYPE, LSM_SUBJ_USER, LSM_SUBJ_ROLE, LSM_SUBJ_TYPE }; enum policy_types { ORIGINAL_TCB = 1, DEFAULT_TCB }; enum policy_rule_list { IMA_DEFAULT_POLICY = 1, IMA_CUSTOM_POLICY }; struct ima_rule_opt_list { size_t count; char *items[] __counted_by(count); }; /* * These comparators are needed nowhere outside of ima so just define them here. * This pattern should hopefully never be needed outside of ima. */ static inline bool vfsuid_gt_kuid(vfsuid_t vfsuid, kuid_t kuid) { return __vfsuid_val(vfsuid) > __kuid_val(kuid); } static inline bool vfsgid_gt_kgid(vfsgid_t vfsgid, kgid_t kgid) { return __vfsgid_val(vfsgid) > __kgid_val(kgid); } static inline bool vfsuid_lt_kuid(vfsuid_t vfsuid, kuid_t kuid) { return __vfsuid_val(vfsuid) < __kuid_val(kuid); } static inline bool vfsgid_lt_kgid(vfsgid_t vfsgid, kgid_t kgid) { return __vfsgid_val(vfsgid) < __kgid_val(kgid); } struct ima_rule_entry { struct list_head list; int action; unsigned int flags; enum ima_hooks func; int mask; unsigned long fsmagic; uuid_t fsuuid; kuid_t uid; kgid_t gid; kuid_t fowner; kgid_t fgroup; bool (*uid_op)(kuid_t cred_uid, kuid_t rule_uid); /* Handlers for operators */ bool (*gid_op)(kgid_t cred_gid, kgid_t rule_gid); bool (*fowner_op)(vfsuid_t vfsuid, kuid_t rule_uid); /* vfsuid_eq_kuid(), vfsuid_gt_kuid(), vfsuid_lt_kuid() */ bool (*fgroup_op)(vfsgid_t vfsgid, kgid_t rule_gid); /* vfsgid_eq_kgid(), vfsgid_gt_kgid(), vfsgid_lt_kgid() */ int pcr; unsigned int allowed_algos; /* bitfield of allowed hash algorithms */ struct { void *rule; /* LSM file metadata specific */ char *args_p; /* audit value */ int type; /* audit type */ } lsm[MAX_LSM_RULES]; char *fsname; char *fs_subtype; struct ima_rule_opt_list *keyrings; /* Measure keys added to these keyrings */ struct ima_rule_opt_list *label; /* Measure data grouped under this label */ struct ima_template_desc *template; }; /* * sanity check in case the kernels gains more hash algorithms that can * fit in an unsigned int */ static_assert( 8 * sizeof(unsigned int) >= HASH_ALGO__LAST, "The bitfield allowed_algos in ima_rule_entry is too small to contain all the supported hash algorithms, consider using a bigger type"); /* * Without LSM specific knowledge, the default policy can only be * written in terms of .action, .func, .mask, .fsmagic, .uid, .gid, * .fowner, and .fgroup */ /* * The minimum rule set to allow for full TCB coverage. Measures all files * opened or mmap for exec and everything read by root. Dangerous because * normal users can easily run the machine out of memory simply building * and running executables. */ static struct ima_rule_entry dont_measure_rules[] __ro_after_init = { {.action = DONT_MEASURE, .fsmagic = PROC_SUPER_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = SYSFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = DEBUGFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = TMPFS_MAGIC, .func = FILE_CHECK, .flags = IMA_FSMAGIC | IMA_FUNC}, {.action = DONT_MEASURE, .fsmagic = DEVPTS_SUPER_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = BINFMTFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = SECURITYFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = SELINUX_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = SMACK_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = CGROUP_SUPER_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = CGROUP2_SUPER_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = NSFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_MEASURE, .fsmagic = EFIVARFS_MAGIC, .flags = IMA_FSMAGIC} }; static struct ima_rule_entry original_measurement_rules[] __ro_after_init = { {.action = MEASURE, .func = MMAP_CHECK, .mask = MAY_EXEC, .flags = IMA_FUNC | IMA_MASK}, {.action = MEASURE, .func = BPRM_CHECK, .mask = MAY_EXEC, .flags = IMA_FUNC | IMA_MASK}, {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ, .uid = GLOBAL_ROOT_UID, .uid_op = &uid_eq, .flags = IMA_FUNC | IMA_MASK | IMA_UID}, {.action = MEASURE, .func = MODULE_CHECK, .flags = IMA_FUNC}, {.action = MEASURE, .func = FIRMWARE_CHECK, .flags = IMA_FUNC}, }; static struct ima_rule_entry default_measurement_rules[] __ro_after_init = { {.action = MEASURE, .func = MMAP_CHECK, .mask = MAY_EXEC, .flags = IMA_FUNC | IMA_MASK}, {.action = MEASURE, .func = BPRM_CHECK, .mask = MAY_EXEC, .flags = IMA_FUNC | IMA_MASK}, {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ, .uid = GLOBAL_ROOT_UID, .uid_op = &uid_eq, .flags = IMA_FUNC | IMA_INMASK | IMA_EUID}, {.action = MEASURE, .func = FILE_CHECK, .mask = MAY_READ, .uid = GLOBAL_ROOT_UID, .uid_op = &uid_eq, .flags = IMA_FUNC | IMA_INMASK | IMA_UID}, {.action = MEASURE, .func = MODULE_CHECK, .flags = IMA_FUNC}, {.action = MEASURE, .func = FIRMWARE_CHECK, .flags = IMA_FUNC}, {.action = MEASURE, .func = POLICY_CHECK, .flags = IMA_FUNC}, }; static struct ima_rule_entry default_appraise_rules[] __ro_after_init = { {.action = DONT_APPRAISE, .fsmagic = PROC_SUPER_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = SYSFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = DEBUGFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = TMPFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = RAMFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = DEVPTS_SUPER_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = BINFMTFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = SECURITYFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = SELINUX_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = SMACK_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = NSFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = EFIVARFS_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = CGROUP_SUPER_MAGIC, .flags = IMA_FSMAGIC}, {.action = DONT_APPRAISE, .fsmagic = CGROUP2_SUPER_MAGIC, .flags = IMA_FSMAGIC}, #ifdef CONFIG_IMA_WRITE_POLICY {.action = APPRAISE, .func = POLICY_CHECK, .flags = IMA_FUNC | IMA_DIGSIG_REQUIRED}, #endif #ifndef CONFIG_IMA_APPRAISE_SIGNED_INIT {.action = APPRAISE, .fowner = GLOBAL_ROOT_UID, .fowner_op = &vfsuid_eq_kuid, .flags = IMA_FOWNER}, #else /* force signature */ {.action = APPRAISE, .fowner = GLOBAL_ROOT_UID, .fowner_op = &vfsuid_eq_kuid, .flags = IMA_FOWNER | IMA_DIGSIG_REQUIRED}, #endif }; static struct ima_rule_entry build_appraise_rules[] __ro_after_init = { #ifdef CONFIG_IMA_APPRAISE_REQUIRE_MODULE_SIGS {.action = APPRAISE, .func = MODULE_CHECK, .flags = IMA_FUNC | IMA_DIGSIG_REQUIRED}, #endif #ifdef CONFIG_IMA_APPRAISE_REQUIRE_FIRMWARE_SIGS {.action = APPRAISE, .func = FIRMWARE_CHECK, .flags = IMA_FUNC | IMA_DIGSIG_REQUIRED}, #endif #ifdef CONFIG_IMA_APPRAISE_REQUIRE_KEXEC_SIGS {.action = APPRAISE, .func = KEXEC_KERNEL_CHECK, .flags = IMA_FUNC | IMA_DIGSIG_REQUIRED}, #endif #ifdef CONFIG_IMA_APPRAISE_REQUIRE_POLICY_SIGS {.action = APPRAISE, .func = POLICY_CHECK, .flags = IMA_FUNC | IMA_DIGSIG_REQUIRED}, #endif }; static struct ima_rule_entry secure_boot_rules[] __ro_after_init = { {.action = APPRAISE, .func = MODULE_CHECK, .flags = IMA_FUNC | IMA_DIGSIG_REQUIRED | IMA_MODSIG_ALLOWED | IMA_CHECK_BLACKLIST}, {.action = APPRAISE, .func = FIRMWARE_CHECK, .flags = IMA_FUNC | IMA_DIGSIG_REQUIRED}, {.action = APPRAISE, .func = KEXEC_KERNEL_CHECK, .flags = IMA_FUNC | IMA_DIGSIG_REQUIRED}, {.action = APPRAISE, .func = POLICY_CHECK, .flags = IMA_FUNC | IMA_DIGSIG_REQUIRED}, }; static struct ima_rule_entry critical_data_rules[] __ro_after_init = { {.action = MEASURE, .func = CRITICAL_DATA, .flags = IMA_FUNC}, }; /* An array of architecture specific rules */ static struct ima_rule_entry *arch_policy_entry __ro_after_init; static LIST_HEAD(ima_default_rules); static LIST_HEAD(ima_policy_rules); static LIST_HEAD(ima_temp_rules); static struct list_head __rcu *ima_rules = (struct list_head __rcu *)(&ima_default_rules); static int ima_policy __initdata; static int __init default_measure_policy_setup(char *str) { if (ima_policy) return 1; ima_policy = ORIGINAL_TCB; return 1; } __setup("ima_tcb", default_measure_policy_setup); static bool ima_use_appraise_tcb __initdata; static bool ima_use_secure_boot __initdata; static bool ima_use_critical_data __initdata; static bool ima_fail_unverifiable_sigs __ro_after_init; static int __init policy_setup(char *str) { char *p; while ((p = strsep(&str, " |\n")) != NULL) { if (*p == ' ') continue; if ((strcmp(p, "tcb") == 0) && !ima_policy) ima_policy = DEFAULT_TCB; else if (strcmp(p, "appraise_tcb") == 0) ima_use_appraise_tcb = true; else if (strcmp(p, "secure_boot") == 0) ima_use_secure_boot = true; else if (strcmp(p, "critical_data") == 0) ima_use_critical_data = true; else if (strcmp(p, "fail_securely") == 0) ima_fail_unverifiable_sigs = true; else pr_err("policy \"%s\" not found", p); } return 1; } __setup("ima_policy=", policy_setup); static int __init default_appraise_policy_setup(char *str) { ima_use_appraise_tcb = true; return 1; } __setup("ima_appraise_tcb", default_appraise_policy_setup); static struct ima_rule_opt_list *ima_alloc_rule_opt_list(const substring_t *src) { struct ima_rule_opt_list *opt_list; size_t count = 0; char *src_copy; char *cur, *next; size_t i; src_copy = match_strdup(src); if (!src_copy) return ERR_PTR(-ENOMEM); next = src_copy; while ((cur = strsep(&next, "|"))) { /* Don't accept an empty list item */ if (!(*cur)) { kfree(src_copy); return ERR_PTR(-EINVAL); } count++; } /* Don't accept an empty list */ if (!count) { kfree(src_copy); return ERR_PTR(-EINVAL); } opt_list = kzalloc(struct_size(opt_list, items, count), GFP_KERNEL); if (!opt_list) { kfree(src_copy); return ERR_PTR(-ENOMEM); } opt_list->count = count; /* * strsep() has already replaced all instances of '|' with '\0', * leaving a byte sequence of NUL-terminated strings. Reference each * string with the array of items. * * IMPORTANT: Ownership of the allocated buffer is transferred from * src_copy to the first element in the items array. To free the * buffer, kfree() must only be called on the first element of the * array. */ for (i = 0, cur = src_copy; i < count; i++) { opt_list->items[i] = cur; cur = strchr(cur, '\0') + 1; } return opt_list; } static void ima_free_rule_opt_list(struct ima_rule_opt_list *opt_list) { if (!opt_list) return; if (opt_list->count) { kfree(opt_list->items[0]); opt_list->count = 0; } kfree(opt_list); } static void ima_lsm_free_rule(struct ima_rule_entry *entry) { int i; for (i = 0; i < MAX_LSM_RULES; i++) { ima_filter_rule_free(entry->lsm[i].rule); kfree(entry->lsm[i].args_p); } } static void ima_free_rule(struct ima_rule_entry *entry) { if (!entry) return; /* * entry->template->fields may be allocated in ima_parse_rule() but that * reference is owned by the corresponding ima_template_desc element in * the defined_templates list and cannot be freed here */ kfree(entry->fsname); kfree(entry->fs_subtype); ima_free_rule_opt_list(entry->keyrings); ima_lsm_free_rule(entry); kfree(entry); } static struct ima_rule_entry *ima_lsm_copy_rule(struct ima_rule_entry *entry, gfp_t gfp) { struct ima_rule_entry *nentry; int i; /* * Immutable elements are copied over as pointers and data; only * lsm rules can change */ nentry = kmemdup(entry, sizeof(*nentry), gfp); if (!nentry) return NULL; memset(nentry->lsm, 0, sizeof_field(struct ima_rule_entry, lsm)); for (i = 0; i < MAX_LSM_RULES; i++) { if (!entry->lsm[i].args_p) continue; nentry->lsm[i].type = entry->lsm[i].type; nentry->lsm[i].args_p = entry->lsm[i].args_p; ima_filter_rule_init(nentry->lsm[i].type, Audit_equal, nentry->lsm[i].args_p, &nentry->lsm[i].rule, gfp); if (!nentry->lsm[i].rule) pr_warn("rule for LSM \'%s\' is undefined\n", nentry->lsm[i].args_p); } return nentry; } static int ima_lsm_update_rule(struct ima_rule_entry *entry) { int i; struct ima_rule_entry *nentry; nentry = ima_lsm_copy_rule(entry, GFP_KERNEL); if (!nentry) return -ENOMEM; list_replace_rcu(&entry->list, &nentry->list); synchronize_rcu(); /* * ima_lsm_copy_rule() shallow copied all references, except for the * LSM references, from entry to nentry so we only want to free the LSM * references and the entry itself. All other memory references will now * be owned by nentry. */ for (i = 0; i < MAX_LSM_RULES; i++) ima_filter_rule_free(entry->lsm[i].rule); kfree(entry); return 0; } static bool ima_rule_contains_lsm_cond(struct ima_rule_entry *entry) { int i; for (i = 0; i < MAX_LSM_RULES; i++) if (entry->lsm[i].args_p) return true; return false; } /* * The LSM policy can be reloaded, leaving the IMA LSM based rules referring * to the old, stale LSM policy. Update the IMA LSM based rules to reflect * the reloaded LSM policy. */ static void ima_lsm_update_rules(void) { struct ima_rule_entry *entry, *e; int result; list_for_each_entry_safe(entry, e, &ima_policy_rules, list) { if (!ima_rule_contains_lsm_cond(entry)) continue; result = ima_lsm_update_rule(entry); if (result) { pr_err("lsm rule update error %d\n", result); return; } } } int ima_lsm_policy_change(struct notifier_block *nb, unsigned long event, void *lsm_data) { if (event != LSM_POLICY_CHANGE) return NOTIFY_DONE; ima_lsm_update_rules(); return NOTIFY_OK; } /** * ima_match_rule_data - determine whether func_data matches the policy rule * @rule: a pointer to a rule * @func_data: data to match against the measure rule data * @cred: a pointer to a credentials structure for user validation * * Returns true if func_data matches one in the rule, false otherwise. */ static bool ima_match_rule_data(struct ima_rule_entry *rule, const char *func_data, const struct cred *cred) { const struct ima_rule_opt_list *opt_list = NULL; bool matched = false; size_t i; if ((rule->flags & IMA_UID) && !rule->uid_op(cred->uid, rule->uid)) return false; switch (rule->func) { case KEY_CHECK: if (!rule->keyrings) return true; opt_list = rule->keyrings; break; case CRITICAL_DATA: if (!rule->label) return true; opt_list = rule->label; break; default: return false; } if (!func_data) return false; for (i = 0; i < opt_list->count; i++) { if (!strcmp(opt_list->items[i], func_data)) { matched = true; break; } } return matched; } /** * ima_match_rules - determine whether an inode matches the policy rule. * @rule: a pointer to a rule * @idmap: idmap of the mount the inode was found from * @inode: a pointer to an inode * @cred: a pointer to a credentials structure for user validation * @prop: LSM properties of the task to be validated * @func: LIM hook identifier * @mask: requested action (MAY_READ | MAY_WRITE | MAY_APPEND | MAY_EXEC) * @func_data: func specific data, may be NULL * * Returns true on rule match, false on failure. */ static bool ima_match_rules(struct ima_rule_entry *rule, struct mnt_idmap *idmap, struct inode *inode, const struct cred *cred, struct lsm_prop *prop, enum ima_hooks func, int mask, const char *func_data) { int i; bool result = false; struct ima_rule_entry *lsm_rule = rule; bool rule_reinitialized = false; if ((rule->flags & IMA_FUNC) && (rule->func != func && func != POST_SETATTR)) return false; switch (func) { case KEY_CHECK: case CRITICAL_DATA: return ((rule->func == func) && ima_match_rule_data(rule, func_data, cred)); default: break; } if ((rule->flags & IMA_MASK) && (rule->mask != mask && func != POST_SETATTR)) return false; if ((rule->flags & IMA_INMASK) && (!(rule->mask & mask) && func != POST_SETATTR)) return false; if ((rule->flags & IMA_FSMAGIC) && rule->fsmagic != inode->i_sb->s_magic) return false; if ((rule->flags & IMA_FSNAME) && strcmp(rule->fsname, inode->i_sb->s_type->name)) return false; if (rule->flags & IMA_FS_SUBTYPE) { if (!inode->i_sb->s_subtype) return false; if (strcmp(rule->fs_subtype, inode->i_sb->s_subtype)) return false; } if ((rule->flags & IMA_FSUUID) && !uuid_equal(&rule->fsuuid, &inode->i_sb->s_uuid)) return false; if ((rule->flags & IMA_UID) && !rule->uid_op(cred->uid, rule->uid)) return false; if (rule->flags & IMA_EUID) { if (has_capability_noaudit(current, CAP_SETUID)) { if (!rule->uid_op(cred->euid, rule->uid) && !rule->uid_op(cred->suid, rule->uid) && !rule->uid_op(cred->uid, rule->uid)) return false; } else if (!rule->uid_op(cred->euid, rule->uid)) return false; } if ((rule->flags & IMA_GID) && !rule->gid_op(cred->gid, rule->gid)) return false; if (rule->flags & IMA_EGID) { if (has_capability_noaudit(current, CAP_SETGID)) { if (!rule->gid_op(cred->egid, rule->gid) && !rule->gid_op(cred->sgid, rule->gid) && !rule->gid_op(cred->gid, rule->gid)) return false; } else if (!rule->gid_op(cred->egid, rule->gid)) return false; } if ((rule->flags & IMA_FOWNER) && !rule->fowner_op(i_uid_into_vfsuid(idmap, inode), rule->fowner)) return false; if ((rule->flags & IMA_FGROUP) && !rule->fgroup_op(i_gid_into_vfsgid(idmap, inode), rule->fgroup)) return false; for (i = 0; i < MAX_LSM_RULES; i++) { int rc = 0; struct lsm_prop inode_prop = { }; if (!lsm_rule->lsm[i].rule) { if (!lsm_rule->lsm[i].args_p) continue; else return false; } retry: switch (i) { case LSM_OBJ_USER: case LSM_OBJ_ROLE: case LSM_OBJ_TYPE: security_inode_getlsmprop(inode, &inode_prop); rc = ima_filter_rule_match(&inode_prop, lsm_rule->lsm[i].type, Audit_equal, lsm_rule->lsm[i].rule); break; case LSM_SUBJ_USER: case LSM_SUBJ_ROLE: case LSM_SUBJ_TYPE: rc = ima_filter_rule_match(prop, lsm_rule->lsm[i].type, Audit_equal, lsm_rule->lsm[i].rule); break; default: break; } if (rc == -ESTALE && !rule_reinitialized) { lsm_rule = ima_lsm_copy_rule(rule, GFP_ATOMIC); if (lsm_rule) { rule_reinitialized = true; goto retry; } } if (rc <= 0) { result = false; goto out; } } result = true; out: if (rule_reinitialized) { for (i = 0; i < MAX_LSM_RULES; i++) ima_filter_rule_free(lsm_rule->lsm[i].rule); kfree(lsm_rule); } return result; } /* * In addition to knowing that we need to appraise the file in general, * we need to differentiate between calling hooks, for hook specific rules. */ static int get_subaction(struct ima_rule_entry *rule, enum ima_hooks func) { if (!(rule->flags & IMA_FUNC)) return IMA_FILE_APPRAISE; switch (func) { case MMAP_CHECK: case MMAP_CHECK_REQPROT: return IMA_MMAP_APPRAISE; case BPRM_CHECK: return IMA_BPRM_APPRAISE; case CREDS_CHECK: return IMA_CREDS_APPRAISE; case FILE_CHECK: case POST_SETATTR: return IMA_FILE_APPRAISE; case MODULE_CHECK ... MAX_CHECK - 1: default: return IMA_READ_APPRAISE; } } /** * ima_match_policy - decision based on LSM and other conditions * @idmap: idmap of the mount the inode was found from * @inode: pointer to an inode for which the policy decision is being made * @cred: pointer to a credentials structure for which the policy decision is * being made * @prop: LSM properties of the task to be validated * @func: IMA hook identifier * @mask: requested action (MAY_READ | MAY_WRITE | MAY_APPEND | MAY_EXEC) * @flags: IMA actions to consider (e.g. IMA_MEASURE | IMA_APPRAISE) * @pcr: set the pcr to extend * @template_desc: the template that should be used for this rule * @func_data: func specific data, may be NULL * @allowed_algos: allowlist of hash algorithms for the IMA xattr * * Measure decision based on func/mask/fsmagic and LSM(subj/obj/type) * conditions. * * Since the IMA policy may be updated multiple times we need to lock the * list when walking it. Reads are many orders of magnitude more numerous * than writes so ima_match_policy() is classical RCU candidate. */ int ima_match_policy(struct mnt_idmap *idmap, struct inode *inode, const struct cred *cred, struct lsm_prop *prop, enum ima_hooks func, int mask, int flags, int *pcr, struct ima_template_desc **template_desc, const char *func_data, unsigned int *allowed_algos) { struct ima_rule_entry *entry; int action = 0, actmask = flags | (flags << 1); struct list_head *ima_rules_tmp; if (template_desc && !*template_desc) *template_desc = ima_template_desc_current(); rcu_read_lock(); ima_rules_tmp = rcu_dereference(ima_rules); list_for_each_entry_rcu(entry, ima_rules_tmp, list) { if (!(entry->action & actmask)) continue; if (!ima_match_rules(entry, idmap, inode, cred, prop, func, mask, func_data)) continue; action |= entry->flags & IMA_NONACTION_FLAGS; action |= entry->action & IMA_DO_MASK; if (entry->action & IMA_APPRAISE) { action |= get_subaction(entry, func); action &= ~IMA_HASH; if (ima_fail_unverifiable_sigs) action |= IMA_FAIL_UNVERIFIABLE_SIGS; if (allowed_algos && entry->flags & IMA_VALIDATE_ALGOS) *allowed_algos = entry->allowed_algos; } if (entry->action & IMA_DO_MASK) actmask &= ~(entry->action | entry->action << 1); else actmask &= ~(entry->action | entry->action >> 1); if ((pcr) && (entry->flags & IMA_PCR)) *pcr = entry->pcr; if (template_desc && entry->template) *template_desc = entry->template; if (!actmask) break; } rcu_read_unlock(); return action; } /** * ima_update_policy_flags() - Update global IMA variables * * Update ima_policy_flag and ima_setxattr_allowed_hash_algorithms * based on the currently loaded policy. * * With ima_policy_flag, the decision to short circuit out of a function * or not call the function in the first place can be made earlier. * * With ima_setxattr_allowed_hash_algorithms, the policy can restrict the * set of hash algorithms accepted when updating the security.ima xattr of * a file. * * Context: called after a policy update and at system initialization. */ void ima_update_policy_flags(void) { struct ima_rule_entry *entry; int new_policy_flag = 0; struct list_head *ima_rules_tmp; rcu_read_lock(); ima_rules_tmp = rcu_dereference(ima_rules); list_for_each_entry_rcu(entry, ima_rules_tmp, list) { /* * SETXATTR_CHECK rules do not implement a full policy check * because rule checking would probably have an important * performance impact on setxattr(). As a consequence, only one * SETXATTR_CHECK can be active at a given time. * Because we want to preserve that property, we set out to use * atomic_cmpxchg. Either: * - the atomic was non-zero: a setxattr hash policy is * already enforced, we do nothing * - the atomic was zero: no setxattr policy was set, enable * the setxattr hash policy */ if (entry->func == SETXATTR_CHECK) { atomic_cmpxchg(&ima_setxattr_allowed_hash_algorithms, 0, entry->allowed_algos); /* SETXATTR_CHECK doesn't impact ima_policy_flag */ continue; } if (entry->action & IMA_DO_MASK) new_policy_flag |= entry->action; } rcu_read_unlock(); ima_appraise |= (build_ima_appraise | temp_ima_appraise); if (!ima_appraise) new_policy_flag &= ~IMA_APPRAISE; ima_policy_flag = new_policy_flag; } static int ima_appraise_flag(enum ima_hooks func) { if (func == MODULE_CHECK) return IMA_APPRAISE_MODULES; else if (func == FIRMWARE_CHECK) return IMA_APPRAISE_FIRMWARE; else if (func == POLICY_CHECK) return IMA_APPRAISE_POLICY; else if (func == KEXEC_KERNEL_CHECK) return IMA_APPRAISE_KEXEC; return 0; } static void add_rules(struct ima_rule_entry *entries, int count, enum policy_rule_list policy_rule) { int i = 0; for (i = 0; i < count; i++) { struct ima_rule_entry *entry; if (policy_rule & IMA_DEFAULT_POLICY) list_add_tail(&entries[i].list, &ima_default_rules); if (policy_rule & IMA_CUSTOM_POLICY) { entry = kmemdup(&entries[i], sizeof(*entry), GFP_KERNEL); if (!entry) continue; list_add_tail(&entry->list, &ima_policy_rules); } if (entries[i].action == APPRAISE) { if (entries != build_appraise_rules) temp_ima_appraise |= ima_appraise_flag(entries[i].func); else build_ima_appraise |= ima_appraise_flag(entries[i].func); } } } static int ima_parse_rule(char *rule, struct ima_rule_entry *entry); static int __init ima_init_arch_policy(void) { const char * const *arch_rules; const char * const *rules; int arch_entries = 0; int i = 0; arch_rules = arch_get_ima_policy(); if (!arch_rules) return arch_entries; /* Get number of rules */ for (rules = arch_rules; *rules != NULL; rules++) arch_entries++; arch_policy_entry = kcalloc(arch_entries + 1, sizeof(*arch_policy_entry), GFP_KERNEL); if (!arch_policy_entry) return 0; /* Convert each policy string rules to struct ima_rule_entry format */ for (rules = arch_rules, i = 0; *rules != NULL; rules++) { char rule[255]; int result; result = strscpy(rule, *rules, sizeof(rule)); INIT_LIST_HEAD(&arch_policy_entry[i].list); result = ima_parse_rule(rule, &arch_policy_entry[i]); if (result) { pr_warn("Skipping unknown architecture policy rule: %s\n", rule); memset(&arch_policy_entry[i], 0, sizeof(*arch_policy_entry)); continue; } i++; } return i; } /** * ima_init_policy - initialize the default measure rules. * * ima_rules points to either the ima_default_rules or the new ima_policy_rules. */ void __init ima_init_policy(void) { int build_appraise_entries, arch_entries; /* if !ima_policy, we load NO default rules */ if (ima_policy) add_rules(dont_measure_rules, ARRAY_SIZE(dont_measure_rules), IMA_DEFAULT_POLICY); switch (ima_policy) { case ORIGINAL_TCB: add_rules(original_measurement_rules, ARRAY_SIZE(original_measurement_rules), IMA_DEFAULT_POLICY); break; case DEFAULT_TCB: add_rules(default_measurement_rules, ARRAY_SIZE(default_measurement_rules), IMA_DEFAULT_POLICY); break; default: break; } /* * Based on runtime secure boot flags, insert arch specific measurement * and appraise rules requiring file signatures for both the initial * and custom policies, prior to other appraise rules. * (Highest priority) */ arch_entries = ima_init_arch_policy(); if (!arch_entries) pr_info("No architecture policies found\n"); else add_rules(arch_policy_entry, arch_entries, IMA_DEFAULT_POLICY | IMA_CUSTOM_POLICY); /* * Insert the builtin "secure_boot" policy rules requiring file * signatures, prior to other appraise rules. */ if (ima_use_secure_boot) add_rules(secure_boot_rules, ARRAY_SIZE(secure_boot_rules), IMA_DEFAULT_POLICY); /* * Insert the build time appraise rules requiring file signatures * for both the initial and custom policies, prior to other appraise * rules. As the secure boot rules includes all of the build time * rules, include either one or the other set of rules, but not both. */ build_appraise_entries = ARRAY_SIZE(build_appraise_rules); if (build_appraise_entries) { if (ima_use_secure_boot) add_rules(build_appraise_rules, build_appraise_entries, IMA_CUSTOM_POLICY); else add_rules(build_appraise_rules, build_appraise_entries, IMA_DEFAULT_POLICY | IMA_CUSTOM_POLICY); } if (ima_use_appraise_tcb) add_rules(default_appraise_rules, ARRAY_SIZE(default_appraise_rules), IMA_DEFAULT_POLICY); if (ima_use_critical_data) add_rules(critical_data_rules, ARRAY_SIZE(critical_data_rules), IMA_DEFAULT_POLICY); atomic_set(&ima_setxattr_allowed_hash_algorithms, 0); ima_update_policy_flags(); } /* Make sure we have a valid policy, at least containing some rules. */ int ima_check_policy(void) { if (list_empty(&ima_temp_rules)) return -EINVAL; return 0; } /** * ima_update_policy - update default_rules with new measure rules * * Called on file .release to update the default rules with a complete new * policy. What we do here is to splice ima_policy_rules and ima_temp_rules so * they make a queue. The policy may be updated multiple times and this is the * RCU updater. * * Policy rules are never deleted so ima_policy_flag gets zeroed only once when * we switch from the default policy to user defined. */ void ima_update_policy(void) { struct list_head *policy = &ima_policy_rules; list_splice_tail_init_rcu(&ima_temp_rules, policy, synchronize_rcu); if (ima_rules != (struct list_head __rcu *)policy) { ima_policy_flag = 0; rcu_assign_pointer(ima_rules, policy); /* * IMA architecture specific policy rules are specified * as strings and converted to an array of ima_entry_rules * on boot. After loading a custom policy, free the * architecture specific rules stored as an array. */ kfree(arch_policy_entry); } ima_update_policy_flags(); /* Custom IMA policy has been loaded */ ima_process_queued_keys(); } /* Keep the enumeration in sync with the policy_tokens! */ enum policy_opt { Opt_measure, Opt_dont_measure, Opt_appraise, Opt_dont_appraise, Opt_audit, Opt_dont_audit, Opt_hash, Opt_dont_hash, Opt_obj_user, Opt_obj_role, Opt_obj_type, Opt_subj_user, Opt_subj_role, Opt_subj_type, Opt_func, Opt_mask, Opt_fsmagic, Opt_fsname, Opt_fs_subtype, Opt_fsuuid, Opt_uid_eq, Opt_euid_eq, Opt_gid_eq, Opt_egid_eq, Opt_fowner_eq, Opt_fgroup_eq, Opt_uid_gt, Opt_euid_gt, Opt_gid_gt, Opt_egid_gt, Opt_fowner_gt, Opt_fgroup_gt, Opt_uid_lt, Opt_euid_lt, Opt_gid_lt, Opt_egid_lt, Opt_fowner_lt, Opt_fgroup_lt, Opt_digest_type, Opt_appraise_type, Opt_appraise_flag, Opt_appraise_algos, Opt_permit_directio, Opt_pcr, Opt_template, Opt_keyrings, Opt_label, Opt_err }; static const match_table_t policy_tokens = { {Opt_measure, "measure"}, {Opt_dont_measure, "dont_measure"}, {Opt_appraise, "appraise"}, {Opt_dont_appraise, "dont_appraise"}, {Opt_audit, "audit"}, {Opt_dont_audit, "dont_audit"}, {Opt_hash, "hash"}, {Opt_dont_hash, "dont_hash"}, {Opt_obj_user, "obj_user=%s"}, {Opt_obj_role, "obj_role=%s"}, {Opt_obj_type, "obj_type=%s"}, {Opt_subj_user, "subj_user=%s"}, {Opt_subj_role, "subj_role=%s"}, {Opt_subj_type, "subj_type=%s"}, {Opt_func, "func=%s"}, {Opt_mask, "mask=%s"}, {Opt_fsmagic, "fsmagic=%s"}, {Opt_fsname, "fsname=%s"}, {Opt_fs_subtype, "fs_subtype=%s"}, {Opt_fsuuid, "fsuuid=%s"}, {Opt_uid_eq, "uid=%s"}, {Opt_euid_eq, "euid=%s"}, {Opt_gid_eq, "gid=%s"}, {Opt_egid_eq, "egid=%s"}, {Opt_fowner_eq, "fowner=%s"}, {Opt_fgroup_eq, "fgroup=%s"}, {Opt_uid_gt, "uid>%s"}, {Opt_euid_gt, "euid>%s"}, {Opt_gid_gt, "gid>%s"}, {Opt_egid_gt, "egid>%s"}, {Opt_fowner_gt, "fowner>%s"}, {Opt_fgroup_gt, "fgroup>%s"}, {Opt_uid_lt, "uid<%s"}, {Opt_euid_lt, "euid<%s"}, {Opt_gid_lt, "gid<%s"}, {Opt_egid_lt, "egid<%s"}, {Opt_fowner_lt, "fowner<%s"}, {Opt_fgroup_lt, "fgroup<%s"}, {Opt_digest_type, "digest_type=%s"}, {Opt_appraise_type, "appraise_type=%s"}, {Opt_appraise_flag, "appraise_flag=%s"}, {Opt_appraise_algos, "appraise_algos=%s"}, {Opt_permit_directio, "permit_directio"}, {Opt_pcr, "pcr=%s"}, {Opt_template, "template=%s"}, {Opt_keyrings, "keyrings=%s"}, {Opt_label, "label=%s"}, {Opt_err, NULL} }; static int ima_lsm_rule_init(struct ima_rule_entry *entry, substring_t *args, int lsm_rule, int audit_type) { int result; if (entry->lsm[lsm_rule].rule) return -EINVAL; entry->lsm[lsm_rule].args_p = match_strdup(args); if (!entry->lsm[lsm_rule].args_p) return -ENOMEM; entry->lsm[lsm_rule].type = audit_type; result = ima_filter_rule_init(entry->lsm[lsm_rule].type, Audit_equal, entry->lsm[lsm_rule].args_p, &entry->lsm[lsm_rule].rule, GFP_KERNEL); if (!entry->lsm[lsm_rule].rule) { pr_warn("rule for LSM \'%s\' is undefined\n", entry->lsm[lsm_rule].args_p); if (ima_rules == (struct list_head __rcu *)(&ima_default_rules)) { kfree(entry->lsm[lsm_rule].args_p); entry->lsm[lsm_rule].args_p = NULL; result = -EINVAL; } else result = 0; } return result; } static void ima_log_string_op(struct audit_buffer *ab, char *key, char *value, enum policy_opt rule_operator) { if (!ab) return; switch (rule_operator) { case Opt_uid_gt: case Opt_euid_gt: case Opt_gid_gt: case Opt_egid_gt: case Opt_fowner_gt: case Opt_fgroup_gt: audit_log_format(ab, "%s>", key); break; case Opt_uid_lt: case Opt_euid_lt: case Opt_gid_lt: case Opt_egid_lt: case Opt_fowner_lt: case Opt_fgroup_lt: audit_log_format(ab, "%s<", key); break; default: audit_log_format(ab, "%s=", key); } audit_log_format(ab, "%s ", value); } static void ima_log_string(struct audit_buffer *ab, char *key, char *value) { ima_log_string_op(ab, key, value, Opt_err); } /* * Validating the appended signature included in the measurement list requires * the file hash calculated without the appended signature (i.e., the 'd-modsig' * field). Therefore, notify the user if they have the 'modsig' field but not * the 'd-modsig' field in the template. */ static void check_template_modsig(const struct ima_template_desc *template) { #define MSG "template with 'modsig' field also needs 'd-modsig' field\n" bool has_modsig, has_dmodsig; static bool checked; int i; /* We only need to notify the user once. */ if (checked) return; has_modsig = has_dmodsig = false; for (i = 0; i < template->num_fields; i++) { if (!strcmp(template->fields[i]->field_id, "modsig")) has_modsig = true; else if (!strcmp(template->fields[i]->field_id, "d-modsig")) has_dmodsig = true; } if (has_modsig && !has_dmodsig) pr_notice(MSG); checked = true; #undef MSG } /* * Warn if the template does not contain the given field. */ static void check_template_field(const struct ima_template_desc *template, const char *field, const char *msg) { int i; for (i = 0; i < template->num_fields; i++) if (!strcmp(template->fields[i]->field_id, field)) return; pr_notice_once("%s", msg); } static bool ima_validate_rule(struct ima_rule_entry *entry) { /* Ensure that the action is set and is compatible with the flags */ if (entry->action == UNKNOWN) return false; if (entry->action != MEASURE && entry->flags & IMA_PCR) return false; if (entry->action != APPRAISE && entry->flags & (IMA_DIGSIG_REQUIRED | IMA_MODSIG_ALLOWED | IMA_CHECK_BLACKLIST | IMA_VALIDATE_ALGOS)) return false; /* * The IMA_FUNC bit must be set if and only if there's a valid hook * function specified, and vice versa. Enforcing this property allows * for the NONE case below to validate a rule without an explicit hook * function. */ if (((entry->flags & IMA_FUNC) && entry->func == NONE) || (!(entry->flags & IMA_FUNC) && entry->func != NONE)) return false; /* * Ensure that the hook function is compatible with the other * components of the rule */ switch (entry->func) { case NONE: case FILE_CHECK: case MMAP_CHECK: case MMAP_CHECK_REQPROT: case BPRM_CHECK: case CREDS_CHECK: case POST_SETATTR: case FIRMWARE_CHECK: case POLICY_CHECK: if (entry->flags & ~(IMA_FUNC | IMA_MASK | IMA_FSMAGIC | IMA_UID | IMA_FOWNER | IMA_FSUUID | IMA_INMASK | IMA_EUID | IMA_PCR | IMA_FSNAME | IMA_FS_SUBTYPE | IMA_GID | IMA_EGID | IMA_FGROUP | IMA_DIGSIG_REQUIRED | IMA_PERMIT_DIRECTIO | IMA_VALIDATE_ALGOS | IMA_CHECK_BLACKLIST | IMA_VERITY_REQUIRED)) return false; break; case MODULE_CHECK: case KEXEC_KERNEL_CHECK: case KEXEC_INITRAMFS_CHECK: if (entry->flags & ~(IMA_FUNC | IMA_MASK | IMA_FSMAGIC | IMA_UID | IMA_FOWNER | IMA_FSUUID | IMA_INMASK | IMA_EUID | IMA_PCR | IMA_FSNAME | IMA_FS_SUBTYPE | IMA_GID | IMA_EGID | IMA_FGROUP | IMA_DIGSIG_REQUIRED | IMA_PERMIT_DIRECTIO | IMA_MODSIG_ALLOWED | IMA_CHECK_BLACKLIST | IMA_VALIDATE_ALGOS)) return false; break; case KEXEC_CMDLINE: if (entry->action & ~(MEASURE | DONT_MEASURE)) return false; if (entry->flags & ~(IMA_FUNC | IMA_FSMAGIC | IMA_UID | IMA_FOWNER | IMA_FSUUID | IMA_EUID | IMA_PCR | IMA_FSNAME | IMA_FS_SUBTYPE | IMA_GID | IMA_EGID | IMA_FGROUP)) return false; break; case KEY_CHECK: if (entry->action & ~(MEASURE | DONT_MEASURE)) return false; if (entry->flags & ~(IMA_FUNC | IMA_UID | IMA_GID | IMA_PCR | IMA_KEYRINGS)) return false; if (ima_rule_contains_lsm_cond(entry)) return false; break; case CRITICAL_DATA: if (entry->action & ~(MEASURE | DONT_MEASURE)) return false; if (entry->flags & ~(IMA_FUNC | IMA_UID | IMA_GID | IMA_PCR | IMA_LABEL)) return false; if (ima_rule_contains_lsm_cond(entry)) return false; break; case SETXATTR_CHECK: /* any action other than APPRAISE is unsupported */ if (entry->action != APPRAISE) return false; /* SETXATTR_CHECK requires an appraise_algos parameter */ if (!(entry->flags & IMA_VALIDATE_ALGOS)) return false; /* * full policies are not supported, they would have too * much of a performance impact */ if (entry->flags & ~(IMA_FUNC | IMA_VALIDATE_ALGOS)) return false; break; default: return false; } /* Ensure that combinations of flags are compatible with each other */ if (entry->flags & IMA_CHECK_BLACKLIST && !(entry->flags & IMA_DIGSIG_REQUIRED)) return false; /* * Unlike for regular IMA 'appraise' policy rules where security.ima * xattr may contain either a file hash or signature, the security.ima * xattr for fsverity must contain a file signature (sigv3). Ensure * that 'appraise' rules for fsverity require file signatures by * checking the IMA_DIGSIG_REQUIRED flag is set. */ if (entry->action == APPRAISE && (entry->flags & IMA_VERITY_REQUIRED) && !(entry->flags & IMA_DIGSIG_REQUIRED)) return false; return true; } static unsigned int ima_parse_appraise_algos(char *arg) { unsigned int res = 0; int idx; char *token; while ((token = strsep(&arg, ",")) != NULL) { idx = match_string(hash_algo_name, HASH_ALGO__LAST, token); if (idx < 0) { pr_err("unknown hash algorithm \"%s\"", token); return 0; } if (!crypto_has_alg(hash_algo_name[idx], 0, 0)) { pr_err("unavailable hash algorithm \"%s\", check your kernel configuration", token); return 0; } /* Add the hash algorithm to the 'allowed' bitfield */ res |= (1U << idx); } return res; } static int ima_parse_rule(char *rule, struct ima_rule_entry *entry) { struct audit_buffer *ab; char *from; char *p; bool eid_token; /* either euid or egid */ struct ima_template_desc *template_desc; int result = 0; ab = integrity_audit_log_start(audit_context(), GFP_KERNEL, AUDIT_INTEGRITY_POLICY_RULE); entry->uid = INVALID_UID; entry->gid = INVALID_GID; entry->fowner = INVALID_UID; entry->fgroup = INVALID_GID; entry->uid_op = &uid_eq; entry->gid_op = &gid_eq; entry->fowner_op = &vfsuid_eq_kuid; entry->fgroup_op = &vfsgid_eq_kgid; entry->action = UNKNOWN; while ((p = strsep(&rule, " \t")) != NULL) { substring_t args[MAX_OPT_ARGS]; int token; unsigned long lnum; if (result < 0 || *p == '#') /* ignore suffixed comment */ break; if ((*p == '\0') || (*p == ' ') || (*p == '\t')) continue; token = match_token(p, policy_tokens, args); switch (token) { case Opt_measure: ima_log_string(ab, "action", "measure"); if (entry->action != UNKNOWN) result = -EINVAL; entry->action = MEASURE; break; case Opt_dont_measure: ima_log_string(ab, "action", "dont_measure"); if (entry->action != UNKNOWN) result = -EINVAL; entry->action = DONT_MEASURE; break; case Opt_appraise: ima_log_string(ab, "action", "appraise"); if (entry->action != UNKNOWN) result = -EINVAL; entry->action = APPRAISE; break; case Opt_dont_appraise: ima_log_string(ab, "action", "dont_appraise"); if (entry->action != UNKNOWN) result = -EINVAL; entry->action = DONT_APPRAISE; break; case Opt_audit: ima_log_string(ab, "action", "audit"); if (entry->action != UNKNOWN) result = -EINVAL; entry->action = AUDIT; break; case Opt_dont_audit: ima_log_string(ab, "action", "dont_audit"); if (entry->action != UNKNOWN) result = -EINVAL; entry->action = DONT_AUDIT; break; case Opt_hash: ima_log_string(ab, "action", "hash"); if (entry->action != UNKNOWN) result = -EINVAL; entry->action = HASH; break; case Opt_dont_hash: ima_log_string(ab, "action", "dont_hash"); if (entry->action != UNKNOWN) result = -EINVAL; entry->action = DONT_HASH; break; case Opt_func: ima_log_string(ab, "func", args[0].from); if (entry->func) result = -EINVAL; if (strcmp(args[0].from, "FILE_CHECK") == 0) entry->func = FILE_CHECK; /* PATH_CHECK is for backwards compat */ else if (strcmp(args[0].from, "PATH_CHECK") == 0) entry->func = FILE_CHECK; else if (strcmp(args[0].from, "MODULE_CHECK") == 0) entry->func = MODULE_CHECK; else if (strcmp(args[0].from, "FIRMWARE_CHECK") == 0) entry->func = FIRMWARE_CHECK; else if ((strcmp(args[0].from, "FILE_MMAP") == 0) || (strcmp(args[0].from, "MMAP_CHECK") == 0)) entry->func = MMAP_CHECK; else if ((strcmp(args[0].from, "MMAP_CHECK_REQPROT") == 0)) entry->func = MMAP_CHECK_REQPROT; else if (strcmp(args[0].from, "BPRM_CHECK") == 0) entry->func = BPRM_CHECK; else if (strcmp(args[0].from, "CREDS_CHECK") == 0) entry->func = CREDS_CHECK; else if (strcmp(args[0].from, "KEXEC_KERNEL_CHECK") == 0) entry->func = KEXEC_KERNEL_CHECK; else if (strcmp(args[0].from, "KEXEC_INITRAMFS_CHECK") == 0) entry->func = KEXEC_INITRAMFS_CHECK; else if (strcmp(args[0].from, "POLICY_CHECK") == 0) entry->func = POLICY_CHECK; else if (strcmp(args[0].from, "KEXEC_CMDLINE") == 0) entry->func = KEXEC_CMDLINE; else if (IS_ENABLED(CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS) && strcmp(args[0].from, "KEY_CHECK") == 0) entry->func = KEY_CHECK; else if (strcmp(args[0].from, "CRITICAL_DATA") == 0) entry->func = CRITICAL_DATA; else if (strcmp(args[0].from, "SETXATTR_CHECK") == 0) entry->func = SETXATTR_CHECK; else result = -EINVAL; if (!result) entry->flags |= IMA_FUNC; break; case Opt_mask: ima_log_string(ab, "mask", args[0].from); if (entry->mask) result = -EINVAL; from = args[0].from; if (*from == '^') from++; if ((strcmp(from, "MAY_EXEC")) == 0) entry->mask = MAY_EXEC; else if (strcmp(from, "MAY_WRITE") == 0) entry->mask = MAY_WRITE; else if (strcmp(from, "MAY_READ") == 0) entry->mask = MAY_READ; else if (strcmp(from, "MAY_APPEND") == 0) entry->mask = MAY_APPEND; else result = -EINVAL; if (!result) entry->flags |= (*args[0].from == '^') ? IMA_INMASK : IMA_MASK; break; case Opt_fsmagic: ima_log_string(ab, "fsmagic", args[0].from); if (entry->fsmagic) { result = -EINVAL; break; } result = kstrtoul(args[0].from, 16, &entry->fsmagic); if (!result) entry->flags |= IMA_FSMAGIC; break; case Opt_fsname: ima_log_string(ab, "fsname", args[0].from); entry->fsname = kstrdup(args[0].from, GFP_KERNEL); if (!entry->fsname) { result = -ENOMEM; break; } result = 0; entry->flags |= IMA_FSNAME; break; case Opt_fs_subtype: ima_log_string(ab, "fs_subtype", args[0].from); if (entry->fs_subtype) { result = -EINVAL; break; } entry->fs_subtype = kstrdup(args[0].from, GFP_KERNEL); if (!entry->fs_subtype) { result = -ENOMEM; break; } result = 0; entry->flags |= IMA_FS_SUBTYPE; break; case Opt_keyrings: ima_log_string(ab, "keyrings", args[0].from); if (!IS_ENABLED(CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS) || entry->keyrings) { result = -EINVAL; break; } entry->keyrings = ima_alloc_rule_opt_list(args); if (IS_ERR(entry->keyrings)) { result = PTR_ERR(entry->keyrings); entry->keyrings = NULL; break; } entry->flags |= IMA_KEYRINGS; break; case Opt_label: ima_log_string(ab, "label", args[0].from); if (entry->label) { result = -EINVAL; break; } entry->label = ima_alloc_rule_opt_list(args); if (IS_ERR(entry->label)) { result = PTR_ERR(entry->label); entry->label = NULL; break; } entry->flags |= IMA_LABEL; break; case Opt_fsuuid: ima_log_string(ab, "fsuuid", args[0].from); if (!uuid_is_null(&entry->fsuuid)) { result = -EINVAL; break; } result = uuid_parse(args[0].from, &entry->fsuuid); if (!result) entry->flags |= IMA_FSUUID; break; case Opt_uid_gt: case Opt_euid_gt: entry->uid_op = &uid_gt; fallthrough; case Opt_uid_lt: case Opt_euid_lt: if ((token == Opt_uid_lt) || (token == Opt_euid_lt)) entry->uid_op = &uid_lt; fallthrough; case Opt_uid_eq: case Opt_euid_eq: eid_token = (token == Opt_euid_eq) || (token == Opt_euid_gt) || (token == Opt_euid_lt); ima_log_string_op(ab, eid_token ? "euid" : "uid", args[0].from, token); if (uid_valid(entry->uid)) { result = -EINVAL; break; } result = kstrtoul(args[0].from, 10, &lnum); if (!result) { entry->uid = make_kuid(current_user_ns(), (uid_t) lnum); if (!uid_valid(entry->uid) || (uid_t)lnum != lnum) result = -EINVAL; else entry->flags |= eid_token ? IMA_EUID : IMA_UID; } break; case Opt_gid_gt: case Opt_egid_gt: entry->gid_op = &gid_gt; fallthrough; case Opt_gid_lt: case Opt_egid_lt: if ((token == Opt_gid_lt) || (token == Opt_egid_lt)) entry->gid_op = &gid_lt; fallthrough; case Opt_gid_eq: case Opt_egid_eq: eid_token = (token == Opt_egid_eq) || (token == Opt_egid_gt) || (token == Opt_egid_lt); ima_log_string_op(ab, eid_token ? "egid" : "gid", args[0].from, token); if (gid_valid(entry->gid)) { result = -EINVAL; break; } result = kstrtoul(args[0].from, 10, &lnum); if (!result) { entry->gid = make_kgid(current_user_ns(), (gid_t)lnum); if (!gid_valid(entry->gid) || (((gid_t)lnum) != lnum)) result = -EINVAL; else entry->flags |= eid_token ? IMA_EGID : IMA_GID; } break; case Opt_fowner_gt: entry->fowner_op = &vfsuid_gt_kuid; fallthrough; case Opt_fowner_lt: if (token == Opt_fowner_lt) entry->fowner_op = &vfsuid_lt_kuid; fallthrough; case Opt_fowner_eq: ima_log_string_op(ab, "fowner", args[0].from, token); if (uid_valid(entry->fowner)) { result = -EINVAL; break; } result = kstrtoul(args[0].from, 10, &lnum); if (!result) { entry->fowner = make_kuid(current_user_ns(), (uid_t)lnum); if (!uid_valid(entry->fowner) || (((uid_t)lnum) != lnum)) result = -EINVAL; else entry->flags |= IMA_FOWNER; } break; case Opt_fgroup_gt: entry->fgroup_op = &vfsgid_gt_kgid; fallthrough; case Opt_fgroup_lt: if (token == Opt_fgroup_lt) entry->fgroup_op = &vfsgid_lt_kgid; fallthrough; case Opt_fgroup_eq: ima_log_string_op(ab, "fgroup", args[0].from, token); if (gid_valid(entry->fgroup)) { result = -EINVAL; break; } result = kstrtoul(args[0].from, 10, &lnum); if (!result) { entry->fgroup = make_kgid(current_user_ns(), (gid_t)lnum); if (!gid_valid(entry->fgroup) || (((gid_t)lnum) != lnum)) result = -EINVAL; else entry->flags |= IMA_FGROUP; } break; case Opt_obj_user: ima_log_string(ab, "obj_user", args[0].from); result = ima_lsm_rule_init(entry, args, LSM_OBJ_USER, AUDIT_OBJ_USER); break; case Opt_obj_role: ima_log_string(ab, "obj_role", args[0].from); result = ima_lsm_rule_init(entry, args, LSM_OBJ_ROLE, AUDIT_OBJ_ROLE); break; case Opt_obj_type: ima_log_string(ab, "obj_type", args[0].from); result = ima_lsm_rule_init(entry, args, LSM_OBJ_TYPE, AUDIT_OBJ_TYPE); break; case Opt_subj_user: ima_log_string(ab, "subj_user", args[0].from); result = ima_lsm_rule_init(entry, args, LSM_SUBJ_USER, AUDIT_SUBJ_USER); break; case Opt_subj_role: ima_log_string(ab, "subj_role", args[0].from); result = ima_lsm_rule_init(entry, args, LSM_SUBJ_ROLE, AUDIT_SUBJ_ROLE); break; case Opt_subj_type: ima_log_string(ab, "subj_type", args[0].from); result = ima_lsm_rule_init(entry, args, LSM_SUBJ_TYPE, AUDIT_SUBJ_TYPE); break; case Opt_digest_type: ima_log_string(ab, "digest_type", args[0].from); if (entry->flags & IMA_DIGSIG_REQUIRED) result = -EINVAL; else if ((strcmp(args[0].from, "verity")) == 0) entry->flags |= IMA_VERITY_REQUIRED; else result = -EINVAL; break; case Opt_appraise_type: ima_log_string(ab, "appraise_type", args[0].from); if ((strcmp(args[0].from, "imasig")) == 0) { if (entry->flags & IMA_VERITY_REQUIRED) result = -EINVAL; else entry->flags |= IMA_DIGSIG_REQUIRED | IMA_CHECK_BLACKLIST; } else if (strcmp(args[0].from, "sigv3") == 0) { /* Only fsverity supports sigv3 for now */ if (entry->flags & IMA_VERITY_REQUIRED) entry->flags |= IMA_DIGSIG_REQUIRED | IMA_CHECK_BLACKLIST; else result = -EINVAL; } else if (IS_ENABLED(CONFIG_IMA_APPRAISE_MODSIG) && strcmp(args[0].from, "imasig|modsig") == 0) { if (entry->flags & IMA_VERITY_REQUIRED) result = -EINVAL; else entry->flags |= IMA_DIGSIG_REQUIRED | IMA_MODSIG_ALLOWED | IMA_CHECK_BLACKLIST; } else { result = -EINVAL; } break; case Opt_appraise_flag: ima_log_string(ab, "appraise_flag", args[0].from); break; case Opt_appraise_algos: ima_log_string(ab, "appraise_algos", args[0].from); if (entry->allowed_algos) { result = -EINVAL; break; } entry->allowed_algos = ima_parse_appraise_algos(args[0].from); /* invalid or empty list of algorithms */ if (!entry->allowed_algos) { result = -EINVAL; break; } entry->flags |= IMA_VALIDATE_ALGOS; break; case Opt_permit_directio: entry->flags |= IMA_PERMIT_DIRECTIO; break; case Opt_pcr: ima_log_string(ab, "pcr", args[0].from); result = kstrtoint(args[0].from, 10, &entry->pcr); if (result || INVALID_PCR(entry->pcr)) result = -EINVAL; else entry->flags |= IMA_PCR; break; case Opt_template: ima_log_string(ab, "template", args[0].from); if (entry->action != MEASURE) { result = -EINVAL; break; } template_desc = lookup_template_desc(args[0].from); if (!template_desc || entry->template) { result = -EINVAL; break; } /* * template_desc_init_fields() does nothing if * the template is already initialised, so * it's safe to do this unconditionally */ template_desc_init_fields(template_desc->fmt, &(template_desc->fields), &(template_desc->num_fields)); entry->template = template_desc; break; case Opt_err: ima_log_string(ab, "UNKNOWN", p); result = -EINVAL; break; } } if (!result && !ima_validate_rule(entry)) result = -EINVAL; else if (entry->action == APPRAISE) temp_ima_appraise |= ima_appraise_flag(entry->func); if (!result && entry->flags & IMA_MODSIG_ALLOWED) { template_desc = entry->template ? entry->template : ima_template_desc_current(); check_template_modsig(template_desc); } /* d-ngv2 template field recommended for unsigned fs-verity digests */ if (!result && entry->action == MEASURE && entry->flags & IMA_VERITY_REQUIRED) { template_desc = entry->template ? entry->template : ima_template_desc_current(); check_template_field(template_desc, "d-ngv2", "verity rules should include d-ngv2"); } audit_log_format(ab, "res=%d", !result); audit_log_end(ab); return result; } /** * ima_parse_add_rule - add a rule to ima_policy_rules * @rule: ima measurement policy rule * * Avoid locking by allowing just one writer at a time in ima_write_policy() * Returns the length of the rule parsed, an error code on failure */ ssize_t ima_parse_add_rule(char *rule) { static const char op[] = "update_policy"; char *p; struct ima_rule_entry *entry; ssize_t result, len; int audit_info = 0; p = strsep(&rule, "\n"); len = strlen(p) + 1; p += strspn(p, " \t"); if (*p == '#' || *p == '\0') return len; entry = kzalloc(sizeof(*entry), GFP_KERNEL); if (!entry) { integrity_audit_msg(AUDIT_INTEGRITY_STATUS, NULL, NULL, op, "-ENOMEM", -ENOMEM, audit_info); return -ENOMEM; } INIT_LIST_HEAD(&entry->list); result = ima_parse_rule(p, entry); if (result) { ima_free_rule(entry); integrity_audit_msg(AUDIT_INTEGRITY_STATUS, NULL, NULL, op, "invalid-policy", result, audit_info); return result; } list_add_tail(&entry->list, &ima_temp_rules); return len; } /** * ima_delete_rules() - called to cleanup invalid in-flight policy. * * We don't need locking as we operate on the temp list, which is * different from the active one. There is also only one user of * ima_delete_rules() at a time. */ void ima_delete_rules(void) { struct ima_rule_entry *entry, *tmp; temp_ima_appraise = 0; list_for_each_entry_safe(entry, tmp, &ima_temp_rules, list) { list_del(&entry->list); ima_free_rule(entry); } } #define __ima_hook_stringify(func, str) (#func), const char *const func_tokens[] = { __ima_hooks(__ima_hook_stringify) }; #ifdef CONFIG_IMA_READ_POLICY enum { mask_exec = 0, mask_write, mask_read, mask_append }; static const char *const mask_tokens[] = { "^MAY_EXEC", "^MAY_WRITE", "^MAY_READ", "^MAY_APPEND" }; void *ima_policy_start(struct seq_file *m, loff_t *pos) { loff_t l = *pos; struct ima_rule_entry *entry; struct list_head *ima_rules_tmp; rcu_read_lock(); ima_rules_tmp = rcu_dereference(ima_rules); list_for_each_entry_rcu(entry, ima_rules_tmp, list) { if (!l--) { rcu_read_unlock(); return entry; } } rcu_read_unlock(); return NULL; } void *ima_policy_next(struct seq_file *m, void *v, loff_t *pos) { struct ima_rule_entry *entry = v; rcu_read_lock(); entry = list_entry_rcu(entry->list.next, struct ima_rule_entry, list); rcu_read_unlock(); (*pos)++; return (&entry->list == &ima_default_rules || &entry->list == &ima_policy_rules) ? NULL : entry; } void ima_policy_stop(struct seq_file *m, void *v) { } #define pt(token) policy_tokens[token].pattern #define mt(token) mask_tokens[token] /* * policy_func_show - display the ima_hooks policy rule */ static void policy_func_show(struct seq_file *m, enum ima_hooks func) { if (func > 0 && func < MAX_CHECK) seq_printf(m, "func=%s ", func_tokens[func]); else seq_printf(m, "func=%d ", func); } static void ima_show_rule_opt_list(struct seq_file *m, const struct ima_rule_opt_list *opt_list) { size_t i; for (i = 0; i < opt_list->count; i++) seq_printf(m, "%s%s", i ? "|" : "", opt_list->items[i]); } static void ima_policy_show_appraise_algos(struct seq_file *m, unsigned int allowed_hashes) { int idx, list_size = 0; for (idx = 0; idx < HASH_ALGO__LAST; idx++) { if (!(allowed_hashes & (1U << idx))) continue; /* only add commas if the list contains multiple entries */ if (list_size++) seq_puts(m, ","); seq_puts(m, hash_algo_name[idx]); } } int ima_policy_show(struct seq_file *m, void *v) { struct ima_rule_entry *entry = v; int i; char tbuf[64] = {0,}; int offset = 0; rcu_read_lock(); /* Do not print rules with inactive LSM labels */ for (i = 0; i < MAX_LSM_RULES; i++) { if (entry->lsm[i].args_p && !entry->lsm[i].rule) { rcu_read_unlock(); return 0; } } if (entry->action & MEASURE) seq_puts(m, pt(Opt_measure)); if (entry->action & DONT_MEASURE) seq_puts(m, pt(Opt_dont_measure)); if (entry->action & APPRAISE) seq_puts(m, pt(Opt_appraise)); if (entry->action & DONT_APPRAISE) seq_puts(m, pt(Opt_dont_appraise)); if (entry->action & AUDIT) seq_puts(m, pt(Opt_audit)); if (entry->action & DONT_AUDIT) seq_puts(m, pt(Opt_dont_audit)); if (entry->action & HASH) seq_puts(m, pt(Opt_hash)); if (entry->action & DONT_HASH) seq_puts(m, pt(Opt_dont_hash)); seq_puts(m, " "); if (entry->flags & IMA_FUNC) policy_func_show(m, entry->func); if ((entry->flags & IMA_MASK) || (entry->flags & IMA_INMASK)) { if (entry->flags & IMA_MASK) offset = 1; if (entry->mask & MAY_EXEC) seq_printf(m, pt(Opt_mask), mt(mask_exec) + offset); if (entry->mask & MAY_WRITE) seq_printf(m, pt(Opt_mask), mt(mask_write) + offset); if (entry->mask & MAY_READ) seq_printf(m, pt(Opt_mask), mt(mask_read) + offset); if (entry->mask & MAY_APPEND) seq_printf(m, pt(Opt_mask), mt(mask_append) + offset); seq_puts(m, " "); } if (entry->flags & IMA_FSMAGIC) { snprintf(tbuf, sizeof(tbuf), "0x%lx", entry->fsmagic); seq_printf(m, pt(Opt_fsmagic), tbuf); seq_puts(m, " "); } if (entry->flags & IMA_FSNAME) { snprintf(tbuf, sizeof(tbuf), "%s", entry->fsname); seq_printf(m, pt(Opt_fsname), tbuf); seq_puts(m, " "); } if (entry->flags & IMA_FS_SUBTYPE) { snprintf(tbuf, sizeof(tbuf), "%s", entry->fs_subtype); seq_printf(m, pt(Opt_fs_subtype), tbuf); seq_puts(m, " "); } if (entry->flags & IMA_KEYRINGS) { seq_puts(m, "keyrings="); ima_show_rule_opt_list(m, entry->keyrings); seq_puts(m, " "); } if (entry->flags & IMA_LABEL) { seq_puts(m, "label="); ima_show_rule_opt_list(m, entry->label); seq_puts(m, " "); } if (entry->flags & IMA_PCR) { snprintf(tbuf, sizeof(tbuf), "%d", entry->pcr); seq_printf(m, pt(Opt_pcr), tbuf); seq_puts(m, " "); } if (entry->flags & IMA_FSUUID) { seq_printf(m, "fsuuid=%pU", &entry->fsuuid); seq_puts(m, " "); } if (entry->flags & IMA_UID) { snprintf(tbuf, sizeof(tbuf), "%d", __kuid_val(entry->uid)); if (entry->uid_op == &uid_gt) seq_printf(m, pt(Opt_uid_gt), tbuf); else if (entry->uid_op == &uid_lt) seq_printf(m, pt(Opt_uid_lt), tbuf); else seq_printf(m, pt(Opt_uid_eq), tbuf); seq_puts(m, " "); } if (entry->flags & IMA_EUID) { snprintf(tbuf, sizeof(tbuf), "%d", __kuid_val(entry->uid)); if (entry->uid_op == &uid_gt) seq_printf(m, pt(Opt_euid_gt), tbuf); else if (entry->uid_op == &uid_lt) seq_printf(m, pt(Opt_euid_lt), tbuf); else seq_printf(m, pt(Opt_euid_eq), tbuf); seq_puts(m, " "); } if (entry->flags & IMA_GID) { snprintf(tbuf, sizeof(tbuf), "%d", __kgid_val(entry->gid)); if (entry->gid_op == &gid_gt) seq_printf(m, pt(Opt_gid_gt), tbuf); else if (entry->gid_op == &gid_lt) seq_printf(m, pt(Opt_gid_lt), tbuf); else seq_printf(m, pt(Opt_gid_eq), tbuf); seq_puts(m, " "); } if (entry->flags & IMA_EGID) { snprintf(tbuf, sizeof(tbuf), "%d", __kgid_val(entry->gid)); if (entry->gid_op == &gid_gt) seq_printf(m, pt(Opt_egid_gt), tbuf); else if (entry->gid_op == &gid_lt) seq_printf(m, pt(Opt_egid_lt), tbuf); else seq_printf(m, pt(Opt_egid_eq), tbuf); seq_puts(m, " "); } if (entry->flags & IMA_FOWNER) { snprintf(tbuf, sizeof(tbuf), "%d", __kuid_val(entry->fowner)); if (entry->fowner_op == &vfsuid_gt_kuid) seq_printf(m, pt(Opt_fowner_gt), tbuf); else if (entry->fowner_op == &vfsuid_lt_kuid) seq_printf(m, pt(Opt_fowner_lt), tbuf); else seq_printf(m, pt(Opt_fowner_eq), tbuf); seq_puts(m, " "); } if (entry->flags & IMA_FGROUP) { snprintf(tbuf, sizeof(tbuf), "%d", __kgid_val(entry->fgroup)); if (entry->fgroup_op == &vfsgid_gt_kgid) seq_printf(m, pt(Opt_fgroup_gt), tbuf); else if (entry->fgroup_op == &vfsgid_lt_kgid) seq_printf(m, pt(Opt_fgroup_lt), tbuf); else seq_printf(m, pt(Opt_fgroup_eq), tbuf); seq_puts(m, " "); } if (entry->flags & IMA_VALIDATE_ALGOS) { seq_puts(m, "appraise_algos="); ima_policy_show_appraise_algos(m, entry->allowed_algos); seq_puts(m, " "); } for (i = 0; i < MAX_LSM_RULES; i++) { if (entry->lsm[i].rule) { switch (i) { case LSM_OBJ_USER: seq_printf(m, pt(Opt_obj_user), entry->lsm[i].args_p); break; case LSM_OBJ_ROLE: seq_printf(m, pt(Opt_obj_role), entry->lsm[i].args_p); break; case LSM_OBJ_TYPE: seq_printf(m, pt(Opt_obj_type), entry->lsm[i].args_p); break; case LSM_SUBJ_USER: seq_printf(m, pt(Opt_subj_user), entry->lsm[i].args_p); break; case LSM_SUBJ_ROLE: seq_printf(m, pt(Opt_subj_role), entry->lsm[i].args_p); break; case LSM_SUBJ_TYPE: seq_printf(m, pt(Opt_subj_type), entry->lsm[i].args_p); break; } seq_puts(m, " "); } } if (entry->template) seq_printf(m, "template=%s ", entry->template->name); if (entry->flags & IMA_DIGSIG_REQUIRED) { if (entry->flags & IMA_VERITY_REQUIRED) seq_puts(m, "appraise_type=sigv3 "); else if (entry->flags & IMA_MODSIG_ALLOWED) seq_puts(m, "appraise_type=imasig|modsig "); else seq_puts(m, "appraise_type=imasig "); } if (entry->flags & IMA_VERITY_REQUIRED) seq_puts(m, "digest_type=verity "); if (entry->flags & IMA_PERMIT_DIRECTIO) seq_puts(m, "permit_directio "); rcu_read_unlock(); seq_puts(m, "\n"); return 0; } #endif /* CONFIG_IMA_READ_POLICY */ #if defined(CONFIG_IMA_APPRAISE) && defined(CONFIG_INTEGRITY_TRUSTED_KEYRING) /* * ima_appraise_signature: whether IMA will appraise a given function using * an IMA digital signature. This is restricted to cases where the kernel * has a set of built-in trusted keys in order to avoid an attacker simply * loading additional keys. */ bool ima_appraise_signature(enum kernel_read_file_id id) { struct ima_rule_entry *entry; bool found = false; enum ima_hooks func; struct list_head *ima_rules_tmp; if (id >= READING_MAX_ID) return false; if (id == READING_KEXEC_IMAGE && !(ima_appraise & IMA_APPRAISE_ENFORCE) && security_locked_down(LOCKDOWN_KEXEC)) return false; func = read_idmap[id] ?: FILE_CHECK; rcu_read_lock(); ima_rules_tmp = rcu_dereference(ima_rules); list_for_each_entry_rcu(entry, ima_rules_tmp, list) { if (entry->action != APPRAISE) continue; /* * A generic entry will match, but otherwise require that it * match the func we're looking for */ if (entry->func && entry->func != func) continue; /* * We require this to be a digital signature, not a raw IMA * hash. */ if (entry->flags & IMA_DIGSIG_REQUIRED) found = true; /* * We've found a rule that matches, so break now even if it * didn't require a digital signature - a later rule that does * won't override it, so would be a false positive. */ break; } rcu_read_unlock(); return found; } #endif /* CONFIG_IMA_APPRAISE && CONFIG_INTEGRITY_TRUSTED_KEYRING */ |
| 5 5 8 3 2 45 5 3 12 12 4 8 14 14 1 41 6 45 2 3 1 1 14 30 1 6 4 1 2 1 31 24 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Portions Copyright (C) 1992 Drew Eckhardt */ #ifndef _LINUX_BLKDEV_H #define _LINUX_BLKDEV_H #include <linux/types.h> #include <linux/blk_types.h> #include <linux/device.h> #include <linux/list.h> #include <linux/llist.h> #include <linux/minmax.h> #include <linux/timer.h> #include <linux/workqueue.h> #include <linux/wait.h> #include <linux/bio.h> #include <linux/gfp.h> #include <linux/kdev_t.h> #include <linux/rcupdate.h> #include <linux/percpu-refcount.h> #include <linux/blkzoned.h> #include <linux/sched.h> #include <linux/sbitmap.h> #include <linux/uuid.h> #include <linux/xarray.h> #include <linux/file.h> #include <linux/lockdep.h> struct module; struct request_queue; struct elevator_queue; struct blk_trace; struct request; struct sg_io_hdr; struct blkcg_gq; struct blk_flush_queue; struct kiocb; struct pr_ops; struct rq_qos; struct blk_report_zones_args; struct blk_queue_stats; struct blk_stat_callback; struct blk_crypto_profile; extern const struct device_type disk_type; extern const struct device_type part_type; extern const struct class block_class; /* * Maximum number of blkcg policies allowed to be registered concurrently. * Defined here to simplify include dependency. */ #define BLKCG_MAX_POLS 6 #define DISK_MAX_PARTS 256 #define DISK_NAME_LEN 32 #define PARTITION_META_INFO_VOLNAMELTH 64 /* * Enough for the string representation of any kind of UUID plus NULL. * EFI UUID is 36 characters. MSDOS UUID is 11 characters. */ #define PARTITION_META_INFO_UUIDLTH (UUID_STRING_LEN + 1) struct partition_meta_info { char uuid[PARTITION_META_INFO_UUIDLTH]; u8 volname[PARTITION_META_INFO_VOLNAMELTH]; }; /** * DOC: genhd capability flags * * ``GENHD_FL_REMOVABLE``: indicates that the block device gives access to * removable media. When set, the device remains present even when media is not * inserted. Shall not be set for devices which are removed entirely when the * media is removed. * * ``GENHD_FL_HIDDEN``: the block device is hidden; it doesn't produce events, * doesn't appear in sysfs, and can't be opened from userspace or using * blkdev_get*. Used for the underlying components of multipath devices. * * ``GENHD_FL_NO_PART``: partition support is disabled. The kernel will not * scan for partitions from add_disk, and users can't add partitions manually. * */ enum { GENHD_FL_REMOVABLE = 1 << 0, GENHD_FL_HIDDEN = 1 << 1, GENHD_FL_NO_PART = 1 << 2, }; enum { DISK_EVENT_MEDIA_CHANGE = 1 << 0, /* media changed */ DISK_EVENT_EJECT_REQUEST = 1 << 1, /* eject requested */ }; enum { /* Poll even if events_poll_msecs is unset */ DISK_EVENT_FLAG_POLL = 1 << 0, /* Forward events to udev */ DISK_EVENT_FLAG_UEVENT = 1 << 1, /* Block event polling when open for exclusive write */ DISK_EVENT_FLAG_BLOCK_ON_EXCL_WRITE = 1 << 2, }; struct disk_events; struct badblocks; enum blk_integrity_checksum { BLK_INTEGRITY_CSUM_NONE = 0, BLK_INTEGRITY_CSUM_IP = 1, BLK_INTEGRITY_CSUM_CRC = 2, BLK_INTEGRITY_CSUM_CRC64 = 3, } __packed ; struct blk_integrity { unsigned char flags; enum blk_integrity_checksum csum_type; unsigned char metadata_size; unsigned char pi_offset; unsigned char interval_exp; unsigned char tag_size; unsigned char pi_tuple_size; }; typedef unsigned int __bitwise blk_mode_t; /* open for reading */ #define BLK_OPEN_READ ((__force blk_mode_t)(1 << 0)) /* open for writing */ #define BLK_OPEN_WRITE ((__force blk_mode_t)(1 << 1)) /* open exclusively (vs other exclusive openers */ #define BLK_OPEN_EXCL ((__force blk_mode_t)(1 << 2)) /* opened with O_NDELAY */ #define BLK_OPEN_NDELAY ((__force blk_mode_t)(1 << 3)) /* open for "writes" only for ioctls (specialy hack for floppy.c) */ #define BLK_OPEN_WRITE_IOCTL ((__force blk_mode_t)(1 << 4)) /* open is exclusive wrt all other BLK_OPEN_WRITE opens to the device */ #define BLK_OPEN_RESTRICT_WRITES ((__force blk_mode_t)(1 << 5)) /* return partition scanning errors */ #define BLK_OPEN_STRICT_SCAN ((__force blk_mode_t)(1 << 6)) struct gendisk { /* * major/first_minor/minors should not be set by any new driver, the * block core will take care of allocating them automatically. */ int major; int first_minor; int minors; char disk_name[DISK_NAME_LEN]; /* name of major driver */ unsigned short events; /* supported events */ unsigned short event_flags; /* flags related to event processing */ struct xarray part_tbl; struct block_device *part0; const struct block_device_operations *fops; struct request_queue *queue; void *private_data; struct bio_set bio_split; int flags; unsigned long state; #define GD_NEED_PART_SCAN 0 #define GD_READ_ONLY 1 #define GD_DEAD 2 #define GD_NATIVE_CAPACITY 3 #define GD_ADDED 4 #define GD_SUPPRESS_PART_SCAN 5 #define GD_OWNS_QUEUE 6 #define GD_ZONE_APPEND_USED 7 struct mutex open_mutex; /* open/close mutex */ unsigned open_partitions; /* number of open partitions */ struct backing_dev_info *bdi; struct kobject queue_kobj; /* the queue/ directory */ struct kobject *slave_dir; #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED struct list_head slave_bdevs; #endif struct timer_rand_state *random; struct disk_events *ev; #ifdef CONFIG_BLK_DEV_ZONED /* * Zoned block device information. Reads of this information must be * protected with blk_queue_enter() / blk_queue_exit(). Modifying this * information is only allowed while no requests are being processed. * See also blk_mq_freeze_queue() and blk_mq_unfreeze_queue(). */ unsigned int nr_zones; unsigned int zone_capacity; unsigned int last_zone_capacity; u8 __rcu *zones_cond; unsigned int zone_wplugs_hash_bits; atomic_t nr_zone_wplugs; spinlock_t zone_wplugs_lock; struct mempool *zone_wplugs_pool; struct hlist_head *zone_wplugs_hash; struct workqueue_struct *zone_wplugs_wq; #endif /* CONFIG_BLK_DEV_ZONED */ #if IS_ENABLED(CONFIG_CDROM) struct cdrom_device_info *cdi; #endif int node_id; struct badblocks *bb; struct lockdep_map lockdep_map; u64 diskseq; blk_mode_t open_mode; /* * Independent sector access ranges. This is always NULL for * devices that do not have multiple independent access ranges. */ struct blk_independent_access_ranges *ia_ranges; struct mutex rqos_state_mutex; /* rqos state change mutex */ }; /** * disk_openers - returns how many openers are there for a disk * @disk: disk to check * * This returns the number of openers for a disk. Note that this value is only * stable if disk->open_mutex is held. * * Note: Due to a quirk in the block layer open code, each open partition is * only counted once even if there are multiple openers. */ static inline unsigned int disk_openers(struct gendisk *disk) { return atomic_read(&disk->part0->bd_openers); } /** * disk_has_partscan - return %true if partition scanning is enabled on a disk * @disk: disk to check * * Returns %true if partitions scanning is enabled for @disk, or %false if * partition scanning is disabled either permanently or temporarily. */ static inline bool disk_has_partscan(struct gendisk *disk) { return !(disk->flags & (GENHD_FL_NO_PART | GENHD_FL_HIDDEN)) && !test_bit(GD_SUPPRESS_PART_SCAN, &disk->state); } /* * The gendisk is refcounted by the part0 block_device, and the bd_device * therein is also used for device model presentation in sysfs. */ #define dev_to_disk(device) \ (dev_to_bdev(device)->bd_disk) #define disk_to_dev(disk) \ (&((disk)->part0->bd_device)) #if IS_REACHABLE(CONFIG_CDROM) #define disk_to_cdi(disk) ((disk)->cdi) #else #define disk_to_cdi(disk) NULL #endif static inline dev_t disk_devt(struct gendisk *disk) { return MKDEV(disk->major, disk->first_minor); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) * however we constrain this to what we can validate and test. */ #define BLK_MAX_BLOCK_SIZE SZ_64K #else #define BLK_MAX_BLOCK_SIZE PAGE_SIZE #endif /* blk_validate_limits() validates bsize, so drivers don't usually need to */ static inline int blk_validate_block_size(unsigned long bsize) { if (bsize < 512 || bsize > BLK_MAX_BLOCK_SIZE || !is_power_of_2(bsize)) return -EINVAL; return 0; } static inline bool blk_op_is_passthrough(blk_opf_t op) { op &= REQ_OP_MASK; return op == REQ_OP_DRV_IN || op == REQ_OP_DRV_OUT; } /* flags set by the driver in queue_limits.features */ typedef unsigned int __bitwise blk_features_t; /* supports a volatile write cache */ #define BLK_FEAT_WRITE_CACHE ((__force blk_features_t)(1u << 0)) /* supports passing on the FUA bit */ #define BLK_FEAT_FUA ((__force blk_features_t)(1u << 1)) /* rotational device (hard drive or floppy) */ #define BLK_FEAT_ROTATIONAL ((__force blk_features_t)(1u << 2)) /* contributes to the random number pool */ #define BLK_FEAT_ADD_RANDOM ((__force blk_features_t)(1u << 3)) /* do disk/partitions IO accounting */ #define BLK_FEAT_IO_STAT ((__force blk_features_t)(1u << 4)) /* don't modify data until writeback is done */ #define BLK_FEAT_STABLE_WRITES ((__force blk_features_t)(1u << 5)) /* always completes in submit context */ #define BLK_FEAT_SYNCHRONOUS ((__force blk_features_t)(1u << 6)) /* supports REQ_NOWAIT */ #define BLK_FEAT_NOWAIT ((__force blk_features_t)(1u << 7)) /* supports DAX */ #define BLK_FEAT_DAX ((__force blk_features_t)(1u << 8)) /* supports I/O polling */ #define BLK_FEAT_POLL ((__force blk_features_t)(1u << 9)) /* is a zoned device */ #define BLK_FEAT_ZONED ((__force blk_features_t)(1u << 10)) /* supports PCI(e) p2p requests */ #define BLK_FEAT_PCI_P2PDMA ((__force blk_features_t)(1u << 12)) /* skip this queue in blk_mq_(un)quiesce_tagset */ #define BLK_FEAT_SKIP_TAGSET_QUIESCE ((__force blk_features_t)(1u << 13)) /* undocumented magic for bcache */ #define BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE \ ((__force blk_features_t)(1u << 15)) /* atomic writes enabled */ #define BLK_FEAT_ATOMIC_WRITES \ ((__force blk_features_t)(1u << 16)) /* * Flags automatically inherited when stacking limits. */ #define BLK_FEAT_INHERIT_MASK \ (BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA | BLK_FEAT_ROTATIONAL | \ BLK_FEAT_STABLE_WRITES | BLK_FEAT_ZONED | \ BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE) /* internal flags in queue_limits.flags */ typedef unsigned int __bitwise blk_flags_t; /* do not send FLUSH/FUA commands despite advertising a write cache */ #define BLK_FLAG_WRITE_CACHE_DISABLED ((__force blk_flags_t)(1u << 0)) /* I/O topology is misaligned */ #define BLK_FLAG_MISALIGNED ((__force blk_flags_t)(1u << 1)) /* passthrough command IO accounting */ #define BLK_FLAG_IOSTATS_PASSTHROUGH ((__force blk_flags_t)(1u << 2)) struct queue_limits { blk_features_t features; blk_flags_t flags; unsigned long seg_boundary_mask; unsigned long virt_boundary_mask; unsigned int max_hw_sectors; unsigned int max_dev_sectors; unsigned int chunk_sectors; unsigned int max_sectors; unsigned int max_user_sectors; unsigned int max_segment_size; unsigned int max_fast_segment_size; unsigned int physical_block_size; unsigned int logical_block_size; unsigned int alignment_offset; unsigned int io_min; unsigned int io_opt; unsigned int max_discard_sectors; unsigned int max_hw_discard_sectors; unsigned int max_user_discard_sectors; unsigned int max_secure_erase_sectors; unsigned int max_write_zeroes_sectors; unsigned int max_wzeroes_unmap_sectors; unsigned int max_hw_wzeroes_unmap_sectors; unsigned int max_user_wzeroes_unmap_sectors; unsigned int max_hw_zone_append_sectors; unsigned int max_zone_append_sectors; unsigned int discard_granularity; unsigned int discard_alignment; unsigned int zone_write_granularity; /* atomic write limits */ unsigned int atomic_write_hw_max; unsigned int atomic_write_max_sectors; unsigned int atomic_write_hw_boundary; unsigned int atomic_write_boundary_sectors; unsigned int atomic_write_hw_unit_min; unsigned int atomic_write_unit_min; unsigned int atomic_write_hw_unit_max; unsigned int atomic_write_unit_max; unsigned short max_segments; unsigned short max_integrity_segments; unsigned short max_discard_segments; unsigned short max_write_streams; unsigned int write_stream_granularity; unsigned int max_open_zones; unsigned int max_active_zones; /* * Drivers that set dma_alignment to less than 511 must be prepared to * handle individual bvec's that are not a multiple of a SECTOR_SIZE * due to possible offsets. */ unsigned int dma_alignment; unsigned int dma_pad_mask; struct blk_integrity integrity; }; typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx, void *data); int disk_report_zone(struct gendisk *disk, struct blk_zone *zone, unsigned int idx, struct blk_report_zones_args *args); int blkdev_get_zone_info(struct block_device *bdev, sector_t sector, struct blk_zone *zone); #define BLK_ALL_ZONES ((unsigned int)-1) int blkdev_report_zones(struct block_device *bdev, sector_t sector, unsigned int nr_zones, report_zones_cb cb, void *data); int blkdev_report_zones_cached(struct block_device *bdev, sector_t sector, unsigned int nr_zones, report_zones_cb cb, void *data); int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op, sector_t sectors, sector_t nr_sectors); int blk_revalidate_disk_zones(struct gendisk *disk); /* * Independent access ranges: struct blk_independent_access_range describes * a range of contiguous sectors that can be accessed using device command * execution resources that are independent from the resources used for * other access ranges. This is typically found with single-LUN multi-actuator * HDDs where each access range is served by a different set of heads. * The set of independent ranges supported by the device is defined using * struct blk_independent_access_ranges. The independent ranges must not overlap * and must include all sectors within the disk capacity (no sector holes * allowed). * For a device with multiple ranges, requests targeting sectors in different * ranges can be executed in parallel. A request can straddle an access range * boundary. */ struct blk_independent_access_range { struct kobject kobj; sector_t sector; sector_t nr_sectors; }; struct blk_independent_access_ranges { struct kobject kobj; bool sysfs_registered; unsigned int nr_ia_ranges; struct blk_independent_access_range ia_range[]; }; struct request_queue { /* * The queue owner gets to use this for whatever they like. * ll_rw_blk doesn't touch it. */ void *queuedata; struct elevator_queue *elevator; const struct blk_mq_ops *mq_ops; /* sw queues */ struct blk_mq_ctx __percpu *queue_ctx; /* * various queue flags, see QUEUE_* below */ unsigned long queue_flags; unsigned int __data_racy rq_timeout; unsigned int queue_depth; refcount_t refs; /* hw dispatch queues */ unsigned int nr_hw_queues; struct blk_mq_hw_ctx * __rcu *queue_hw_ctx; struct percpu_ref q_usage_counter; struct lock_class_key io_lock_cls_key; struct lockdep_map io_lockdep_map; struct lock_class_key q_lock_cls_key; struct lockdep_map q_lockdep_map; struct request *last_merge; spinlock_t queue_lock; int quiesce_depth; struct gendisk *disk; /* * mq queue kobject */ struct kobject *mq_kobj; struct queue_limits limits; #ifdef CONFIG_PM struct device *dev; enum rpm_status rpm_status; #endif /* * Number of contexts that have called blk_set_pm_only(). If this * counter is above zero then only RQF_PM requests are processed. */ atomic_t pm_only; struct blk_queue_stats *stats; struct rq_qos *rq_qos; struct mutex rq_qos_mutex; /* * ida allocated id for this queue. Used to index queues from * ioctx. */ int id; /* * queue settings */ unsigned long nr_requests; /* Max # of requests */ #ifdef CONFIG_BLK_INLINE_ENCRYPTION struct blk_crypto_profile *crypto_profile; struct kobject *crypto_kobject; #endif struct timer_list timeout; struct work_struct timeout_work; atomic_t nr_active_requests_shared_tags; struct blk_mq_tags *sched_shared_tags; struct list_head icq_list; #ifdef CONFIG_BLK_CGROUP DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS); struct blkcg_gq *root_blkg; struct list_head blkg_list; struct mutex blkcg_mutex; #endif int node; spinlock_t requeue_lock; struct list_head requeue_list; struct delayed_work requeue_work; #ifdef CONFIG_BLK_DEV_IO_TRACE struct blk_trace __rcu *blk_trace; #endif /* * for flush operations */ struct blk_flush_queue *fq; struct list_head flush_list; /* * Protects against I/O scheduler switching, particularly when updating * q->elevator. Since the elevator update code path may also modify q-> * nr_requests and wbt latency, this lock also protects the sysfs attrs * nr_requests and wbt_lat_usec. Additionally the nr_hw_queues update * may modify hctx tags, reserved-tags and cpumask, so this lock also * helps protect the hctx sysfs/debugfs attrs. To ensure proper locking * order during an elevator or nr_hw_queue update, first freeze the * queue, then acquire ->elevator_lock. */ struct mutex elevator_lock; struct mutex sysfs_lock; /* * Protects queue limits and also sysfs attribute read_ahead_kb. */ struct mutex limits_lock; /* * for reusing dead hctx instance in case of updating * nr_hw_queues */ struct list_head unused_hctx_list; spinlock_t unused_hctx_lock; int mq_freeze_depth; #ifdef CONFIG_BLK_DEV_THROTTLING /* Throttle data */ struct throtl_data *td; #endif struct rcu_head rcu_head; #ifdef CONFIG_LOCKDEP struct task_struct *mq_freeze_owner; int mq_freeze_owner_depth; /* * Records disk & queue state in current context, used in unfreeze * queue */ bool mq_freeze_disk_dead; bool mq_freeze_queue_dying; #endif wait_queue_head_t mq_freeze_wq; /* * Protect concurrent access to q_usage_counter by * percpu_ref_kill() and percpu_ref_reinit(). */ struct mutex mq_freeze_lock; struct blk_mq_tag_set *tag_set; struct list_head tag_set_list; struct dentry *debugfs_dir; struct dentry *sched_debugfs_dir; struct dentry *rqos_debugfs_dir; /* * Serializes all debugfs metadata operations using the above dentries. */ struct mutex debugfs_mutex; }; /* Keep blk_queue_flag_name[] in sync with the definitions below */ enum { QUEUE_FLAG_DYING, /* queue being torn down */ QUEUE_FLAG_NOMERGES, /* disable merge attempts */ QUEUE_FLAG_SAME_COMP, /* complete on same CPU-group */ QUEUE_FLAG_FAIL_IO, /* fake timeout */ QUEUE_FLAG_NOXMERGES, /* No extended merges */ QUEUE_FLAG_SAME_FORCE, /* force complete on same CPU */ QUEUE_FLAG_INIT_DONE, /* queue is initialized */ QUEUE_FLAG_STATS, /* track IO start and completion times */ QUEUE_FLAG_REGISTERED, /* queue has been registered to a disk */ QUEUE_FLAG_QUIESCED, /* queue has been quiesced */ QUEUE_FLAG_RQ_ALLOC_TIME, /* record rq->alloc_time_ns */ QUEUE_FLAG_HCTX_ACTIVE, /* at least one blk-mq hctx is active */ QUEUE_FLAG_SQ_SCHED, /* single queue style io dispatch */ QUEUE_FLAG_DISABLE_WBT_DEF, /* for sched to disable/enable wbt */ QUEUE_FLAG_NO_ELV_SWITCH, /* can't switch elevator any more */ QUEUE_FLAG_QOS_ENABLED, /* qos is enabled */ QUEUE_FLAG_BIO_ISSUE_TIME, /* record bio->issue_time_ns */ QUEUE_FLAG_MAX }; #define QUEUE_FLAG_MQ_DEFAULT (1UL << QUEUE_FLAG_SAME_COMP) void blk_queue_flag_set(unsigned int flag, struct request_queue *q); void blk_queue_flag_clear(unsigned int flag, struct request_queue *q); #define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags) #define blk_queue_init_done(q) test_bit(QUEUE_FLAG_INIT_DONE, &(q)->queue_flags) #define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags) #define blk_queue_noxmerges(q) \ test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags) #define blk_queue_nonrot(q) (!((q)->limits.features & BLK_FEAT_ROTATIONAL)) #define blk_queue_io_stat(q) ((q)->limits.features & BLK_FEAT_IO_STAT) #define blk_queue_passthrough_stat(q) \ ((q)->limits.flags & BLK_FLAG_IOSTATS_PASSTHROUGH) #define blk_queue_dax(q) ((q)->limits.features & BLK_FEAT_DAX) #define blk_queue_pci_p2pdma(q) ((q)->limits.features & BLK_FEAT_PCI_P2PDMA) #ifdef CONFIG_BLK_RQ_ALLOC_TIME #define blk_queue_rq_alloc_time(q) \ test_bit(QUEUE_FLAG_RQ_ALLOC_TIME, &(q)->queue_flags) #else #define blk_queue_rq_alloc_time(q) false #endif #define blk_noretry_request(rq) \ ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \ REQ_FAILFAST_DRIVER)) #define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags) #define blk_queue_pm_only(q) atomic_read(&(q)->pm_only) #define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags) #define blk_queue_sq_sched(q) test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags) #define blk_queue_skip_tagset_quiesce(q) \ ((q)->limits.features & BLK_FEAT_SKIP_TAGSET_QUIESCE) #define blk_queue_disable_wbt(q) \ test_bit(QUEUE_FLAG_DISABLE_WBT_DEF, &(q)->queue_flags) #define blk_queue_no_elv_switch(q) \ test_bit(QUEUE_FLAG_NO_ELV_SWITCH, &(q)->queue_flags) extern void blk_set_pm_only(struct request_queue *q); extern void blk_clear_pm_only(struct request_queue *q); #define list_entry_rq(ptr) list_entry((ptr), struct request, queuelist) #define dma_map_bvec(dev, bv, dir, attrs) \ dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \ (dir), (attrs)) static inline bool queue_is_mq(struct request_queue *q) { return q->mq_ops; } #ifdef CONFIG_PM static inline enum rpm_status queue_rpm_status(struct request_queue *q) { return q->rpm_status; } #else static inline enum rpm_status queue_rpm_status(struct request_queue *q) { return RPM_ACTIVE; } #endif static inline bool blk_queue_is_zoned(struct request_queue *q) { return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && (q->limits.features & BLK_FEAT_ZONED); } static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector) { if (!blk_queue_is_zoned(disk->queue)) return 0; return sector >> ilog2(disk->queue->limits.chunk_sectors); } static inline unsigned int bdev_max_open_zones(struct block_device *bdev) { return bdev->bd_disk->queue->limits.max_open_zones; } static inline unsigned int bdev_max_active_zones(struct block_device *bdev) { return bdev->bd_disk->queue->limits.max_active_zones; } static inline unsigned int blk_queue_depth(struct request_queue *q) { if (q->queue_depth) return q->queue_depth; return q->nr_requests; } /* * default timeout for SG_IO if none specified */ #define BLK_DEFAULT_SG_TIMEOUT (60 * HZ) #define BLK_MIN_SG_TIMEOUT (7 * HZ) /* This should not be used directly - use rq_for_each_segment */ #define for_each_bio(_bio) \ for (; _bio; _bio = _bio->bi_next) int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk, const struct attribute_group **groups, struct fwnode_handle *fwnode); int __must_check device_add_disk(struct device *parent, struct gendisk *disk, const struct attribute_group **groups); static inline int __must_check add_disk(struct gendisk *disk) { return device_add_disk(NULL, disk, NULL); } void del_gendisk(struct gendisk *gp); void invalidate_disk(struct gendisk *disk); void set_disk_ro(struct gendisk *disk, bool read_only); void disk_uevent(struct gendisk *disk, enum kobject_action action); static inline u8 bdev_partno(const struct block_device *bdev) { return atomic_read(&bdev->__bd_flags) & BD_PARTNO; } static inline bool bdev_test_flag(const struct block_device *bdev, unsigned flag) { return atomic_read(&bdev->__bd_flags) & flag; } static inline void bdev_set_flag(struct block_device *bdev, unsigned flag) { atomic_or(flag, &bdev->__bd_flags); } static inline void bdev_clear_flag(struct block_device *bdev, unsigned flag) { atomic_andnot(flag, &bdev->__bd_flags); } static inline bool get_disk_ro(struct gendisk *disk) { return bdev_test_flag(disk->part0, BD_READ_ONLY) || test_bit(GD_READ_ONLY, &disk->state); } static inline bool bdev_read_only(struct block_device *bdev) { return bdev_test_flag(bdev, BD_READ_ONLY) || get_disk_ro(bdev->bd_disk); } bool set_capacity_and_notify(struct gendisk *disk, sector_t size); void disk_force_media_change(struct gendisk *disk); void bdev_mark_dead(struct block_device *bdev, bool surprise); void add_disk_randomness(struct gendisk *disk) __latent_entropy; void rand_initialize_disk(struct gendisk *disk); static inline sector_t get_start_sect(struct block_device *bdev) { return bdev->bd_start_sect; } static inline sector_t bdev_nr_sectors(struct block_device *bdev) { return bdev->bd_nr_sectors; } static inline loff_t bdev_nr_bytes(struct block_device *bdev) { return (loff_t)bdev_nr_sectors(bdev) << SECTOR_SHIFT; } static inline sector_t get_capacity(struct gendisk *disk) { return bdev_nr_sectors(disk->part0); } static inline u64 sb_bdev_nr_blocks(struct super_block *sb) { return bdev_nr_sectors(sb->s_bdev) >> (sb->s_blocksize_bits - SECTOR_SHIFT); } #ifdef CONFIG_BLK_DEV_ZONED static inline unsigned int disk_nr_zones(struct gendisk *disk) { return disk->nr_zones; } /** * bio_needs_zone_write_plugging - Check if a BIO needs to be handled with zone * write plugging * @bio: The BIO being submitted * * Return true whenever @bio execution needs to be handled through zone * write plugging (using blk_zone_plug_bio()). Return false otherwise. */ static inline bool bio_needs_zone_write_plugging(struct bio *bio) { enum req_op op = bio_op(bio); /* * Only zoned block devices have a zone write plug hash table. But not * all of them have one (e.g. DM devices may not need one). */ if (!bio->bi_bdev->bd_disk->zone_wplugs_hash) return false; /* Only write operations need zone write plugging. */ if (!op_is_write(op)) return false; /* Ignore empty flush */ if (op_is_flush(bio->bi_opf) && !bio_sectors(bio)) return false; /* Ignore BIOs that already have been handled by zone write plugging. */ if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING)) return false; /* * All zone write operations must be handled through zone write plugging * using blk_zone_plug_bio(). */ switch (op) { case REQ_OP_ZONE_APPEND: case REQ_OP_WRITE: case REQ_OP_WRITE_ZEROES: case REQ_OP_ZONE_FINISH: case REQ_OP_ZONE_RESET: case REQ_OP_ZONE_RESET_ALL: return true; default: return false; } } bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs); /** * disk_zone_capacity - returns the zone capacity of zone containing @sector * @disk: disk to work with * @sector: sector number within the querying zone * * Returns the zone capacity of a zone containing @sector. @sector can be any * sector in the zone. */ static inline unsigned int disk_zone_capacity(struct gendisk *disk, sector_t sector) { sector_t zone_sectors = disk->queue->limits.chunk_sectors; if (sector + zone_sectors >= get_capacity(disk)) return disk->last_zone_capacity; return disk->zone_capacity; } static inline unsigned int bdev_zone_capacity(struct block_device *bdev, sector_t pos) { return disk_zone_capacity(bdev->bd_disk, pos); } bool bdev_zone_is_seq(struct block_device *bdev, sector_t sector); #else /* CONFIG_BLK_DEV_ZONED */ static inline unsigned int disk_nr_zones(struct gendisk *disk) { return 0; } static inline bool bdev_zone_is_seq(struct block_device *bdev, sector_t sector) { return false; } static inline bool bio_needs_zone_write_plugging(struct bio *bio) { return false; } static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs) { return false; } #endif /* CONFIG_BLK_DEV_ZONED */ static inline unsigned int bdev_nr_zones(struct block_device *bdev) { return disk_nr_zones(bdev->bd_disk); } int bdev_disk_changed(struct gendisk *disk, bool invalidate); void put_disk(struct gendisk *disk); struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node, struct lock_class_key *lkclass); /** * blk_alloc_disk - allocate a gendisk structure * @lim: queue limits to be used for this disk. * @node_id: numa node to allocate on * * Allocate and pre-initialize a gendisk structure for use with BIO based * drivers. * * Returns an ERR_PTR on error, else the allocated disk. * * Context: can sleep */ #define blk_alloc_disk(lim, node_id) \ ({ \ static struct lock_class_key __key; \ \ __blk_alloc_disk(lim, node_id, &__key); \ }) int __register_blkdev(unsigned int major, const char *name, void (*probe)(dev_t devt)); #define register_blkdev(major, name) \ __register_blkdev(major, name, NULL) void unregister_blkdev(unsigned int major, const char *name); bool disk_check_media_change(struct gendisk *disk); void set_capacity(struct gendisk *disk, sector_t size); #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk); void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk); #else static inline int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) { return 0; } static inline void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk) { } #endif /* CONFIG_BLOCK_HOLDER_DEPRECATED */ dev_t part_devt(struct gendisk *disk, u8 partno); void inc_diskseq(struct gendisk *disk); void blk_request_module(dev_t devt); extern int blk_register_queue(struct gendisk *disk); extern void blk_unregister_queue(struct gendisk *disk); void submit_bio_noacct(struct bio *bio); struct bio *bio_split_to_limits(struct bio *bio); struct bio *bio_submit_split_bioset(struct bio *bio, unsigned int split_sectors, struct bio_set *bs); extern int blk_lld_busy(struct request_queue *q); extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags); extern void blk_queue_exit(struct request_queue *q); extern void blk_sync_queue(struct request_queue *q); /* Helper to convert REQ_OP_XXX to its string format XXX */ extern const char *blk_op_str(enum req_op op); int blk_status_to_errno(blk_status_t status); blk_status_t errno_to_blk_status(int errno); const char *blk_status_to_str(blk_status_t status); /* only poll the hardware once, don't continue until a completion was found */ #define BLK_POLL_ONESHOT (1 << 0) int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags); int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob, unsigned int flags); static inline struct request_queue *bdev_get_queue(struct block_device *bdev) { return bdev->bd_queue; /* this is never NULL */ } /* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond); static inline unsigned int bio_zone_no(struct bio *bio) { return disk_zone_no(bio->bi_bdev->bd_disk, bio->bi_iter.bi_sector); } static inline bool bio_straddles_zones(struct bio *bio) { return bio_sectors(bio) && bio_zone_no(bio) != disk_zone_no(bio->bi_bdev->bd_disk, bio_end_sector(bio) - 1); } /* * Return how much within the boundary is left to be used for I/O at a given * offset. */ static inline unsigned int blk_boundary_sectors_left(sector_t offset, unsigned int boundary_sectors) { if (unlikely(!is_power_of_2(boundary_sectors))) return boundary_sectors - sector_div(offset, boundary_sectors); return boundary_sectors - (offset & (boundary_sectors - 1)); } /** * queue_limits_start_update - start an atomic update of queue limits * @q: queue to update * * This functions starts an atomic update of the queue limits. It takes a lock * to prevent other updates and returns a snapshot of the current limits that * the caller can modify. The caller must call queue_limits_commit_update() * to finish the update. * * Context: process context. */ static inline struct queue_limits queue_limits_start_update(struct request_queue *q) { mutex_lock(&q->limits_lock); return q->limits; } int queue_limits_commit_update_frozen(struct request_queue *q, struct queue_limits *lim); int queue_limits_commit_update(struct request_queue *q, struct queue_limits *lim); int queue_limits_set(struct request_queue *q, struct queue_limits *lim); int blk_validate_limits(struct queue_limits *lim); /** * queue_limits_cancel_update - cancel an atomic update of queue limits * @q: queue to update * * This functions cancels an atomic update of the queue limits started by * queue_limits_start_update() and should be used when an error occurs after * starting update. */ static inline void queue_limits_cancel_update(struct request_queue *q) { mutex_unlock(&q->limits_lock); } /* * These helpers are for drivers that have sloppy feature negotiation and might * have to disable DISCARD, WRITE_ZEROES or SECURE_DISCARD from the I/O * completion handler when the device returned an indicator that the respective * feature is not actually supported. They are racy and the driver needs to * cope with that. Try to avoid this scheme if you can. */ static inline void blk_queue_disable_discard(struct request_queue *q) { q->limits.max_discard_sectors = 0; } static inline void blk_queue_disable_secure_erase(struct request_queue *q) { q->limits.max_secure_erase_sectors = 0; } static inline void blk_queue_disable_write_zeroes(struct request_queue *q) { q->limits.max_write_zeroes_sectors = 0; q->limits.max_wzeroes_unmap_sectors = 0; } /* * Access functions for manipulating queue properties */ extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth); extern void blk_set_stacking_limits(struct queue_limits *lim); extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, sector_t offset); void queue_limits_stack_bdev(struct queue_limits *t, struct block_device *bdev, sector_t offset, const char *pfx); extern void blk_queue_rq_timeout(struct request_queue *, unsigned int); struct blk_independent_access_ranges * disk_alloc_independent_access_ranges(struct gendisk *disk, int nr_ia_ranges); void disk_set_independent_access_ranges(struct gendisk *disk, struct blk_independent_access_ranges *iars); bool __must_check blk_get_queue(struct request_queue *); extern void blk_put_queue(struct request_queue *); void blk_mark_disk_dead(struct gendisk *disk); struct rq_list { struct request *head; struct request *tail; }; #ifdef CONFIG_BLOCK /* * blk_plug permits building a queue of related requests by holding the I/O * fragments for a short period. This allows merging of sequential requests * into single larger request. As the requests are moved from a per-task list to * the device's request_queue in a batch, this results in improved scalability * as the lock contention for request_queue lock is reduced. * * It is ok not to disable preemption when adding the request to the plug list * or when attempting a merge. For details, please see schedule() where * blk_flush_plug() is called. */ struct blk_plug { struct rq_list mq_list; /* blk-mq requests */ /* if ios_left is > 1, we can batch tag/rq allocations */ struct rq_list cached_rqs; u64 cur_ktime; unsigned short nr_ios; unsigned short rq_count; bool multiple_queues; bool has_elevator; struct list_head cb_list; /* md requires an unplug callback */ }; struct blk_plug_cb; typedef void (*blk_plug_cb_fn)(struct blk_plug_cb *, bool); struct blk_plug_cb { struct list_head list; blk_plug_cb_fn callback; void *data; }; extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, void *data, int size); extern void blk_start_plug(struct blk_plug *); extern void blk_start_plug_nr_ios(struct blk_plug *, unsigned short); extern void blk_finish_plug(struct blk_plug *); void __blk_flush_plug(struct blk_plug *plug, bool from_schedule); static inline void blk_flush_plug(struct blk_plug *plug, bool async) { if (plug) __blk_flush_plug(plug, async); } /* * tsk == current here */ static inline void blk_plug_invalidate_ts(struct task_struct *tsk) { struct blk_plug *plug = tsk->plug; if (plug) plug->cur_ktime = 0; current->flags &= ~PF_BLOCK_TS; } int blkdev_issue_flush(struct block_device *bdev); long nr_blockdev_pages(void); #else /* CONFIG_BLOCK */ struct blk_plug { }; static inline void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios) { } static inline void blk_start_plug(struct blk_plug *plug) { } static inline void blk_finish_plug(struct blk_plug *plug) { } static inline void blk_flush_plug(struct blk_plug *plug, bool async) { } static inline void blk_plug_invalidate_ts(struct task_struct *tsk) { } static inline int blkdev_issue_flush(struct block_device *bdev) { return 0; } static inline long nr_blockdev_pages(void) { return 0; } #endif /* CONFIG_BLOCK */ extern void blk_io_schedule(void); int blkdev_issue_discard(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask); int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop); int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp); #define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */ #define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */ #define BLKDEV_ZERO_KILLABLE (1 << 2) /* interruptible by fatal signals */ extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, unsigned flags); extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, unsigned flags); static inline int sb_issue_discard(struct super_block *sb, sector_t block, sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags) { return blkdev_issue_discard(sb->s_bdev, block << (sb->s_blocksize_bits - SECTOR_SHIFT), nr_blocks << (sb->s_blocksize_bits - SECTOR_SHIFT), gfp_mask); } static inline int sb_issue_zeroout(struct super_block *sb, sector_t block, sector_t nr_blocks, gfp_t gfp_mask) { return blkdev_issue_zeroout(sb->s_bdev, block << (sb->s_blocksize_bits - SECTOR_SHIFT), nr_blocks << (sb->s_blocksize_bits - SECTOR_SHIFT), gfp_mask, 0); } static inline bool bdev_is_partition(struct block_device *bdev) { return bdev_partno(bdev) != 0; } enum blk_default_limits { BLK_MAX_SEGMENTS = 128, BLK_SAFE_MAX_SECTORS = 255, BLK_MAX_SEGMENT_SIZE = 65536, BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL, }; static inline struct queue_limits *bdev_limits(struct block_device *bdev) { return &bdev_get_queue(bdev)->limits; } static inline unsigned long queue_segment_boundary(const struct request_queue *q) { return q->limits.seg_boundary_mask; } static inline unsigned long queue_virt_boundary(const struct request_queue *q) { return q->limits.virt_boundary_mask; } static inline unsigned int queue_max_sectors(const struct request_queue *q) { return q->limits.max_sectors; } static inline unsigned int queue_max_bytes(struct request_queue *q) { return min_t(unsigned int, queue_max_sectors(q), INT_MAX >> 9) << 9; } static inline unsigned int queue_max_hw_sectors(const struct request_queue *q) { return q->limits.max_hw_sectors; } static inline unsigned short queue_max_segments(const struct request_queue *q) { return q->limits.max_segments; } static inline unsigned short queue_max_discard_segments(const struct request_queue *q) { return q->limits.max_discard_segments; } static inline unsigned int queue_max_segment_size(const struct request_queue *q) { return q->limits.max_segment_size; } static inline bool queue_emulates_zone_append(struct request_queue *q) { return blk_queue_is_zoned(q) && !q->limits.max_hw_zone_append_sectors; } static inline bool bdev_emulates_zone_append(struct block_device *bdev) { return queue_emulates_zone_append(bdev_get_queue(bdev)); } static inline unsigned int bdev_max_zone_append_sectors(struct block_device *bdev) { return bdev_limits(bdev)->max_zone_append_sectors; } static inline unsigned int bdev_max_segments(struct block_device *bdev) { return queue_max_segments(bdev_get_queue(bdev)); } static inline unsigned short bdev_max_write_streams(struct block_device *bdev) { if (bdev_is_partition(bdev)) return 0; return bdev_limits(bdev)->max_write_streams; } static inline unsigned queue_logical_block_size(const struct request_queue *q) { return q->limits.logical_block_size; } static inline unsigned int bdev_logical_block_size(struct block_device *bdev) { return queue_logical_block_size(bdev_get_queue(bdev)); } static inline unsigned int queue_physical_block_size(const struct request_queue *q) { return q->limits.physical_block_size; } static inline unsigned int bdev_physical_block_size(struct block_device *bdev) { return queue_physical_block_size(bdev_get_queue(bdev)); } static inline unsigned int queue_io_min(const struct request_queue *q) { return q->limits.io_min; } static inline unsigned int bdev_io_min(struct block_device *bdev) { return queue_io_min(bdev_get_queue(bdev)); } static inline unsigned int queue_io_opt(const struct request_queue *q) { return q->limits.io_opt; } static inline unsigned int bdev_io_opt(struct block_device *bdev) { return queue_io_opt(bdev_get_queue(bdev)); } static inline unsigned int queue_zone_write_granularity(const struct request_queue *q) { return q->limits.zone_write_granularity; } static inline unsigned int bdev_zone_write_granularity(struct block_device *bdev) { return queue_zone_write_granularity(bdev_get_queue(bdev)); } int bdev_alignment_offset(struct block_device *bdev); unsigned int bdev_discard_alignment(struct block_device *bdev); static inline unsigned int bdev_max_discard_sectors(struct block_device *bdev) { return bdev_limits(bdev)->max_discard_sectors; } static inline unsigned int bdev_discard_granularity(struct block_device *bdev) { return bdev_limits(bdev)->discard_granularity; } static inline unsigned int bdev_max_secure_erase_sectors(struct block_device *bdev) { return bdev_limits(bdev)->max_secure_erase_sectors; } static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev) { return bdev_limits(bdev)->max_write_zeroes_sectors; } static inline unsigned int bdev_write_zeroes_unmap_sectors(struct block_device *bdev) { return bdev_limits(bdev)->max_wzeroes_unmap_sectors; } static inline bool bdev_nonrot(struct block_device *bdev) { return blk_queue_nonrot(bdev_get_queue(bdev)); } static inline bool bdev_synchronous(struct block_device *bdev) { return bdev->bd_disk->queue->limits.features & BLK_FEAT_SYNCHRONOUS; } static inline bool bdev_stable_writes(struct block_device *bdev) { struct request_queue *q = bdev_get_queue(bdev); if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && q->limits.integrity.csum_type != BLK_INTEGRITY_CSUM_NONE) return true; return q->limits.features & BLK_FEAT_STABLE_WRITES; } static inline bool blk_queue_write_cache(struct request_queue *q) { return (q->limits.features & BLK_FEAT_WRITE_CACHE) && !(q->limits.flags & BLK_FLAG_WRITE_CACHE_DISABLED); } static inline bool bdev_write_cache(struct block_device *bdev) { return blk_queue_write_cache(bdev_get_queue(bdev)); } static inline bool bdev_fua(struct block_device *bdev) { return bdev_limits(bdev)->features & BLK_FEAT_FUA; } static inline bool bdev_nowait(struct block_device *bdev) { return bdev->bd_disk->queue->limits.features & BLK_FEAT_NOWAIT; } static inline bool bdev_is_zoned(struct block_device *bdev) { return blk_queue_is_zoned(bdev_get_queue(bdev)); } static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec) { return disk_zone_no(bdev->bd_disk, sec); } static inline sector_t bdev_zone_sectors(struct block_device *bdev) { struct request_queue *q = bdev_get_queue(bdev); if (!blk_queue_is_zoned(q)) return 0; return q->limits.chunk_sectors; } static inline sector_t bdev_zone_start(struct block_device *bdev, sector_t sector) { return sector & ~(bdev_zone_sectors(bdev) - 1); } static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev, sector_t sector) { return sector & (bdev_zone_sectors(bdev) - 1); } static inline sector_t bio_offset_from_zone_start(struct bio *bio) { return bdev_offset_from_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector); } static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sector) { return bdev_offset_from_zone_start(bdev, sector) == 0; } /* Check whether @sector is a multiple of the zone size. */ static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sector) { return bdev_is_zone_start(bdev, sector); } int blk_zone_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask); static inline unsigned int queue_dma_alignment(const struct request_queue *q) { return q->limits.dma_alignment; } static inline unsigned int queue_atomic_write_unit_max_bytes(const struct request_queue *q) { return q->limits.atomic_write_unit_max; } static inline unsigned int queue_atomic_write_unit_min_bytes(const struct request_queue *q) { return q->limits.atomic_write_unit_min; } static inline unsigned int queue_atomic_write_boundary_bytes(const struct request_queue *q) { return q->limits.atomic_write_boundary_sectors << SECTOR_SHIFT; } static inline unsigned int queue_atomic_write_max_bytes(const struct request_queue *q) { return q->limits.atomic_write_max_sectors << SECTOR_SHIFT; } static inline unsigned int bdev_dma_alignment(struct block_device *bdev) { return queue_dma_alignment(bdev_get_queue(bdev)); } static inline unsigned int blk_lim_dma_alignment_and_pad(struct queue_limits *lim) { return lim->dma_alignment | lim->dma_pad_mask; } static inline bool blk_rq_aligned(struct request_queue *q, unsigned long addr, unsigned int len) { unsigned int alignment = blk_lim_dma_alignment_and_pad(&q->limits); return !(addr & alignment) && !(len & alignment); } /* assumes size > 256 */ static inline unsigned int blksize_bits(unsigned int size) { return order_base_2(size >> SECTOR_SHIFT) + SECTOR_SHIFT; } int kblockd_schedule_work(struct work_struct *work); int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay); #define MODULE_ALIAS_BLOCKDEV(major,minor) \ MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor)) #define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \ MODULE_ALIAS("block-major-" __stringify(major) "-*") #ifdef CONFIG_BLK_INLINE_ENCRYPTION bool blk_crypto_register(struct blk_crypto_profile *profile, struct request_queue *q); #else /* CONFIG_BLK_INLINE_ENCRYPTION */ static inline bool blk_crypto_register(struct blk_crypto_profile *profile, struct request_queue *q) { return true; } #endif /* CONFIG_BLK_INLINE_ENCRYPTION */ enum blk_unique_id { /* these match the Designator Types specified in SPC */ BLK_UID_T10 = 1, BLK_UID_EUI64 = 2, BLK_UID_NAA = 3, }; struct block_device_operations { void (*submit_bio)(struct bio *bio); int (*poll_bio)(struct bio *bio, struct io_comp_batch *iob, unsigned int flags); int (*open)(struct gendisk *disk, blk_mode_t mode); void (*release)(struct gendisk *disk); int (*ioctl)(struct block_device *bdev, blk_mode_t mode, unsigned cmd, unsigned long arg); int (*compat_ioctl)(struct block_device *bdev, blk_mode_t mode, unsigned cmd, unsigned long arg); unsigned int (*check_events) (struct gendisk *disk, unsigned int clearing); void (*unlock_native_capacity) (struct gendisk *); int (*getgeo)(struct gendisk *, struct hd_geometry *); int (*set_read_only)(struct block_device *bdev, bool ro); void (*free_disk)(struct gendisk *disk); /* this callback is with swap_lock and sometimes page table lock held */ void (*swap_slot_free_notify) (struct block_device *, unsigned long); int (*report_zones)(struct gendisk *, sector_t sector, unsigned int nr_zones, struct blk_report_zones_args *args); char *(*devnode)(struct gendisk *disk, umode_t *mode); /* returns the length of the identifier or a negative errno: */ int (*get_unique_id)(struct gendisk *disk, u8 id[16], enum blk_unique_id id_type); struct module *owner; const struct pr_ops *pr_ops; /* * Special callback for probing GPT entry at a given sector. * Needed by Android devices, used by GPT scanner and MMC blk * driver. */ int (*alternative_gpt_sector)(struct gendisk *disk, sector_t *sector); }; #ifdef CONFIG_COMPAT extern int blkdev_compat_ptr_ioctl(struct block_device *, blk_mode_t, unsigned int, unsigned long); #else #define blkdev_compat_ptr_ioctl NULL #endif static inline void blk_wake_io_task(struct task_struct *waiter) { /* * If we're polling, the task itself is doing the completions. For * that case, we don't need to signal a wakeup, it's enough to just * mark us as RUNNING. */ if (waiter == current) __set_current_state(TASK_RUNNING); else wake_up_process(waiter); } unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op, unsigned long start_time); void bdev_end_io_acct(struct block_device *bdev, enum req_op op, unsigned int sectors, unsigned long start_time); unsigned long bio_start_io_acct(struct bio *bio); void bio_end_io_acct_remapped(struct bio *bio, unsigned long start_time, struct block_device *orig_bdev); /** * bio_end_io_acct - end I/O accounting for bio based drivers * @bio: bio to end account for * @start_time: start time returned by bio_start_io_acct() */ static inline void bio_end_io_acct(struct bio *bio, unsigned long start_time) { return bio_end_io_acct_remapped(bio, start_time, bio->bi_bdev); } int bdev_validate_blocksize(struct block_device *bdev, int block_size); int set_blocksize(struct file *file, int size); int lookup_bdev(const char *pathname, dev_t *dev); void blkdev_show(struct seq_file *seqf, off_t offset); #define BDEVNAME_SIZE 32 /* Largest string for a blockdev identifier */ #define BDEVT_SIZE 10 /* Largest string for MAJ:MIN for blkdev */ #ifdef CONFIG_BLOCK #define BLKDEV_MAJOR_MAX 512 #else #define BLKDEV_MAJOR_MAX 0 #endif struct blk_holder_ops { void (*mark_dead)(struct block_device *bdev, bool surprise); /* * Sync the file system mounted on the block device. */ void (*sync)(struct block_device *bdev); /* * Freeze the file system mounted on the block device. */ int (*freeze)(struct block_device *bdev); /* * Thaw the file system mounted on the block device. */ int (*thaw)(struct block_device *bdev); }; /* * For filesystems using @fs_holder_ops, the @holder argument passed to * helpers used to open and claim block devices via * bd_prepare_to_claim() must point to a superblock. */ extern const struct blk_holder_ops fs_holder_ops; /* * Return the correct open flags for blkdev_get_by_* for super block flags * as stored in sb->s_flags. */ #define sb_open_mode(flags) \ (BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES | \ (((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE)) struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder, const struct blk_holder_ops *hops); struct file *bdev_file_open_by_path(const char *path, blk_mode_t mode, void *holder, const struct blk_holder_ops *hops); int bd_prepare_to_claim(struct block_device *bdev, void *holder, const struct blk_holder_ops *hops); void bd_abort_claiming(struct block_device *bdev, void *holder); struct block_device *I_BDEV(struct inode *inode); struct block_device *file_bdev(struct file *bdev_file); bool disk_live(struct gendisk *disk); unsigned int block_size(struct block_device *bdev); #ifdef CONFIG_BLOCK void invalidate_bdev(struct block_device *bdev); int sync_blockdev(struct block_device *bdev); int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend); int sync_blockdev_nowait(struct block_device *bdev); void sync_bdevs(bool wait); void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask); void printk_all_partitions(void); int __init early_lookup_bdev(const char *pathname, dev_t *dev); #else static inline void invalidate_bdev(struct block_device *bdev) { } static inline int sync_blockdev(struct block_device *bdev) { return 0; } static inline int sync_blockdev_nowait(struct block_device *bdev) { return 0; } static inline void sync_bdevs(bool wait) { } static inline void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask) { } static inline void printk_all_partitions(void) { } static inline int early_lookup_bdev(const char *pathname, dev_t *dev) { return -EINVAL; } #endif /* CONFIG_BLOCK */ int bdev_freeze(struct block_device *bdev); int bdev_thaw(struct block_device *bdev); void bdev_fput(struct file *bdev_file); struct io_comp_batch { struct rq_list req_list; bool need_ts; void (*complete)(struct io_comp_batch *); }; static inline bool blk_atomic_write_start_sect_aligned(sector_t sector, struct queue_limits *limits) { unsigned int alignment = max(limits->atomic_write_hw_unit_min, limits->atomic_write_hw_boundary); return IS_ALIGNED(sector, alignment >> SECTOR_SHIFT); } static inline bool bdev_can_atomic_write(struct block_device *bdev) { struct request_queue *bd_queue = bdev->bd_queue; struct queue_limits *limits = &bd_queue->limits; if (!limits->atomic_write_unit_min) return false; if (bdev_is_partition(bdev)) return blk_atomic_write_start_sect_aligned(bdev->bd_start_sect, limits); return true; } static inline unsigned int bdev_atomic_write_unit_min_bytes(struct block_device *bdev) { if (!bdev_can_atomic_write(bdev)) return 0; return queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev)); } static inline unsigned int bdev_atomic_write_unit_max_bytes(struct block_device *bdev) { if (!bdev_can_atomic_write(bdev)) return 0; return queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev)); } static inline int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim, unsigned *segs, unsigned max_bytes) { return bio_split_io_at(bio, lim, segs, max_bytes, lim->dma_alignment); } #define DEFINE_IO_COMP_BATCH(name) struct io_comp_batch name = { } #endif /* _LINUX_BLKDEV_H */ |
| 1513 1522 1494 1480 1482 1493 109 109 140 139 139 140 15 5 5 5 5 5 5 15 15 15 14 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 51 51 50 51 51 1 50 51 51 50 51 51 51 51 51 51 50 51 51 13 13 12 13 51 50 51 4 51 50 51 13 13 13 7 7 13 51 50 4 4 4 4 51 13 139 11 140 133 140 138 138 140 138 5 5 138 140 138 126 108 109 109 11 11 11 109 107 18 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 | // SPDX-License-Identifier: GPL-2.0-only /* * Resizable, Scalable, Concurrent Hash Table * * Copyright (c) 2015 Herbert Xu <herbert@gondor.apana.org.au> * Copyright (c) 2014-2015 Thomas Graf <tgraf@suug.ch> * Copyright (c) 2008-2014 Patrick McHardy <kaber@trash.net> * * Code partially derived from nft_hash * Rewritten with rehash code from br_multicast plus single list * pointer as suggested by Josh Triplett */ #include <linux/atomic.h> #include <linux/kernel.h> #include <linux/init.h> #include <linux/log2.h> #include <linux/sched.h> #include <linux/rculist.h> #include <linux/slab.h> #include <linux/vmalloc.h> #include <linux/mm.h> #include <linux/jhash.h> #include <linux/random.h> #include <linux/rhashtable.h> #include <linux/err.h> #include <linux/export.h> #define HASH_DEFAULT_SIZE 64UL #define HASH_MIN_SIZE 4U union nested_table { union nested_table __rcu *table; struct rhash_lock_head __rcu *bucket; }; static u32 head_hashfn(struct rhashtable *ht, const struct bucket_table *tbl, const struct rhash_head *he) { return rht_head_hashfn(ht, tbl, he, ht->p); } #ifdef CONFIG_PROVE_LOCKING #define ASSERT_RHT_MUTEX(HT) BUG_ON(!lockdep_rht_mutex_is_held(HT)) int lockdep_rht_mutex_is_held(struct rhashtable *ht) { return (debug_locks) ? lockdep_is_held(&ht->mutex) : 1; } EXPORT_SYMBOL_GPL(lockdep_rht_mutex_is_held); int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash) { if (!debug_locks) return 1; if (unlikely(tbl->nest)) return 1; return bit_spin_is_locked(0, (unsigned long *)&tbl->buckets[hash]); } EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held); #else #define ASSERT_RHT_MUTEX(HT) #endif static inline union nested_table *nested_table_top( const struct bucket_table *tbl) { /* The top-level bucket entry does not need RCU protection * because it's set at the same time as tbl->nest. */ return (void *)rcu_dereference_protected(tbl->buckets[0], 1); } static void nested_table_free(union nested_table *ntbl, unsigned int size) { const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *)); const unsigned int len = 1 << shift; unsigned int i; ntbl = rcu_dereference_protected(ntbl->table, 1); if (!ntbl) return; if (size > len) { size >>= shift; for (i = 0; i < len; i++) nested_table_free(ntbl + i, size); } kfree(ntbl); } static void nested_bucket_table_free(const struct bucket_table *tbl) { unsigned int size = tbl->size >> tbl->nest; unsigned int len = 1 << tbl->nest; union nested_table *ntbl; unsigned int i; ntbl = nested_table_top(tbl); for (i = 0; i < len; i++) nested_table_free(ntbl + i, size); kfree(ntbl); } static void bucket_table_free(const struct bucket_table *tbl) { if (tbl->nest) nested_bucket_table_free(tbl); kvfree(tbl); } static void bucket_table_free_rcu(struct rcu_head *head) { bucket_table_free(container_of(head, struct bucket_table, rcu)); } static union nested_table *nested_table_alloc(struct rhashtable *ht, union nested_table __rcu **prev, bool leaf) { union nested_table *ntbl; int i; ntbl = rcu_dereference(*prev); if (ntbl) return ntbl; ntbl = alloc_hooks_tag(ht->alloc_tag, kmalloc_noprof(PAGE_SIZE, GFP_ATOMIC|__GFP_ZERO)); if (ntbl && leaf) { for (i = 0; i < PAGE_SIZE / sizeof(ntbl[0]); i++) INIT_RHT_NULLS_HEAD(ntbl[i].bucket); } if (cmpxchg((union nested_table **)prev, NULL, ntbl) == NULL) return ntbl; /* Raced with another thread. */ kfree(ntbl); return rcu_dereference(*prev); } static struct bucket_table *nested_bucket_table_alloc(struct rhashtable *ht, size_t nbuckets, gfp_t gfp) { const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *)); struct bucket_table *tbl; size_t size; if (nbuckets < (1 << (shift + 1))) return NULL; size = sizeof(*tbl) + sizeof(tbl->buckets[0]); tbl = alloc_hooks_tag(ht->alloc_tag, kmalloc_noprof(size, gfp|__GFP_ZERO)); if (!tbl) return NULL; if (!nested_table_alloc(ht, (union nested_table __rcu **)tbl->buckets, false)) { kfree(tbl); return NULL; } tbl->nest = (ilog2(nbuckets) - 1) % shift + 1; return tbl; } static struct bucket_table *bucket_table_alloc(struct rhashtable *ht, size_t nbuckets, gfp_t gfp) { struct bucket_table *tbl = NULL; size_t size; int i; static struct lock_class_key __key; tbl = alloc_hooks_tag(ht->alloc_tag, kvmalloc_node_align_noprof(struct_size(tbl, buckets, nbuckets), 1, gfp|__GFP_ZERO, NUMA_NO_NODE)); size = nbuckets; if (tbl == NULL && !gfpflags_allow_blocking(gfp)) { tbl = nested_bucket_table_alloc(ht, nbuckets, gfp); nbuckets = 0; } if (tbl == NULL) return NULL; lockdep_init_map(&tbl->dep_map, "rhashtable_bucket", &__key, 0); tbl->size = size; rcu_head_init(&tbl->rcu); INIT_LIST_HEAD(&tbl->walkers); tbl->hash_rnd = get_random_u32(); for (i = 0; i < nbuckets; i++) INIT_RHT_NULLS_HEAD(tbl->buckets[i]); return tbl; } static struct bucket_table *rhashtable_last_table(struct rhashtable *ht, struct bucket_table *tbl) { struct bucket_table *new_tbl; do { new_tbl = tbl; tbl = rht_dereference_rcu(tbl->future_tbl, ht); } while (tbl); return new_tbl; } static int rhashtable_rehash_one(struct rhashtable *ht, struct rhash_lock_head __rcu **bkt, unsigned int old_hash) { struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht); struct bucket_table *new_tbl = rhashtable_last_table(ht, old_tbl); int err = -EAGAIN; struct rhash_head *head, *next, *entry; struct rhash_head __rcu **pprev = NULL; unsigned int new_hash; unsigned long flags; if (new_tbl->nest) goto out; err = -ENOENT; rht_for_each_from(entry, rht_ptr(bkt, old_tbl, old_hash), old_tbl, old_hash) { err = 0; next = rht_dereference_bucket(entry->next, old_tbl, old_hash); if (rht_is_a_nulls(next)) break; pprev = &entry->next; } if (err) goto out; new_hash = head_hashfn(ht, new_tbl, entry); flags = rht_lock_nested(new_tbl, &new_tbl->buckets[new_hash], SINGLE_DEPTH_NESTING); head = rht_ptr(new_tbl->buckets + new_hash, new_tbl, new_hash); RCU_INIT_POINTER(entry->next, head); rht_assign_unlock(new_tbl, &new_tbl->buckets[new_hash], entry, flags); if (pprev) rcu_assign_pointer(*pprev, next); else /* Need to preserved the bit lock. */ rht_assign_locked(bkt, next); out: return err; } static int rhashtable_rehash_chain(struct rhashtable *ht, unsigned int old_hash) { struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht); struct rhash_lock_head __rcu **bkt = rht_bucket_var(old_tbl, old_hash); unsigned long flags; int err; if (!bkt) return 0; flags = rht_lock(old_tbl, bkt); while (!(err = rhashtable_rehash_one(ht, bkt, old_hash))) ; if (err == -ENOENT) err = 0; rht_unlock(old_tbl, bkt, flags); return err; } static int rhashtable_rehash_attach(struct rhashtable *ht, struct bucket_table *old_tbl, struct bucket_table *new_tbl) { /* Make insertions go into the new, empty table right away. Deletions * and lookups will be attempted in both tables until we synchronize. * As cmpxchg() provides strong barriers, we do not need * rcu_assign_pointer(). */ if (cmpxchg((struct bucket_table **)&old_tbl->future_tbl, NULL, new_tbl) != NULL) return -EEXIST; return 0; } static int rhashtable_rehash_table(struct rhashtable *ht) { struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht); struct bucket_table *new_tbl; struct rhashtable_walker *walker; unsigned int old_hash; int err; new_tbl = rht_dereference(old_tbl->future_tbl, ht); if (!new_tbl) return 0; for (old_hash = 0; old_hash < old_tbl->size; old_hash++) { err = rhashtable_rehash_chain(ht, old_hash); if (err) return err; cond_resched(); } /* Publish the new table pointer. */ rcu_assign_pointer(ht->tbl, new_tbl); spin_lock(&ht->lock); list_for_each_entry(walker, &old_tbl->walkers, list) walker->tbl = NULL; /* Wait for readers. All new readers will see the new * table, and thus no references to the old table will * remain. * We do this inside the locked region so that * rhashtable_walk_stop() can use rcu_head_after_call_rcu() * to check if it should not re-link the table. */ call_rcu(&old_tbl->rcu, bucket_table_free_rcu); spin_unlock(&ht->lock); return rht_dereference(new_tbl->future_tbl, ht) ? -EAGAIN : 0; } static int rhashtable_rehash_alloc(struct rhashtable *ht, struct bucket_table *old_tbl, unsigned int size) { struct bucket_table *new_tbl; int err; ASSERT_RHT_MUTEX(ht); new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL); if (new_tbl == NULL) return -ENOMEM; err = rhashtable_rehash_attach(ht, old_tbl, new_tbl); if (err) bucket_table_free(new_tbl); return err; } /** * rhashtable_shrink - Shrink hash table while allowing concurrent lookups * @ht: the hash table to shrink * * This function shrinks the hash table to fit, i.e., the smallest * size would not cause it to expand right away automatically. * * The caller must ensure that no concurrent resizing occurs by holding * ht->mutex. * * The caller must ensure that no concurrent table mutations take place. * It is however valid to have concurrent lookups if they are RCU protected. * * It is valid to have concurrent insertions and deletions protected by per * bucket locks or concurrent RCU protected lookups and traversals. */ static int rhashtable_shrink(struct rhashtable *ht) { struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht); unsigned int nelems = atomic_read(&ht->nelems); unsigned int size = 0; if (nelems) size = roundup_pow_of_two(nelems * 3 / 2); if (size < ht->p.min_size) size = ht->p.min_size; if (old_tbl->size <= size) return 0; if (rht_dereference(old_tbl->future_tbl, ht)) return -EEXIST; return rhashtable_rehash_alloc(ht, old_tbl, size); } static void rht_deferred_worker(struct work_struct *work) { struct rhashtable *ht; struct bucket_table *tbl; int err = 0; ht = container_of(work, struct rhashtable, run_work); mutex_lock(&ht->mutex); tbl = rht_dereference(ht->tbl, ht); tbl = rhashtable_last_table(ht, tbl); if (rht_grow_above_75(ht, tbl)) err = rhashtable_rehash_alloc(ht, tbl, tbl->size * 2); else if (ht->p.automatic_shrinking && rht_shrink_below_30(ht, tbl)) err = rhashtable_shrink(ht); else if (tbl->nest) err = rhashtable_rehash_alloc(ht, tbl, tbl->size); if (!err || err == -EEXIST) { int nerr; nerr = rhashtable_rehash_table(ht); err = err ?: nerr; } mutex_unlock(&ht->mutex); if (err) schedule_work(&ht->run_work); } static int rhashtable_insert_rehash(struct rhashtable *ht, struct bucket_table *tbl) { struct bucket_table *old_tbl; struct bucket_table *new_tbl; unsigned int size; int err; old_tbl = rht_dereference_rcu(ht->tbl, ht); size = tbl->size; err = -EBUSY; if (rht_grow_above_75(ht, tbl)) size *= 2; /* Do not schedule more than one rehash */ else if (old_tbl != tbl) goto fail; err = -ENOMEM; new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC | __GFP_NOWARN); if (new_tbl == NULL) goto fail; err = rhashtable_rehash_attach(ht, tbl, new_tbl); if (err) { bucket_table_free(new_tbl); if (err == -EEXIST) err = 0; } else schedule_work(&ht->run_work); return err; fail: /* Do not fail the insert if someone else did a rehash. */ if (likely(rcu_access_pointer(tbl->future_tbl))) return 0; /* Schedule async rehash to retry allocation in process context. */ if (err == -ENOMEM) schedule_work(&ht->run_work); return err; } static void *rhashtable_lookup_one(struct rhashtable *ht, struct rhash_lock_head __rcu **bkt, struct bucket_table *tbl, unsigned int hash, const void *key, struct rhash_head *obj) { struct rhashtable_compare_arg arg = { .ht = ht, .key = key, }; struct rhash_head __rcu **pprev = NULL; struct rhash_head *head; int elasticity; elasticity = RHT_ELASTICITY; rht_for_each_from(head, rht_ptr(bkt, tbl, hash), tbl, hash) { struct rhlist_head *list; struct rhlist_head *plist; elasticity--; if (!key || (ht->p.obj_cmpfn ? ht->p.obj_cmpfn(&arg, rht_obj(ht, head)) : rhashtable_compare(&arg, rht_obj(ht, head)))) { pprev = &head->next; continue; } if (!ht->rhlist) return rht_obj(ht, head); list = container_of(obj, struct rhlist_head, rhead); plist = container_of(head, struct rhlist_head, rhead); RCU_INIT_POINTER(list->next, plist); head = rht_dereference_bucket(head->next, tbl, hash); RCU_INIT_POINTER(list->rhead.next, head); if (pprev) rcu_assign_pointer(*pprev, obj); else /* Need to preserve the bit lock */ rht_assign_locked(bkt, obj); return NULL; } if (elasticity <= 0) return ERR_PTR(-EAGAIN); return ERR_PTR(-ENOENT); } static struct bucket_table *rhashtable_insert_one( struct rhashtable *ht, struct rhash_lock_head __rcu **bkt, struct bucket_table *tbl, unsigned int hash, struct rhash_head *obj, void *data) { struct bucket_table *new_tbl; struct rhash_head *head; if (!IS_ERR_OR_NULL(data)) return ERR_PTR(-EEXIST); if (PTR_ERR(data) != -EAGAIN && PTR_ERR(data) != -ENOENT) return ERR_CAST(data); new_tbl = rht_dereference_rcu(tbl->future_tbl, ht); if (new_tbl) return new_tbl; if (PTR_ERR(data) != -ENOENT) return ERR_CAST(data); if (unlikely(rht_grow_above_max(ht, tbl))) return ERR_PTR(-E2BIG); if (unlikely(rht_grow_above_100(ht, tbl))) return ERR_PTR(-EAGAIN); head = rht_ptr(bkt, tbl, hash); RCU_INIT_POINTER(obj->next, head); if (ht->rhlist) { struct rhlist_head *list; list = container_of(obj, struct rhlist_head, rhead); RCU_INIT_POINTER(list->next, NULL); } /* bkt is always the head of the list, so it holds * the lock, which we need to preserve */ rht_assign_locked(bkt, obj); return NULL; } static void *rhashtable_try_insert(struct rhashtable *ht, const void *key, struct rhash_head *obj) { struct bucket_table *new_tbl; struct bucket_table *tbl; struct rhash_lock_head __rcu **bkt; unsigned long flags; unsigned int hash; void *data; new_tbl = rcu_dereference(ht->tbl); do { tbl = new_tbl; hash = rht_head_hashfn(ht, tbl, obj, ht->p); if (rcu_access_pointer(tbl->future_tbl)) /* Failure is OK */ bkt = rht_bucket_var(tbl, hash); else bkt = rht_bucket_insert(ht, tbl, hash); if (bkt == NULL) { new_tbl = rht_dereference_rcu(tbl->future_tbl, ht); data = ERR_PTR(-EAGAIN); } else { bool inserted; flags = rht_lock(tbl, bkt); data = rhashtable_lookup_one(ht, bkt, tbl, hash, key, obj); new_tbl = rhashtable_insert_one(ht, bkt, tbl, hash, obj, data); inserted = data && !new_tbl; if (inserted) atomic_inc(&ht->nelems); if (PTR_ERR(new_tbl) != -EEXIST) data = ERR_CAST(new_tbl); rht_unlock(tbl, bkt, flags); if (inserted && rht_grow_above_75(ht, tbl)) schedule_work(&ht->run_work); } } while (!IS_ERR_OR_NULL(new_tbl)); if (PTR_ERR(data) == -EAGAIN) data = ERR_PTR(rhashtable_insert_rehash(ht, tbl) ?: -EAGAIN); return data; } void *rhashtable_insert_slow(struct rhashtable *ht, const void *key, struct rhash_head *obj) { void *data; do { rcu_read_lock(); data = rhashtable_try_insert(ht, key, obj); rcu_read_unlock(); } while (PTR_ERR(data) == -EAGAIN); return data; } EXPORT_SYMBOL_GPL(rhashtable_insert_slow); /** * rhashtable_walk_enter - Initialise an iterator * @ht: Table to walk over * @iter: Hash table Iterator * * This function prepares a hash table walk. * * Note that if you restart a walk after rhashtable_walk_stop you * may see the same object twice. Also, you may miss objects if * there are removals in between rhashtable_walk_stop and the next * call to rhashtable_walk_start. * * For a completely stable walk you should construct your own data * structure outside the hash table. * * This function may be called from any process context, including * non-preemptible context, but cannot be called from softirq or * hardirq context. * * You must call rhashtable_walk_exit after this function returns. */ void rhashtable_walk_enter(struct rhashtable *ht, struct rhashtable_iter *iter) { iter->ht = ht; iter->p = NULL; iter->slot = 0; iter->skip = 0; iter->end_of_table = 0; spin_lock(&ht->lock); iter->walker.tbl = rcu_dereference_protected(ht->tbl, lockdep_is_held(&ht->lock)); list_add(&iter->walker.list, &iter->walker.tbl->walkers); spin_unlock(&ht->lock); } EXPORT_SYMBOL_GPL(rhashtable_walk_enter); /** * rhashtable_walk_exit - Free an iterator * @iter: Hash table Iterator * * This function frees resources allocated by rhashtable_walk_enter. */ void rhashtable_walk_exit(struct rhashtable_iter *iter) { spin_lock(&iter->ht->lock); if (iter->walker.tbl) list_del(&iter->walker.list); spin_unlock(&iter->ht->lock); } EXPORT_SYMBOL_GPL(rhashtable_walk_exit); /** * rhashtable_walk_start_check - Start a hash table walk * @iter: Hash table iterator * * Start a hash table walk at the current iterator position. Note that we take * the RCU lock in all cases including when we return an error. So you must * always call rhashtable_walk_stop to clean up. * * Returns zero if successful. * * Returns -EAGAIN if resize event occurred. Note that the iterator * will rewind back to the beginning and you may use it immediately * by calling rhashtable_walk_next. * * rhashtable_walk_start is defined as an inline variant that returns * void. This is preferred in cases where the caller would ignore * resize events and always continue. */ int rhashtable_walk_start_check(struct rhashtable_iter *iter) __acquires(RCU) { struct rhashtable *ht = iter->ht; bool rhlist = ht->rhlist; rcu_read_lock(); spin_lock(&ht->lock); if (iter->walker.tbl) list_del(&iter->walker.list); spin_unlock(&ht->lock); if (iter->end_of_table) return 0; if (!iter->walker.tbl) { iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht); iter->slot = 0; iter->skip = 0; return -EAGAIN; } if (iter->p && !rhlist) { /* * We need to validate that 'p' is still in the table, and * if so, update 'skip' */ struct rhash_head *p; int skip = 0; rht_for_each_rcu(p, iter->walker.tbl, iter->slot) { skip++; if (p == iter->p) { iter->skip = skip; goto found; } } iter->p = NULL; } else if (iter->p && rhlist) { /* Need to validate that 'list' is still in the table, and * if so, update 'skip' and 'p'. */ struct rhash_head *p; struct rhlist_head *list; int skip = 0; rht_for_each_rcu(p, iter->walker.tbl, iter->slot) { for (list = container_of(p, struct rhlist_head, rhead); list; list = rcu_dereference(list->next)) { skip++; if (list == iter->list) { iter->p = p; iter->skip = skip; goto found; } } } iter->p = NULL; } found: return 0; } EXPORT_SYMBOL_GPL(rhashtable_walk_start_check); /** * __rhashtable_walk_find_next - Find the next element in a table (or the first * one in case of a new walk). * * @iter: Hash table iterator * * Returns the found object or NULL when the end of the table is reached. * * Returns -EAGAIN if resize event occurred. */ static void *__rhashtable_walk_find_next(struct rhashtable_iter *iter) { struct bucket_table *tbl = iter->walker.tbl; struct rhlist_head *list = iter->list; struct rhashtable *ht = iter->ht; struct rhash_head *p = iter->p; bool rhlist = ht->rhlist; if (!tbl) return NULL; for (; iter->slot < tbl->size; iter->slot++) { int skip = iter->skip; rht_for_each_rcu(p, tbl, iter->slot) { if (rhlist) { list = container_of(p, struct rhlist_head, rhead); do { if (!skip) goto next; skip--; list = rcu_dereference(list->next); } while (list); continue; } if (!skip) break; skip--; } next: if (!rht_is_a_nulls(p)) { iter->skip++; iter->p = p; iter->list = list; return rht_obj(ht, rhlist ? &list->rhead : p); } iter->skip = 0; } iter->p = NULL; /* Ensure we see any new tables. */ smp_rmb(); iter->walker.tbl = rht_dereference_rcu(tbl->future_tbl, ht); if (iter->walker.tbl) { iter->slot = 0; iter->skip = 0; return ERR_PTR(-EAGAIN); } else { iter->end_of_table = true; } return NULL; } /** * rhashtable_walk_next - Return the next object and advance the iterator * @iter: Hash table iterator * * Note that you must call rhashtable_walk_stop when you are finished * with the walk. * * Returns the next object or NULL when the end of the table is reached. * * Returns -EAGAIN if resize event occurred. Note that the iterator * will rewind back to the beginning and you may continue to use it. */ void *rhashtable_walk_next(struct rhashtable_iter *iter) { struct rhlist_head *list = iter->list; struct rhashtable *ht = iter->ht; struct rhash_head *p = iter->p; bool rhlist = ht->rhlist; if (p) { if (!rhlist || !(list = rcu_dereference(list->next))) { p = rcu_dereference(p->next); list = container_of(p, struct rhlist_head, rhead); } if (!rht_is_a_nulls(p)) { iter->skip++; iter->p = p; iter->list = list; return rht_obj(ht, rhlist ? &list->rhead : p); } /* At the end of this slot, switch to next one and then find * next entry from that point. */ iter->skip = 0; iter->slot++; } return __rhashtable_walk_find_next(iter); } EXPORT_SYMBOL_GPL(rhashtable_walk_next); /** * rhashtable_walk_peek - Return the next object but don't advance the iterator * @iter: Hash table iterator * * Returns the next object or NULL when the end of the table is reached. * * Returns -EAGAIN if resize event occurred. Note that the iterator * will rewind back to the beginning and you may continue to use it. */ void *rhashtable_walk_peek(struct rhashtable_iter *iter) { struct rhlist_head *list = iter->list; struct rhashtable *ht = iter->ht; struct rhash_head *p = iter->p; if (p) return rht_obj(ht, ht->rhlist ? &list->rhead : p); /* No object found in current iter, find next one in the table. */ if (iter->skip) { /* A nonzero skip value points to the next entry in the table * beyond that last one that was found. Decrement skip so * we find the current value. __rhashtable_walk_find_next * will restore the original value of skip assuming that * the table hasn't changed. */ iter->skip--; } return __rhashtable_walk_find_next(iter); } EXPORT_SYMBOL_GPL(rhashtable_walk_peek); /** * rhashtable_walk_stop - Finish a hash table walk * @iter: Hash table iterator * * Finish a hash table walk. Does not reset the iterator to the start of the * hash table. */ void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU) { struct rhashtable *ht; struct bucket_table *tbl = iter->walker.tbl; if (!tbl) goto out; ht = iter->ht; spin_lock(&ht->lock); if (rcu_head_after_call_rcu(&tbl->rcu, bucket_table_free_rcu)) /* This bucket table is being freed, don't re-link it. */ iter->walker.tbl = NULL; else list_add(&iter->walker.list, &tbl->walkers); spin_unlock(&ht->lock); out: rcu_read_unlock(); } EXPORT_SYMBOL_GPL(rhashtable_walk_stop); static size_t rounded_hashtable_size(const struct rhashtable_params *params) { size_t retsize; if (params->nelem_hint) retsize = max(roundup_pow_of_two(params->nelem_hint * 4 / 3), (unsigned long)params->min_size); else retsize = max(HASH_DEFAULT_SIZE, (unsigned long)params->min_size); return retsize; } static u32 rhashtable_jhash2(const void *key, u32 length, u32 seed) { return jhash2(key, length, seed); } /** * rhashtable_init - initialize a new hash table * @ht: hash table to be initialized * @params: configuration parameters * * Initializes a new hash table based on the provided configuration * parameters. A table can be configured either with a variable or * fixed length key: * * Configuration Example 1: Fixed length keys * struct test_obj { * int key; * void * my_member; * struct rhash_head node; * }; * * struct rhashtable_params params = { * .head_offset = offsetof(struct test_obj, node), * .key_offset = offsetof(struct test_obj, key), * .key_len = sizeof(int), * .hashfn = jhash, * }; * * Configuration Example 2: Variable length keys * struct test_obj { * [...] * struct rhash_head node; * }; * * u32 my_hash_fn(const void *data, u32 len, u32 seed) * { * struct test_obj *obj = data; * * return [... hash ...]; * } * * struct rhashtable_params params = { * .head_offset = offsetof(struct test_obj, node), * .hashfn = jhash, * .obj_hashfn = my_hash_fn, * }; */ int rhashtable_init_noprof(struct rhashtable *ht, const struct rhashtable_params *params) { struct bucket_table *tbl; size_t size; if ((!params->key_len && !params->obj_hashfn) || (params->obj_hashfn && !params->obj_cmpfn)) return -EINVAL; memset(ht, 0, sizeof(*ht)); mutex_init(&ht->mutex); spin_lock_init(&ht->lock); memcpy(&ht->p, params, sizeof(*params)); alloc_tag_record(ht->alloc_tag); if (params->min_size) ht->p.min_size = roundup_pow_of_two(params->min_size); /* Cap total entries at 2^31 to avoid nelems overflow. */ ht->max_elems = 1u << 31; if (params->max_size) { ht->p.max_size = rounddown_pow_of_two(params->max_size); if (ht->p.max_size < ht->max_elems / 2) ht->max_elems = ht->p.max_size * 2; } ht->p.min_size = max_t(u16, ht->p.min_size, HASH_MIN_SIZE); size = rounded_hashtable_size(&ht->p); ht->key_len = ht->p.key_len; if (!params->hashfn) { ht->p.hashfn = jhash; if (!(ht->key_len & (sizeof(u32) - 1))) { ht->key_len /= sizeof(u32); ht->p.hashfn = rhashtable_jhash2; } } /* * This is api initialization and thus we need to guarantee the * initial rhashtable allocation. Upon failure, retry with the * smallest possible size with __GFP_NOFAIL semantics. */ tbl = bucket_table_alloc(ht, size, GFP_KERNEL); if (unlikely(tbl == NULL)) { size = max_t(u16, ht->p.min_size, HASH_MIN_SIZE); tbl = bucket_table_alloc(ht, size, GFP_KERNEL | __GFP_NOFAIL); } atomic_set(&ht->nelems, 0); RCU_INIT_POINTER(ht->tbl, tbl); INIT_WORK(&ht->run_work, rht_deferred_worker); return 0; } EXPORT_SYMBOL_GPL(rhashtable_init_noprof); /** * rhltable_init - initialize a new hash list table * @hlt: hash list table to be initialized * @params: configuration parameters * * Initializes a new hash list table. * * See documentation for rhashtable_init. */ int rhltable_init_noprof(struct rhltable *hlt, const struct rhashtable_params *params) { int err; err = rhashtable_init_noprof(&hlt->ht, params); hlt->ht.rhlist = true; return err; } EXPORT_SYMBOL_GPL(rhltable_init_noprof); static void rhashtable_free_one(struct rhashtable *ht, struct rhash_head *obj, void (*free_fn)(void *ptr, void *arg), void *arg) { struct rhlist_head *list; if (!ht->rhlist) { free_fn(rht_obj(ht, obj), arg); return; } list = container_of(obj, struct rhlist_head, rhead); do { obj = &list->rhead; list = rht_dereference(list->next, ht); free_fn(rht_obj(ht, obj), arg); } while (list); } /** * rhashtable_free_and_destroy - free elements and destroy hash table * @ht: the hash table to destroy * @free_fn: callback to release resources of element * @arg: pointer passed to free_fn * * Stops an eventual async resize. If defined, invokes free_fn for each * element to releasal resources. Please note that RCU protected * readers may still be accessing the elements. Releasing of resources * must occur in a compatible manner. Then frees the bucket array. * * This function will eventually sleep to wait for an async resize * to complete. The caller is responsible that no further write operations * occurs in parallel. */ void rhashtable_free_and_destroy(struct rhashtable *ht, void (*free_fn)(void *ptr, void *arg), void *arg) { struct bucket_table *tbl, *next_tbl; unsigned int i; cancel_work_sync(&ht->run_work); mutex_lock(&ht->mutex); tbl = rht_dereference(ht->tbl, ht); restart: if (free_fn) { for (i = 0; i < tbl->size; i++) { struct rhash_head *pos, *next; cond_resched(); for (pos = rht_ptr_exclusive(rht_bucket(tbl, i)), next = !rht_is_a_nulls(pos) ? rht_dereference(pos->next, ht) : NULL; !rht_is_a_nulls(pos); pos = next, next = !rht_is_a_nulls(pos) ? rht_dereference(pos->next, ht) : NULL) rhashtable_free_one(ht, pos, free_fn, arg); } } next_tbl = rht_dereference(tbl->future_tbl, ht); bucket_table_free(tbl); if (next_tbl) { tbl = next_tbl; goto restart; } mutex_unlock(&ht->mutex); } EXPORT_SYMBOL_GPL(rhashtable_free_and_destroy); void rhashtable_destroy(struct rhashtable *ht) { return rhashtable_free_and_destroy(ht, NULL, NULL); } EXPORT_SYMBOL_GPL(rhashtable_destroy); struct rhash_lock_head __rcu **__rht_bucket_nested( const struct bucket_table *tbl, unsigned int hash) { const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *)); unsigned int index = hash & ((1 << tbl->nest) - 1); unsigned int size = tbl->size >> tbl->nest; unsigned int subhash = hash; union nested_table *ntbl; ntbl = nested_table_top(tbl); ntbl = rht_dereference_bucket_rcu(ntbl[index].table, tbl, hash); subhash >>= tbl->nest; while (ntbl && size > (1 << shift)) { index = subhash & ((1 << shift) - 1); ntbl = rht_dereference_bucket_rcu(ntbl[index].table, tbl, hash); size >>= shift; subhash >>= shift; } if (!ntbl) return NULL; return &ntbl[subhash].bucket; } EXPORT_SYMBOL_GPL(__rht_bucket_nested); struct rhash_lock_head __rcu **rht_bucket_nested( const struct bucket_table *tbl, unsigned int hash) { static struct rhash_lock_head __rcu *rhnull; if (!rhnull) INIT_RHT_NULLS_HEAD(rhnull); return __rht_bucket_nested(tbl, hash) ?: &rhnull; } EXPORT_SYMBOL_GPL(rht_bucket_nested); struct rhash_lock_head __rcu **rht_bucket_nested_insert( struct rhashtable *ht, struct bucket_table *tbl, unsigned int hash) { const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *)); unsigned int index = hash & ((1 << tbl->nest) - 1); unsigned int size = tbl->size >> tbl->nest; union nested_table *ntbl; ntbl = nested_table_top(tbl); hash >>= tbl->nest; ntbl = nested_table_alloc(ht, &ntbl[index].table, size <= (1 << shift)); while (ntbl && size > (1 << shift)) { index = hash & ((1 << shift) - 1); size >>= shift; hash >>= shift; ntbl = nested_table_alloc(ht, &ntbl[index].table, size <= (1 << shift)); } if (!ntbl) return NULL; return &ntbl[hash].bucket; } EXPORT_SYMBOL_GPL(rht_bucket_nested_insert); |
| 2 7 1 1 341 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * Copyright (c) 2014 Mahesh Bandewar <maheshb@google.com> */ #ifndef __IPVLAN_H #define __IPVLAN_H #include <linux/kernel.h> #include <linux/types.h> #include <linux/module.h> #include <linux/init.h> #include <linux/rculist.h> #include <linux/notifier.h> #include <linux/netdevice.h> #include <linux/etherdevice.h> #include <linux/if_arp.h> #include <linux/if_link.h> #include <linux/if_vlan.h> #include <linux/ip.h> #include <linux/inetdevice.h> #include <linux/netfilter.h> #include <net/ip.h> #include <net/ip6_route.h> #include <net/netns/generic.h> #include <net/rtnetlink.h> #include <net/route.h> #include <net/addrconf.h> #include <net/l3mdev.h> #define IPVLAN_DRV "ipvlan" #define IPV_DRV_VER "0.1" #define IPVLAN_HASH_SIZE (1 << BITS_PER_BYTE) #define IPVLAN_HASH_MASK (IPVLAN_HASH_SIZE - 1) #define IPVLAN_MAC_FILTER_BITS 8 #define IPVLAN_MAC_FILTER_SIZE (1 << IPVLAN_MAC_FILTER_BITS) #define IPVLAN_MAC_FILTER_MASK (IPVLAN_MAC_FILTER_SIZE - 1) #define IPVLAN_QBACKLOG_LIMIT 1000 typedef enum { IPVL_IPV6 = 0, IPVL_ICMPV6, IPVL_IPV4, IPVL_ARP, } ipvl_hdr_type; struct ipvl_pcpu_stats { u64_stats_t rx_pkts; u64_stats_t rx_bytes; u64_stats_t rx_mcast; u64_stats_t tx_pkts; u64_stats_t tx_bytes; struct u64_stats_sync syncp; u32 rx_errs; u32 tx_drps; }; struct ipvl_port; struct ipvl_dev { struct net_device *dev; struct list_head pnode; struct ipvl_port *port; struct net_device *phy_dev; struct list_head addrs; struct ipvl_pcpu_stats __percpu *pcpu_stats; DECLARE_BITMAP(mac_filters, IPVLAN_MAC_FILTER_SIZE); netdev_features_t sfeatures; u32 msg_enable; }; struct ipvl_addr { struct ipvl_dev *master; /* Back pointer to master */ union { struct in6_addr ip6; /* IPv6 address on logical interface */ struct in_addr ip4; /* IPv4 address on logical interface */ } ipu; #define ip6addr ipu.ip6 #define ip4addr ipu.ip4 struct hlist_node hlnode; /* Hash-table linkage */ struct list_head anode; /* logical-interface linkage */ ipvl_hdr_type atype; struct rcu_head rcu; }; struct ipvl_port { struct net_device *dev; possible_net_t pnet; struct hlist_head hlhead[IPVLAN_HASH_SIZE]; spinlock_t addrs_lock; /* guards hash-table and addrs */ struct list_head ipvlans; u16 mode; u16 flags; u16 dev_id_start; struct work_struct wq; struct sk_buff_head backlog; int count; struct ida ida; netdevice_tracker dev_tracker; }; struct ipvl_skb_cb { bool tx_pkt; }; #define IPVL_SKB_CB(_skb) ((struct ipvl_skb_cb *)&((_skb)->cb[0])) static inline struct ipvl_port *ipvlan_port_get_rcu(const struct net_device *d) { return rcu_dereference(d->rx_handler_data); } static inline struct ipvl_port *ipvlan_port_get_rcu_bh(const struct net_device *d) { return rcu_dereference_bh(d->rx_handler_data); } static inline struct ipvl_port *ipvlan_port_get_rtnl(const struct net_device *d) { return rtnl_dereference(d->rx_handler_data); } static inline bool ipvlan_is_private(const struct ipvl_port *port) { return !!(port->flags & IPVLAN_F_PRIVATE); } static inline void ipvlan_mark_private(struct ipvl_port *port) { port->flags |= IPVLAN_F_PRIVATE; } static inline void ipvlan_clear_private(struct ipvl_port *port) { port->flags &= ~IPVLAN_F_PRIVATE; } static inline bool ipvlan_is_vepa(const struct ipvl_port *port) { return !!(port->flags & IPVLAN_F_VEPA); } static inline void ipvlan_mark_vepa(struct ipvl_port *port) { port->flags |= IPVLAN_F_VEPA; } static inline void ipvlan_clear_vepa(struct ipvl_port *port) { port->flags &= ~IPVLAN_F_VEPA; } void ipvlan_init_secret(void); unsigned int ipvlan_mac_hash(const unsigned char *addr); rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb); void ipvlan_process_multicast(struct work_struct *work); int ipvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev); void ipvlan_ht_addr_add(struct ipvl_dev *ipvlan, struct ipvl_addr *addr); struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev *ipvlan, const void *iaddr, bool is_v6); bool ipvlan_addr_busy(struct ipvl_port *port, void *iaddr, bool is_v6); void ipvlan_ht_addr_del(struct ipvl_addr *addr); struct ipvl_addr *ipvlan_addr_lookup(struct ipvl_port *port, void *lyr3h, int addr_type, bool use_dest); void *ipvlan_get_L3_hdr(struct ipvl_port *port, struct sk_buff *skb, int *type); void ipvlan_count_rx(const struct ipvl_dev *ipvlan, unsigned int len, bool success, bool mcast); int ipvlan_link_new(struct net_device *dev, struct rtnl_newlink_params *params, struct netlink_ext_ack *extack); void ipvlan_link_delete(struct net_device *dev, struct list_head *head); void ipvlan_link_setup(struct net_device *dev); int ipvlan_link_register(struct rtnl_link_ops *ops); #ifdef CONFIG_IPVLAN_L3S int ipvlan_l3s_register(struct ipvl_port *port); void ipvlan_l3s_unregister(struct ipvl_port *port); void ipvlan_migrate_l3s_hook(struct net *oldnet, struct net *newnet); int ipvlan_l3s_init(void); void ipvlan_l3s_cleanup(void); #else static inline int ipvlan_l3s_register(struct ipvl_port *port) { return -ENOTSUPP; } static inline void ipvlan_l3s_unregister(struct ipvl_port *port) { } static inline void ipvlan_migrate_l3s_hook(struct net *oldnet, struct net *newnet) { } static inline int ipvlan_l3s_init(void) { return 0; } static inline void ipvlan_l3s_cleanup(void) { } #endif /* CONFIG_IPVLAN_L3S */ static inline bool netif_is_ipvlan_port(const struct net_device *dev) { return rcu_access_pointer(dev->rx_handler) == ipvlan_handle_frame; } #endif /* __IPVLAN_H */ |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | #ifndef INTERNAL_IO_WQ_H #define INTERNAL_IO_WQ_H #include <linux/refcount.h> #include <linux/io_uring_types.h> struct io_wq; enum { IO_WQ_WORK_CANCEL = 1, IO_WQ_WORK_HASHED = 2, IO_WQ_WORK_UNBOUND = 4, IO_WQ_WORK_CONCURRENT = 16, IO_WQ_HASH_SHIFT = 24, /* upper 8 bits are used for hash key */ }; enum io_wq_cancel { IO_WQ_CANCEL_OK, /* cancelled before started */ IO_WQ_CANCEL_RUNNING, /* found, running, and attempted cancelled */ IO_WQ_CANCEL_NOTFOUND, /* work not found */ }; struct io_wq_hash { refcount_t refs; unsigned long map; struct wait_queue_head wait; }; static inline void io_wq_put_hash(struct io_wq_hash *hash) { if (refcount_dec_and_test(&hash->refs)) kfree(hash); } struct io_wq_data { struct io_wq_hash *hash; struct task_struct *task; }; struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data); void io_wq_exit_start(struct io_wq *wq); void io_wq_put_and_exit(struct io_wq *wq); void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work); void io_wq_hash_work(struct io_wq_work *work, void *val); int io_wq_cpu_affinity(struct io_uring_task *tctx, cpumask_var_t mask); int io_wq_max_workers(struct io_wq *wq, int *new_count); bool io_wq_worker_stopped(void); static inline bool __io_wq_is_hashed(unsigned int work_flags) { return work_flags & IO_WQ_WORK_HASHED; } static inline bool io_wq_is_hashed(struct io_wq_work *work) { return __io_wq_is_hashed(atomic_read(&work->flags)); } typedef bool (work_cancel_fn)(struct io_wq_work *, void *); enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, void *data, bool cancel_all); #if defined(CONFIG_IO_WQ) extern void io_wq_worker_sleeping(struct task_struct *); extern void io_wq_worker_running(struct task_struct *); #else static inline void io_wq_worker_sleeping(struct task_struct *tsk) { } static inline void io_wq_worker_running(struct task_struct *tsk) { } #endif static inline bool io_wq_current_is_worker(void) { return in_task() && (current->flags & PF_IO_WORKER) && current->worker_private; } #endif |
| 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 | // SPDX-License-Identifier: GPL-2.0-only /* * Line 6 Linux USB driver * * Copyright (C) 2004-2010 Markus Grabner (line6@grabner-graz.at) */ #include <linux/kernel.h> #include <linux/module.h> #include <linux/export.h> #include <linux/slab.h> #include <linux/usb.h> #include <sound/core.h> #include <sound/initval.h> #include <sound/hwdep.h> #include "capture.h" #include "driver.h" #include "midi.h" #include "playback.h" #define DRIVER_AUTHOR "Markus Grabner <line6@grabner-graz.at>" #define DRIVER_DESC "Line 6 USB Driver" /* This is Line 6's MIDI manufacturer ID. */ const unsigned char line6_midi_id[3] = { 0x00, 0x01, 0x0c }; EXPORT_SYMBOL_GPL(line6_midi_id); /* Code to request version of POD, Variax interface (and maybe other devices). */ static const char line6_request_version[] = { 0xf0, 0x7e, 0x7f, 0x06, 0x01, 0xf7 }; /* Class for asynchronous messages. */ struct message { struct usb_line6 *line6; const char *buffer; int size; int done; }; /* Forward declarations. */ static void line6_data_received(struct urb *urb); static int line6_send_raw_message_async_part(struct message *msg, struct urb *urb); /* Start to listen on endpoint. */ static int line6_start_listen(struct usb_line6 *line6) { int err; if (line6->properties->capabilities & LINE6_CAP_CONTROL_MIDI) { usb_fill_int_urb(line6->urb_listen, line6->usbdev, usb_rcvintpipe(line6->usbdev, line6->properties->ep_ctrl_r), line6->buffer_listen, LINE6_BUFSIZE_LISTEN, line6_data_received, line6, line6->interval); } else { usb_fill_bulk_urb(line6->urb_listen, line6->usbdev, usb_rcvbulkpipe(line6->usbdev, line6->properties->ep_ctrl_r), line6->buffer_listen, LINE6_BUFSIZE_LISTEN, line6_data_received, line6); } /* sanity checks of EP before actually submitting */ if (usb_urb_ep_type_check(line6->urb_listen)) { dev_err(line6->ifcdev, "invalid control EP\n"); return -EINVAL; } line6->urb_listen->actual_length = 0; err = usb_submit_urb(line6->urb_listen, GFP_ATOMIC); return err; } /* Stop listening on endpoint. */ static void line6_stop_listen(struct usb_line6 *line6) { usb_kill_urb(line6->urb_listen); } /* Send raw message in pieces of wMaxPacketSize bytes. */ int line6_send_raw_message(struct usb_line6 *line6, const char *buffer, int size) { int i, done = 0; const struct line6_properties *properties = line6->properties; for (i = 0; i < size; i += line6->max_packet_size) { int partial; const char *frag_buf = buffer + i; int frag_size = min(line6->max_packet_size, size - i); int retval; if (properties->capabilities & LINE6_CAP_CONTROL_MIDI) { retval = usb_interrupt_msg(line6->usbdev, usb_sndintpipe(line6->usbdev, properties->ep_ctrl_w), (char *)frag_buf, frag_size, &partial, LINE6_TIMEOUT); } else { retval = usb_bulk_msg(line6->usbdev, usb_sndbulkpipe(line6->usbdev, properties->ep_ctrl_w), (char *)frag_buf, frag_size, &partial, LINE6_TIMEOUT); } if (retval) { dev_err(line6->ifcdev, "usb_bulk_msg failed (%d)\n", retval); break; } done += frag_size; } return done; } EXPORT_SYMBOL_GPL(line6_send_raw_message); /* Notification of completion of asynchronous request transmission. */ static void line6_async_request_sent(struct urb *urb) { struct message *msg = (struct message *)urb->context; if (msg->done >= msg->size) { usb_free_urb(urb); kfree(msg); } else line6_send_raw_message_async_part(msg, urb); } /* Asynchronously send part of a raw message. */ static int line6_send_raw_message_async_part(struct message *msg, struct urb *urb) { int retval; struct usb_line6 *line6 = msg->line6; int done = msg->done; int bytes = min(msg->size - done, line6->max_packet_size); if (line6->properties->capabilities & LINE6_CAP_CONTROL_MIDI) { usb_fill_int_urb(urb, line6->usbdev, usb_sndintpipe(line6->usbdev, line6->properties->ep_ctrl_w), (char *)msg->buffer + done, bytes, line6_async_request_sent, msg, line6->interval); } else { usb_fill_bulk_urb(urb, line6->usbdev, usb_sndbulkpipe(line6->usbdev, line6->properties->ep_ctrl_w), (char *)msg->buffer + done, bytes, line6_async_request_sent, msg); } msg->done += bytes; /* sanity checks of EP before actually submitting */ retval = usb_urb_ep_type_check(urb); if (retval < 0) goto error; retval = usb_submit_urb(urb, GFP_ATOMIC); if (retval < 0) goto error; return 0; error: dev_err(line6->ifcdev, "%s: usb_submit_urb failed (%d)\n", __func__, retval); usb_free_urb(urb); kfree(msg); return retval; } /* Asynchronously send raw message. */ int line6_send_raw_message_async(struct usb_line6 *line6, const char *buffer, int size) { struct message *msg; struct urb *urb; /* create message: */ msg = kzalloc(sizeof(struct message), GFP_ATOMIC); if (msg == NULL) return -ENOMEM; /* create URB: */ urb = usb_alloc_urb(0, GFP_ATOMIC); if (urb == NULL) { kfree(msg); return -ENOMEM; } /* set message data: */ msg->line6 = line6; msg->buffer = buffer; msg->size = size; msg->done = 0; /* start sending: */ return line6_send_raw_message_async_part(msg, urb); } EXPORT_SYMBOL_GPL(line6_send_raw_message_async); /* Send asynchronous device version request. */ int line6_version_request_async(struct usb_line6 *line6) { char *buffer; int retval; buffer = kmemdup(line6_request_version, sizeof(line6_request_version), GFP_ATOMIC); if (buffer == NULL) return -ENOMEM; retval = line6_send_raw_message_async(line6, buffer, sizeof(line6_request_version)); kfree(buffer); return retval; } EXPORT_SYMBOL_GPL(line6_version_request_async); /* Send sysex message in pieces of wMaxPacketSize bytes. */ int line6_send_sysex_message(struct usb_line6 *line6, const char *buffer, int size) { return line6_send_raw_message(line6, buffer, size + SYSEX_EXTRA_SIZE) - SYSEX_EXTRA_SIZE; } EXPORT_SYMBOL_GPL(line6_send_sysex_message); /* Allocate buffer for sysex message and prepare header. @param code sysex message code @param size number of bytes between code and sysex end */ char *line6_alloc_sysex_buffer(struct usb_line6 *line6, int code1, int code2, int size) { char *buffer = kmalloc(size + SYSEX_EXTRA_SIZE, GFP_ATOMIC); if (!buffer) return NULL; buffer[0] = LINE6_SYSEX_BEGIN; memcpy(buffer + 1, line6_midi_id, sizeof(line6_midi_id)); buffer[sizeof(line6_midi_id) + 1] = code1; buffer[sizeof(line6_midi_id) + 2] = code2; buffer[sizeof(line6_midi_id) + 3 + size] = LINE6_SYSEX_END; return buffer; } EXPORT_SYMBOL_GPL(line6_alloc_sysex_buffer); /* Notification of data received from the Line 6 device. */ static void line6_data_received(struct urb *urb) { struct usb_line6 *line6 = (struct usb_line6 *)urb->context; struct midi_buffer *mb = &line6->line6midi->midibuf_in; int done; if (urb->status == -ESHUTDOWN) return; if (line6->properties->capabilities & LINE6_CAP_CONTROL_MIDI) { scoped_guard(spinlock_irqsave, &line6->line6midi->lock) { done = line6_midibuf_write(mb, urb->transfer_buffer, urb->actual_length); if (done < urb->actual_length) { line6_midibuf_ignore(mb, done); dev_dbg(line6->ifcdev, "%d %d buffer overflow - message skipped\n", done, urb->actual_length); } } for (;;) { scoped_guard(spinlock_irqsave, &line6->line6midi->lock) { done = line6_midibuf_read(mb, line6->buffer_message, LINE6_MIDI_MESSAGE_MAXLEN, LINE6_MIDIBUF_READ_RX); } if (done <= 0) break; line6->message_length = done; line6_midi_receive(line6, line6->buffer_message, done); if (line6->process_message) line6->process_message(line6); } } else { line6->buffer_message = urb->transfer_buffer; line6->message_length = urb->actual_length; if (line6->process_message) line6->process_message(line6); line6->buffer_message = NULL; } line6_start_listen(line6); } #define LINE6_READ_WRITE_STATUS_DELAY 2 /* milliseconds */ #define LINE6_READ_WRITE_MAX_RETRIES 50 /* Read data from device. */ int line6_read_data(struct usb_line6 *line6, unsigned address, void *data, unsigned datalen) { struct usb_device *usbdev = line6->usbdev; int ret; u8 len; unsigned count; if (address > 0xffff || datalen > 0xff) return -EINVAL; /* query the serial number: */ ret = usb_control_msg_send(usbdev, 0, 0x67, USB_TYPE_VENDOR | USB_RECIP_DEVICE | USB_DIR_OUT, (datalen << 8) | 0x21, address, NULL, 0, LINE6_TIMEOUT, GFP_KERNEL); if (ret) { dev_err(line6->ifcdev, "read request failed (error %d)\n", ret); goto exit; } /* Wait for data length. We'll get 0xff until length arrives. */ for (count = 0; count < LINE6_READ_WRITE_MAX_RETRIES; count++) { mdelay(LINE6_READ_WRITE_STATUS_DELAY); ret = usb_control_msg_recv(usbdev, 0, 0x67, USB_TYPE_VENDOR | USB_RECIP_DEVICE | USB_DIR_IN, 0x0012, 0x0000, &len, 1, LINE6_TIMEOUT, GFP_KERNEL); if (ret) { dev_err(line6->ifcdev, "receive length failed (error %d)\n", ret); goto exit; } if (len != 0xff) break; } ret = -EIO; if (len == 0xff) { dev_err(line6->ifcdev, "read failed after %d retries\n", count); goto exit; } else if (len != datalen) { /* should be equal or something went wrong */ dev_err(line6->ifcdev, "length mismatch (expected %d, got %d)\n", (int)datalen, len); goto exit; } /* receive the result: */ ret = usb_control_msg_recv(usbdev, 0, 0x67, USB_TYPE_VENDOR | USB_RECIP_DEVICE | USB_DIR_IN, 0x0013, 0x0000, data, datalen, LINE6_TIMEOUT, GFP_KERNEL); if (ret) dev_err(line6->ifcdev, "read failed (error %d)\n", ret); exit: return ret; } EXPORT_SYMBOL_GPL(line6_read_data); /* Write data to device. */ int line6_write_data(struct usb_line6 *line6, unsigned address, void *data, unsigned datalen) { struct usb_device *usbdev = line6->usbdev; int ret; unsigned char *status; int count; if (address > 0xffff || datalen > 0xffff) return -EINVAL; status = kmalloc(1, GFP_KERNEL); if (!status) return -ENOMEM; ret = usb_control_msg_send(usbdev, 0, 0x67, USB_TYPE_VENDOR | USB_RECIP_DEVICE | USB_DIR_OUT, 0x0022, address, data, datalen, LINE6_TIMEOUT, GFP_KERNEL); if (ret) { dev_err(line6->ifcdev, "write request failed (error %d)\n", ret); goto exit; } for (count = 0; count < LINE6_READ_WRITE_MAX_RETRIES; count++) { mdelay(LINE6_READ_WRITE_STATUS_DELAY); ret = usb_control_msg_recv(usbdev, 0, 0x67, USB_TYPE_VENDOR | USB_RECIP_DEVICE | USB_DIR_IN, 0x0012, 0x0000, status, 1, LINE6_TIMEOUT, GFP_KERNEL); if (ret) { dev_err(line6->ifcdev, "receiving status failed (error %d)\n", ret); goto exit; } if (*status != 0xff) break; } if (*status == 0xff) { dev_err(line6->ifcdev, "write failed after %d retries\n", count); ret = -EIO; } else if (*status != 0) { dev_err(line6->ifcdev, "write failed (error %d)\n", ret); ret = -EIO; } exit: kfree(status); return ret; } EXPORT_SYMBOL_GPL(line6_write_data); /* Read Line 6 device serial number. (POD, TonePort, GuitarPort) */ int line6_read_serial_number(struct usb_line6 *line6, u32 *serial_number) { return line6_read_data(line6, 0x80d0, serial_number, sizeof(*serial_number)); } EXPORT_SYMBOL_GPL(line6_read_serial_number); /* Card destructor. */ static void line6_destruct(struct snd_card *card) { struct usb_line6 *line6 = card->private_data; struct usb_device *usbdev = line6->usbdev; /* Free buffer memory first. We cannot depend on the existence of private * data from the (podhd) module, it may be gone already during this call */ kfree(line6->buffer_message); kfree(line6->buffer_listen); /* then free URBs: */ usb_free_urb(line6->urb_listen); line6->urb_listen = NULL; /* decrement reference counters: */ usb_put_dev(usbdev); } static void line6_get_usb_properties(struct usb_line6 *line6) { struct usb_device *usbdev = line6->usbdev; const struct line6_properties *properties = line6->properties; int pipe; struct usb_host_endpoint *ep = NULL; if (properties->capabilities & LINE6_CAP_CONTROL) { if (properties->capabilities & LINE6_CAP_CONTROL_MIDI) { pipe = usb_rcvintpipe(line6->usbdev, line6->properties->ep_ctrl_r); } else { pipe = usb_rcvbulkpipe(line6->usbdev, line6->properties->ep_ctrl_r); } ep = usbdev->ep_in[usb_pipeendpoint(pipe)]; } /* Control data transfer properties */ if (ep) { line6->interval = ep->desc.bInterval; line6->max_packet_size = le16_to_cpu(ep->desc.wMaxPacketSize); } else { if (properties->capabilities & LINE6_CAP_CONTROL) { dev_err(line6->ifcdev, "endpoint not available, using fallback values"); } line6->interval = LINE6_FALLBACK_INTERVAL; line6->max_packet_size = LINE6_FALLBACK_MAXPACKETSIZE; } /* Isochronous transfer properties */ if (usbdev->speed == USB_SPEED_LOW) { line6->intervals_per_second = USB_LOW_INTERVALS_PER_SECOND; line6->iso_buffers = USB_LOW_ISO_BUFFERS; } else { line6->intervals_per_second = USB_HIGH_INTERVALS_PER_SECOND; line6->iso_buffers = USB_HIGH_ISO_BUFFERS; } } /* Enable buffering of incoming messages, flush the buffer */ static int line6_hwdep_open(struct snd_hwdep *hw, struct file *file) { struct usb_line6 *line6 = hw->private_data; /* NOTE: hwdep layer provides atomicity here */ line6->messages.active = 1; line6->messages.nonblock = file->f_flags & O_NONBLOCK ? 1 : 0; return 0; } /* Stop buffering */ static int line6_hwdep_release(struct snd_hwdep *hw, struct file *file) { struct usb_line6 *line6 = hw->private_data; line6->messages.active = 0; return 0; } /* Read from circular buffer, return to user */ static long line6_hwdep_read(struct snd_hwdep *hwdep, char __user *buf, long count, loff_t *offset) { struct usb_line6 *line6 = hwdep->private_data; long rv = 0; unsigned int out_count; if (mutex_lock_interruptible(&line6->messages.read_lock)) return -ERESTARTSYS; while (kfifo_len(&line6->messages.fifo) == 0) { mutex_unlock(&line6->messages.read_lock); if (line6->messages.nonblock) return -EAGAIN; rv = wait_event_interruptible( line6->messages.wait_queue, kfifo_len(&line6->messages.fifo) != 0); if (rv < 0) return rv; if (mutex_lock_interruptible(&line6->messages.read_lock)) return -ERESTARTSYS; } if (kfifo_peek_len(&line6->messages.fifo) > count) { /* Buffer too small; allow re-read of the current item... */ rv = -EINVAL; } else { rv = kfifo_to_user(&line6->messages.fifo, buf, count, &out_count); if (rv == 0) rv = out_count; } mutex_unlock(&line6->messages.read_lock); return rv; } /* Write directly (no buffering) to device by user*/ static long line6_hwdep_write(struct snd_hwdep *hwdep, const char __user *data, long count, loff_t *offset) { struct usb_line6 *line6 = hwdep->private_data; int rv; char *data_copy; if (count > line6->max_packet_size * LINE6_RAW_MESSAGES_MAXCOUNT) { /* This is an arbitrary limit - still better than nothing... */ return -EINVAL; } data_copy = memdup_user(data, count); if (IS_ERR(data_copy)) return PTR_ERR(data_copy); rv = line6_send_raw_message(line6, data_copy, count); kfree(data_copy); return rv; } static __poll_t line6_hwdep_poll(struct snd_hwdep *hwdep, struct file *file, poll_table *wait) { struct usb_line6 *line6 = hwdep->private_data; poll_wait(file, &line6->messages.wait_queue, wait); guard(mutex)(&line6->messages.read_lock); return kfifo_len(&line6->messages.fifo) == 0 ? 0 : EPOLLIN | EPOLLRDNORM; } static const struct snd_hwdep_ops hwdep_ops = { .open = line6_hwdep_open, .release = line6_hwdep_release, .read = line6_hwdep_read, .write = line6_hwdep_write, .poll = line6_hwdep_poll, }; /* Insert into circular buffer */ static void line6_hwdep_push_message(struct usb_line6 *line6) { if (!line6->messages.active) return; if (kfifo_avail(&line6->messages.fifo) >= line6->message_length) { /* No race condition here, there's only one writer */ kfifo_in(&line6->messages.fifo, line6->buffer_message, line6->message_length); } /* else TODO: signal overflow */ wake_up_interruptible(&line6->messages.wait_queue); } static int line6_hwdep_init(struct usb_line6 *line6) { int err; struct snd_hwdep *hwdep; /* TODO: usb_driver_claim_interface(); */ line6->process_message = line6_hwdep_push_message; line6->messages.active = 0; init_waitqueue_head(&line6->messages.wait_queue); mutex_init(&line6->messages.read_lock); INIT_KFIFO(line6->messages.fifo); err = snd_hwdep_new(line6->card, "config", 0, &hwdep); if (err < 0) goto end; strscpy(hwdep->name, "config"); hwdep->iface = SNDRV_HWDEP_IFACE_LINE6; hwdep->ops = hwdep_ops; hwdep->private_data = line6; hwdep->exclusive = true; end: return err; } static int line6_init_cap_control(struct usb_line6 *line6) { int ret; /* initialize USB buffers: */ line6->buffer_listen = kzalloc(LINE6_BUFSIZE_LISTEN, GFP_KERNEL); if (!line6->buffer_listen) return -ENOMEM; line6->urb_listen = usb_alloc_urb(0, GFP_KERNEL); if (!line6->urb_listen) return -ENOMEM; if (line6->properties->capabilities & LINE6_CAP_CONTROL_MIDI) { line6->buffer_message = kzalloc(LINE6_MIDI_MESSAGE_MAXLEN, GFP_KERNEL); if (!line6->buffer_message) return -ENOMEM; ret = line6_init_midi(line6); if (ret < 0) return ret; } else { ret = line6_hwdep_init(line6); if (ret < 0) return ret; } ret = line6_start_listen(line6); if (ret < 0) { dev_err(line6->ifcdev, "cannot start listening: %d\n", ret); return ret; } return 0; } static void line6_startup_work(struct work_struct *work) { struct usb_line6 *line6 = container_of(work, struct usb_line6, startup_work.work); if (line6->startup) line6->startup(line6); } /* Probe USB device. */ int line6_probe(struct usb_interface *interface, const struct usb_device_id *id, const char *driver_name, const struct line6_properties *properties, int (*private_init)(struct usb_line6 *, const struct usb_device_id *id), size_t data_size) { struct usb_device *usbdev = interface_to_usbdev(interface); struct snd_card *card; struct usb_line6 *line6; int interface_number; int ret; if (WARN_ON(data_size < sizeof(*line6))) return -EINVAL; /* we don't handle multiple configurations */ if (usbdev->descriptor.bNumConfigurations != 1) return -ENODEV; ret = snd_card_new(&interface->dev, SNDRV_DEFAULT_IDX1, SNDRV_DEFAULT_STR1, THIS_MODULE, data_size, &card); if (ret < 0) return ret; /* store basic data: */ line6 = card->private_data; line6->card = card; line6->properties = properties; line6->usbdev = usbdev; line6->ifcdev = &interface->dev; INIT_DELAYED_WORK(&line6->startup_work, line6_startup_work); strscpy(card->id, properties->id); strscpy(card->driver, driver_name); strscpy(card->shortname, properties->name); sprintf(card->longname, "Line 6 %s at USB %s", properties->name, dev_name(line6->ifcdev)); card->private_free = line6_destruct; usb_set_intfdata(interface, line6); /* increment reference counters: */ usb_get_dev(usbdev); /* initialize device info: */ dev_info(&interface->dev, "Line 6 %s found\n", properties->name); /* query interface number */ interface_number = interface->cur_altsetting->desc.bInterfaceNumber; /* TODO reserves the bus bandwidth even without actual transfer */ ret = usb_set_interface(usbdev, interface_number, properties->altsetting); if (ret < 0) { dev_err(&interface->dev, "set_interface failed\n"); goto error; } line6_get_usb_properties(line6); if (properties->capabilities & LINE6_CAP_CONTROL) { ret = line6_init_cap_control(line6); if (ret < 0) goto error; } /* initialize device data based on device: */ ret = private_init(line6, id); if (ret < 0) goto error; /* creation of additional special files should go here */ dev_info(&interface->dev, "Line 6 %s now attached\n", properties->name); return 0; error: /* we can call disconnect callback here because no close-sync is * needed yet at this point */ line6_disconnect(interface); return ret; } EXPORT_SYMBOL_GPL(line6_probe); /* Line 6 device disconnected. */ void line6_disconnect(struct usb_interface *interface) { struct usb_line6 *line6 = usb_get_intfdata(interface); struct usb_device *usbdev = interface_to_usbdev(interface); if (!line6) return; if (WARN_ON(usbdev != line6->usbdev)) return; cancel_delayed_work_sync(&line6->startup_work); if (line6->urb_listen != NULL) line6_stop_listen(line6); snd_card_disconnect(line6->card); if (line6->line6pcm) line6_pcm_disconnect(line6->line6pcm); if (line6->disconnect) line6->disconnect(line6); dev_info(&interface->dev, "Line 6 %s now disconnected\n", line6->properties->name); /* make sure the device isn't destructed twice: */ usb_set_intfdata(interface, NULL); snd_card_free_when_closed(line6->card); } EXPORT_SYMBOL_GPL(line6_disconnect); #ifdef CONFIG_PM /* Suspend Line 6 device. */ int line6_suspend(struct usb_interface *interface, pm_message_t message) { struct usb_line6 *line6 = usb_get_intfdata(interface); struct snd_line6_pcm *line6pcm = line6->line6pcm; snd_power_change_state(line6->card, SNDRV_CTL_POWER_D3hot); if (line6->properties->capabilities & LINE6_CAP_CONTROL) line6_stop_listen(line6); if (line6pcm != NULL) line6pcm->flags = 0; return 0; } EXPORT_SYMBOL_GPL(line6_suspend); /* Resume Line 6 device. */ int line6_resume(struct usb_interface *interface) { struct usb_line6 *line6 = usb_get_intfdata(interface); if (line6->properties->capabilities & LINE6_CAP_CONTROL) line6_start_listen(line6); snd_power_change_state(line6->card, SNDRV_CTL_POWER_D0); return 0; } EXPORT_SYMBOL_GPL(line6_resume); #endif /* CONFIG_PM */ MODULE_AUTHOR(DRIVER_AUTHOR); MODULE_DESCRIPTION(DRIVER_DESC); MODULE_LICENSE("GPL"); |
| 241 241 241 241 241 241 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 | // SPDX-License-Identifier: GPL-2.0-only /* * Device tree integration for the pin control subsystem * * Copyright (C) 2012 NVIDIA CORPORATION. All rights reserved. */ #include <linux/device.h> #include <linux/of.h> #include <linux/pinctrl/pinctrl.h> #include <linux/slab.h> #include "core.h" #include "devicetree.h" /** * struct pinctrl_dt_map - mapping table chunk parsed from device tree * @node: list node for struct pinctrl's @dt_maps field * @pctldev: the pin controller that allocated this struct, and will free it * @map: the mapping table entries * @num_maps: number of mapping table entries */ struct pinctrl_dt_map { struct list_head node; struct pinctrl_dev *pctldev; struct pinctrl_map *map; unsigned int num_maps; }; static void dt_free_map(struct pinctrl_dev *pctldev, struct pinctrl_map *map, unsigned int num_maps) { int i; for (i = 0; i < num_maps; ++i) { kfree_const(map[i].dev_name); map[i].dev_name = NULL; } if (pctldev) { const struct pinctrl_ops *ops = pctldev->desc->pctlops; if (ops->dt_free_map) ops->dt_free_map(pctldev, map, num_maps); } else { /* There is no pctldev for PIN_MAP_TYPE_DUMMY_STATE */ kfree(map); } } void pinctrl_dt_free_maps(struct pinctrl *p) { struct pinctrl_dt_map *dt_map, *n1; list_for_each_entry_safe(dt_map, n1, &p->dt_maps, node) { pinctrl_unregister_mappings(dt_map->map); list_del(&dt_map->node); dt_free_map(dt_map->pctldev, dt_map->map, dt_map->num_maps); kfree(dt_map); } of_node_put(p->dev->of_node); } static int dt_remember_or_free_map(struct pinctrl *p, const char *statename, struct pinctrl_dev *pctldev, struct pinctrl_map *map, unsigned int num_maps) { int i; struct pinctrl_dt_map *dt_map; /* Initialize common mapping table entry fields */ for (i = 0; i < num_maps; i++) { const char *devname; devname = kstrdup_const(dev_name(p->dev), GFP_KERNEL); if (!devname) goto err_free_map; map[i].dev_name = devname; map[i].name = statename; if (pctldev) map[i].ctrl_dev_name = dev_name(pctldev->dev); } /* Remember the converted mapping table entries */ dt_map = kzalloc(sizeof(*dt_map), GFP_KERNEL); if (!dt_map) goto err_free_map; dt_map->pctldev = pctldev; dt_map->map = map; dt_map->num_maps = num_maps; list_add_tail(&dt_map->node, &p->dt_maps); return pinctrl_register_mappings(map, num_maps); err_free_map: dt_free_map(pctldev, map, num_maps); return -ENOMEM; } struct pinctrl_dev *of_pinctrl_get(struct device_node *np) { return get_pinctrl_dev_from_of_node(np); } EXPORT_SYMBOL_GPL(of_pinctrl_get); static int dt_to_map_one_config(struct pinctrl *p, struct pinctrl_dev *hog_pctldev, const char *statename, struct device_node *np_config) { struct pinctrl_dev *pctldev = NULL; struct device_node *np_pctldev; const struct pinctrl_ops *ops; int ret; struct pinctrl_map *map; unsigned int num_maps; bool allow_default = false; /* Find the pin controller containing np_config */ np_pctldev = of_node_get(np_config); for (;;) { if (!allow_default) allow_default = of_property_read_bool(np_pctldev, "pinctrl-use-default"); np_pctldev = of_get_next_parent(np_pctldev); if (!np_pctldev || of_node_is_root(np_pctldev)) { of_node_put(np_pctldev); ret = -ENODEV; /* keep deferring if modules are enabled */ if (IS_ENABLED(CONFIG_MODULES) && !allow_default && ret < 0) ret = -EPROBE_DEFER; return ret; } /* If we're creating a hog we can use the passed pctldev */ if (hog_pctldev && (np_pctldev == p->dev->of_node)) { pctldev = hog_pctldev; break; } pctldev = get_pinctrl_dev_from_of_node(np_pctldev); if (pctldev) break; /* * Do not defer probing of hogs (circular loop) * * Return 1 to let the caller catch the case. */ if (np_pctldev == p->dev->of_node) { of_node_put(np_pctldev); return 1; } } of_node_put(np_pctldev); /* * Call pinctrl driver to parse device tree node, and * generate mapping table entries */ ops = pctldev->desc->pctlops; if (!ops->dt_node_to_map) { dev_err(p->dev, "pctldev %s doesn't support DT\n", dev_name(pctldev->dev)); return -ENODEV; } ret = ops->dt_node_to_map(pctldev, np_config, &map, &num_maps); if (ret < 0) return ret; else if (num_maps == 0) { /* * If we have no valid maps (maybe caused by empty pinctrl node * or typing error) ther is no need remember this, so just * return. */ dev_info(p->dev, "there is not valid maps for state %s\n", statename); return 0; } /* Stash the mapping table chunk away for later use */ return dt_remember_or_free_map(p, statename, pctldev, map, num_maps); } static int dt_remember_dummy_state(struct pinctrl *p, const char *statename) { struct pinctrl_map *map; map = kzalloc(sizeof(*map), GFP_KERNEL); if (!map) return -ENOMEM; /* There is no pctldev for PIN_MAP_TYPE_DUMMY_STATE */ map->type = PIN_MAP_TYPE_DUMMY_STATE; return dt_remember_or_free_map(p, statename, NULL, map, 1); } int pinctrl_dt_to_map(struct pinctrl *p, struct pinctrl_dev *pctldev) { struct device_node *np = p->dev->of_node; int state, ret; char *propname; struct property *prop; const char *statename; const __be32 *list; int size, config; phandle phandle; struct device_node *np_config; /* CONFIG_OF enabled, p->dev not instantiated from DT */ if (!np) { if (of_have_populated_dt()) dev_dbg(p->dev, "no of_node; not parsing pinctrl DT\n"); return 0; } /* We may store pointers to property names within the node */ of_node_get(np); /* For each defined state ID */ for (state = 0; ; state++) { /* Retrieve the pinctrl-* property */ propname = kasprintf(GFP_KERNEL, "pinctrl-%d", state); if (!propname) { ret = -ENOMEM; goto err; } prop = of_find_property(np, propname, &size); kfree(propname); if (!prop) { if (state == 0) { ret = -ENODEV; goto err; } break; } list = prop->value; size /= sizeof(*list); /* Determine whether pinctrl-names property names the state */ ret = of_property_read_string_index(np, "pinctrl-names", state, &statename); /* * If not, statename is just the integer state ID. But rather * than dynamically allocate it and have to free it later, * just point part way into the property name for the string. */ if (ret < 0) statename = prop->name + strlen("pinctrl-"); /* For every referenced pin configuration node in it */ for (config = 0; config < size; config++) { phandle = be32_to_cpup(list++); /* Look up the pin configuration node */ np_config = of_find_node_by_phandle(phandle); if (!np_config) { dev_err(p->dev, "prop %s index %i invalid phandle\n", prop->name, config); ret = -EINVAL; goto err; } /* Parse the node */ ret = dt_to_map_one_config(p, pctldev, statename, np_config); of_node_put(np_config); if (ret == 1) continue; if (ret < 0) goto err; } /* No entries in DT? Generate a dummy state table entry */ if (!size) { ret = dt_remember_dummy_state(p, statename); if (ret < 0) goto err; } } return 0; err: pinctrl_dt_free_maps(p); return ret; } /* * For pinctrl binding, typically #pinctrl-cells is for the pin controller * device, so either parent or grandparent. See pinctrl-bindings.txt. */ static int pinctrl_find_cells_size(const struct device_node *np) { const char *cells_name = "#pinctrl-cells"; int cells_size, error; error = of_property_read_u32(np->parent, cells_name, &cells_size); if (error) { error = of_property_read_u32(np->parent->parent, cells_name, &cells_size); if (error) return -ENOENT; } return cells_size; } /** * pinctrl_get_list_and_count - Gets the list and it's cell size and number * @np: pointer to device node with the property * @list_name: property that contains the list * @list: pointer for the list found * @cells_size: pointer for the cell size found * @nr_elements: pointer for the number of elements found * * Typically np is a single pinctrl entry containing the list. */ static int pinctrl_get_list_and_count(const struct device_node *np, const char *list_name, const __be32 **list, int *cells_size, int *nr_elements) { int size; *cells_size = 0; *nr_elements = 0; *list = of_get_property(np, list_name, &size); if (!*list) return -ENOENT; *cells_size = pinctrl_find_cells_size(np); if (*cells_size < 0) return -ENOENT; /* First element is always the index within the pinctrl device */ *nr_elements = (size / sizeof(**list)) / (*cells_size + 1); return 0; } /** * pinctrl_count_index_with_args - Count number of elements in a pinctrl entry * @np: pointer to device node with the property * @list_name: property that contains the list * * Counts the number of elements in a pinctrl array consisting of an index * within the controller and a number of u32 entries specified for each * entry. Note that device_node is always for the parent pin controller device. */ int pinctrl_count_index_with_args(const struct device_node *np, const char *list_name) { const __be32 *list; int size, nr_cells, error; error = pinctrl_get_list_and_count(np, list_name, &list, &nr_cells, &size); if (error) return error; return size; } EXPORT_SYMBOL_GPL(pinctrl_count_index_with_args); /** * pinctrl_copy_args - Populates of_phandle_args based on index * @np: pointer to device node with the property * @list: pointer to a list with the elements * @index: entry within the list of elements * @nr_cells: number of cells in the list * @nr_elem: number of elements for each entry in the list * @out_args: returned values * * Populates the of_phandle_args based on the index in the list. */ static int pinctrl_copy_args(const struct device_node *np, const __be32 *list, int index, int nr_cells, int nr_elem, struct of_phandle_args *out_args) { int i; memset(out_args, 0, sizeof(*out_args)); out_args->np = (struct device_node *)np; out_args->args_count = nr_cells + 1; if (index >= nr_elem) return -EINVAL; list += index * (nr_cells + 1); for (i = 0; i < nr_cells + 1; i++) out_args->args[i] = be32_to_cpup(list++); return 0; } /** * pinctrl_parse_index_with_args - Find a node pointed by index in a list * @np: pointer to device node with the property * @list_name: property that contains the list * @index: index within the list * @out_args: entries in the list pointed by index * * Finds the selected element in a pinctrl array consisting of an index * within the controller and a number of u32 entries specified for each * entry. Note that device_node is always for the parent pin controller device. */ int pinctrl_parse_index_with_args(const struct device_node *np, const char *list_name, int index, struct of_phandle_args *out_args) { const __be32 *list; int nr_elem, nr_cells, error; error = pinctrl_get_list_and_count(np, list_name, &list, &nr_cells, &nr_elem); if (error || !nr_cells) return error; error = pinctrl_copy_args(np, list, index, nr_cells, nr_elem, out_args); if (error) return error; return 0; } EXPORT_SYMBOL_GPL(pinctrl_parse_index_with_args); |
| 9 10 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 13 6 5 5 5 5 5 13 11 11 2 2 1 1 1 1 1 1 1 1 1 3 3 3 1 2 2 1 2 2 9 9 9 9 9 9 4 2 7 1 3 1 1 7 1 6 1 9 5 5 5 5 5 5 5 1 4 4 3 2 3 2 2 9 9 5 3 9 9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 | // SPDX-License-Identifier: GPL-2.0-only #include "cgroup-internal.h" #include <linux/ctype.h> #include <linux/kmod.h> #include <linux/sort.h> #include <linux/delay.h> #include <linux/mm.h> #include <linux/sched/signal.h> #include <linux/sched/task.h> #include <linux/magic.h> #include <linux/slab.h> #include <linux/string.h> #include <linux/vmalloc.h> #include <linux/delayacct.h> #include <linux/pid_namespace.h> #include <linux/cgroupstats.h> #include <linux/fs_parser.h> #include <trace/events/cgroup.h> /* * pidlists linger the following amount before being destroyed. The goal * is avoiding frequent destruction in the middle of consecutive read calls * Expiring in the middle is a performance problem not a correctness one. * 1 sec should be enough. */ #define CGROUP_PIDLIST_DESTROY_DELAY HZ /* Controllers blocked by the commandline in v1 */ static u16 cgroup_no_v1_mask; /* disable named v1 mounts */ static bool cgroup_no_v1_named; /* Show unavailable controllers in /proc/cgroups */ static bool proc_show_all; /* * pidlist destructions need to be flushed on cgroup destruction. Use a * separate workqueue as flush domain. */ static struct workqueue_struct *cgroup_pidlist_destroy_wq; /* protects cgroup_subsys->release_agent_path */ static DEFINE_SPINLOCK(release_agent_path_lock); bool cgroup1_ssid_disabled(int ssid) { return cgroup_no_v1_mask & (1 << ssid); } static bool cgroup1_subsys_absent(struct cgroup_subsys *ss) { /* Check also dfl_cftypes for file-less controllers, i.e. perf_event */ return ss->legacy_cftypes == NULL && ss->dfl_cftypes; } /** * cgroup_attach_task_all - attach task 'tsk' to all cgroups of task 'from' * @from: attach to all cgroups of a given task * @tsk: the task to be attached * * Return: %0 on success or a negative errno code on failure */ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) { struct cgroup_root *root; int retval = 0; cgroup_lock(); cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); for_each_root(root) { struct cgroup *from_cgrp; spin_lock_irq(&css_set_lock); from_cgrp = task_cgroup_from_root(from, root); spin_unlock_irq(&css_set_lock); retval = cgroup_attach_task(from_cgrp, tsk, false); if (retval) break; } cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); return retval; } EXPORT_SYMBOL_GPL(cgroup_attach_task_all); /** * cgroup_transfer_tasks - move tasks from one cgroup to another * @to: cgroup to which the tasks will be moved * @from: cgroup in which the tasks currently reside * * Locking rules between cgroup_post_fork() and the migration path * guarantee that, if a task is forking while being migrated, the new child * is guaranteed to be either visible in the source cgroup after the * parent's migration is complete or put into the target cgroup. No task * can slip out of migration through forking. * * Return: %0 on success or a negative errno code on failure */ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) { DEFINE_CGROUP_MGCTX(mgctx); struct cgrp_cset_link *link; struct css_task_iter it; struct task_struct *task; int ret; if (cgroup_on_dfl(to)) return -EINVAL; ret = cgroup_migrate_vet_dst(to); if (ret) return ret; cgroup_lock(); cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); list_for_each_entry(link, &from->cset_links, cset_link) cgroup_migrate_add_src(link->cset, to, &mgctx); spin_unlock_irq(&css_set_lock); ret = cgroup_migrate_prepare_dst(&mgctx); if (ret) goto out_err; /* * Migrate tasks one-by-one until @from is empty. This fails iff * ->can_attach() fails. */ do { css_task_iter_start(&from->self, 0, &it); do { task = css_task_iter_next(&it); } while (task && (task->flags & PF_EXITING)); if (task) get_task_struct(task); css_task_iter_end(&it); if (task) { ret = cgroup_migrate(task, false, &mgctx); if (!ret) TRACE_CGROUP_PATH(transfer_tasks, to, task, false); put_task_struct(task); } } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); return ret; } /* * Stuff for reading the 'tasks'/'procs' files. * * Reading this file can return large amounts of data if a cgroup has * *lots* of attached tasks. So it may need several calls to read(), * but we cannot guarantee that the information we produce is correct * unless we produce it entirely atomically. * */ /* which pidlist file are we talking about? */ enum cgroup_filetype { CGROUP_FILE_PROCS, CGROUP_FILE_TASKS, }; /* * A pidlist is a list of pids that virtually represents the contents of one * of the cgroup files ("procs" or "tasks"). We keep a list of such pidlists, * a pair (one each for procs, tasks) for each pid namespace that's relevant * to the cgroup. */ struct cgroup_pidlist { /* * used to find which pidlist is wanted. doesn't change as long as * this particular list stays in the list. */ struct { enum cgroup_filetype type; struct pid_namespace *ns; } key; /* array of xids */ pid_t *list; /* how many elements the above list has */ int length; /* each of these stored in a list by its cgroup */ struct list_head links; /* pointer to the cgroup we belong to, for list removal purposes */ struct cgroup *owner; /* for delayed destruction */ struct delayed_work destroy_dwork; }; /* * Used to destroy all pidlists lingering waiting for destroy timer. None * should be left afterwards. */ void cgroup1_pidlist_destroy_all(struct cgroup *cgrp) { struct cgroup_pidlist *l, *tmp_l; mutex_lock(&cgrp->pidlist_mutex); list_for_each_entry_safe(l, tmp_l, &cgrp->pidlists, links) mod_delayed_work(cgroup_pidlist_destroy_wq, &l->destroy_dwork, 0); mutex_unlock(&cgrp->pidlist_mutex); flush_workqueue(cgroup_pidlist_destroy_wq); BUG_ON(!list_empty(&cgrp->pidlists)); } static void cgroup_pidlist_destroy_work_fn(struct work_struct *work) { struct delayed_work *dwork = to_delayed_work(work); struct cgroup_pidlist *l = container_of(dwork, struct cgroup_pidlist, destroy_dwork); struct cgroup_pidlist *tofree = NULL; mutex_lock(&l->owner->pidlist_mutex); /* * Destroy iff we didn't get queued again. The state won't change * as destroy_dwork can only be queued while locked. */ if (!delayed_work_pending(dwork)) { list_del(&l->links); kvfree(l->list); put_pid_ns(l->key.ns); tofree = l; } mutex_unlock(&l->owner->pidlist_mutex); kfree(tofree); } /* * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries * Returns the number of unique elements. */ static int pidlist_uniq(pid_t *list, int length) { int src, dest = 1; /* * we presume the 0th element is unique, so i starts at 1. trivial * edge cases first; no work needs to be done for either */ if (length == 0 || length == 1) return length; /* src and dest walk down the list; dest counts unique elements */ for (src = 1; src < length; src++) { /* find next unique element */ while (list[src] == list[src-1]) { src++; if (src == length) goto after; } /* dest always points to where the next unique element goes */ list[dest] = list[src]; dest++; } after: return dest; } /* * The two pid files - task and cgroup.procs - guaranteed that the result * is sorted, which forced this whole pidlist fiasco. As pid order is * different per namespace, each namespace needs differently sorted list, * making it impossible to use, for example, single rbtree of member tasks * sorted by task pointer. As pidlists can be fairly large, allocating one * per open file is dangerous, so cgroup had to implement shared pool of * pidlists keyed by cgroup and namespace. */ static int cmppid(const void *a, const void *b) { return *(pid_t *)a - *(pid_t *)b; } static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp, enum cgroup_filetype type) { struct cgroup_pidlist *l; /* don't need task_nsproxy() if we're looking at ourself */ struct pid_namespace *ns = task_active_pid_ns(current); lockdep_assert_held(&cgrp->pidlist_mutex); list_for_each_entry(l, &cgrp->pidlists, links) if (l->key.type == type && l->key.ns == ns) return l; return NULL; } /* * find the appropriate pidlist for our purpose (given procs vs tasks) * returns with the lock on that pidlist already held, and takes care * of the use count, or returns NULL with no locks held if we're out of * memory. */ static struct cgroup_pidlist *cgroup_pidlist_find_create(struct cgroup *cgrp, enum cgroup_filetype type) { struct cgroup_pidlist *l; lockdep_assert_held(&cgrp->pidlist_mutex); l = cgroup_pidlist_find(cgrp, type); if (l) return l; /* entry not found; create a new one */ l = kzalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL); if (!l) return l; INIT_DELAYED_WORK(&l->destroy_dwork, cgroup_pidlist_destroy_work_fn); l->key.type = type; /* don't need task_nsproxy() if we're looking at ourself */ l->key.ns = get_pid_ns(task_active_pid_ns(current)); l->owner = cgrp; list_add(&l->links, &cgrp->pidlists); return l; } /* * Load a cgroup's pidarray with either procs' tgids or tasks' pids */ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type, struct cgroup_pidlist **lp) { pid_t *array; int length; int pid, n = 0; /* used for populating the array */ struct css_task_iter it; struct task_struct *tsk; struct cgroup_pidlist *l; lockdep_assert_held(&cgrp->pidlist_mutex); /* * If cgroup gets more users after we read count, we won't have * enough space - tough. This race is indistinguishable to the * caller from the case that the additional cgroup users didn't * show up until sometime later on. */ length = cgroup_task_count(cgrp); array = kvmalloc_array(length, sizeof(pid_t), GFP_KERNEL); if (!array) return -ENOMEM; /* now, populate the array */ css_task_iter_start(&cgrp->self, 0, &it); while ((tsk = css_task_iter_next(&it))) { if (unlikely(n == length)) break; /* get tgid or pid for procs or tasks file respectively */ if (type == CGROUP_FILE_PROCS) pid = task_tgid_vnr(tsk); else pid = task_pid_vnr(tsk); if (pid > 0) /* make sure to only use valid results */ array[n++] = pid; } css_task_iter_end(&it); length = n; /* now sort & strip out duplicates (tgids or recycled thread PIDs) */ sort(array, length, sizeof(pid_t), cmppid, NULL); length = pidlist_uniq(array, length); l = cgroup_pidlist_find_create(cgrp, type); if (!l) { kvfree(array); return -ENOMEM; } /* store array, freeing old if necessary */ kvfree(l->list); l->list = array; l->length = length; *lp = l; return 0; } /* * seq_file methods for the tasks/procs files. The seq_file position is the * next pid to display; the seq_file iterator is a pointer to the pid * in the cgroup->l->list array. */ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos) { /* * Initially we receive a position value that corresponds to * one more than the last pid shown (or 0 on the first call or * after a seek to the start). Use a binary-search to find the * next pid to display, if any */ struct kernfs_open_file *of = s->private; struct cgroup_file_ctx *ctx = of->priv; struct cgroup *cgrp = seq_css(s)->cgroup; struct cgroup_pidlist *l; enum cgroup_filetype type = seq_cft(s)->private; int index = 0, pid = *pos; int *iter, ret; mutex_lock(&cgrp->pidlist_mutex); /* * !NULL @ctx->procs1.pidlist indicates that this isn't the first * start() after open. If the matching pidlist is around, we can use * that. Look for it. Note that @ctx->procs1.pidlist can't be used * directly. It could already have been destroyed. */ if (ctx->procs1.pidlist) ctx->procs1.pidlist = cgroup_pidlist_find(cgrp, type); /* * Either this is the first start() after open or the matching * pidlist has been destroyed inbetween. Create a new one. */ if (!ctx->procs1.pidlist) { ret = pidlist_array_load(cgrp, type, &ctx->procs1.pidlist); if (ret) return ERR_PTR(ret); } l = ctx->procs1.pidlist; if (pid) { int end = l->length; while (index < end) { int mid = (index + end) / 2; if (l->list[mid] == pid) { index = mid; break; } else if (l->list[mid] < pid) index = mid + 1; else end = mid; } } /* If we're off the end of the array, we're done */ if (index >= l->length) return NULL; /* Update the abstract position to be the actual pid that we found */ iter = l->list + index; *pos = *iter; return iter; } static void cgroup_pidlist_stop(struct seq_file *s, void *v) { struct kernfs_open_file *of = s->private; struct cgroup_file_ctx *ctx = of->priv; struct cgroup_pidlist *l = ctx->procs1.pidlist; if (l) mod_delayed_work(cgroup_pidlist_destroy_wq, &l->destroy_dwork, CGROUP_PIDLIST_DESTROY_DELAY); mutex_unlock(&seq_css(s)->cgroup->pidlist_mutex); } static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos) { struct kernfs_open_file *of = s->private; struct cgroup_file_ctx *ctx = of->priv; struct cgroup_pidlist *l = ctx->procs1.pidlist; pid_t *p = v; pid_t *end = l->list + l->length; /* * Advance to the next pid in the array. If this goes off the * end, we're done */ p++; if (p >= end) { (*pos)++; return NULL; } else { *pos = *p; return p; } } static int cgroup_pidlist_show(struct seq_file *s, void *v) { seq_printf(s, "%d\n", *(int *)v); return 0; } static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off, bool threadgroup) { struct cgroup *cgrp; struct task_struct *task; const struct cred *cred, *tcred; ssize_t ret; enum cgroup_attach_lock_mode lock_mode; cgrp = cgroup_kn_lock_live(of->kn, false); if (!cgrp) return -ENODEV; task = cgroup_procs_write_start(buf, threadgroup, &lock_mode); ret = PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; /* * Even if we're attaching all tasks in the thread group, we only need * to check permissions on one of them. Check permissions using the * credentials from file open to protect against inherited fd attacks. */ cred = of->file->f_cred; tcred = get_task_cred(task); if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && !uid_eq(cred->euid, tcred->uid) && !uid_eq(cred->euid, tcred->suid)) ret = -EACCES; put_cred(tcred); if (ret) goto out_finish; ret = cgroup_attach_task(cgrp, task, threadgroup); out_finish: cgroup_procs_write_finish(task, lock_mode); out_unlock: cgroup_kn_unlock(of->kn); return ret ?: nbytes; } static ssize_t cgroup1_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { return __cgroup1_procs_write(of, buf, nbytes, off, true); } static ssize_t cgroup1_tasks_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { return __cgroup1_procs_write(of, buf, nbytes, off, false); } static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct cgroup *cgrp; struct cgroup_file_ctx *ctx; BUILD_BUG_ON(sizeof(cgrp->root->release_agent_path) < PATH_MAX); /* * Release agent gets called with all capabilities, * require capabilities to set release agent. */ ctx = of->priv; if ((ctx->ns->user_ns != &init_user_ns) || !file_ns_capable(of->file, &init_user_ns, CAP_SYS_ADMIN)) return -EPERM; cgrp = cgroup_kn_lock_live(of->kn, false); if (!cgrp) return -ENODEV; spin_lock(&release_agent_path_lock); strscpy(cgrp->root->release_agent_path, strstrip(buf), sizeof(cgrp->root->release_agent_path)); spin_unlock(&release_agent_path_lock); cgroup_kn_unlock(of->kn); return nbytes; } static int cgroup_release_agent_show(struct seq_file *seq, void *v) { struct cgroup *cgrp = seq_css(seq)->cgroup; spin_lock(&release_agent_path_lock); seq_puts(seq, cgrp->root->release_agent_path); spin_unlock(&release_agent_path_lock); seq_putc(seq, '\n'); return 0; } static int cgroup_sane_behavior_show(struct seq_file *seq, void *v) { seq_puts(seq, "0\n"); return 0; } static u64 cgroup_read_notify_on_release(struct cgroup_subsys_state *css, struct cftype *cft) { return notify_on_release(css->cgroup); } static int cgroup_write_notify_on_release(struct cgroup_subsys_state *css, struct cftype *cft, u64 val) { if (val) set_bit(CGRP_NOTIFY_ON_RELEASE, &css->cgroup->flags); else clear_bit(CGRP_NOTIFY_ON_RELEASE, &css->cgroup->flags); return 0; } static u64 cgroup_clone_children_read(struct cgroup_subsys_state *css, struct cftype *cft) { return test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags); } static int cgroup_clone_children_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val) { if (val) set_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags); else clear_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags); return 0; } /* cgroup core interface files for the legacy hierarchies */ struct cftype cgroup1_base_files[] = { { .name = "cgroup.procs", .seq_start = cgroup_pidlist_start, .seq_next = cgroup_pidlist_next, .seq_stop = cgroup_pidlist_stop, .seq_show = cgroup_pidlist_show, .private = CGROUP_FILE_PROCS, .write = cgroup1_procs_write, }, { .name = "cgroup.clone_children", .read_u64 = cgroup_clone_children_read, .write_u64 = cgroup_clone_children_write, }, { .name = "cgroup.sane_behavior", .flags = CFTYPE_ONLY_ON_ROOT, .seq_show = cgroup_sane_behavior_show, }, { .name = "tasks", .seq_start = cgroup_pidlist_start, .seq_next = cgroup_pidlist_next, .seq_stop = cgroup_pidlist_stop, .seq_show = cgroup_pidlist_show, .private = CGROUP_FILE_TASKS, .write = cgroup1_tasks_write, }, { .name = "notify_on_release", .read_u64 = cgroup_read_notify_on_release, .write_u64 = cgroup_write_notify_on_release, }, { .name = "release_agent", .flags = CFTYPE_ONLY_ON_ROOT, .seq_show = cgroup_release_agent_show, .write = cgroup_release_agent_write, .max_write_len = PATH_MAX - 1, }, { } /* terminate */ }; /* Display information about each subsystem and each hierarchy */ int proc_cgroupstats_show(struct seq_file *m, void *v) { struct cgroup_subsys *ss; bool cgrp_v1_visible = false; int i; seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n"); /* * Grab the subsystems state racily. No need to add avenue to * cgroup_mutex contention. */ for_each_subsys(ss, i) { cgrp_v1_visible |= ss->root != &cgrp_dfl_root; if (!proc_show_all && cgroup1_subsys_absent(ss)) continue; seq_printf(m, "%s\t%d\t%d\t%d\n", ss->legacy_name, ss->root->hierarchy_id, atomic_read(&ss->root->nr_cgrps), cgroup_ssid_enabled(i)); } if (cgrp_dfl_visible && !cgrp_v1_visible) pr_info_once("/proc/cgroups lists only v1 controllers, use cgroup.controllers of root cgroup for v2 info\n"); return 0; } /** * cgroupstats_build - build and fill cgroupstats * @stats: cgroupstats to fill information into * @dentry: A dentry entry belonging to the cgroup for which stats have * been requested. * * Build and fill cgroupstats so that taskstats can export it to user * space. * * Return: %0 on success or a negative errno code on failure */ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) { struct kernfs_node *kn = kernfs_node_from_dentry(dentry); struct cgroup *cgrp; struct css_task_iter it; struct task_struct *tsk; /* it should be kernfs_node belonging to cgroupfs and is a directory */ if (dentry->d_sb->s_type != &cgroup_fs_type || !kn || kernfs_type(kn) != KERNFS_DIR) return -EINVAL; /* * We aren't being called from kernfs and there's no guarantee on * @kn->priv's validity. For this and css_tryget_online_from_dir(), * @kn->priv is RCU safe. Let's do the RCU dancing. */ rcu_read_lock(); cgrp = rcu_dereference(*(void __rcu __force **)&kn->priv); if (!cgrp || !cgroup_tryget(cgrp)) { rcu_read_unlock(); return -ENOENT; } rcu_read_unlock(); css_task_iter_start(&cgrp->self, 0, &it); while ((tsk = css_task_iter_next(&it))) { switch (READ_ONCE(tsk->__state)) { case TASK_RUNNING: stats->nr_running++; break; case TASK_INTERRUPTIBLE: stats->nr_sleeping++; break; case TASK_UNINTERRUPTIBLE: stats->nr_uninterruptible++; break; case TASK_STOPPED: stats->nr_stopped++; break; default: if (tsk->in_iowait) stats->nr_io_wait++; break; } } css_task_iter_end(&it); cgroup_put(cgrp); return 0; } void cgroup1_check_for_release(struct cgroup *cgrp) { if (notify_on_release(cgrp) && !cgroup_is_populated(cgrp) && !css_has_online_children(&cgrp->self) && !cgroup_is_dead(cgrp)) schedule_work(&cgrp->release_agent_work); } /* * Notify userspace when a cgroup is released, by running the * configured release agent with the name of the cgroup (path * relative to the root of cgroup file system) as the argument. * * Most likely, this user command will try to rmdir this cgroup. * * This races with the possibility that some other task will be * attached to this cgroup before it is removed, or that some other * user task will 'mkdir' a child cgroup of this cgroup. That's ok. * The presumed 'rmdir' will fail quietly if this cgroup is no longer * unused, and this cgroup will be reprieved from its death sentence, * to continue to serve a useful existence. Next time it's released, * we will get notified again, if it still has 'notify_on_release' set. * * The final arg to call_usermodehelper() is UMH_WAIT_EXEC, which * means only wait until the task is successfully execve()'d. The * separate release agent task is forked by call_usermodehelper(), * then control in this thread returns here, without waiting for the * release agent task. We don't bother to wait because the caller of * this routine has no use for the exit status of the release agent * task, so no sense holding our caller up for that. */ void cgroup1_release_agent(struct work_struct *work) { struct cgroup *cgrp = container_of(work, struct cgroup, release_agent_work); char *pathbuf, *agentbuf; char *argv[3], *envp[3]; int ret; /* snoop agent path and exit early if empty */ if (!cgrp->root->release_agent_path[0]) return; /* prepare argument buffers */ pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); agentbuf = kmalloc(PATH_MAX, GFP_KERNEL); if (!pathbuf || !agentbuf) goto out_free; spin_lock(&release_agent_path_lock); strscpy(agentbuf, cgrp->root->release_agent_path, PATH_MAX); spin_unlock(&release_agent_path_lock); if (!agentbuf[0]) goto out_free; ret = cgroup_path_ns(cgrp, pathbuf, PATH_MAX, &init_cgroup_ns); if (ret < 0) goto out_free; argv[0] = agentbuf; argv[1] = pathbuf; argv[2] = NULL; /* minimal command environment */ envp[0] = "HOME=/"; envp[1] = "PATH=/sbin:/bin:/usr/sbin:/usr/bin"; envp[2] = NULL; call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC); out_free: kfree(agentbuf); kfree(pathbuf); } /* * cgroup_rename - Only allow simple rename of directories in place. */ static int cgroup1_rename(struct kernfs_node *kn, struct kernfs_node *new_parent, const char *new_name_str) { struct cgroup *cgrp = kn->priv; int ret; /* do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable */ if (strchr(new_name_str, '\n')) return -EINVAL; if (kernfs_type(kn) != KERNFS_DIR) return -ENOTDIR; if (rcu_access_pointer(kn->__parent) != new_parent) return -EIO; /* * We're gonna grab cgroup_mutex which nests outside kernfs * active_ref. kernfs_rename() doesn't require active_ref * protection. Break them before grabbing cgroup_mutex. */ kernfs_break_active_protection(new_parent); kernfs_break_active_protection(kn); cgroup_lock(); ret = kernfs_rename(kn, new_parent, new_name_str); if (!ret) TRACE_CGROUP_PATH(rename, cgrp); cgroup_unlock(); kernfs_unbreak_active_protection(kn); kernfs_unbreak_active_protection(new_parent); return ret; } static int cgroup1_show_options(struct seq_file *seq, struct kernfs_root *kf_root) { struct cgroup_root *root = cgroup_root_from_kf(kf_root); struct cgroup_subsys *ss; int ssid; for_each_subsys(ss, ssid) if (root->subsys_mask & (1 << ssid)) seq_show_option(seq, ss->legacy_name, NULL); if (root->flags & CGRP_ROOT_NOPREFIX) seq_puts(seq, ",noprefix"); if (root->flags & CGRP_ROOT_XATTR) seq_puts(seq, ",xattr"); if (root->flags & CGRP_ROOT_CPUSET_V2_MODE) seq_puts(seq, ",cpuset_v2_mode"); if (root->flags & CGRP_ROOT_FAVOR_DYNMODS) seq_puts(seq, ",favordynmods"); spin_lock(&release_agent_path_lock); if (strlen(root->release_agent_path)) seq_show_option(seq, "release_agent", root->release_agent_path); spin_unlock(&release_agent_path_lock); if (test_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags)) seq_puts(seq, ",clone_children"); if (strlen(root->name)) seq_show_option(seq, "name", root->name); return 0; } enum cgroup1_param { Opt_all, Opt_clone_children, Opt_cpuset_v2_mode, Opt_name, Opt_none, Opt_noprefix, Opt_release_agent, Opt_xattr, Opt_favordynmods, Opt_nofavordynmods, }; const struct fs_parameter_spec cgroup1_fs_parameters[] = { fsparam_flag ("all", Opt_all), fsparam_flag ("clone_children", Opt_clone_children), fsparam_flag ("cpuset_v2_mode", Opt_cpuset_v2_mode), fsparam_string("name", Opt_name), fsparam_flag ("none", Opt_none), fsparam_flag ("noprefix", Opt_noprefix), fsparam_string("release_agent", Opt_release_agent), fsparam_flag ("xattr", Opt_xattr), fsparam_flag ("favordynmods", Opt_favordynmods), fsparam_flag ("nofavordynmods", Opt_nofavordynmods), {} }; int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); struct cgroup_subsys *ss; struct fs_parse_result result; int opt, i; opt = fs_parse(fc, cgroup1_fs_parameters, param, &result); if (opt == -ENOPARAM) { int ret; ret = vfs_parse_fs_param_source(fc, param); if (ret != -ENOPARAM) return ret; for_each_subsys(ss, i) { if (strcmp(param->key, ss->legacy_name) || cgroup1_subsys_absent(ss)) continue; if (!cgroup_ssid_enabled(i) || cgroup1_ssid_disabled(i)) return invalfc(fc, "Disabled controller '%s'", param->key); ctx->subsys_mask |= (1 << i); return 0; } return invalfc(fc, "Unknown subsys name '%s'", param->key); } if (opt < 0) return opt; switch (opt) { case Opt_none: /* Explicitly have no subsystems */ ctx->none = true; break; case Opt_all: ctx->all_ss = true; break; case Opt_noprefix: ctx->flags |= CGRP_ROOT_NOPREFIX; break; case Opt_clone_children: ctx->cpuset_clone_children = true; break; case Opt_cpuset_v2_mode: ctx->flags |= CGRP_ROOT_CPUSET_V2_MODE; break; case Opt_xattr: ctx->flags |= CGRP_ROOT_XATTR; break; case Opt_favordynmods: ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS; break; case Opt_nofavordynmods: ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS; break; case Opt_release_agent: /* Specifying two release agents is forbidden */ if (ctx->release_agent) return invalfc(fc, "release_agent respecified"); /* * Release agent gets called with all capabilities, * require capabilities to set release agent. */ if ((fc->user_ns != &init_user_ns) || !capable(CAP_SYS_ADMIN)) return invalfc(fc, "Setting release_agent not allowed"); ctx->release_agent = param->string; param->string = NULL; break; case Opt_name: /* blocked by boot param? */ if (cgroup_no_v1_named) return -ENOENT; /* Can't specify an empty name */ if (!param->size) return invalfc(fc, "Empty name"); if (param->size > MAX_CGROUP_ROOT_NAMELEN - 1) return invalfc(fc, "Name too long"); /* Must match [\w.-]+ */ for (i = 0; i < param->size; i++) { char c = param->string[i]; if (isalnum(c)) continue; if ((c == '.') || (c == '-') || (c == '_')) continue; return invalfc(fc, "Invalid name"); } /* Specifying two names is forbidden */ if (ctx->name) return invalfc(fc, "name respecified"); ctx->name = param->string; param->string = NULL; break; } return 0; } static int check_cgroupfs_options(struct fs_context *fc) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); u16 mask = U16_MAX; u16 enabled = 0; struct cgroup_subsys *ss; int i; #ifdef CONFIG_CPUSETS mask = ~((u16)1 << cpuset_cgrp_id); #endif for_each_subsys(ss, i) if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i) && !cgroup1_subsys_absent(ss)) enabled |= 1 << i; ctx->subsys_mask &= enabled; /* * In absence of 'none', 'name=' and subsystem name options, * let's default to 'all'. */ if (!ctx->subsys_mask && !ctx->none && !ctx->name) ctx->all_ss = true; if (ctx->all_ss) { /* Mutually exclusive option 'all' + subsystem name */ if (ctx->subsys_mask) return invalfc(fc, "subsys name conflicts with all"); /* 'all' => select all the subsystems */ ctx->subsys_mask = enabled; } /* * We either have to specify by name or by subsystems. (So all * empty hierarchies must have a name). */ if (!ctx->subsys_mask && !ctx->name) return invalfc(fc, "Need name or subsystem set"); /* * Option noprefix was introduced just for backward compatibility * with the old cpuset, so we allow noprefix only if mounting just * the cpuset subsystem. */ if ((ctx->flags & CGRP_ROOT_NOPREFIX) && (ctx->subsys_mask & mask)) return invalfc(fc, "noprefix used incorrectly"); /* Can't specify "none" and some subsystems */ if (ctx->subsys_mask && ctx->none) return invalfc(fc, "none used incorrectly"); return 0; } int cgroup1_reconfigure(struct fs_context *fc) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); struct kernfs_root *kf_root = kernfs_root_from_sb(fc->root->d_sb); struct cgroup_root *root = cgroup_root_from_kf(kf_root); int ret = 0; u16 added_mask, removed_mask; cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); /* See what subsystems are wanted */ ret = check_cgroupfs_options(fc); if (ret) goto out_unlock; if (ctx->subsys_mask != root->subsys_mask || ctx->release_agent) pr_warn("option changes via remount are deprecated (pid=%d comm=%s)\n", task_tgid_nr(current), current->comm); added_mask = ctx->subsys_mask & ~root->subsys_mask; removed_mask = root->subsys_mask & ~ctx->subsys_mask; /* Don't allow flags or name to change at remount */ if ((ctx->flags ^ root->flags) || (ctx->name && strcmp(ctx->name, root->name))) { errorfc(fc, "option or name mismatch, new: 0x%x \"%s\", old: 0x%x \"%s\"", ctx->flags, ctx->name ?: "", root->flags, root->name); ret = -EINVAL; goto out_unlock; } /* remounting is not allowed for populated hierarchies */ if (!list_empty(&root->cgrp.self.children)) { ret = -EBUSY; goto out_unlock; } ret = rebind_subsystems(root, added_mask); if (ret) goto out_unlock; WARN_ON(rebind_subsystems(&cgrp_dfl_root, removed_mask)); if (ctx->release_agent) { spin_lock(&release_agent_path_lock); strscpy(root->release_agent_path, ctx->release_agent); spin_unlock(&release_agent_path_lock); } trace_cgroup_remount(root); out_unlock: cgroup_unlock(); return ret; } struct kernfs_syscall_ops cgroup1_kf_syscall_ops = { .rename = cgroup1_rename, .show_options = cgroup1_show_options, .mkdir = cgroup_mkdir, .rmdir = cgroup_rmdir, .show_path = cgroup_show_path, }; /* * The guts of cgroup1 mount - find or create cgroup_root to use. * Called with cgroup_mutex held; returns 0 on success, -E... on * error and positive - in case when the candidate is busy dying. * On success it stashes a reference to cgroup_root into given * cgroup_fs_context; that reference is *NOT* counting towards the * cgroup_root refcount. */ static int cgroup1_root_to_use(struct fs_context *fc) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); struct cgroup_root *root; struct cgroup_subsys *ss; int i, ret; /* First find the desired set of subsystems */ ret = check_cgroupfs_options(fc); if (ret) return ret; /* * Destruction of cgroup root is asynchronous, so subsystems may * still be dying after the previous unmount. Let's drain the * dying subsystems. We just need to ensure that the ones * unmounted previously finish dying and don't care about new ones * starting. Testing ref liveliness is good enough. */ for_each_subsys(ss, i) { if (!(ctx->subsys_mask & (1 << i)) || ss->root == &cgrp_dfl_root) continue; if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) return 1; /* restart */ cgroup_put(&ss->root->cgrp); } for_each_root(root) { bool name_match = false; if (root == &cgrp_dfl_root) continue; /* * If we asked for a name then it must match. Also, if * name matches but sybsys_mask doesn't, we should fail. * Remember whether name matched. */ if (ctx->name) { if (strcmp(ctx->name, root->name)) continue; name_match = true; } /* * If we asked for subsystems (or explicitly for no * subsystems) then they must match. */ if ((ctx->subsys_mask || ctx->none) && (ctx->subsys_mask != root->subsys_mask)) { if (!name_match) continue; return -EBUSY; } if (root->flags ^ ctx->flags) pr_warn("new mount options do not match the existing superblock, will be ignored\n"); ctx->root = root; return 0; } /* * No such thing, create a new one. name= matching without subsys * specification is allowed for already existing hierarchies but we * can't create new one without subsys specification. */ if (!ctx->subsys_mask && !ctx->none) return invalfc(fc, "No subsys list or none specified"); /* Hierarchies may only be created in the initial cgroup namespace. */ if (ctx->ns != &init_cgroup_ns) return -EPERM; root = kzalloc(sizeof(*root), GFP_KERNEL); if (!root) return -ENOMEM; ctx->root = root; init_cgroup_root(ctx); ret = cgroup_setup_root(root, ctx->subsys_mask); if (!ret) cgroup_favor_dynmods(root, ctx->flags & CGRP_ROOT_FAVOR_DYNMODS); else cgroup_free_root(root); return ret; } int cgroup1_get_tree(struct fs_context *fc) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); int ret; /* Check if the caller has permission to mount. */ if (!ns_capable(ctx->ns->user_ns, CAP_SYS_ADMIN)) return -EPERM; cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); ret = cgroup1_root_to_use(fc); if (!ret && !percpu_ref_tryget_live(&ctx->root->cgrp.self.refcnt)) ret = 1; /* restart */ cgroup_unlock(); if (!ret) ret = cgroup_do_get_tree(fc); if (!ret && percpu_ref_is_dying(&ctx->root->cgrp.self.refcnt)) { fc_drop_locked(fc); ret = 1; } if (unlikely(ret > 0)) { msleep(10); return restart_syscall(); } return ret; } /** * task_get_cgroup1 - Acquires the associated cgroup of a task within a * specific cgroup1 hierarchy. The cgroup1 hierarchy is identified by its * hierarchy ID. * @tsk: The target task * @hierarchy_id: The ID of a cgroup1 hierarchy * * On success, the cgroup is returned. On failure, ERR_PTR is returned. * We limit it to cgroup1 only. */ struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id) { struct cgroup *cgrp = ERR_PTR(-ENOENT); struct cgroup_root *root; unsigned long flags; rcu_read_lock(); for_each_root(root) { /* cgroup1 only*/ if (root == &cgrp_dfl_root) continue; if (root->hierarchy_id != hierarchy_id) continue; spin_lock_irqsave(&css_set_lock, flags); cgrp = task_cgroup_from_root(tsk, root); if (!cgrp || !cgroup_tryget(cgrp)) cgrp = ERR_PTR(-ENOENT); spin_unlock_irqrestore(&css_set_lock, flags); break; } rcu_read_unlock(); return cgrp; } static int __init cgroup1_wq_init(void) { /* * Used to destroy pidlists and separate to serve as flush domain. * Cap @max_active to 1 too. */ cgroup_pidlist_destroy_wq = alloc_workqueue("cgroup_pidlist_destroy", WQ_PERCPU, 1); BUG_ON(!cgroup_pidlist_destroy_wq); return 0; } core_initcall(cgroup1_wq_init); static int __init cgroup_no_v1(char *str) { struct cgroup_subsys *ss; char *token; int i; while ((token = strsep(&str, ",")) != NULL) { if (!*token) continue; if (!strcmp(token, "all")) { cgroup_no_v1_mask = U16_MAX; continue; } if (!strcmp(token, "named")) { cgroup_no_v1_named = true; continue; } for_each_subsys(ss, i) { if (strcmp(token, ss->name) && strcmp(token, ss->legacy_name)) continue; cgroup_no_v1_mask |= 1 << i; break; } } return 1; } __setup("cgroup_no_v1=", cgroup_no_v1); static int __init cgroup_v1_proc(char *str) { return (kstrtobool(str, &proc_show_all) == 0); } __setup("cgroup_v1_proc=", cgroup_v1_proc); |
| 40 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | /* SPDX-License-Identifier: GPL-2.0-only */ #ifndef __ASM_VDSO_PROCESSOR_H #define __ASM_VDSO_PROCESSOR_H #ifndef __ASSEMBLER__ #include <asm/barrier.h> #include <asm/errata_list.h> #include <asm/insn-def.h> static inline void cpu_relax(void) { #ifdef __riscv_muldiv int dummy; /* In lieu of a halt instruction, induce a long-latency stall. */ __asm__ __volatile__ ("div %0, %0, zero" : "=r" (dummy)); #endif /* * Reduce instruction retirement. * This assumes the PC changes. */ ALT_RISCV_PAUSE(); barrier(); } #endif /* __ASSEMBLER__ */ #endif /* __ASM_VDSO_PROCESSOR_H */ |
| 21 25 70 84 21 1 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* fs/ internal definitions * * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) */ struct super_block; struct file_system_type; struct iomap; struct iomap_ops; struct linux_binprm; struct path; struct mount; struct shrink_control; struct fs_context; struct pipe_inode_info; struct iov_iter; struct mnt_idmap; struct ns_common; /* * block/bdev.c */ #ifdef CONFIG_BLOCK extern void __init bdev_cache_init(void); #else static inline void bdev_cache_init(void) { } #endif /* CONFIG_BLOCK */ /* * buffer.c */ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len, get_block_t *get_block, const struct iomap *iomap); /* * char_dev.c */ extern void __init chrdev_init(void); /* * fs_context.c */ extern const struct fs_context_operations legacy_fs_context_ops; extern int parse_monolithic_mount_data(struct fs_context *, void *); extern void vfs_clean_context(struct fs_context *fc); extern int finish_clean_context(struct fs_context *fc); /* * namei.c */ extern int filename_lookup(int dfd, struct filename *name, unsigned flags, struct path *path, const struct path *root); int do_rmdir(int dfd, struct filename *name); int do_unlinkat(int dfd, struct filename *name); int may_linkat(struct mnt_idmap *idmap, const struct path *link); int do_renameat2(int olddfd, struct filename *oldname, int newdfd, struct filename *newname, unsigned int flags); int do_mkdirat(int dfd, struct filename *name, umode_t mode); int do_symlinkat(struct filename *from, int newdfd, struct filename *to); int do_linkat(int olddfd, struct filename *old, int newdfd, struct filename *new, int flags); int vfs_tmpfile(struct mnt_idmap *idmap, const struct path *parentpath, struct file *file, umode_t mode); struct dentry *d_hash_and_lookup(struct dentry *, struct qstr *); struct dentry *start_dirop(struct dentry *parent, struct qstr *name, unsigned int lookup_flags); int lookup_noperm_common(struct qstr *qname, struct dentry *base); /* * namespace.c */ extern struct vfsmount *lookup_mnt(const struct path *); extern int finish_automount(struct vfsmount *, const struct path *); extern int sb_prepare_remount_readonly(struct super_block *); extern void __init mnt_init(void); int mnt_get_write_access_file(struct file *file); void mnt_put_write_access_file(struct file *file); extern void dissolve_on_fput(struct vfsmount *); extern bool may_mount(void); int path_mount(const char *dev_name, const struct path *path, const char *type_page, unsigned long flags, void *data_page); int path_umount(const struct path *path, int flags); int show_path(struct seq_file *m, struct dentry *root); /* * fs_struct.c */ extern void chroot_fs_refs(const struct path *, const struct path *); /* * file_table.c */ struct file *alloc_empty_file(int flags, const struct cred *cred); struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred); struct file *alloc_empty_backing_file(int flags, const struct cred *cred); void backing_file_set_user_path(struct file *f, const struct path *path); static inline void file_put_write_access(struct file *file) { put_write_access(file->f_inode); mnt_put_write_access(file->f_path.mnt); if (unlikely(file->f_mode & FMODE_BACKING)) mnt_put_write_access(backing_file_user_path(file)->mnt); } static inline void put_file_access(struct file *file) { if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) { i_readcount_dec(file->f_inode); } else if (file->f_mode & FMODE_WRITER) { file_put_write_access(file); } } void fput_close_sync(struct file *); void fput_close(struct file *); /* * super.c */ extern int reconfigure_super(struct fs_context *); extern bool super_trylock_shared(struct super_block *sb); struct super_block *user_get_super(dev_t, bool excl); void put_super(struct super_block *sb); extern bool mount_capable(struct fs_context *); int sb_init_dio_done_wq(struct super_block *sb); /* * Prepare superblock for changing its read-only state (i.e., either remount * read-write superblock read-only or vice versa). After this function returns * mnt_is_readonly() will return true for any mount of the superblock if its * caller is able to observe any changes done by the remount. This holds until * sb_end_ro_state_change() is called. */ static inline void sb_start_ro_state_change(struct super_block *sb) { WRITE_ONCE(sb->s_readonly_remount, 1); /* * For RO->RW transition, the barrier pairs with the barrier in * mnt_is_readonly() making sure if mnt_is_readonly() sees SB_RDONLY * cleared, it will see s_readonly_remount set. * For RW->RO transition, the barrier pairs with the barrier in * mnt_get_write_access() before the mnt_is_readonly() check. * The barrier makes sure if mnt_get_write_access() sees MNT_WRITE_HOLD * already cleared, it will see s_readonly_remount set. */ smp_wmb(); } /* * Ends section changing read-only state of the superblock. After this function * returns if mnt_is_readonly() returns false, the caller will be able to * observe all the changes remount did to the superblock. */ static inline void sb_end_ro_state_change(struct super_block *sb) { /* * This barrier provides release semantics that pairs with * the smp_rmb() acquire semantics in mnt_is_readonly(). * This barrier pair ensure that when mnt_is_readonly() sees * 0 for sb->s_readonly_remount, it will also see all the * preceding flag changes that were made during the RO state * change. */ smp_wmb(); WRITE_ONCE(sb->s_readonly_remount, 0); } /* * open.c */ struct open_flags { int open_flag; umode_t mode; int acc_mode; int intent; int lookup_flags; }; extern struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op); extern struct file *do_file_open_root(const struct path *, const char *, const struct open_flags *); extern struct open_how build_open_how(int flags, umode_t mode); extern int build_open_flags(const struct open_how *how, struct open_flags *op); struct file *file_close_fd_locked(struct files_struct *files, unsigned fd); int do_ftruncate(struct file *file, loff_t length, int small); int do_sys_ftruncate(unsigned int fd, loff_t length, int small); int chmod_common(const struct path *path, umode_t mode); int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group, int flag); int chown_common(const struct path *path, uid_t user, gid_t group); extern int vfs_open(const struct path *, struct file *); /* * inode.c */ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); int dentry_needs_remove_privs(struct mnt_idmap *, struct dentry *dentry); bool in_group_or_capable(struct mnt_idmap *idmap, const struct inode *inode, vfsgid_t vfsgid); /* * fs-writeback.c */ extern long get_nr_dirty_inodes(void); /* * dcache.c */ extern int d_set_mounted(struct dentry *dentry); extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc); extern struct dentry *d_alloc_cursor(struct dentry *); extern struct dentry * d_alloc_pseudo(struct super_block *, const struct qstr *); extern char *simple_dname(struct dentry *, char *, int); extern void dput_to_list(struct dentry *, struct list_head *); extern void shrink_dentry_list(struct list_head *); extern void shrink_dcache_for_umount(struct super_block *); extern struct dentry *__d_lookup(const struct dentry *, const struct qstr *); extern struct dentry *__d_lookup_rcu(const struct dentry *parent, const struct qstr *name, unsigned *seq); /* * pipe.c */ extern const struct file_operations pipefifo_fops; /* * fs_pin.c */ extern void group_pin_kill(struct hlist_head *p); extern void mnt_pin_kill(struct mount *m); /* * fs/nsfs.c */ extern const struct dentry_operations ns_dentry_operations; int open_namespace(struct ns_common *ns); /* * fs/stat.c: */ int do_statx(int dfd, struct filename *filename, unsigned int flags, unsigned int mask, struct statx __user *buffer); int do_statx_fd(int fd, unsigned int flags, unsigned int mask, struct statx __user *buffer); /* * fs/splice.c: */ ssize_t splice_file_to_pipe(struct file *in, struct pipe_inode_info *opipe, loff_t *offset, size_t len, unsigned int flags); /* * fs/xattr.c: */ struct xattr_name { char name[XATTR_NAME_MAX + 1]; }; struct kernel_xattr_ctx { /* Value of attribute */ union { const void __user *cvalue; void __user *value; }; void *kvalue; size_t size; /* Attribute name */ struct xattr_name *kname; unsigned int flags; }; ssize_t file_getxattr(struct file *file, struct kernel_xattr_ctx *ctx); ssize_t filename_getxattr(int dfd, struct filename *filename, unsigned int lookup_flags, struct kernel_xattr_ctx *ctx); int file_setxattr(struct file *file, struct kernel_xattr_ctx *ctx); int filename_setxattr(int dfd, struct filename *filename, unsigned int lookup_flags, struct kernel_xattr_ctx *ctx); int setxattr_copy(const char __user *name, struct kernel_xattr_ctx *ctx); int import_xattr_name(struct xattr_name *kname, const char __user *name); int may_write_xattr(struct mnt_idmap *idmap, struct inode *inode); #ifdef CONFIG_FS_POSIX_ACL int do_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *acl_name, const void *kvalue, size_t size); ssize_t do_get_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *acl_name, void *kvalue, size_t size); #else static inline int do_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *acl_name, const void *kvalue, size_t size) { return -EOPNOTSUPP; } static inline ssize_t do_get_acl(struct mnt_idmap *idmap, struct dentry *dentry, const char *acl_name, void *kvalue, size_t size) { return -EOPNOTSUPP; } #endif ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *pos); /* * fs/attr.c */ struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns); struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap); void mnt_idmap_put(struct mnt_idmap *idmap); struct stashed_operations { struct dentry *(*stash_dentry)(struct dentry **stashed, struct dentry *dentry); void (*put_data)(void *data); int (*init_inode)(struct inode *inode, void *data); }; int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data, struct path *path); void stashed_dentry_prune(struct dentry *dentry); struct dentry *stash_dentry(struct dentry **stashed, struct dentry *dentry); struct dentry *stashed_dentry_get(struct dentry **stashed); /** * path_mounted - check whether path is mounted * @path: path to check * * Determine whether @path refers to the root of a mount. * * Return: true if @path is the root of a mount, false if not. */ static inline bool path_mounted(const struct path *path) { return path->mnt->mnt_root == path->dentry; } void file_f_owner_release(struct file *file); bool file_seek_cur_needs_f_lock(struct file *file); int statmount_mnt_idmap(struct mnt_idmap *idmap, struct seq_file *seq, bool uid_map); struct dentry *find_next_child(struct dentry *parent, struct dentry *prev); int anon_inode_getattr(struct mnt_idmap *idmap, const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags); int anon_inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct iattr *attr); void pidfs_get_root(struct path *path); void nsfs_get_root(struct path *path); |
| 1 12 12 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | /* SPDX-License-Identifier: GPL-2.0-only */ #ifndef _INPUT_MT_H #define _INPUT_MT_H /* * Input Multitouch Library * * Copyright (c) 2010 Henrik Rydberg */ #include <linux/input.h> #define TRKID_MAX 0xffff #define INPUT_MT_POINTER 0x0001 /* pointer device, e.g. trackpad */ #define INPUT_MT_DIRECT 0x0002 /* direct device, e.g. touchscreen */ #define INPUT_MT_DROP_UNUSED 0x0004 /* drop contacts not seen in frame */ #define INPUT_MT_TRACK 0x0008 /* use in-kernel tracking */ #define INPUT_MT_SEMI_MT 0x0010 /* semi-mt device, finger count handled manually */ #define INPUT_MT_TOTAL_FORCE 0x0020 /* calculate total force from slots pressure */ /** * struct input_mt_slot - represents the state of an input MT slot * @abs: holds current values of ABS_MT axes for this slot * @frame: last frame at which input_mt_report_slot_state() was called * @key: optional driver designation of this slot */ struct input_mt_slot { int abs[ABS_MT_LAST - ABS_MT_FIRST + 1]; unsigned int frame; unsigned int key; }; /** * struct input_mt - state of tracked contacts * @trkid: stores MT tracking ID for the next contact * @num_slots: number of MT slots the device uses * @slot: MT slot currently being transmitted * @flags: input_mt operation flags * @frame: increases every time input_mt_sync_frame() is called * @red: reduced cost matrix for in-kernel tracking * @slots: array of slots holding current values of tracked contacts */ struct input_mt { int trkid; int num_slots; int slot; unsigned int flags; unsigned int frame; int *red; struct input_mt_slot slots[] __counted_by(num_slots); }; static inline void input_mt_set_value(struct input_mt_slot *slot, unsigned code, int value) { slot->abs[code - ABS_MT_FIRST] = value; } static inline int input_mt_get_value(const struct input_mt_slot *slot, unsigned code) { return slot->abs[code - ABS_MT_FIRST]; } static inline bool input_mt_is_active(const struct input_mt_slot *slot) { return input_mt_get_value(slot, ABS_MT_TRACKING_ID) >= 0; } static inline bool input_mt_is_used(const struct input_mt *mt, const struct input_mt_slot *slot) { return slot->frame == mt->frame; } int input_mt_init_slots(struct input_dev *dev, unsigned int num_slots, unsigned int flags); void input_mt_destroy_slots(struct input_dev *dev); static inline int input_mt_new_trkid(struct input_mt *mt) { return mt->trkid++ & TRKID_MAX; } static inline void input_mt_slot(struct input_dev *dev, int slot) { input_event(dev, EV_ABS, ABS_MT_SLOT, slot); } static inline bool input_is_mt_value(int axis) { return axis >= ABS_MT_FIRST && axis <= ABS_MT_LAST; } static inline bool input_is_mt_axis(int axis) { return axis == ABS_MT_SLOT || input_is_mt_value(axis); } bool input_mt_report_slot_state(struct input_dev *dev, unsigned int tool_type, bool active); static inline void input_mt_report_slot_inactive(struct input_dev *dev) { input_mt_report_slot_state(dev, 0, false); } void input_mt_report_finger_count(struct input_dev *dev, int count); void input_mt_report_pointer_emulation(struct input_dev *dev, bool use_count); void input_mt_drop_unused(struct input_dev *dev); void input_mt_sync_frame(struct input_dev *dev); /** * struct input_mt_pos - contact position * @x: horizontal coordinate * @y: vertical coordinate */ struct input_mt_pos { s16 x, y; }; int input_mt_assign_slots(struct input_dev *dev, int *slots, const struct input_mt_pos *pos, int num_pos, int dmax); int input_mt_get_slot_by_key(struct input_dev *dev, int key); #endif |
| 3 2 2 2 1 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _NET_IP6_TUNNEL_H #define _NET_IP6_TUNNEL_H #include <linux/ipv6.h> #include <linux/netdevice.h> #include <linux/if_tunnel.h> #include <linux/ip6_tunnel.h> #include <net/ip_tunnels.h> #include <net/dst_cache.h> #define IP6TUNNEL_ERR_TIMEO (30*HZ) /* capable of sending packets */ #define IP6_TNL_F_CAP_XMIT 0x10000 /* capable of receiving packets */ #define IP6_TNL_F_CAP_RCV 0x20000 /* determine capability on a per-packet basis */ #define IP6_TNL_F_CAP_PER_PACKET 0x40000 struct __ip6_tnl_parm { char name[IFNAMSIZ]; /* name of tunnel device */ int link; /* ifindex of underlying L2 interface */ __u8 proto; /* tunnel protocol */ __u8 encap_limit; /* encapsulation limit for tunnel */ __u8 hop_limit; /* hop limit for tunnel */ bool collect_md; __be32 flowinfo; /* traffic class and flowlabel for tunnel */ __u32 flags; /* tunnel flags */ struct in6_addr laddr; /* local tunnel end-point address */ struct in6_addr raddr; /* remote tunnel end-point address */ IP_TUNNEL_DECLARE_FLAGS(i_flags); IP_TUNNEL_DECLARE_FLAGS(o_flags); __be32 i_key; __be32 o_key; __u32 fwmark; __u32 index; /* ERSPAN type II index */ __u8 erspan_ver; /* ERSPAN version */ __u8 dir; /* direction */ __u16 hwid; /* hwid */ }; /* IPv6 tunnel */ struct ip6_tnl { struct ip6_tnl __rcu *next; /* next tunnel in list */ struct net_device *dev; /* virtual device associated with tunnel */ netdevice_tracker dev_tracker; struct net *net; /* netns for packet i/o */ struct __ip6_tnl_parm parms; /* tunnel configuration parameters */ struct flowi fl; /* flowi template for xmit */ struct dst_cache dst_cache; /* cached dst */ struct gro_cells gro_cells; int err_count; unsigned long err_time; /* These fields used only by GRE */ __u32 i_seqno; /* The last seen seqno */ atomic_t o_seqno; /* The last output seqno */ int hlen; /* tun_hlen + encap_hlen */ int tun_hlen; /* Precalculated header length */ int encap_hlen; /* Encap header length (FOU,GUE) */ struct ip_tunnel_encap encap; int mlink; }; struct ip6_tnl_encap_ops { size_t (*encap_hlen)(struct ip_tunnel_encap *e); int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e, u8 *protocol, struct flowi6 *fl6); int (*err_handler)(struct sk_buff *skb, struct inet6_skb_parm *opt, u8 type, u8 code, int offset, __be32 info); }; #ifdef CONFIG_INET extern const struct ip6_tnl_encap_ops __rcu * ip6tun_encaps[MAX_IPTUN_ENCAP_OPS]; int ip6_tnl_encap_add_ops(const struct ip6_tnl_encap_ops *ops, unsigned int num); int ip6_tnl_encap_del_ops(const struct ip6_tnl_encap_ops *ops, unsigned int num); int ip6_tnl_encap_setup(struct ip6_tnl *t, struct ip_tunnel_encap *ipencap); static inline int ip6_encap_hlen(struct ip_tunnel_encap *e) { const struct ip6_tnl_encap_ops *ops; int hlen = -EINVAL; if (e->type == TUNNEL_ENCAP_NONE) return 0; if (e->type >= MAX_IPTUN_ENCAP_OPS) return -EINVAL; rcu_read_lock(); ops = rcu_dereference(ip6tun_encaps[e->type]); if (likely(ops && ops->encap_hlen)) hlen = ops->encap_hlen(e); rcu_read_unlock(); return hlen; } static inline int ip6_tnl_encap(struct sk_buff *skb, struct ip6_tnl *t, u8 *protocol, struct flowi6 *fl6) { const struct ip6_tnl_encap_ops *ops; int ret = -EINVAL; if (t->encap.type == TUNNEL_ENCAP_NONE) return 0; if (t->encap.type >= MAX_IPTUN_ENCAP_OPS) return -EINVAL; rcu_read_lock(); ops = rcu_dereference(ip6tun_encaps[t->encap.type]); if (likely(ops && ops->build_header)) ret = ops->build_header(skb, &t->encap, protocol, fl6); rcu_read_unlock(); return ret; } /* Tunnel encapsulation limit destination sub-option */ struct ipv6_tlv_tnl_enc_lim { __u8 type; /* type-code for option */ __u8 length; /* option length */ __u8 encap_limit; /* tunnel encapsulation limit */ } __packed; int ip6_tnl_rcv_ctl(struct ip6_tnl *t, const struct in6_addr *laddr, const struct in6_addr *raddr); int ip6_tnl_rcv(struct ip6_tnl *tunnel, struct sk_buff *skb, const struct tnl_ptk_info *tpi, struct metadata_dst *tun_dst, bool log_ecn_error); int ip6_tnl_xmit_ctl(struct ip6_tnl *t, const struct in6_addr *laddr, const struct in6_addr *raddr); int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield, struct flowi6 *fl6, int encap_limit, __u32 *pmtu, __u8 proto); __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw); __u32 ip6_tnl_get_cap(struct ip6_tnl *t, const struct in6_addr *laddr, const struct in6_addr *raddr); struct net *ip6_tnl_get_link_net(const struct net_device *dev); int ip6_tnl_get_iflink(const struct net_device *dev); int ip6_tnl_change_mtu(struct net_device *dev, int new_mtu); static inline void ip6tunnel_xmit(struct sock *sk, struct sk_buff *skb, struct net_device *dev, u16 ip6cb_flags) { int pkt_len, err; memset(skb->cb, 0, sizeof(struct inet6_skb_parm)); IP6CB(skb)->flags = ip6cb_flags; pkt_len = skb->len - skb_inner_network_offset(skb); err = ip6_local_out(skb_dst_dev_net(skb), sk, skb); if (dev) { if (unlikely(net_xmit_eval(err))) pkt_len = -1; iptunnel_xmit_stats(dev, pkt_len); } } #endif #endif |
| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 | // SPDX-License-Identifier: GPL-2.0-only /* * Ram backed block device driver. * * Copyright (C) 2007 Nick Piggin * Copyright (C) 2007 Novell Inc. * * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright * of their respective owners. */ #include <linux/init.h> #include <linux/initrd.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/major.h> #include <linux/blkdev.h> #include <linux/bio.h> #include <linux/highmem.h> #include <linux/mutex.h> #include <linux/pagemap.h> #include <linux/xarray.h> #include <linux/fs.h> #include <linux/slab.h> #include <linux/backing-dev.h> #include <linux/debugfs.h> #include <linux/uaccess.h> /* * Each block ramdisk device has a xarray brd_pages of pages that stores * the pages containing the block device's contents. */ struct brd_device { int brd_number; struct gendisk *brd_disk; struct list_head brd_list; /* * Backing store of pages. This is the contents of the block device. */ struct xarray brd_pages; u64 brd_nr_pages; }; /* * Look up and return a brd's page with reference grabbed for a given sector. */ static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector) { struct page *page; XA_STATE(xas, &brd->brd_pages, sector >> PAGE_SECTORS_SHIFT); rcu_read_lock(); repeat: page = xas_load(&xas); if (xas_retry(&xas, page)) { xas_reset(&xas); goto repeat; } if (!page) goto out; if (!get_page_unless_zero(page)) { xas_reset(&xas); goto repeat; } if (unlikely(page != xas_reload(&xas))) { put_page(page); xas_reset(&xas); goto repeat; } out: rcu_read_unlock(); return page; } /* * Insert a new page for a given sector, if one does not already exist. * The returned page will grab reference. */ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector, blk_opf_t opf) { gfp_t gfp = (opf & REQ_NOWAIT) ? GFP_NOWAIT : GFP_NOIO; struct page *page, *ret; page = alloc_page(gfp | __GFP_ZERO | __GFP_HIGHMEM); if (!page) return ERR_PTR(-ENOMEM); xa_lock(&brd->brd_pages); ret = __xa_cmpxchg(&brd->brd_pages, sector >> PAGE_SECTORS_SHIFT, NULL, page, gfp); if (!ret) { brd->brd_nr_pages++; get_page(page); xa_unlock(&brd->brd_pages); return page; } if (!xa_is_err(ret)) { get_page(ret); xa_unlock(&brd->brd_pages); put_page(page); return ret; } xa_unlock(&brd->brd_pages); put_page(page); return ERR_PTR(xa_err(ret)); } /* * Free all backing store pages and xarray. This must only be called when * there are no other users of the device. */ static void brd_free_pages(struct brd_device *brd) { struct page *page; pgoff_t idx; xa_for_each(&brd->brd_pages, idx, page) { put_page(page); cond_resched(); } xa_destroy(&brd->brd_pages); } /* * Process a single segment. The segment is capped to not cross page boundaries * in both the bio and the brd backing memory. */ static bool brd_rw_bvec(struct brd_device *brd, struct bio *bio) { struct bio_vec bv = bio_iter_iovec(bio, bio->bi_iter); sector_t sector = bio->bi_iter.bi_sector; u32 offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT; blk_opf_t opf = bio->bi_opf; struct page *page; void *kaddr; bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset); page = brd_lookup_page(brd, sector); if (!page && op_is_write(opf)) { page = brd_insert_page(brd, sector, opf); if (IS_ERR(page)) goto out_error; } kaddr = bvec_kmap_local(&bv); if (op_is_write(opf)) { memcpy_to_page(page, offset, kaddr, bv.bv_len); } else { if (page) memcpy_from_page(kaddr, page, offset, bv.bv_len); else memset(kaddr, 0, bv.bv_len); } kunmap_local(kaddr); bio_advance_iter_single(bio, &bio->bi_iter, bv.bv_len); if (page) put_page(page); return true; out_error: if (PTR_ERR(page) == -ENOMEM && (opf & REQ_NOWAIT)) bio_wouldblock_error(bio); else bio_io_error(bio); return false; } static void brd_do_discard(struct brd_device *brd, sector_t sector, u32 size) { sector_t aligned_sector = round_up(sector, PAGE_SECTORS); sector_t aligned_end = round_down( sector + (size >> SECTOR_SHIFT), PAGE_SECTORS); struct page *page; if (aligned_end <= aligned_sector) return; xa_lock(&brd->brd_pages); while (aligned_sector < aligned_end && aligned_sector < rd_size * 2) { page = __xa_erase(&brd->brd_pages, aligned_sector >> PAGE_SECTORS_SHIFT); if (page) { put_page(page); brd->brd_nr_pages--; } aligned_sector += PAGE_SECTORS; } xa_unlock(&brd->brd_pages); } static void brd_submit_bio(struct bio *bio) { struct brd_device *brd = bio->bi_bdev->bd_disk->private_data; if (unlikely(op_is_discard(bio->bi_opf))) { brd_do_discard(brd, bio->bi_iter.bi_sector, bio->bi_iter.bi_size); bio_endio(bio); return; } do { if (!brd_rw_bvec(brd, bio)) return; } while (bio->bi_iter.bi_size); bio_endio(bio); } static const struct block_device_operations brd_fops = { .owner = THIS_MODULE, .submit_bio = brd_submit_bio, }; /* * And now the modules code and kernel interface. */ static int rd_nr = CONFIG_BLK_DEV_RAM_COUNT; module_param(rd_nr, int, 0444); MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices"); unsigned long rd_size = CONFIG_BLK_DEV_RAM_SIZE; module_param(rd_size, ulong, 0444); MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes."); static int max_part = 1; module_param(max_part, int, 0444); MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices"); MODULE_DESCRIPTION("Ram backed block device driver"); MODULE_LICENSE("GPL"); MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR); MODULE_ALIAS("rd"); #ifndef MODULE /* Legacy boot options - nonmodular */ static int __init ramdisk_size(char *str) { rd_size = simple_strtol(str, NULL, 0); return 1; } __setup("ramdisk_size=", ramdisk_size); #endif /* * The device scheme is derived from loop.c. Keep them in synch where possible * (should share code eventually). */ static LIST_HEAD(brd_devices); static DEFINE_MUTEX(brd_devices_mutex); static struct dentry *brd_debugfs_dir; static struct brd_device *brd_find_or_alloc_device(int i) { struct brd_device *brd; mutex_lock(&brd_devices_mutex); list_for_each_entry(brd, &brd_devices, brd_list) { if (brd->brd_number == i) { mutex_unlock(&brd_devices_mutex); return ERR_PTR(-EEXIST); } } brd = kzalloc(sizeof(*brd), GFP_KERNEL); if (!brd) { mutex_unlock(&brd_devices_mutex); return ERR_PTR(-ENOMEM); } brd->brd_number = i; list_add_tail(&brd->brd_list, &brd_devices); mutex_unlock(&brd_devices_mutex); return brd; } static void brd_free_device(struct brd_device *brd) { mutex_lock(&brd_devices_mutex); list_del(&brd->brd_list); mutex_unlock(&brd_devices_mutex); kfree(brd); } static int brd_alloc(int i) { struct brd_device *brd; struct gendisk *disk; char buf[DISK_NAME_LEN]; int err = -ENOMEM; struct queue_limits lim = { /* * This is so fdisk will align partitions on 4k, because of * direct_access API needing 4k alignment, returning a PFN * (This is only a problem on very small devices <= 4M, * otherwise fdisk will align on 1M. Regardless this call * is harmless) */ .physical_block_size = PAGE_SIZE, .max_hw_discard_sectors = UINT_MAX, .max_discard_segments = 1, .discard_granularity = PAGE_SIZE, .features = BLK_FEAT_SYNCHRONOUS | BLK_FEAT_NOWAIT, }; brd = brd_find_or_alloc_device(i); if (IS_ERR(brd)) return PTR_ERR(brd); xa_init(&brd->brd_pages); snprintf(buf, DISK_NAME_LEN, "ram%d", i); if (!IS_ERR_OR_NULL(brd_debugfs_dir)) debugfs_create_u64(buf, 0444, brd_debugfs_dir, &brd->brd_nr_pages); disk = brd->brd_disk = blk_alloc_disk(&lim, NUMA_NO_NODE); if (IS_ERR(disk)) { err = PTR_ERR(disk); goto out_free_dev; } disk->major = RAMDISK_MAJOR; disk->first_minor = i * max_part; disk->minors = max_part; disk->fops = &brd_fops; disk->private_data = brd; strscpy(disk->disk_name, buf, DISK_NAME_LEN); set_capacity(disk, rd_size * 2); err = add_disk(disk); if (err) goto out_cleanup_disk; return 0; out_cleanup_disk: put_disk(disk); out_free_dev: brd_free_device(brd); return err; } static void brd_probe(dev_t dev) { brd_alloc(MINOR(dev) / max_part); } static void brd_cleanup(void) { struct brd_device *brd, *next; debugfs_remove_recursive(brd_debugfs_dir); list_for_each_entry_safe(brd, next, &brd_devices, brd_list) { del_gendisk(brd->brd_disk); put_disk(brd->brd_disk); brd_free_pages(brd); brd_free_device(brd); } } static inline void brd_check_and_reset_par(void) { if (unlikely(!max_part)) max_part = 1; /* * make sure 'max_part' can be divided exactly by (1U << MINORBITS), * otherwise, it is possiable to get same dev_t when adding partitions. */ if ((1U << MINORBITS) % max_part != 0) max_part = 1UL << fls(max_part); if (max_part > DISK_MAX_PARTS) { pr_info("brd: max_part can't be larger than %d, reset max_part = %d.\n", DISK_MAX_PARTS, DISK_MAX_PARTS); max_part = DISK_MAX_PARTS; } } static int __init brd_init(void) { int err, i; /* * brd module now has a feature to instantiate underlying device * structure on-demand, provided that there is an access dev node. * * (1) if rd_nr is specified, create that many upfront. else * it defaults to CONFIG_BLK_DEV_RAM_COUNT * (2) User can further extend brd devices by create dev node themselves * and have kernel automatically instantiate actual device * on-demand. Example: * mknod /path/devnod_name b 1 X # 1 is the rd major * fdisk -l /path/devnod_name * If (X / max_part) was not already created it will be created * dynamically. */ brd_check_and_reset_par(); brd_debugfs_dir = debugfs_create_dir("ramdisk_pages", NULL); if (__register_blkdev(RAMDISK_MAJOR, "ramdisk", brd_probe)) { err = -EIO; goto out_free; } for (i = 0; i < rd_nr; i++) brd_alloc(i); pr_info("brd: module loaded\n"); return 0; out_free: brd_cleanup(); pr_info("brd: module NOT loaded !!!\n"); return err; } static void __exit brd_exit(void) { unregister_blkdev(RAMDISK_MAJOR, "ramdisk"); brd_cleanup(); pr_info("brd: module unloaded\n"); } module_init(brd_init); module_exit(brd_exit); |
| 13 13 105 32 15 1 4 15 4 15 4 4 11 15 1 1 1 1 119 3 1 3 3 3 1 5 75 75 74 75 14 75 75 31 6 31 30 18 18 1 18 17 5 5 21 23 17 17 17 17 2 15 15 15 15 14 15 15 15 15 15 15 15 14 15 15 15 4 4 4 1 3 4 3 1 1 1 1 1 1 1 1 4 4 4 2 4 3 3 3 4 7 6 7 7 6 7 1 7 5 5 5 2 2 1 1 1 1 1 1 5 2 5 2 5 5 1 5 4 1 4 4 3 3 3 4 1 4 4 2 2 19 9 19 19 17 12 5 5 5 5 5 15 15 15 15 15 15 15 15 15 15 15 17 17 17 17 10 5 16 17 4 4 4 4 4 15 15 15 15 15 15 15 10 5 5 5 5 5 5 5 5 16 3 7 6 6 6 1 5 5 6 17 17 17 17 16 16 1 1 15 14 15 2 2 2 2 2 2 2 31 3 3 3 3 1 3 1 3 1 3 3 67 105 106 106 106 45 68 106 106 106 106 7 60 60 36 4 106 106 106 106 106 7 7 7 7 7 7 7 7 7 7 7 7 7 7 4 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 3 4 8 7 7 7 7 4 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 63 63 62 63 63 62 63 1 1 1 1 1 28 28 28 28 4 4 4 4 4 1 1 3 3 3 3 4 4 3 3 3 3 1 1 3 3 3 3 3 8 5 4 3 3 5 3 14 8 8 4 4 9 8 8 9 1 1 1 1 2 2 1 1 1 1 1 47 43 2 2 1 1 1 1 1 1 1 1 1 1 28 28 1 3 3 1 1 2 2 1 1 1 2 2 1 1 47 45 45 26 26 45 46 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 14 14 13 13 14 13 14 13 14 14 14 14 14 14 13 14 3 3 14 14 14 14 44 46 46 46 106 106 4 3 4 1 4 2 46 47 47 32 32 32 32 32 31 31 10 12 9 9 9 9 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 | /* * Resizable virtual memory filesystem for Linux. * * Copyright (C) 2000 Linus Torvalds. * 2000 Transmeta Corp. * 2000-2001 Christoph Rohland * 2000-2001 SAP AG * 2002 Red Hat Inc. * Copyright (C) 2002-2011 Hugh Dickins. * Copyright (C) 2011 Google Inc. * Copyright (C) 2002-2005 VERITAS Software Corporation. * Copyright (C) 2004 Andi Kleen, SuSE Labs * * Extended attribute support for tmpfs: * Copyright (c) 2004, Luke Kenneth Casson Leighton <lkcl@lkcl.net> * Copyright (c) 2004 Red Hat, Inc., James Morris <jmorris@redhat.com> * * tiny-shmem: * Copyright (c) 2004, 2008 Matt Mackall <mpm@selenic.com> * * This file is released under the GPL. */ #include <linux/fs.h> #include <linux/init.h> #include <linux/vfs.h> #include <linux/mount.h> #include <linux/ramfs.h> #include <linux/pagemap.h> #include <linux/file.h> #include <linux/fileattr.h> #include <linux/mm.h> #include <linux/random.h> #include <linux/sched/signal.h> #include <linux/export.h> #include <linux/shmem_fs.h> #include <linux/swap.h> #include <linux/uio.h> #include <linux/hugetlb.h> #include <linux/fs_parser.h> #include <linux/swapfile.h> #include <linux/iversion.h> #include <linux/unicode.h> #include "swap.h" static struct vfsmount *shm_mnt __ro_after_init; #ifdef CONFIG_SHMEM /* * This virtual memory filesystem is heavily based on the ramfs. It * extends ramfs by the ability to use swap and honor resource limits * which makes it a completely usable filesystem. */ #include <linux/xattr.h> #include <linux/exportfs.h> #include <linux/posix_acl.h> #include <linux/posix_acl_xattr.h> #include <linux/mman.h> #include <linux/string.h> #include <linux/slab.h> #include <linux/backing-dev.h> #include <linux/writeback.h> #include <linux/pagevec.h> #include <linux/percpu_counter.h> #include <linux/falloc.h> #include <linux/splice.h> #include <linux/security.h> #include <linux/leafops.h> #include <linux/mempolicy.h> #include <linux/namei.h> #include <linux/ctype.h> #include <linux/migrate.h> #include <linux/highmem.h> #include <linux/seq_file.h> #include <linux/magic.h> #include <linux/syscalls.h> #include <linux/fcntl.h> #include <uapi/linux/memfd.h> #include <linux/rmap.h> #include <linux/uuid.h> #include <linux/quotaops.h> #include <linux/rcupdate_wait.h> #include <linux/uaccess.h> #include "internal.h" #define VM_ACCT(size) (PAGE_ALIGN(size) >> PAGE_SHIFT) /* Pretend that each entry is of this size in directory's i_size */ #define BOGO_DIRENT_SIZE 20 /* Pretend that one inode + its dentry occupy this much memory */ #define BOGO_INODE_SIZE 1024 /* Symlink up to this size is kmalloc'ed instead of using a swappable page */ #define SHORT_SYMLINK_LEN 128 /* * shmem_fallocate communicates with shmem_fault or shmem_writeout via * inode->i_private (with i_rwsem making sure that it has only one user at * a time): we would prefer not to enlarge the shmem inode just for that. */ struct shmem_falloc { wait_queue_head_t *waitq; /* faults into hole wait for punch to end */ pgoff_t start; /* start of range currently being fallocated */ pgoff_t next; /* the next page offset to be fallocated */ pgoff_t nr_falloced; /* how many new pages have been fallocated */ pgoff_t nr_unswapped; /* how often writeout refused to swap out */ }; struct shmem_options { unsigned long long blocks; unsigned long long inodes; struct mempolicy *mpol; kuid_t uid; kgid_t gid; umode_t mode; bool full_inums; int huge; int seen; bool noswap; unsigned short quota_types; struct shmem_quota_limits qlimits; #if IS_ENABLED(CONFIG_UNICODE) struct unicode_map *encoding; bool strict_encoding; #endif #define SHMEM_SEEN_BLOCKS 1 #define SHMEM_SEEN_INODES 2 #define SHMEM_SEEN_HUGE 4 #define SHMEM_SEEN_INUMS 8 #define SHMEM_SEEN_QUOTA 16 }; #ifdef CONFIG_TRANSPARENT_HUGEPAGE static unsigned long huge_shmem_orders_always __read_mostly; static unsigned long huge_shmem_orders_madvise __read_mostly; static unsigned long huge_shmem_orders_inherit __read_mostly; static unsigned long huge_shmem_orders_within_size __read_mostly; static bool shmem_orders_configured __initdata; #endif #ifdef CONFIG_TMPFS static unsigned long shmem_default_max_blocks(void) { return totalram_pages() / 2; } static unsigned long shmem_default_max_inodes(void) { unsigned long nr_pages = totalram_pages(); return min3(nr_pages - totalhigh_pages(), nr_pages / 2, ULONG_MAX / BOGO_INODE_SIZE); } #endif static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct folio **foliop, enum sgp_type sgp, gfp_t gfp, struct vm_area_struct *vma, vm_fault_t *fault_type); static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb) { return sb->s_fs_info; } /* * shmem_file_setup pre-accounts the whole fixed size of a VM object, * for shared memory and for shared anonymous (/dev/zero) mappings * (unless MAP_NORESERVE and sysctl_overcommit_memory <= 1), * consistent with the pre-accounting of private mappings ... */ static inline int shmem_acct_size(unsigned long flags, loff_t size) { return (flags & SHMEM_F_NORESERVE) ? 0 : security_vm_enough_memory_mm(current->mm, VM_ACCT(size)); } static inline void shmem_unacct_size(unsigned long flags, loff_t size) { if (!(flags & SHMEM_F_NORESERVE)) vm_unacct_memory(VM_ACCT(size)); } static inline int shmem_reacct_size(unsigned long flags, loff_t oldsize, loff_t newsize) { if (!(flags & SHMEM_F_NORESERVE)) { if (VM_ACCT(newsize) > VM_ACCT(oldsize)) return security_vm_enough_memory_mm(current->mm, VM_ACCT(newsize) - VM_ACCT(oldsize)); else if (VM_ACCT(newsize) < VM_ACCT(oldsize)) vm_unacct_memory(VM_ACCT(oldsize) - VM_ACCT(newsize)); } return 0; } /* * ... whereas tmpfs objects are accounted incrementally as * pages are allocated, in order to allow large sparse files. * shmem_get_folio reports shmem_acct_blocks failure as -ENOSPC not -ENOMEM, * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM. */ static inline int shmem_acct_blocks(unsigned long flags, long pages) { if (!(flags & SHMEM_F_NORESERVE)) return 0; return security_vm_enough_memory_mm(current->mm, pages * VM_ACCT(PAGE_SIZE)); } static inline void shmem_unacct_blocks(unsigned long flags, long pages) { if (flags & SHMEM_F_NORESERVE) vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE)); } int shmem_inode_acct_blocks(struct inode *inode, long pages) { struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); int err = -ENOSPC; if (shmem_acct_blocks(info->flags, pages)) return err; might_sleep(); /* when quotas */ if (sbinfo->max_blocks) { if (!percpu_counter_limited_add(&sbinfo->used_blocks, sbinfo->max_blocks, pages)) goto unacct; err = dquot_alloc_block_nodirty(inode, pages); if (err) { percpu_counter_sub(&sbinfo->used_blocks, pages); goto unacct; } } else { err = dquot_alloc_block_nodirty(inode, pages); if (err) goto unacct; } return 0; unacct: shmem_unacct_blocks(info->flags, pages); return err; } static void shmem_inode_unacct_blocks(struct inode *inode, long pages) { struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); might_sleep(); /* when quotas */ dquot_free_block_nodirty(inode, pages); if (sbinfo->max_blocks) percpu_counter_sub(&sbinfo->used_blocks, pages); shmem_unacct_blocks(info->flags, pages); } static const struct super_operations shmem_ops; static const struct address_space_operations shmem_aops; static const struct file_operations shmem_file_operations; static const struct inode_operations shmem_inode_operations; static const struct inode_operations shmem_dir_inode_operations; static const struct inode_operations shmem_special_inode_operations; static const struct vm_operations_struct shmem_vm_ops; static const struct vm_operations_struct shmem_anon_vm_ops; static struct file_system_type shmem_fs_type; bool shmem_mapping(const struct address_space *mapping) { return mapping->a_ops == &shmem_aops; } EXPORT_SYMBOL_GPL(shmem_mapping); bool vma_is_anon_shmem(const struct vm_area_struct *vma) { return vma->vm_ops == &shmem_anon_vm_ops; } bool vma_is_shmem(const struct vm_area_struct *vma) { return vma_is_anon_shmem(vma) || vma->vm_ops == &shmem_vm_ops; } static LIST_HEAD(shmem_swaplist); static DEFINE_SPINLOCK(shmem_swaplist_lock); #ifdef CONFIG_TMPFS_QUOTA static int shmem_enable_quotas(struct super_block *sb, unsigned short quota_types) { int type, err = 0; sb_dqopt(sb)->flags |= DQUOT_QUOTA_SYS_FILE | DQUOT_NOLIST_DIRTY; for (type = 0; type < SHMEM_MAXQUOTAS; type++) { if (!(quota_types & (1 << type))) continue; err = dquot_load_quota_sb(sb, type, QFMT_SHMEM, DQUOT_USAGE_ENABLED | DQUOT_LIMITS_ENABLED); if (err) goto out_err; } return 0; out_err: pr_warn("tmpfs: failed to enable quota tracking (type=%d, err=%d)\n", type, err); for (type--; type >= 0; type--) dquot_quota_off(sb, type); return err; } static void shmem_disable_quotas(struct super_block *sb) { int type; for (type = 0; type < SHMEM_MAXQUOTAS; type++) dquot_quota_off(sb, type); } static struct dquot __rcu **shmem_get_dquots(struct inode *inode) { return SHMEM_I(inode)->i_dquot; } #endif /* CONFIG_TMPFS_QUOTA */ /* * shmem_reserve_inode() performs bookkeeping to reserve a shmem inode, and * produces a novel ino for the newly allocated inode. * * It may also be called when making a hard link to permit the space needed by * each dentry. However, in that case, no new inode number is needed since that * internally draws from another pool of inode numbers (currently global * get_next_ino()). This case is indicated by passing NULL as inop. */ #define SHMEM_INO_BATCH 1024 static int shmem_reserve_inode(struct super_block *sb, ino_t *inop) { struct shmem_sb_info *sbinfo = SHMEM_SB(sb); ino_t ino; if (!(sb->s_flags & SB_KERNMOUNT)) { raw_spin_lock(&sbinfo->stat_lock); if (sbinfo->max_inodes) { if (sbinfo->free_ispace < BOGO_INODE_SIZE) { raw_spin_unlock(&sbinfo->stat_lock); return -ENOSPC; } sbinfo->free_ispace -= BOGO_INODE_SIZE; } if (inop) { ino = sbinfo->next_ino++; if (unlikely(is_zero_ino(ino))) ino = sbinfo->next_ino++; if (unlikely(!sbinfo->full_inums && ino > UINT_MAX)) { /* * Emulate get_next_ino uint wraparound for * compatibility */ if (IS_ENABLED(CONFIG_64BIT)) pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n", __func__, MINOR(sb->s_dev)); sbinfo->next_ino = 1; ino = sbinfo->next_ino++; } *inop = ino; } raw_spin_unlock(&sbinfo->stat_lock); } else if (inop) { /* * __shmem_file_setup, one of our callers, is lock-free: it * doesn't hold stat_lock in shmem_reserve_inode since * max_inodes is always 0, and is called from potentially * unknown contexts. As such, use a per-cpu batched allocator * which doesn't require the per-sb stat_lock unless we are at * the batch boundary. * * We don't need to worry about inode{32,64} since SB_KERNMOUNT * shmem mounts are not exposed to userspace, so we don't need * to worry about things like glibc compatibility. */ ino_t *next_ino; next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu()); ino = *next_ino; if (unlikely(ino % SHMEM_INO_BATCH == 0)) { raw_spin_lock(&sbinfo->stat_lock); ino = sbinfo->next_ino; sbinfo->next_ino += SHMEM_INO_BATCH; raw_spin_unlock(&sbinfo->stat_lock); if (unlikely(is_zero_ino(ino))) ino++; } *inop = ino; *next_ino = ++ino; put_cpu(); } return 0; } static void shmem_free_inode(struct super_block *sb, size_t freed_ispace) { struct shmem_sb_info *sbinfo = SHMEM_SB(sb); if (sbinfo->max_inodes) { raw_spin_lock(&sbinfo->stat_lock); sbinfo->free_ispace += BOGO_INODE_SIZE + freed_ispace; raw_spin_unlock(&sbinfo->stat_lock); } } /** * shmem_recalc_inode - recalculate the block usage of an inode * @inode: inode to recalc * @alloced: the change in number of pages allocated to inode * @swapped: the change in number of pages swapped from inode * * We have to calculate the free blocks since the mm can drop * undirtied hole pages behind our back. * * But normally info->alloced == inode->i_mapping->nrpages + info->swapped * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped) * * Return: true if swapped was incremented from 0, for shmem_writeout(). */ bool shmem_recalc_inode(struct inode *inode, long alloced, long swapped) { struct shmem_inode_info *info = SHMEM_I(inode); bool first_swapped = false; long freed; spin_lock(&info->lock); info->alloced += alloced; info->swapped += swapped; freed = info->alloced - info->swapped - READ_ONCE(inode->i_mapping->nrpages); /* * Special case: whereas normally shmem_recalc_inode() is called * after i_mapping->nrpages has already been adjusted (up or down), * shmem_writeout() has to raise swapped before nrpages is lowered - * to stop a racing shmem_recalc_inode() from thinking that a page has * been freed. Compensate here, to avoid the need for a followup call. */ if (swapped > 0) { if (info->swapped == swapped) first_swapped = true; freed += swapped; } if (freed > 0) info->alloced -= freed; spin_unlock(&info->lock); /* The quota case may block */ if (freed > 0) shmem_inode_unacct_blocks(inode, freed); return first_swapped; } bool shmem_charge(struct inode *inode, long pages) { struct address_space *mapping = inode->i_mapping; if (shmem_inode_acct_blocks(inode, pages)) return false; /* nrpages adjustment first, then shmem_recalc_inode() when balanced */ xa_lock_irq(&mapping->i_pages); mapping->nrpages += pages; xa_unlock_irq(&mapping->i_pages); shmem_recalc_inode(inode, pages, 0); return true; } void shmem_uncharge(struct inode *inode, long pages) { /* pages argument is currently unused: keep it to help debugging */ /* nrpages adjustment done by __filemap_remove_folio() or caller */ shmem_recalc_inode(inode, 0, 0); } /* * Replace item expected in xarray by a new item, while holding xa_lock. */ static int shmem_replace_entry(struct address_space *mapping, pgoff_t index, void *expected, void *replacement) { XA_STATE(xas, &mapping->i_pages, index); void *item; VM_BUG_ON(!expected); VM_BUG_ON(!replacement); item = xas_load(&xas); if (item != expected) return -ENOENT; xas_store(&xas, replacement); return 0; } /* * Sometimes, before we decide whether to proceed or to fail, we must check * that an entry was not already brought back or split by a racing thread. * * Checking folio is not enough: by the time a swapcache folio is locked, it * might be reused, and again be swapcache, using the same swap as before. * Returns the swap entry's order if it still presents, else returns -1. */ static int shmem_confirm_swap(struct address_space *mapping, pgoff_t index, swp_entry_t swap) { XA_STATE(xas, &mapping->i_pages, index); int ret = -1; void *entry; rcu_read_lock(); do { entry = xas_load(&xas); if (entry == swp_to_radix_entry(swap)) ret = xas_get_order(&xas); } while (xas_retry(&xas, entry)); rcu_read_unlock(); return ret; } /* * Definitions for "huge tmpfs": tmpfs mounted with the huge= option * * SHMEM_HUGE_NEVER: * disables huge pages for the mount; * SHMEM_HUGE_ALWAYS: * enables huge pages for the mount; * SHMEM_HUGE_WITHIN_SIZE: * only allocate huge pages if the page will be fully within i_size, * also respect madvise() hints; * SHMEM_HUGE_ADVISE: * only allocate huge pages if requested with madvise(); */ #define SHMEM_HUGE_NEVER 0 #define SHMEM_HUGE_ALWAYS 1 #define SHMEM_HUGE_WITHIN_SIZE 2 #define SHMEM_HUGE_ADVISE 3 /* * Special values. * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled: * * SHMEM_HUGE_DENY: * disables huge on shm_mnt and all mounts, for emergency use; * SHMEM_HUGE_FORCE: * enables huge on shm_mnt and all mounts, w/o needing option, for testing; * */ #define SHMEM_HUGE_DENY (-1) #define SHMEM_HUGE_FORCE (-2) #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* ifdef here to avoid bloating shmem.o when not necessary */ #if defined(CONFIG_TRANSPARENT_HUGEPAGE_SHMEM_HUGE_NEVER) #define SHMEM_HUGE_DEFAULT SHMEM_HUGE_NEVER #elif defined(CONFIG_TRANSPARENT_HUGEPAGE_SHMEM_HUGE_ALWAYS) #define SHMEM_HUGE_DEFAULT SHMEM_HUGE_ALWAYS #elif defined(CONFIG_TRANSPARENT_HUGEPAGE_SHMEM_HUGE_WITHIN_SIZE) #define SHMEM_HUGE_DEFAULT SHMEM_HUGE_WITHIN_SIZE #elif defined(CONFIG_TRANSPARENT_HUGEPAGE_SHMEM_HUGE_ADVISE) #define SHMEM_HUGE_DEFAULT SHMEM_HUGE_ADVISE #else #define SHMEM_HUGE_DEFAULT SHMEM_HUGE_NEVER #endif static int shmem_huge __read_mostly = SHMEM_HUGE_DEFAULT; #undef SHMEM_HUGE_DEFAULT #if defined(CONFIG_TRANSPARENT_HUGEPAGE_TMPFS_HUGE_NEVER) #define TMPFS_HUGE_DEFAULT SHMEM_HUGE_NEVER #elif defined(CONFIG_TRANSPARENT_HUGEPAGE_TMPFS_HUGE_ALWAYS) #define TMPFS_HUGE_DEFAULT SHMEM_HUGE_ALWAYS #elif defined(CONFIG_TRANSPARENT_HUGEPAGE_TMPFS_HUGE_WITHIN_SIZE) #define TMPFS_HUGE_DEFAULT SHMEM_HUGE_WITHIN_SIZE #elif defined(CONFIG_TRANSPARENT_HUGEPAGE_TMPFS_HUGE_ADVISE) #define TMPFS_HUGE_DEFAULT SHMEM_HUGE_ADVISE #else #define TMPFS_HUGE_DEFAULT SHMEM_HUGE_NEVER #endif static int tmpfs_huge __read_mostly = TMPFS_HUGE_DEFAULT; #undef TMPFS_HUGE_DEFAULT static unsigned int shmem_get_orders_within_size(struct inode *inode, unsigned long within_size_orders, pgoff_t index, loff_t write_end) { pgoff_t aligned_index; unsigned long order; loff_t i_size; order = highest_order(within_size_orders); while (within_size_orders) { aligned_index = round_up(index + 1, 1 << order); i_size = max(write_end, i_size_read(inode)); i_size = round_up(i_size, PAGE_SIZE); if (i_size >> PAGE_SHIFT >= aligned_index) return within_size_orders; order = next_order(&within_size_orders, order); } return 0; } static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index, loff_t write_end, bool shmem_huge_force, struct vm_area_struct *vma, vm_flags_t vm_flags) { unsigned int maybe_pmd_order = HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ? 0 : BIT(HPAGE_PMD_ORDER); unsigned long within_size_orders; if (!S_ISREG(inode->i_mode)) return 0; if (shmem_huge == SHMEM_HUGE_DENY) return 0; if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE) return maybe_pmd_order; /* * The huge order allocation for anon shmem is controlled through * the mTHP interface, so we still use PMD-sized huge order to * check whether global control is enabled. * * For tmpfs with 'huge=always' or 'huge=within_size' mount option, * we will always try PMD-sized order first. If that failed, it will * fall back to small large folios. */ switch (SHMEM_SB(inode->i_sb)->huge) { case SHMEM_HUGE_ALWAYS: return THP_ORDERS_ALL_FILE_DEFAULT; case SHMEM_HUGE_WITHIN_SIZE: within_size_orders = shmem_get_orders_within_size(inode, THP_ORDERS_ALL_FILE_DEFAULT, index, write_end); if (within_size_orders > 0) return within_size_orders; fallthrough; case SHMEM_HUGE_ADVISE: if (vm_flags & VM_HUGEPAGE) return THP_ORDERS_ALL_FILE_DEFAULT; fallthrough; default: return 0; } } static int shmem_parse_huge(const char *str) { int huge; if (!str) return -EINVAL; if (!strcmp(str, "never")) huge = SHMEM_HUGE_NEVER; else if (!strcmp(str, "always")) huge = SHMEM_HUGE_ALWAYS; else if (!strcmp(str, "within_size")) huge = SHMEM_HUGE_WITHIN_SIZE; else if (!strcmp(str, "advise")) huge = SHMEM_HUGE_ADVISE; else if (!strcmp(str, "deny")) huge = SHMEM_HUGE_DENY; else if (!strcmp(str, "force")) huge = SHMEM_HUGE_FORCE; else return -EINVAL; if (!has_transparent_hugepage() && huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY) return -EINVAL; /* Do not override huge allocation policy with non-PMD sized mTHP */ if (huge == SHMEM_HUGE_FORCE && huge_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER)) return -EINVAL; return huge; } #if defined(CONFIG_SYSFS) || defined(CONFIG_TMPFS) static const char *shmem_format_huge(int huge) { switch (huge) { case SHMEM_HUGE_NEVER: return "never"; case SHMEM_HUGE_ALWAYS: return "always"; case SHMEM_HUGE_WITHIN_SIZE: return "within_size"; case SHMEM_HUGE_ADVISE: return "advise"; case SHMEM_HUGE_DENY: return "deny"; case SHMEM_HUGE_FORCE: return "force"; default: VM_BUG_ON(1); return "bad_val"; } } #endif static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, struct shrink_control *sc, unsigned long nr_to_free) { LIST_HEAD(list), *pos, *next; struct inode *inode; struct shmem_inode_info *info; struct folio *folio; unsigned long batch = sc ? sc->nr_to_scan : 128; unsigned long split = 0, freed = 0; if (list_empty(&sbinfo->shrinklist)) return SHRINK_STOP; spin_lock(&sbinfo->shrinklist_lock); list_for_each_safe(pos, next, &sbinfo->shrinklist) { info = list_entry(pos, struct shmem_inode_info, shrinklist); /* pin the inode */ inode = igrab(&info->vfs_inode); /* inode is about to be evicted */ if (!inode) { list_del_init(&info->shrinklist); goto next; } list_move(&info->shrinklist, &list); next: sbinfo->shrinklist_len--; if (!--batch) break; } spin_unlock(&sbinfo->shrinklist_lock); list_for_each_safe(pos, next, &list) { pgoff_t next, end; loff_t i_size; int ret; info = list_entry(pos, struct shmem_inode_info, shrinklist); inode = &info->vfs_inode; if (nr_to_free && freed >= nr_to_free) goto move_back; i_size = i_size_read(inode); folio = filemap_get_entry(inode->i_mapping, i_size / PAGE_SIZE); if (!folio || xa_is_value(folio)) goto drop; /* No large folio at the end of the file: nothing to split */ if (!folio_test_large(folio)) { folio_put(folio); goto drop; } /* Check if there is anything to gain from splitting */ next = folio_next_index(folio); end = shmem_fallocend(inode, DIV_ROUND_UP(i_size, PAGE_SIZE)); if (end <= folio->index || end >= next) { folio_put(folio); goto drop; } /* * Move the inode on the list back to shrinklist if we failed * to lock the page at this time. * * Waiting for the lock may lead to deadlock in the * reclaim path. */ if (!folio_trylock(folio)) { folio_put(folio); goto move_back; } ret = split_folio(folio); folio_unlock(folio); folio_put(folio); /* If split failed move the inode on the list back to shrinklist */ if (ret) goto move_back; freed += next - end; split++; drop: list_del_init(&info->shrinklist); goto put; move_back: /* * Make sure the inode is either on the global list or deleted * from any local list before iput() since it could be deleted * in another thread once we put the inode (then the local list * is corrupted). */ spin_lock(&sbinfo->shrinklist_lock); list_move(&info->shrinklist, &sbinfo->shrinklist); sbinfo->shrinklist_len++; spin_unlock(&sbinfo->shrinklist_lock); put: iput(inode); } return split; } static long shmem_unused_huge_scan(struct super_block *sb, struct shrink_control *sc) { struct shmem_sb_info *sbinfo = SHMEM_SB(sb); if (!READ_ONCE(sbinfo->shrinklist_len)) return SHRINK_STOP; return shmem_unused_huge_shrink(sbinfo, sc, 0); } static long shmem_unused_huge_count(struct super_block *sb, struct shrink_control *sc) { struct shmem_sb_info *sbinfo = SHMEM_SB(sb); return READ_ONCE(sbinfo->shrinklist_len); } #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ #define shmem_huge SHMEM_HUGE_DENY static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, struct shrink_control *sc, unsigned long nr_to_free) { return 0; } static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index, loff_t write_end, bool shmem_huge_force, struct vm_area_struct *vma, vm_flags_t vm_flags) { return 0; } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static void shmem_update_stats(struct folio *folio, int nr_pages) { if (folio_test_pmd_mappable(folio)) lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr_pages); lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); lruvec_stat_mod_folio(folio, NR_SHMEM, nr_pages); } /* * Somewhat like filemap_add_folio, but error if expected item has gone. */ int shmem_add_to_page_cache(struct folio *folio, struct address_space *mapping, pgoff_t index, void *expected, gfp_t gfp) { XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio)); unsigned long nr = folio_nr_pages(folio); swp_entry_t iter, swap; void *entry; VM_BUG_ON_FOLIO(index != round_down(index, nr), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio); folio_ref_add(folio, nr); folio->mapping = mapping; folio->index = index; gfp &= GFP_RECLAIM_MASK; folio_throttle_swaprate(folio, gfp); swap = radix_to_swp_entry(expected); do { iter = swap; xas_lock_irq(&xas); xas_for_each_conflict(&xas, entry) { /* * The range must either be empty, or filled with * expected swap entries. Shmem swap entries are never * partially freed without split of both entry and * folio, so there shouldn't be any holes. */ if (!expected || entry != swp_to_radix_entry(iter)) { xas_set_err(&xas, -EEXIST); goto unlock; } iter.val += 1 << xas_get_order(&xas); } if (expected && iter.val - nr != swap.val) { xas_set_err(&xas, -EEXIST); goto unlock; } xas_store(&xas, folio); if (xas_error(&xas)) goto unlock; shmem_update_stats(folio, nr); mapping->nrpages += nr; unlock: xas_unlock_irq(&xas); } while (xas_nomem(&xas, gfp)); if (xas_error(&xas)) { folio->mapping = NULL; folio_ref_sub(folio, nr); return xas_error(&xas); } return 0; } /* * Somewhat like filemap_remove_folio, but substitutes swap for @folio. */ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap) { struct address_space *mapping = folio->mapping; long nr = folio_nr_pages(folio); int error; xa_lock_irq(&mapping->i_pages); error = shmem_replace_entry(mapping, folio->index, folio, radswap); folio->mapping = NULL; mapping->nrpages -= nr; shmem_update_stats(folio, -nr); xa_unlock_irq(&mapping->i_pages); folio_put_refs(folio, nr); BUG_ON(error); } /* * Remove swap entry from page cache, free the swap and its page cache. Returns * the number of pages being freed. 0 means entry not found in XArray (0 pages * being freed). */ static long shmem_free_swap(struct address_space *mapping, pgoff_t index, void *radswap) { int order = xa_get_order(&mapping->i_pages, index); void *old; old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0); if (old != radswap) return 0; free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order); return 1 << order; } /* * Determine (in bytes) how many of the shmem object's pages mapped by the * given offsets are swapped out. * * This is safe to call without i_rwsem or the i_pages lock thanks to RCU, * as long as the inode doesn't go away and racy results are not a problem. */ unsigned long shmem_partial_swap_usage(struct address_space *mapping, pgoff_t start, pgoff_t end) { XA_STATE(xas, &mapping->i_pages, start); struct folio *folio; unsigned long swapped = 0; unsigned long max = end - 1; rcu_read_lock(); xas_for_each(&xas, folio, max) { if (xas_retry(&xas, folio)) continue; if (xa_is_value(folio)) swapped += 1 << xas_get_order(&xas); if (xas.xa_index == max) break; if (need_resched()) { xas_pause(&xas); cond_resched_rcu(); } } rcu_read_unlock(); return swapped << PAGE_SHIFT; } /* * Determine (in bytes) how many of the shmem object's pages mapped by the * given vma is swapped out. * * This is safe to call without i_rwsem or the i_pages lock thanks to RCU, * as long as the inode doesn't go away and racy results are not a problem. */ unsigned long shmem_swap_usage(struct vm_area_struct *vma) { struct inode *inode = file_inode(vma->vm_file); struct shmem_inode_info *info = SHMEM_I(inode); struct address_space *mapping = inode->i_mapping; unsigned long swapped; /* Be careful as we don't hold info->lock */ swapped = READ_ONCE(info->swapped); /* * The easier cases are when the shmem object has nothing in swap, or * the vma maps it whole. Then we can simply use the stats that we * already track. */ if (!swapped) return 0; if (!vma->vm_pgoff && vma->vm_end - vma->vm_start >= inode->i_size) return swapped << PAGE_SHIFT; /* Here comes the more involved part */ return shmem_partial_swap_usage(mapping, vma->vm_pgoff, vma->vm_pgoff + vma_pages(vma)); } /* * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists. */ void shmem_unlock_mapping(struct address_space *mapping) { struct folio_batch fbatch; pgoff_t index = 0; folio_batch_init(&fbatch); /* * Minor point, but we might as well stop if someone else SHM_LOCKs it. */ while (!mapping_unevictable(mapping) && filemap_get_folios(mapping, &index, ~0UL, &fbatch)) { check_move_unevictable_folios(&fbatch); folio_batch_release(&fbatch); cond_resched(); } } static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index) { struct folio *folio; /* * At first avoid shmem_get_folio(,,,SGP_READ): that fails * beyond i_size, and reports fallocated folios as holes. */ folio = filemap_get_entry(inode->i_mapping, index); if (!folio) return folio; if (!xa_is_value(folio)) { folio_lock(folio); if (folio->mapping == inode->i_mapping) return folio; /* The folio has been swapped out */ folio_unlock(folio); folio_put(folio); } /* * But read a folio back from swap if any of it is within i_size * (although in some cases this is just a waste of time). */ folio = NULL; shmem_get_folio(inode, index, 0, &folio, SGP_READ); return folio; } /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. */ static void shmem_undo_range(struct inode *inode, loff_t lstart, uoff_t lend, bool unfalloc) { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT; pgoff_t end = (lend + 1) >> PAGE_SHIFT; struct folio_batch fbatch; pgoff_t indices[PAGEVEC_SIZE]; struct folio *folio; bool same_folio; long nr_swaps_freed = 0; pgoff_t index; int i; if (lend == -1) end = -1; /* unsigned, so actually very big */ if (info->fallocend > start && info->fallocend <= end && !unfalloc) info->fallocend = start; folio_batch_init(&fbatch); index = start; while (index < end && find_lock_entries(mapping, &index, end - 1, &fbatch, indices)) { for (i = 0; i < folio_batch_count(&fbatch); i++) { folio = fbatch.folios[i]; if (xa_is_value(folio)) { if (unfalloc) continue; nr_swaps_freed += shmem_free_swap(mapping, indices[i], folio); continue; } if (!unfalloc || !folio_test_uptodate(folio)) truncate_inode_folio(mapping, folio); folio_unlock(folio); } folio_batch_remove_exceptionals(&fbatch); folio_batch_release(&fbatch); cond_resched(); } /* * When undoing a failed fallocate, we want none of the partial folio * zeroing and splitting below, but shall want to truncate the whole * folio when !uptodate indicates that it was added by this fallocate, * even when [lstart, lend] covers only a part of the folio. */ if (unfalloc) goto whole_folios; same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT); folio = shmem_get_partial_folio(inode, lstart >> PAGE_SHIFT); if (folio) { same_folio = lend < folio_next_pos(folio); folio_mark_dirty(folio); if (!truncate_inode_partial_folio(folio, lstart, lend)) { start = folio_next_index(folio); if (same_folio) end = folio->index; } folio_unlock(folio); folio_put(folio); folio = NULL; } if (!same_folio) folio = shmem_get_partial_folio(inode, lend >> PAGE_SHIFT); if (folio) { folio_mark_dirty(folio); if (!truncate_inode_partial_folio(folio, lstart, lend)) end = folio->index; folio_unlock(folio); folio_put(folio); } whole_folios: index = start; while (index < end) { cond_resched(); if (!find_get_entries(mapping, &index, end - 1, &fbatch, indices)) { /* If all gone or hole-punch or unfalloc, we're done */ if (index == start || end != -1) break; /* But if truncating, restart to make sure all gone */ index = start; continue; } for (i = 0; i < folio_batch_count(&fbatch); i++) { folio = fbatch.folios[i]; if (xa_is_value(folio)) { long swaps_freed; if (unfalloc) continue; swaps_freed = shmem_free_swap(mapping, indices[i], folio); if (!swaps_freed) { /* Swap was replaced by page: retry */ index = indices[i]; break; } nr_swaps_freed += swaps_freed; continue; } folio_lock(folio); if (!unfalloc || !folio_test_uptodate(folio)) { if (folio_mapping(folio) != mapping) { /* Page was replaced by swap: retry */ folio_unlock(folio); index = indices[i]; break; } VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); if (!folio_test_large(folio)) { truncate_inode_folio(mapping, folio); } else if (truncate_inode_partial_folio(folio, lstart, lend)) { /* * If we split a page, reset the loop so * that we pick up the new sub pages. * Otherwise the THP was entirely * dropped or the target range was * zeroed, so just continue the loop as * is. */ if (!folio_test_large(folio)) { folio_unlock(folio); index = start; break; } } } folio_unlock(folio); } folio_batch_remove_exceptionals(&fbatch); folio_batch_release(&fbatch); } shmem_recalc_inode(inode, 0, -nr_swaps_freed); } void shmem_truncate_range(struct inode *inode, loff_t lstart, uoff_t lend) { shmem_undo_range(inode, lstart, lend, false); inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode)); inode_inc_iversion(inode); } EXPORT_SYMBOL_GPL(shmem_truncate_range); static int shmem_getattr(struct mnt_idmap *idmap, const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = path->dentry->d_inode; struct shmem_inode_info *info = SHMEM_I(inode); if (info->alloced - info->swapped != inode->i_mapping->nrpages) shmem_recalc_inode(inode, 0, 0); if (info->fsflags & FS_APPEND_FL) stat->attributes |= STATX_ATTR_APPEND; if (info->fsflags & FS_IMMUTABLE_FL) stat->attributes |= STATX_ATTR_IMMUTABLE; if (info->fsflags & FS_NODUMP_FL) stat->attributes |= STATX_ATTR_NODUMP; stat->attributes_mask |= (STATX_ATTR_APPEND | STATX_ATTR_IMMUTABLE | STATX_ATTR_NODUMP); generic_fillattr(idmap, request_mask, inode, stat); if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0)) stat->blksize = HPAGE_PMD_SIZE; if (request_mask & STATX_BTIME) { stat->result_mask |= STATX_BTIME; stat->btime.tv_sec = info->i_crtime.tv_sec; stat->btime.tv_nsec = info->i_crtime.tv_nsec; } return 0; } static int shmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct iattr *attr) { struct inode *inode = d_inode(dentry); struct shmem_inode_info *info = SHMEM_I(inode); int error; bool update_mtime = false; bool update_ctime = true; error = setattr_prepare(idmap, dentry, attr); if (error) return error; if ((info->seals & F_SEAL_EXEC) && (attr->ia_valid & ATTR_MODE)) { if ((inode->i_mode ^ attr->ia_mode) & 0111) { return -EPERM; } } if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { loff_t oldsize = inode->i_size; loff_t newsize = attr->ia_size; /* protected by i_rwsem */ if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) || (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM; if (newsize != oldsize) { if (info->flags & SHMEM_F_MAPPING_FROZEN) return -EPERM; error = shmem_reacct_size(SHMEM_I(inode)->flags, oldsize, newsize); if (error) return error; i_size_write(inode, newsize); update_mtime = true; } else { update_ctime = false; } if (newsize <= oldsize) { loff_t holebegin = round_up(newsize, PAGE_SIZE); if (oldsize > holebegin) unmap_mapping_range(inode->i_mapping, holebegin, 0, 1); if (info->alloced) shmem_truncate_range(inode, newsize, (loff_t)-1); /* unmap again to remove racily COWed private pages */ if (oldsize > holebegin) unmap_mapping_range(inode->i_mapping, holebegin, 0, 1); } } if (is_quota_modification(idmap, inode, attr)) { error = dquot_initialize(inode); if (error) return error; } /* Transfer quota accounting */ if (i_uid_needs_update(idmap, attr, inode) || i_gid_needs_update(idmap, attr, inode)) { error = dquot_transfer(idmap, inode, attr); if (error) return error; } setattr_copy(idmap, inode, attr); if (attr->ia_valid & ATTR_MODE) error = posix_acl_chmod(idmap, dentry, inode->i_mode); if (!error && update_ctime) { inode_set_ctime_current(inode); if (update_mtime) inode_set_mtime_to_ts(inode, inode_get_ctime(inode)); inode_inc_iversion(inode); } return error; } static void shmem_evict_inode(struct inode *inode) { struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); size_t freed = 0; if (shmem_mapping(inode->i_mapping)) { shmem_unacct_size(info->flags, inode->i_size); inode->i_size = 0; mapping_set_exiting(inode->i_mapping); shmem_truncate_range(inode, 0, (loff_t)-1); if (!list_empty(&info->shrinklist)) { spin_lock(&sbinfo->shrinklist_lock); if (!list_empty(&info->shrinklist)) { list_del_init(&info->shrinklist); sbinfo->shrinklist_len--; } spin_unlock(&sbinfo->shrinklist_lock); } while (!list_empty(&info->swaplist)) { /* Wait while shmem_unuse() is scanning this inode... */ wait_var_event(&info->stop_eviction, !atomic_read(&info->stop_eviction)); spin_lock(&shmem_swaplist_lock); /* ...but beware of the race if we peeked too early */ if (!atomic_read(&info->stop_eviction)) list_del_init(&info->swaplist); spin_unlock(&shmem_swaplist_lock); } } simple_xattrs_free(&info->xattrs, sbinfo->max_inodes ? &freed : NULL); shmem_free_inode(inode->i_sb, freed); WARN_ON(inode->i_blocks); clear_inode(inode); #ifdef CONFIG_TMPFS_QUOTA dquot_free_inode(inode); dquot_drop(inode); #endif } static unsigned int shmem_find_swap_entries(struct address_space *mapping, pgoff_t start, struct folio_batch *fbatch, pgoff_t *indices, unsigned int type) { XA_STATE(xas, &mapping->i_pages, start); struct folio *folio; swp_entry_t entry; rcu_read_lock(); xas_for_each(&xas, folio, ULONG_MAX) { if (xas_retry(&xas, folio)) continue; if (!xa_is_value(folio)) continue; entry = radix_to_swp_entry(folio); /* * swapin error entries can be found in the mapping. But they're * deliberately ignored here as we've done everything we can do. */ if (swp_type(entry) != type) continue; indices[folio_batch_count(fbatch)] = xas.xa_index; if (!folio_batch_add(fbatch, folio)) break; if (need_resched()) { xas_pause(&xas); cond_resched_rcu(); } } rcu_read_unlock(); return folio_batch_count(fbatch); } /* * Move the swapped pages for an inode to page cache. Returns the count * of pages swapped in, or the error in case of failure. */ static int shmem_unuse_swap_entries(struct inode *inode, struct folio_batch *fbatch, pgoff_t *indices) { int i = 0; int ret = 0; int error = 0; struct address_space *mapping = inode->i_mapping; for (i = 0; i < folio_batch_count(fbatch); i++) { struct folio *folio = fbatch->folios[i]; error = shmem_swapin_folio(inode, indices[i], &folio, SGP_CACHE, mapping_gfp_mask(mapping), NULL, NULL); if (error == 0) { folio_unlock(folio); folio_put(folio); ret++; } if (error == -ENOMEM) break; error = 0; } return error ? error : ret; } /* * If swap found in inode, free it and move page from swapcache to filecache. */ static int shmem_unuse_inode(struct inode *inode, unsigned int type) { struct address_space *mapping = inode->i_mapping; pgoff_t start = 0; struct folio_batch fbatch; pgoff_t indices[PAGEVEC_SIZE]; int ret = 0; do { folio_batch_init(&fbatch); if (!shmem_find_swap_entries(mapping, start, &fbatch, indices, type)) { ret = 0; break; } ret = shmem_unuse_swap_entries(inode, &fbatch, indices); if (ret < 0) break; start = indices[folio_batch_count(&fbatch) - 1]; } while (true); return ret; } /* * Read all the shared memory data that resides in the swap * device 'type' back into memory, so the swap device can be * unused. */ int shmem_unuse(unsigned int type) { struct shmem_inode_info *info, *next; int error = 0; if (list_empty(&shmem_swaplist)) return 0; spin_lock(&shmem_swaplist_lock); start_over: list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) { if (!info->swapped) { list_del_init(&info->swaplist); continue; } /* * Drop the swaplist mutex while searching the inode for swap; * but before doing so, make sure shmem_evict_inode() will not * remove placeholder inode from swaplist, nor let it be freed * (igrab() would protect from unlink, but not from unmount). */ atomic_inc(&info->stop_eviction); spin_unlock(&shmem_swaplist_lock); error = shmem_unuse_inode(&info->vfs_inode, type); cond_resched(); spin_lock(&shmem_swaplist_lock); if (atomic_dec_and_test(&info->stop_eviction)) wake_up_var(&info->stop_eviction); if (error) break; if (list_empty(&info->swaplist)) goto start_over; next = list_next_entry(info, swaplist); if (!info->swapped) list_del_init(&info->swaplist); } spin_unlock(&shmem_swaplist_lock); return error; } /** * shmem_writeout - Write the folio to swap * @folio: The folio to write * @plug: swap plug * @folio_list: list to put back folios on split * * Move the folio from the page cache to the swap cache. */ int shmem_writeout(struct folio *folio, struct swap_iocb **plug, struct list_head *folio_list) { struct address_space *mapping = folio->mapping; struct inode *inode = mapping->host; struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); pgoff_t index; int nr_pages; bool split = false; if ((info->flags & SHMEM_F_LOCKED) || sbinfo->noswap) goto redirty; if (!total_swap_pages) goto redirty; /* * If CONFIG_THP_SWAP is not enabled, the large folio should be * split when swapping. * * And shrinkage of pages beyond i_size does not split swap, so * swapout of a large folio crossing i_size needs to split too * (unless fallocate has been used to preallocate beyond EOF). */ if (folio_test_large(folio)) { index = shmem_fallocend(inode, DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE)); if ((index > folio->index && index < folio_next_index(folio)) || !IS_ENABLED(CONFIG_THP_SWAP)) split = true; } if (split) { try_split: /* Ensure the subpages are still dirty */ folio_test_set_dirty(folio); if (split_folio_to_list(folio, folio_list)) goto redirty; folio_clear_dirty(folio); } index = folio->index; nr_pages = folio_nr_pages(folio); /* * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC * value into swapfile.c, the only way we can correctly account for a * fallocated folio arriving here is now to initialize it and write it. * * That's okay for a folio already fallocated earlier, but if we have * not yet completed the fallocation, then (a) we want to keep track * of this folio in case we have to undo it, and (b) it may not be a * good idea to continue anyway, once we're pushing into swap. So * reactivate the folio, and let shmem_fallocate() quit when too many. */ if (!folio_test_uptodate(folio)) { if (inode->i_private) { struct shmem_falloc *shmem_falloc; spin_lock(&inode->i_lock); shmem_falloc = inode->i_private; if (shmem_falloc && !shmem_falloc->waitq && index >= shmem_falloc->start && index < shmem_falloc->next) shmem_falloc->nr_unswapped += nr_pages; else shmem_falloc = NULL; spin_unlock(&inode->i_lock); if (shmem_falloc) goto redirty; } folio_zero_range(folio, 0, folio_size(folio)); flush_dcache_folio(folio); folio_mark_uptodate(folio); } if (!folio_alloc_swap(folio)) { bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages); int error; /* * Add inode to shmem_unuse()'s list of swapped-out inodes, * if it's not already there. Do it now before the folio is * removed from page cache, when its pagelock no longer * protects the inode from eviction. And do it now, after * we've incremented swapped, because shmem_unuse() will * prune a !swapped inode from the swaplist. */ if (first_swapped) { spin_lock(&shmem_swaplist_lock); if (list_empty(&info->swaplist)) list_add(&info->swaplist, &shmem_swaplist); spin_unlock(&shmem_swaplist_lock); } swap_shmem_alloc(folio->swap, nr_pages); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); BUG_ON(folio_mapped(folio)); error = swap_writeout(folio, plug); if (error != AOP_WRITEPAGE_ACTIVATE) { /* folio has been unlocked */ return error; } /* * The intention here is to avoid holding on to the swap when * zswap was unable to compress and unable to writeback; but * it will be appropriate if other reactivate cases are added. */ error = shmem_add_to_page_cache(folio, mapping, index, swp_to_radix_entry(folio->swap), __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); /* Swap entry might be erased by racing shmem_free_swap() */ if (!error) { shmem_recalc_inode(inode, 0, -nr_pages); swap_free_nr(folio->swap, nr_pages); } /* * The swap_cache_del_folio() below could be left for * shrink_folio_list()'s folio_free_swap() to dispose of; * but I'm a little nervous about letting this folio out of * shmem_writeout() in a hybrid half-tmpfs-half-swap state * e.g. folio_mapping(folio) might give an unexpected answer. */ swap_cache_del_folio(folio); goto redirty; } if (nr_pages > 1) goto try_split; redirty: folio_mark_dirty(folio); return AOP_WRITEPAGE_ACTIVATE; /* Return with folio locked */ } EXPORT_SYMBOL_GPL(shmem_writeout); #if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS) static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) { char buffer[64]; if (!mpol || mpol->mode == MPOL_DEFAULT) return; /* show nothing */ mpol_to_str(buffer, sizeof(buffer), mpol); seq_printf(seq, ",mpol=%s", buffer); } static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) { struct mempolicy *mpol = NULL; if (sbinfo->mpol) { raw_spin_lock(&sbinfo->stat_lock); /* prevent replace/use races */ mpol = sbinfo->mpol; mpol_get(mpol); raw_spin_unlock(&sbinfo->stat_lock); } return mpol; } #else /* !CONFIG_NUMA || !CONFIG_TMPFS */ static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) { } static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) { return NULL; } #endif /* CONFIG_NUMA && CONFIG_TMPFS */ static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info, pgoff_t index, unsigned int order, pgoff_t *ilx); static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp, struct shmem_inode_info *info, pgoff_t index) { struct mempolicy *mpol; pgoff_t ilx; struct folio *folio; mpol = shmem_get_pgoff_policy(info, index, 0, &ilx); folio = swap_cluster_readahead(swap, gfp, mpol, ilx); mpol_cond_put(mpol); return folio; } /* * Make sure huge_gfp is always more limited than limit_gfp. * Some of the flags set permissions, while others set limitations. */ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) { gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM; gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY; gfp_t zoneflags = limit_gfp & GFP_ZONEMASK; gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK); /* Allow allocations only from the originally specified zones. */ result |= zoneflags; /* * Minimize the result gfp by taking the union with the deny flags, * and the intersection of the allow flags. */ result |= (limit_gfp & denyflags); result |= (huge_gfp & limit_gfp) & allowflags; return result; } #ifdef CONFIG_TRANSPARENT_HUGEPAGE bool shmem_hpage_pmd_enabled(void) { if (shmem_huge == SHMEM_HUGE_DENY) return false; if (test_bit(HPAGE_PMD_ORDER, &huge_shmem_orders_always)) return true; if (test_bit(HPAGE_PMD_ORDER, &huge_shmem_orders_madvise)) return true; if (test_bit(HPAGE_PMD_ORDER, &huge_shmem_orders_within_size)) return true; if (test_bit(HPAGE_PMD_ORDER, &huge_shmem_orders_inherit) && shmem_huge != SHMEM_HUGE_NEVER) return true; return false; } unsigned long shmem_allowable_huge_orders(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, loff_t write_end, bool shmem_huge_force) { unsigned long mask = READ_ONCE(huge_shmem_orders_always); unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size); vm_flags_t vm_flags = vma ? vma->vm_flags : 0; unsigned int global_orders; if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, shmem_huge_force))) return 0; global_orders = shmem_huge_global_enabled(inode, index, write_end, shmem_huge_force, vma, vm_flags); /* Tmpfs huge pages allocation */ if (!vma || !vma_is_anon_shmem(vma)) return global_orders; /* * Following the 'deny' semantics of the top level, force the huge * option off from all mounts. */ if (shmem_huge == SHMEM_HUGE_DENY) return 0; /* * Only allow inherit orders if the top-level value is 'force', which * means non-PMD sized THP can not override 'huge' mount option now. */ if (shmem_huge == SHMEM_HUGE_FORCE) return READ_ONCE(huge_shmem_orders_inherit); /* Allow mTHP that will be fully within i_size. */ mask |= shmem_get_orders_within_size(inode, within_size_orders, index, 0); if (vm_flags & VM_HUGEPAGE) mask |= READ_ONCE(huge_shmem_orders_madvise); if (global_orders > 0) mask |= READ_ONCE(huge_shmem_orders_inherit); return THP_ORDERS_ALL_FILE_DEFAULT & mask; } static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf, struct address_space *mapping, pgoff_t index, unsigned long orders) { struct vm_area_struct *vma = vmf ? vmf->vma : NULL; pgoff_t aligned_index; unsigned long pages; int order; if (vma) { orders = thp_vma_suitable_orders(vma, vmf->address, orders); if (!orders) return 0; } /* Find the highest order that can add into the page cache */ order = highest_order(orders); while (orders) { pages = 1UL << order; aligned_index = round_down(index, pages); /* * Check for conflict before waiting on a huge allocation. * Conflict might be that a huge page has just been allocated * and added to page cache by a racing thread, or that there * is already at least one small page in the huge extent. * Be careful to retry when appropriate, but not forever! * Elsewhere -EEXIST would be the right code, but not here. */ if (!xa_find(&mapping->i_pages, &aligned_index, aligned_index + pages - 1, XA_PRESENT)) break; order = next_order(&orders, order); } return orders; } #else static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf, struct address_space *mapping, pgoff_t index, unsigned long orders) { return 0; } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static struct folio *shmem_alloc_folio(gfp_t gfp, int order, struct shmem_inode_info *info, pgoff_t index) { struct mempolicy *mpol; pgoff_t ilx; struct folio *folio; mpol = shmem_get_pgoff_policy(info, index, order, &ilx); folio = folio_alloc_mpol(gfp, order, mpol, ilx, numa_node_id()); mpol_cond_put(mpol); return folio; } static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, gfp_t gfp, struct inode *inode, pgoff_t index, struct mm_struct *fault_mm, unsigned long orders) { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); unsigned long suitable_orders = 0; struct folio *folio = NULL; pgoff_t aligned_index; long pages; int error, order; if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) orders = 0; if (orders > 0) { suitable_orders = shmem_suitable_orders(inode, vmf, mapping, index, orders); order = highest_order(suitable_orders); while (suitable_orders) { pages = 1UL << order; aligned_index = round_down(index, pages); folio = shmem_alloc_folio(gfp, order, info, aligned_index); if (folio) { index = aligned_index; goto allocated; } if (pages == HPAGE_PMD_NR) count_vm_event(THP_FILE_FALLBACK); count_mthp_stat(order, MTHP_STAT_SHMEM_FALLBACK); order = next_order(&suitable_orders, order); } } else { pages = 1; folio = shmem_alloc_folio(gfp, 0, info, index); } if (!folio) return ERR_PTR(-ENOMEM); allocated: __folio_set_locked(folio); __folio_set_swapbacked(folio); gfp &= GFP_RECLAIM_MASK; error = mem_cgroup_charge(folio, fault_mm, gfp); if (error) { if (xa_find(&mapping->i_pages, &index, index + pages - 1, XA_PRESENT)) { error = -EEXIST; } else if (pages > 1) { if (pages == HPAGE_PMD_NR) { count_vm_event(THP_FILE_FALLBACK); count_vm_event(THP_FILE_FALLBACK_CHARGE); } count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_FALLBACK); count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_FALLBACK_CHARGE); } goto unlock; } error = shmem_add_to_page_cache(folio, mapping, index, NULL, gfp); if (error) goto unlock; error = shmem_inode_acct_blocks(inode, pages); if (error) { struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); long freed; /* * Try to reclaim some space by splitting a few * large folios beyond i_size on the filesystem. */ shmem_unused_huge_shrink(sbinfo, NULL, pages); /* * And do a shmem_recalc_inode() to account for freed pages: * except our folio is there in cache, so not quite balanced. */ spin_lock(&info->lock); freed = pages + info->alloced - info->swapped - READ_ONCE(mapping->nrpages); if (freed > 0) info->alloced -= freed; spin_unlock(&info->lock); if (freed > 0) shmem_inode_unacct_blocks(inode, freed); error = shmem_inode_acct_blocks(inode, pages); if (error) { filemap_remove_folio(folio); goto unlock; } } shmem_recalc_inode(inode, pages, 0); folio_add_lru(folio); return folio; unlock: folio_unlock(folio); folio_put(folio); return ERR_PTR(error); } static struct folio *shmem_swap_alloc_folio(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, swp_entry_t entry, int order, gfp_t gfp) { struct shmem_inode_info *info = SHMEM_I(inode); int nr_pages = 1 << order; struct folio *new; gfp_t alloc_gfp; void *shadow; /* * We have arrived here because our zones are constrained, so don't * limit chance of success with further cpuset and node constraints. */ gfp &= ~GFP_CONSTRAINT_MASK; alloc_gfp = gfp; if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { if (WARN_ON_ONCE(order)) return ERR_PTR(-EINVAL); } else if (order) { /* * If uffd is active for the vma, we need per-page fault * fidelity to maintain the uffd semantics, then fallback * to swapin order-0 folio, as well as for zswap case. * Any existing sub folio in the swap cache also blocks * mTHP swapin. */ if ((vma && unlikely(userfaultfd_armed(vma))) || !zswap_never_enabled() || non_swapcache_batch(entry, nr_pages) != nr_pages) goto fallback; alloc_gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp); } retry: new = shmem_alloc_folio(alloc_gfp, order, info, index); if (!new) { new = ERR_PTR(-ENOMEM); goto fallback; } if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL, alloc_gfp, entry)) { folio_put(new); new = ERR_PTR(-ENOMEM); goto fallback; } /* * Prevent parallel swapin from proceeding with the swap cache flag. * * Of course there is another possible concurrent scenario as well, * that is to say, the swap cache flag of a large folio has already * been set by swapcache_prepare(), while another thread may have * already split the large swap entry stored in the shmem mapping. * In this case, shmem_add_to_page_cache() will help identify the * concurrent swapin and return -EEXIST. */ if (swapcache_prepare(entry, nr_pages)) { folio_put(new); new = ERR_PTR(-EEXIST); /* Try smaller folio to avoid cache conflict */ goto fallback; } __folio_set_locked(new); __folio_set_swapbacked(new); new->swap = entry; memcg1_swapin(entry, nr_pages); shadow = swap_cache_get_shadow(entry); if (shadow) workingset_refault(new, shadow); folio_add_lru(new); swap_read_folio(new, NULL); return new; fallback: /* Order 0 swapin failed, nothing to fallback to, abort */ if (!order) return new; entry.val += index - round_down(index, nr_pages); alloc_gfp = gfp; nr_pages = 1; order = 0; goto retry; } /* * When a page is moved from swapcache to shmem filecache (either by the * usual swapin of shmem_get_folio_gfp(), or by the less common swapoff of * shmem_unuse_inode()), it may have been read in earlier from swap, in * ignorance of the mapping it belongs to. If that mapping has special * constraints (like the gma500 GEM driver, which requires RAM below 4GB), * we may need to copy to a suitable page before moving to filecache. * * In a future release, this may well be extended to respect cpuset and * NUMA mempolicy, and applied also to anonymous pages in do_swap_page(); * but for now it is a simple matter of zone. */ static bool shmem_should_replace_folio(struct folio *folio, gfp_t gfp) { return folio_zonenum(folio) > gfp_zone(gfp); } static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, struct shmem_inode_info *info, pgoff_t index, struct vm_area_struct *vma) { struct swap_cluster_info *ci; struct folio *new, *old = *foliop; swp_entry_t entry = old->swap; int nr_pages = folio_nr_pages(old); int error = 0; /* * We have arrived here because our zones are constrained, so don't * limit chance of success by further cpuset and node constraints. */ gfp &= ~GFP_CONSTRAINT_MASK; #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (nr_pages > 1) { gfp_t huge_gfp = vma_thp_gfp_mask(vma); gfp = limit_gfp_mask(huge_gfp, gfp); } #endif new = shmem_alloc_folio(gfp, folio_order(old), info, index); if (!new) return -ENOMEM; folio_ref_add(new, nr_pages); folio_copy(new, old); flush_dcache_folio(new); __folio_set_locked(new); __folio_set_swapbacked(new); folio_mark_uptodate(new); new->swap = entry; folio_set_swapcache(new); ci = swap_cluster_get_and_lock_irq(old); __swap_cache_replace_folio(ci, old, new); mem_cgroup_replace_folio(old, new); shmem_update_stats(new, nr_pages); shmem_update_stats(old, -nr_pages); swap_cluster_unlock_irq(ci); folio_add_lru(new); *foliop = new; folio_clear_swapcache(old); old->private = NULL; folio_unlock(old); /* * The old folio are removed from swap cache, drop the 'nr_pages' * reference, as well as one temporary reference getting from swap * cache. */ folio_put_refs(old, nr_pages + 1); return error; } static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, struct folio *folio, swp_entry_t swap, bool skip_swapcache) { struct address_space *mapping = inode->i_mapping; swp_entry_t swapin_error; void *old; int nr_pages; swapin_error = make_poisoned_swp_entry(); old = xa_cmpxchg_irq(&mapping->i_pages, index, swp_to_radix_entry(swap), swp_to_radix_entry(swapin_error), 0); if (old != swp_to_radix_entry(swap)) return; nr_pages = folio_nr_pages(folio); folio_wait_writeback(folio); if (!skip_swapcache) swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks) * in shmem_evict_inode(). */ shmem_recalc_inode(inode, -nr_pages, -nr_pages); swap_free_nr(swap, nr_pages); } static int shmem_split_large_entry(struct inode *inode, pgoff_t index, swp_entry_t swap, gfp_t gfp) { struct address_space *mapping = inode->i_mapping; XA_STATE_ORDER(xas, &mapping->i_pages, index, 0); int split_order = 0; int i; /* Convert user data gfp flags to xarray node gfp flags */ gfp &= GFP_RECLAIM_MASK; for (;;) { void *old = NULL; int cur_order; pgoff_t swap_index; xas_lock_irq(&xas); old = xas_load(&xas); if (!xa_is_value(old) || swp_to_radix_entry(swap) != old) { xas_set_err(&xas, -EEXIST); goto unlock; } cur_order = xas_get_order(&xas); if (!cur_order) goto unlock; /* Try to split large swap entry in pagecache */ swap_index = round_down(index, 1 << cur_order); split_order = xas_try_split_min_order(cur_order); while (cur_order > 0) { pgoff_t aligned_index = round_down(index, 1 << cur_order); pgoff_t swap_offset = aligned_index - swap_index; xas_set_order(&xas, index, split_order); xas_try_split(&xas, old, cur_order); if (xas_error(&xas)) goto unlock; /* * Re-set the swap entry after splitting, and the swap * offset of the original large entry must be continuous. */ for (i = 0; i < 1 << cur_order; i += (1 << split_order)) { swp_entry_t tmp; tmp = swp_entry(swp_type(swap), swp_offset(swap) + swap_offset + i); __xa_store(&mapping->i_pages, aligned_index + i, swp_to_radix_entry(tmp), 0); } cur_order = split_order; split_order = xas_try_split_min_order(split_order); } unlock: xas_unlock_irq(&xas); if (!xas_nomem(&xas, gfp)) break; } if (xas_error(&xas)) return xas_error(&xas); return 0; } /* * Swap in the folio pointed to by *foliop. * Caller has to make sure that *foliop contains a valid swapped folio. * Returns 0 and the folio in foliop if success. On failure, returns the * error code and NULL in *foliop. */ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct folio **foliop, enum sgp_type sgp, gfp_t gfp, struct vm_area_struct *vma, vm_fault_t *fault_type) { struct address_space *mapping = inode->i_mapping; struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL; struct shmem_inode_info *info = SHMEM_I(inode); swp_entry_t swap; softleaf_t index_entry; struct swap_info_struct *si; struct folio *folio = NULL; bool skip_swapcache = false; int error, nr_pages, order; pgoff_t offset; VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); index_entry = radix_to_swp_entry(*foliop); swap = index_entry; *foliop = NULL; if (softleaf_is_poison_marker(index_entry)) return -EIO; si = get_swap_device(index_entry); order = shmem_confirm_swap(mapping, index, index_entry); if (unlikely(!si)) { if (order < 0) return -EEXIST; else return -EINVAL; } if (unlikely(order < 0)) { put_swap_device(si); return -EEXIST; } /* index may point to the middle of a large entry, get the sub entry */ if (order) { offset = index - round_down(index, 1 << order); swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); } /* Look it up and read it in.. */ folio = swap_cache_get_folio(swap); if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { /* Direct swapin skipping swap cache & readahead */ folio = shmem_swap_alloc_folio(inode, vma, index, index_entry, order, gfp); if (IS_ERR(folio)) { error = PTR_ERR(folio); folio = NULL; goto failed; } skip_swapcache = true; } else { /* Cached swapin only supports order 0 folio */ folio = shmem_swapin_cluster(swap, gfp, info, index); if (!folio) { error = -ENOMEM; goto failed; } } if (fault_type) { *fault_type |= VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); count_memcg_event_mm(fault_mm, PGMAJFAULT); } } else { swap_update_readahead(folio, NULL, 0); } if (order > folio_order(folio)) { /* * Swapin may get smaller folios due to various reasons: * It may fallback to order 0 due to memory pressure or race, * swap readahead may swap in order 0 folios into swapcache * asynchronously, while the shmem mapping can still stores * large swap entries. In such cases, we should split the * large swap entry to prevent possible data corruption. */ error = shmem_split_large_entry(inode, index, index_entry, gfp); if (error) goto failed_nolock; } /* * If the folio is large, round down swap and index by folio size. * No matter what race occurs, the swap layer ensures we either get * a valid folio that has its swap entry aligned by size, or a * temporarily invalid one which we'll abort very soon and retry. * * shmem_add_to_page_cache ensures the whole range contains expected * entries and prevents any corruption, so any race split is fine * too, it will succeed as long as the entries are still there. */ nr_pages = folio_nr_pages(folio); if (nr_pages > 1) { swap.val = round_down(swap.val, nr_pages); index = round_down(index, nr_pages); } /* * We have to do this with the folio locked to prevent races. * The shmem_confirm_swap below only checks if the first swap * entry matches the folio, that's enough to ensure the folio * is not used outside of shmem, as shmem swap entries * and swap cache folios are never partially freed. */ folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || shmem_confirm_swap(mapping, index, swap) < 0 || folio->swap.val != swap.val) { error = -EEXIST; goto unlock; } if (!folio_test_uptodate(folio)) { error = -EIO; goto failed; } folio_wait_writeback(folio); /* * Some architectures may have to restore extra metadata to the * folio after reading from swap. */ arch_swap_restore(folio_swap(swap, folio), folio); if (shmem_should_replace_folio(folio, gfp)) { error = shmem_replace_folio(&folio, gfp, info, index, vma); if (error) goto failed; } error = shmem_add_to_page_cache(folio, mapping, index, swp_to_radix_entry(swap), gfp); if (error) goto failed; shmem_recalc_inode(inode, 0, -nr_pages); if (sgp == SGP_WRITE) folio_mark_accessed(folio); if (skip_swapcache) { folio->swap.val = 0; swapcache_clear(si, swap, nr_pages); } else { swap_cache_del_folio(folio); } folio_mark_dirty(folio); swap_free_nr(swap, nr_pages); put_swap_device(si); *foliop = folio; return 0; failed: if (shmem_confirm_swap(mapping, index, swap) < 0) error = -EEXIST; if (error == -EIO) shmem_set_folio_swapin_error(inode, index, folio, swap, skip_swapcache); unlock: if (folio) folio_unlock(folio); failed_nolock: if (skip_swapcache) swapcache_clear(si, folio->swap, folio_nr_pages(folio)); if (folio) folio_put(folio); put_swap_device(si); return error; } /* * shmem_get_folio_gfp - find page in cache, or get from swap, or allocate * * If we allocate a new one we do not mark it dirty. That's up to the * vm. If we swap it in we mark it dirty since we also free the swap * entry since a page cannot live in both the swap and page cache. * * vmf and fault_type are only supplied by shmem_fault: otherwise they are NULL. */ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, loff_t write_end, struct folio **foliop, enum sgp_type sgp, gfp_t gfp, struct vm_fault *vmf, vm_fault_t *fault_type) { struct vm_area_struct *vma = vmf ? vmf->vma : NULL; struct mm_struct *fault_mm; struct folio *folio; int error; bool alloced; unsigned long orders = 0; if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping))) return -EINVAL; if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT)) return -EFBIG; repeat: if (sgp <= SGP_CACHE && ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) return -EINVAL; alloced = false; fault_mm = vma ? vma->vm_mm : NULL; folio = filemap_get_entry(inode->i_mapping, index); if (folio && vma && userfaultfd_minor(vma)) { if (!xa_is_value(folio)) folio_put(folio); *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); return 0; } if (xa_is_value(folio)) { error = shmem_swapin_folio(inode, index, &folio, sgp, gfp, vma, fault_type); if (error == -EEXIST) goto repeat; *foliop = folio; return error; } if (folio) { folio_lock(folio); /* Has the folio been truncated or swapped out? */ if (unlikely(folio->mapping != inode->i_mapping)) { folio_unlock(folio); folio_put(folio); goto repeat; } if (sgp == SGP_WRITE) folio_mark_accessed(folio); if (folio_test_uptodate(folio)) goto out; /* fallocated folio */ if (sgp != SGP_READ) goto clear; folio_unlock(folio); folio_put(folio); } /* * SGP_READ: succeed on hole, with NULL folio, letting caller zero. * SGP_NOALLOC: fail on hole, with NULL folio, letting caller fail. */ *foliop = NULL; if (sgp == SGP_READ) return 0; if (sgp == SGP_NOALLOC) return -ENOENT; /* * Fast cache lookup and swap lookup did not find it: allocate. */ if (vma && userfaultfd_missing(vma)) { *fault_type = handle_userfault(vmf, VM_UFFD_MISSING); return 0; } /* Find hugepage orders that are allowed for anonymous shmem and tmpfs. */ orders = shmem_allowable_huge_orders(inode, vma, index, write_end, false); if (orders > 0) { gfp_t huge_gfp; huge_gfp = vma_thp_gfp_mask(vma); huge_gfp = limit_gfp_mask(huge_gfp, gfp); folio = shmem_alloc_and_add_folio(vmf, huge_gfp, inode, index, fault_mm, orders); if (!IS_ERR(folio)) { if (folio_test_pmd_mappable(folio)) count_vm_event(THP_FILE_ALLOC); count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_ALLOC); goto alloced; } if (PTR_ERR(folio) == -EEXIST) goto repeat; } folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0); if (IS_ERR(folio)) { error = PTR_ERR(folio); if (error == -EEXIST) goto repeat; folio = NULL; goto unlock; } alloced: alloced = true; if (folio_test_large(folio) && DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) < folio_next_index(folio)) { struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); struct shmem_inode_info *info = SHMEM_I(inode); /* * Part of the large folio is beyond i_size: subject * to shrink under memory pressure. */ spin_lock(&sbinfo->shrinklist_lock); /* * _careful to defend against unlocked access to * ->shrink_list in shmem_unused_huge_shrink() */ if (list_empty_careful(&info->shrinklist)) { list_add_tail(&info->shrinklist, &sbinfo->shrinklist); sbinfo->shrinklist_len++; } spin_unlock(&sbinfo->shrinklist_lock); } if (sgp == SGP_WRITE) folio_set_referenced(folio); /* * Let SGP_FALLOC use the SGP_WRITE optimization on a new folio. */ if (sgp == SGP_FALLOC) sgp = SGP_WRITE; clear: /* * Let SGP_WRITE caller clear ends if write does not fill folio; * but SGP_FALLOC on a folio fallocated earlier must initialize * it now, lest undo on failure cancel our earlier guarantee. */ if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) { long i, n = folio_nr_pages(folio); for (i = 0; i < n; i++) clear_highpage(folio_page(folio, i)); flush_dcache_folio(folio); folio_mark_uptodate(folio); } /* Perhaps the file has been truncated since we checked */ if (sgp <= SGP_CACHE && ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) { error = -EINVAL; goto unlock; } out: *foliop = folio; return 0; /* * Error recovery. */ unlock: if (alloced) filemap_remove_folio(folio); shmem_recalc_inode(inode, 0, 0); if (folio) { folio_unlock(folio); folio_put(folio); } return error; } /** * shmem_get_folio - find, and lock a shmem folio. * @inode: inode to search * @index: the page index. * @write_end: end of a write, could extend inode size * @foliop: pointer to the folio if found * @sgp: SGP_* flags to control behavior * * Looks up the page cache entry at @inode & @index. If a folio is * present, it is returned locked with an increased refcount. * * If the caller modifies data in the folio, it must call folio_mark_dirty() * before unlocking the folio to ensure that the folio is not reclaimed. * There is no need to reserve space before calling folio_mark_dirty(). * * When no folio is found, the behavior depends on @sgp: * - for SGP_READ, *@foliop is %NULL and 0 is returned * - for SGP_NOALLOC, *@foliop is %NULL and -ENOENT is returned * - for all other flags a new folio is allocated, inserted into the * page cache and returned locked in @foliop. * * Context: May sleep. * Return: 0 if successful, else a negative error code. */ int shmem_get_folio(struct inode *inode, pgoff_t index, loff_t write_end, struct folio **foliop, enum sgp_type sgp) { return shmem_get_folio_gfp(inode, index, write_end, foliop, sgp, mapping_gfp_mask(inode->i_mapping), NULL, NULL); } EXPORT_SYMBOL_GPL(shmem_get_folio); /* * This is like autoremove_wake_function, but it removes the wait queue * entry unconditionally - even if something else had already woken the * target. */ static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned int mode, int sync, void *key) { int ret = default_wake_function(wait, mode, sync, key); list_del_init(&wait->entry); return ret; } /* * Trinity finds that probing a hole which tmpfs is punching can * prevent the hole-punch from ever completing: which in turn * locks writers out with its hold on i_rwsem. So refrain from * faulting pages into the hole while it's being punched. Although * shmem_undo_range() does remove the additions, it may be unable to * keep up, as each new page needs its own unmap_mapping_range() call, * and the i_mmap tree grows ever slower to scan if new vmas are added. * * It does not matter if we sometimes reach this check just before the * hole-punch begins, so that one fault then races with the punch: * we just need to make racing faults a rare case. * * The implementation below would be much simpler if we just used a * standard mutex or completion: but we cannot take i_rwsem in fault, * and bloating every shmem inode for this unlikely case would be sad. */ static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode) { struct shmem_falloc *shmem_falloc; struct file *fpin = NULL; vm_fault_t ret = 0; spin_lock(&inode->i_lock); shmem_falloc = inode->i_private; if (shmem_falloc && shmem_falloc->waitq && vmf->pgoff >= shmem_falloc->start && vmf->pgoff < shmem_falloc->next) { wait_queue_head_t *shmem_falloc_waitq; DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); ret = VM_FAULT_NOPAGE; fpin = maybe_unlock_mmap_for_io(vmf, NULL); shmem_falloc_waitq = shmem_falloc->waitq; prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, TASK_UNINTERRUPTIBLE); spin_unlock(&inode->i_lock); schedule(); /* * shmem_falloc_waitq points into the shmem_fallocate() * stack of the hole-punching task: shmem_falloc_waitq * is usually invalid by the time we reach here, but * finish_wait() does not dereference it in that case; * though i_lock needed lest racing with wake_up_all(). */ spin_lock(&inode->i_lock); finish_wait(shmem_falloc_waitq, &shmem_fault_wait); } spin_unlock(&inode->i_lock); if (fpin) { fput(fpin); ret = VM_FAULT_RETRY; } return ret; } static vm_fault_t shmem_fault(struct vm_fault *vmf) { struct inode *inode = file_inode(vmf->vma->vm_file); gfp_t gfp = mapping_gfp_mask(inode->i_mapping); struct folio *folio = NULL; vm_fault_t ret = 0; int err; /* * Trinity finds that probing a hole which tmpfs is punching can * prevent the hole-punch from ever completing: noted in i_private. */ if (unlikely(inode->i_private)) { ret = shmem_falloc_wait(vmf, inode); if (ret) return ret; } WARN_ON_ONCE(vmf->page != NULL); err = shmem_get_folio_gfp(inode, vmf->pgoff, 0, &folio, SGP_CACHE, gfp, vmf, &ret); if (err) return vmf_error(err); if (folio) { vmf->page = folio_file_page(folio, vmf->pgoff); ret |= VM_FAULT_LOCKED; } return ret; } unsigned long shmem_get_unmapped_area(struct file *file, unsigned long uaddr, unsigned long len, unsigned long pgoff, unsigned long flags) { unsigned long addr; unsigned long offset; unsigned long inflated_len; unsigned long inflated_addr; unsigned long inflated_offset; unsigned long hpage_size; if (len > TASK_SIZE) return -ENOMEM; addr = mm_get_unmapped_area(file, uaddr, len, pgoff, flags); if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) return addr; if (IS_ERR_VALUE(addr)) return addr; if (addr & ~PAGE_MASK) return addr; if (addr > TASK_SIZE - len) return addr; if (shmem_huge == SHMEM_HUGE_DENY) return addr; if (flags & MAP_FIXED) return addr; /* * Our priority is to support MAP_SHARED mapped hugely; * and support MAP_PRIVATE mapped hugely too, until it is COWed. * But if caller specified an address hint and we allocated area there * successfully, respect that as before. */ if (uaddr == addr) return addr; hpage_size = HPAGE_PMD_SIZE; if (shmem_huge != SHMEM_HUGE_FORCE) { struct super_block *sb; unsigned long __maybe_unused hpage_orders; int order = 0; if (file) { VM_BUG_ON(file->f_op != &shmem_file_operations); sb = file_inode(file)->i_sb; } else { /* * Called directly from mm/mmap.c, or drivers/char/mem.c * for "/dev/zero", to create a shared anonymous object. */ if (IS_ERR(shm_mnt)) return addr; sb = shm_mnt->mnt_sb; /* * Find the highest mTHP order used for anonymous shmem to * provide a suitable alignment address. */ #ifdef CONFIG_TRANSPARENT_HUGEPAGE hpage_orders = READ_ONCE(huge_shmem_orders_always); hpage_orders |= READ_ONCE(huge_shmem_orders_within_size); hpage_orders |= READ_ONCE(huge_shmem_orders_madvise); if (SHMEM_SB(sb)->huge != SHMEM_HUGE_NEVER) hpage_orders |= READ_ONCE(huge_shmem_orders_inherit); if (hpage_orders > 0) { order = highest_order(hpage_orders); hpage_size = PAGE_SIZE << order; } #endif } if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER && !order) return addr; } if (len < hpage_size) return addr; offset = (pgoff << PAGE_SHIFT) & (hpage_size - 1); if (offset && offset + len < 2 * hpage_size) return addr; if ((addr & (hpage_size - 1)) == offset) return addr; inflated_len = len + hpage_size - PAGE_SIZE; if (inflated_len > TASK_SIZE) return addr; if (inflated_len < len) return addr; inflated_addr = mm_get_unmapped_area(NULL, uaddr, inflated_len, 0, flags); if (IS_ERR_VALUE(inflated_addr)) return addr; if (inflated_addr & ~PAGE_MASK) return addr; inflated_offset = inflated_addr & (hpage_size - 1); inflated_addr += offset - inflated_offset; if (inflated_offset > offset) inflated_addr += hpage_size; if (inflated_addr > TASK_SIZE - len) return addr; return inflated_addr; } #ifdef CONFIG_NUMA static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol) { struct inode *inode = file_inode(vma->vm_file); return mpol_set_shared_policy(&SHMEM_I(inode)->policy, vma, mpol); } static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned long addr, pgoff_t *ilx) { struct inode *inode = file_inode(vma->vm_file); pgoff_t index; /* * Bias interleave by inode number to distribute better across nodes; * but this interface is independent of which page order is used, so * supplies only that bias, letting caller apply the offset (adjusted * by page order, as in shmem_get_pgoff_policy() and get_vma_policy()). */ *ilx = inode->i_ino; index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index); } static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info, pgoff_t index, unsigned int order, pgoff_t *ilx) { struct mempolicy *mpol; /* Bias interleave by inode number to distribute better across nodes */ *ilx = info->vfs_inode.i_ino + (index >> order); mpol = mpol_shared_policy_lookup(&info->policy, index); return mpol ? mpol : get_task_policy(current); } #else static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info, pgoff_t index, unsigned int order, pgoff_t *ilx) { *ilx = 0; return NULL; } #endif /* CONFIG_NUMA */ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts) { struct inode *inode = file_inode(file); struct shmem_inode_info *info = SHMEM_I(inode); int retval = -ENOMEM; /* * What serializes the accesses to info->flags? * ipc_lock_object() when called from shmctl_do_lock(), * no serialization needed when called from shm_destroy(). */ if (lock && !(info->flags & SHMEM_F_LOCKED)) { if (!user_shm_lock(inode->i_size, ucounts)) goto out_nomem; info->flags |= SHMEM_F_LOCKED; mapping_set_unevictable(file->f_mapping); } if (!lock && (info->flags & SHMEM_F_LOCKED) && ucounts) { user_shm_unlock(inode->i_size, ucounts); info->flags &= ~SHMEM_F_LOCKED; mapping_clear_unevictable(file->f_mapping); } retval = 0; out_nomem: return retval; } static int shmem_mmap_prepare(struct vm_area_desc *desc) { struct file *file = desc->file; struct inode *inode = file_inode(file); file_accessed(file); /* This is anonymous shared memory if it is unlinked at the time of mmap */ if (inode->i_nlink) desc->vm_ops = &shmem_vm_ops; else desc->vm_ops = &shmem_anon_vm_ops; return 0; } static int shmem_file_open(struct inode *inode, struct file *file) { file->f_mode |= FMODE_CAN_ODIRECT; return generic_file_open(inode, file); } #ifdef CONFIG_TMPFS_XATTR static int shmem_initxattrs(struct inode *, const struct xattr *, void *); #if IS_ENABLED(CONFIG_UNICODE) /* * shmem_inode_casefold_flags - Deal with casefold file attribute flag * * The casefold file attribute needs some special checks. I can just be added to * an empty dir, and can't be removed from a non-empty dir. */ static int shmem_inode_casefold_flags(struct inode *inode, unsigned int fsflags, struct dentry *dentry, unsigned int *i_flags) { unsigned int old = inode->i_flags; struct super_block *sb = inode->i_sb; if (fsflags & FS_CASEFOLD_FL) { if (!(old & S_CASEFOLD)) { if (!sb->s_encoding) return -EOPNOTSUPP; if (!S_ISDIR(inode->i_mode)) return -ENOTDIR; if (dentry && !simple_empty(dentry)) return -ENOTEMPTY; } *i_flags = *i_flags | S_CASEFOLD; } else if (old & S_CASEFOLD) { if (dentry && !simple_empty(dentry)) return -ENOTEMPTY; } return 0; } #else static int shmem_inode_casefold_flags(struct inode *inode, unsigned int fsflags, struct dentry *dentry, unsigned int *i_flags) { if (fsflags & FS_CASEFOLD_FL) return -EOPNOTSUPP; return 0; } #endif /* * chattr's fsflags are unrelated to extended attributes, * but tmpfs has chosen to enable them under the same config option. */ static int shmem_set_inode_flags(struct inode *inode, unsigned int fsflags, struct dentry *dentry) { unsigned int i_flags = 0; int ret; ret = shmem_inode_casefold_flags(inode, fsflags, dentry, &i_flags); if (ret) return ret; if (fsflags & FS_NOATIME_FL) i_flags |= S_NOATIME; if (fsflags & FS_APPEND_FL) i_flags |= S_APPEND; if (fsflags & FS_IMMUTABLE_FL) i_flags |= S_IMMUTABLE; /* * But FS_NODUMP_FL does not require any action in i_flags. */ inode_set_flags(inode, i_flags, S_NOATIME | S_APPEND | S_IMMUTABLE | S_CASEFOLD); return 0; } #else static void shmem_set_inode_flags(struct inode *inode, unsigned int fsflags, struct dentry *dentry) { } #define shmem_initxattrs NULL #endif static struct offset_ctx *shmem_get_offset_ctx(struct inode *inode) { return &SHMEM_I(inode)->dir_offsets; } static struct inode *__shmem_get_inode(struct mnt_idmap *idmap, struct super_block *sb, struct inode *dir, umode_t mode, dev_t dev, unsigned long flags) { struct inode *inode; struct shmem_inode_info *info; struct shmem_sb_info *sbinfo = SHMEM_SB(sb); ino_t ino; int err; err = shmem_reserve_inode(sb, &ino); if (err) return ERR_PTR(err); inode = new_inode(sb); if (!inode) { shmem_free_inode(sb, 0); return ERR_PTR(-ENOSPC); } inode->i_ino = ino; inode_init_owner(idmap, inode, dir, mode); inode->i_blocks = 0; simple_inode_init_ts(inode); inode->i_generation = get_random_u32(); info = SHMEM_I(inode); memset(info, 0, (char *)inode - (char *)info); spin_lock_init(&info->lock); atomic_set(&info->stop_eviction, 0); info->seals = F_SEAL_SEAL; info->flags = (flags & VM_NORESERVE) ? SHMEM_F_NORESERVE : 0; info->i_crtime = inode_get_mtime(inode); info->fsflags = (dir == NULL) ? 0 : SHMEM_I(dir)->fsflags & SHMEM_FL_INHERITED; if (info->fsflags) shmem_set_inode_flags(inode, info->fsflags, NULL); INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist); simple_xattrs_init(&info->xattrs); cache_no_acl(inode); if (sbinfo->noswap) mapping_set_unevictable(inode->i_mapping); /* Don't consider 'deny' for emergencies and 'force' for testing */ if (sbinfo->huge) mapping_set_large_folios(inode->i_mapping); switch (mode & S_IFMT) { default: inode->i_op = &shmem_special_inode_operations; init_special_inode(inode, mode, dev); break; case S_IFREG: inode->i_mapping->a_ops = &shmem_aops; inode->i_op = &shmem_inode_operations; inode->i_fop = &shmem_file_operations; mpol_shared_policy_init(&info->policy, shmem_get_sbmpol(sbinfo)); break; case S_IFDIR: inc_nlink(inode); /* Some things misbehave if size == 0 on a directory */ inode->i_size = 2 * BOGO_DIRENT_SIZE; inode->i_op = &shmem_dir_inode_operations; inode->i_fop = &simple_offset_dir_operations; simple_offset_init(shmem_get_offset_ctx(inode)); break; case S_IFLNK: /* * Must not load anything in the rbtree, * mpol_free_shared_policy will not be called. */ mpol_shared_policy_init(&info->policy, NULL); break; } lockdep_annotate_inode_mutex_key(inode); return inode; } #ifdef CONFIG_TMPFS_QUOTA static struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block *sb, struct inode *dir, umode_t mode, dev_t dev, unsigned long flags) { int err; struct inode *inode; inode = __shmem_get_inode(idmap, sb, dir, mode, dev, flags); if (IS_ERR(inode)) return inode; err = dquot_initialize(inode); if (err) goto errout; err = dquot_alloc_inode(inode); if (err) { dquot_drop(inode); goto errout; } return inode; errout: inode->i_flags |= S_NOQUOTA; iput(inode); return ERR_PTR(err); } #else static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block *sb, struct inode *dir, umode_t mode, dev_t dev, unsigned long flags) { return __shmem_get_inode(idmap, sb, dir, mode, dev, flags); } #endif /* CONFIG_TMPFS_QUOTA */ #ifdef CONFIG_USERFAULTFD int shmem_mfill_atomic_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, uffd_flags_t flags, struct folio **foliop) { struct inode *inode = file_inode(dst_vma->vm_file); struct shmem_inode_info *info = SHMEM_I(inode); struct address_space *mapping = inode->i_mapping; gfp_t gfp = mapping_gfp_mask(mapping); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); void *page_kaddr; struct folio *folio; int ret; pgoff_t max_off; if (shmem_inode_acct_blocks(inode, 1)) { /* * We may have got a page, returned -ENOENT triggering a retry, * and now we find ourselves with -ENOMEM. Release the page, to * avoid a BUG_ON in our caller. */ if (unlikely(*foliop)) { folio_put(*foliop); *foliop = NULL; } return -ENOMEM; } if (!*foliop) { ret = -ENOMEM; folio = shmem_alloc_folio(gfp, 0, info, pgoff); if (!folio) goto out_unacct_blocks; if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { page_kaddr = kmap_local_folio(folio, 0); /* * The read mmap_lock is held here. Despite the * mmap_lock being read recursive a deadlock is still * possible if a writer has taken a lock. For example: * * process A thread 1 takes read lock on own mmap_lock * process A thread 2 calls mmap, blocks taking write lock * process B thread 1 takes page fault, read lock on own mmap lock * process B thread 2 calls mmap, blocks taking write lock * process A thread 1 blocks taking read lock on process B * process B thread 1 blocks taking read lock on process A * * Disable page faults to prevent potential deadlock * and retry the copy outside the mmap_lock. */ pagefault_disable(); ret = copy_from_user(page_kaddr, (const void __user *)src_addr, PAGE_SIZE); pagefault_enable(); kunmap_local(page_kaddr); /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { *foliop = folio; ret = -ENOENT; /* don't free the page */ goto out_unacct_blocks; } flush_dcache_folio(folio); } else { /* ZEROPAGE */ clear_user_highpage(&folio->page, dst_addr); } } else { folio = *foliop; VM_BUG_ON_FOLIO(folio_test_large(folio), folio); *foliop = NULL; } VM_BUG_ON(folio_test_locked(folio)); VM_BUG_ON(folio_test_swapbacked(folio)); __folio_set_locked(folio); __folio_set_swapbacked(folio); __folio_mark_uptodate(folio); ret = -EFAULT; max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); if (unlikely(pgoff >= max_off)) goto out_release; ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp); if (ret) goto out_release; ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp); if (ret) goto out_release; ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, &folio->page, true, flags); if (ret) goto out_delete_from_cache; shmem_recalc_inode(inode, 1, 0); folio_unlock(folio); return 0; out_delete_from_cache: filemap_remove_folio(folio); out_release: folio_unlock(folio); folio_put(folio); out_unacct_blocks: shmem_inode_unacct_blocks(inode, 1); return ret; } #endif /* CONFIG_USERFAULTFD */ #ifdef CONFIG_TMPFS static const struct inode_operations shmem_symlink_inode_operations; static const struct inode_operations shmem_short_symlink_operations; static int shmem_write_begin(const struct kiocb *iocb, struct address_space *mapping, loff_t pos, unsigned len, struct folio **foliop, void **fsdata) { struct inode *inode = mapping->host; struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t index = pos >> PAGE_SHIFT; struct folio *folio; int ret = 0; /* i_rwsem is held by caller */ if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) { if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) return -EPERM; if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; } if (unlikely((info->flags & SHMEM_F_MAPPING_FROZEN) && pos + len > inode->i_size)) return -EPERM; ret = shmem_get_folio(inode, index, pos + len, &folio, SGP_WRITE); if (ret) return ret; if (folio_contain_hwpoisoned_page(folio)) { folio_unlock(folio); folio_put(folio); return -EIO; } *foliop = folio; return 0; } static int shmem_write_end(const struct kiocb *iocb, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct folio *folio, void *fsdata) { struct inode *inode = mapping->host; if (pos + copied > inode->i_size) i_size_write(inode, pos + copied); if (!folio_test_uptodate(folio)) { if (copied < folio_size(folio)) { size_t from = offset_in_folio(folio, pos); folio_zero_segments(folio, 0, from, from + copied, folio_size(folio)); } folio_mark_uptodate(folio); } folio_mark_dirty(folio); folio_unlock(folio); folio_put(folio); return copied; } static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct file *file = iocb->ki_filp; struct inode *inode = file_inode(file); struct address_space *mapping = inode->i_mapping; pgoff_t index; unsigned long offset; int error = 0; ssize_t retval = 0; for (;;) { struct folio *folio = NULL; struct page *page = NULL; unsigned long nr, ret; loff_t end_offset, i_size = i_size_read(inode); bool fallback_page_copy = false; size_t fsize; if (unlikely(iocb->ki_pos >= i_size)) break; index = iocb->ki_pos >> PAGE_SHIFT; error = shmem_get_folio(inode, index, 0, &folio, SGP_READ); if (error) { if (error == -EINVAL) error = 0; break; } if (folio) { folio_unlock(folio); page = folio_file_page(folio, index); if (PageHWPoison(page)) { folio_put(folio); error = -EIO; break; } if (folio_test_large(folio) && folio_test_has_hwpoisoned(folio)) fallback_page_copy = true; } /* * We must evaluate after, since reads (unlike writes) * are called without i_rwsem protection against truncate */ i_size = i_size_read(inode); if (unlikely(iocb->ki_pos >= i_size)) { if (folio) folio_put(folio); break; } end_offset = min_t(loff_t, i_size, iocb->ki_pos + to->count); if (folio && likely(!fallback_page_copy)) fsize = folio_size(folio); else fsize = PAGE_SIZE; offset = iocb->ki_pos & (fsize - 1); nr = min_t(loff_t, end_offset - iocb->ki_pos, fsize - offset); if (folio) { /* * If users can be writing to this page using arbitrary * virtual addresses, take care about potential aliasing * before reading the page on the kernel side. */ if (mapping_writably_mapped(mapping)) { if (likely(!fallback_page_copy)) flush_dcache_folio(folio); else flush_dcache_page(page); } /* * Mark the folio accessed if we read the beginning. */ if (!offset) folio_mark_accessed(folio); /* * Ok, we have the page, and it's up-to-date, so * now we can copy it to user space... */ if (likely(!fallback_page_copy)) ret = copy_folio_to_iter(folio, offset, nr, to); else ret = copy_page_to_iter(page, offset, nr, to); folio_put(folio); } else if (user_backed_iter(to)) { /* * Copy to user tends to be so well optimized, but * clear_user() not so much, that it is noticeably * faster to copy the zero page instead of clearing. */ ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to); } else { /* * But submitting the same page twice in a row to * splice() - or others? - can result in confusion: * so don't attempt that optimization on pipes etc. */ ret = iov_iter_zero(nr, to); } retval += ret; iocb->ki_pos += ret; if (!iov_iter_count(to)) break; if (ret < nr) { error = -EFAULT; break; } cond_resched(); } file_accessed(file); return retval ? retval : error; } static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; ssize_t ret; inode_lock(inode); ret = generic_write_checks(iocb, from); if (ret <= 0) goto unlock; ret = file_remove_privs(file); if (ret) goto unlock; ret = file_update_time(file); if (ret) goto unlock; ret = generic_perform_write(iocb, from); unlock: inode_unlock(inode); return ret; } static bool zero_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { return true; } static void zero_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { } static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { return false; } static const struct pipe_buf_operations zero_pipe_buf_ops = { .release = zero_pipe_buf_release, .try_steal = zero_pipe_buf_try_steal, .get = zero_pipe_buf_get, }; static size_t splice_zeropage_into_pipe(struct pipe_inode_info *pipe, loff_t fpos, size_t size) { size_t offset = fpos & ~PAGE_MASK; size = min_t(size_t, size, PAGE_SIZE - offset); if (!pipe_is_full(pipe)) { struct pipe_buffer *buf = pipe_head_buf(pipe); *buf = (struct pipe_buffer) { .ops = &zero_pipe_buf_ops, .page = ZERO_PAGE(0), .offset = offset, .len = size, }; pipe->head++; } return size; } static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { struct inode *inode = file_inode(in); struct address_space *mapping = inode->i_mapping; struct folio *folio = NULL; size_t total_spliced = 0, used, npages, n, part; loff_t isize; int error = 0; /* Work out how much data we can actually add into the pipe */ used = pipe_buf_usage(pipe); npages = max_t(ssize_t, pipe->max_usage - used, 0); len = min_t(size_t, len, npages * PAGE_SIZE); do { bool fallback_page_splice = false; struct page *page = NULL; pgoff_t index; size_t size; if (*ppos >= i_size_read(inode)) break; index = *ppos >> PAGE_SHIFT; error = shmem_get_folio(inode, index, 0, &folio, SGP_READ); if (error) { if (error == -EINVAL) error = 0; break; } if (folio) { folio_unlock(folio); page = folio_file_page(folio, index); if (PageHWPoison(page)) { error = -EIO; break; } if (folio_test_large(folio) && folio_test_has_hwpoisoned(folio)) fallback_page_splice = true; } /* * i_size must be checked after we know the pages are Uptodate. * * Checking i_size after the check allows us to calculate * the correct value for "nr", which means the zero-filled * part of the page is not copied back to userspace (unless * another truncate extends the file - this is desired though). */ isize = i_size_read(inode); if (unlikely(*ppos >= isize)) break; /* * Fallback to PAGE_SIZE splice if the large folio has hwpoisoned * pages. */ size = len; if (unlikely(fallback_page_splice)) { size_t offset = *ppos & ~PAGE_MASK; size = umin(size, PAGE_SIZE - offset); } part = min_t(loff_t, isize - *ppos, size); if (folio) { /* * If users can be writing to this page using arbitrary * virtual addresses, take care about potential aliasing * before reading the page on the kernel side. */ if (mapping_writably_mapped(mapping)) { if (likely(!fallback_page_splice)) flush_dcache_folio(folio); else flush_dcache_page(page); } folio_mark_accessed(folio); /* * Ok, we have the page, and it's up-to-date, so we can * now splice it into the pipe. */ n = splice_folio_into_pipe(pipe, folio, *ppos, part); folio_put(folio); folio = NULL; } else { n = splice_zeropage_into_pipe(pipe, *ppos, part); } if (!n) break; len -= n; total_spliced += n; *ppos += n; in->f_ra.prev_pos = *ppos; if (pipe_is_full(pipe)) break; cond_resched(); } while (len); if (folio) folio_put(folio); file_accessed(in); return total_spliced ? total_spliced : error; } static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) { struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; if (whence != SEEK_DATA && whence != SEEK_HOLE) return generic_file_llseek_size(file, offset, whence, MAX_LFS_FILESIZE, i_size_read(inode)); if (offset < 0) return -ENXIO; inode_lock(inode); /* We're holding i_rwsem so we can access i_size directly */ offset = mapping_seek_hole_data(mapping, offset, inode->i_size, whence); if (offset >= 0) offset = vfs_setpos(file, offset, MAX_LFS_FILESIZE); inode_unlock(inode); return offset; } static long shmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { struct inode *inode = file_inode(file); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_falloc shmem_falloc; pgoff_t start, index, end, undo_fallocend; int error; if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) return -EOPNOTSUPP; inode_lock(inode); if (info->flags & SHMEM_F_MAPPING_FROZEN) { error = -EPERM; goto out; } if (mode & FALLOC_FL_PUNCH_HOLE) { struct address_space *mapping = file->f_mapping; loff_t unmap_start = round_up(offset, PAGE_SIZE); loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); /* protected by i_rwsem */ if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) { error = -EPERM; goto out; } shmem_falloc.waitq = &shmem_falloc_waitq; shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; spin_lock(&inode->i_lock); inode->i_private = &shmem_falloc; spin_unlock(&inode->i_lock); if ((u64)unmap_end > (u64)unmap_start) unmap_mapping_range(mapping, unmap_start, 1 + unmap_end - unmap_start, 0); shmem_truncate_range(inode, offset, offset + len - 1); /* No need to unmap again: hole-punching leaves COWed pages */ spin_lock(&inode->i_lock); inode->i_private = NULL; wake_up_all(&shmem_falloc_waitq); WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.head)); spin_unlock(&inode->i_lock); error = 0; goto out; } /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */ error = inode_newsize_ok(inode, offset + len); if (error) goto out; if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) { error = -EPERM; goto out; } start = offset >> PAGE_SHIFT; end = (offset + len + PAGE_SIZE - 1) >> PAGE_SHIFT; /* Try to avoid a swapstorm if len is impossible to satisfy */ if (sbinfo->max_blocks && end - start > sbinfo->max_blocks) { error = -ENOSPC; goto out; } shmem_falloc.waitq = NULL; shmem_falloc.start = start; shmem_falloc.next = start; shmem_falloc.nr_falloced = 0; shmem_falloc.nr_unswapped = 0; spin_lock(&inode->i_lock); inode->i_private = &shmem_falloc; spin_unlock(&inode->i_lock); /* * info->fallocend is only relevant when huge pages might be * involved: to prevent split_huge_page() freeing fallocated * pages when FALLOC_FL_KEEP_SIZE committed beyond i_size. */ undo_fallocend = info->fallocend; if (info->fallocend < end) info->fallocend = end; for (index = start; index < end; ) { struct folio *folio; /* * Check for fatal signal so that we abort early in OOM * situations. We don't want to abort in case of non-fatal * signals as large fallocate can take noticeable time and * e.g. periodic timers may result in fallocate constantly * restarting. */ if (fatal_signal_pending(current)) error = -EINTR; else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced) error = -ENOMEM; else error = shmem_get_folio(inode, index, offset + len, &folio, SGP_FALLOC); if (error) { info->fallocend = undo_fallocend; /* Remove the !uptodate folios we added */ if (index > start) { shmem_undo_range(inode, (loff_t)start << PAGE_SHIFT, ((loff_t)index << PAGE_SHIFT) - 1, true); } goto undone; } /* * Here is a more important optimization than it appears: * a second SGP_FALLOC on the same large folio will clear it, * making it uptodate and un-undoable if we fail later. */ index = folio_next_index(folio); /* Beware 32-bit wraparound */ if (!index) index--; /* * Inform shmem_writeout() how far we have reached. * No need for lock or barrier: we have the page lock. */ if (!folio_test_uptodate(folio)) shmem_falloc.nr_falloced += index - shmem_falloc.next; shmem_falloc.next = index; /* * If !uptodate, leave it that way so that freeable folios * can be recognized if we need to rollback on error later. * But mark it dirty so that memory pressure will swap rather * than free the folios we are allocating (and SGP_CACHE folios * might still be clean: we now need to mark those dirty too). */ folio_mark_dirty(folio); folio_unlock(folio); folio_put(folio); cond_resched(); } if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) i_size_write(inode, offset + len); undone: spin_lock(&inode->i_lock); inode->i_private = NULL; spin_unlock(&inode->i_lock); out: if (!error) file_modified(file); inode_unlock(inode); return error; } static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf) { struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb); buf->f_type = TMPFS_MAGIC; buf->f_bsize = PAGE_SIZE; buf->f_namelen = NAME_MAX; if (sbinfo->max_blocks) { buf->f_blocks = sbinfo->max_blocks; buf->f_bavail = buf->f_bfree = sbinfo->max_blocks - percpu_counter_sum(&sbinfo->used_blocks); } if (sbinfo->max_inodes) { buf->f_files = sbinfo->max_inodes; buf->f_ffree = sbinfo->free_ispace / BOGO_INODE_SIZE; } /* else leave those fields 0 like simple_statfs */ buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); return 0; } /* * File creation. Allocate an inode, and we're done.. */ static int shmem_mknod(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) { struct inode *inode; int error; if (!generic_ci_validate_strict_name(dir, &dentry->d_name)) return -EINVAL; inode = shmem_get_inode(idmap, dir->i_sb, dir, mode, dev, VM_NORESERVE); if (IS_ERR(inode)) return PTR_ERR(inode); error = simple_acl_create(dir, inode); if (error) goto out_iput; error = security_inode_init_security(inode, dir, &dentry->d_name, shmem_initxattrs, NULL); if (error && error != -EOPNOTSUPP) goto out_iput; error = simple_offset_add(shmem_get_offset_ctx(dir), dentry); if (error) goto out_iput; dir->i_size += BOGO_DIRENT_SIZE; inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir)); inode_inc_iversion(dir); d_make_persistent(dentry, inode); return error; out_iput: iput(inode); return error; } static int shmem_tmpfile(struct mnt_idmap *idmap, struct inode *dir, struct file *file, umode_t mode) { struct inode *inode; int error; inode = shmem_get_inode(idmap, dir->i_sb, dir, mode, 0, VM_NORESERVE); if (IS_ERR(inode)) { error = PTR_ERR(inode); goto err_out; } error = security_inode_init_security(inode, dir, NULL, shmem_initxattrs, NULL); if (error && error != -EOPNOTSUPP) goto out_iput; error = simple_acl_create(dir, inode); if (error) goto out_iput; d_tmpfile(file, inode); err_out: return finish_open_simple(file, error); out_iput: iput(inode); return error; } static struct dentry *shmem_mkdir(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode) { int error; error = shmem_mknod(idmap, dir, dentry, mode | S_IFDIR, 0); if (error) return ERR_PTR(error); inc_nlink(dir); return NULL; } static int shmem_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode, bool excl) { return shmem_mknod(idmap, dir, dentry, mode | S_IFREG, 0); } /* * Link a file.. */ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) { struct inode *inode = d_inode(old_dentry); int ret; /* * No ordinary (disk based) filesystem counts links as inodes; * but each new link needs a new dentry, pinning lowmem, and * tmpfs dentries cannot be pruned until they are unlinked. * But if an O_TMPFILE file is linked into the tmpfs, the * first link must skip that, to get the accounting right. */ if (inode->i_nlink) { ret = shmem_reserve_inode(inode->i_sb, NULL); if (ret) return ret; } ret = simple_offset_add(shmem_get_offset_ctx(dir), dentry); if (ret) { if (inode->i_nlink) shmem_free_inode(inode->i_sb, 0); return ret; } dir->i_size += BOGO_DIRENT_SIZE; inode_inc_iversion(dir); return simple_link(old_dentry, dir, dentry); } static int shmem_unlink(struct inode *dir, struct dentry *dentry) { struct inode *inode = d_inode(dentry); if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)) shmem_free_inode(inode->i_sb, 0); simple_offset_remove(shmem_get_offset_ctx(dir), dentry); dir->i_size -= BOGO_DIRENT_SIZE; inode_inc_iversion(dir); simple_unlink(dir, dentry); /* * For now, VFS can't deal with case-insensitive negative dentries, so * we invalidate them */ if (IS_ENABLED(CONFIG_UNICODE) && IS_CASEFOLDED(dir)) d_invalidate(dentry); return 0; } static int shmem_rmdir(struct inode *dir, struct dentry *dentry) { if (!simple_empty(dentry)) return -ENOTEMPTY; drop_nlink(d_inode(dentry)); drop_nlink(dir); return shmem_unlink(dir, dentry); } static int shmem_whiteout(struct mnt_idmap *idmap, struct inode *old_dir, struct dentry *old_dentry) { struct dentry *whiteout; int error; whiteout = d_alloc(old_dentry->d_parent, &old_dentry->d_name); if (!whiteout) return -ENOMEM; error = shmem_mknod(idmap, old_dir, whiteout, S_IFCHR | WHITEOUT_MODE, WHITEOUT_DEV); dput(whiteout); return error; } /* * The VFS layer already does all the dentry stuff for rename, * we just have to decrement the usage count for the target if * it exists so that the VFS layer correctly free's it when it * gets overwritten. */ static int shmem_rename2(struct mnt_idmap *idmap, struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry, unsigned int flags) { struct inode *inode = d_inode(old_dentry); int they_are_dirs = S_ISDIR(inode->i_mode); bool had_offset = false; int error; if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) return -EINVAL; if (flags & RENAME_EXCHANGE) return simple_offset_rename_exchange(old_dir, old_dentry, new_dir, new_dentry); if (!simple_empty(new_dentry)) return -ENOTEMPTY; error = simple_offset_add(shmem_get_offset_ctx(new_dir), new_dentry); if (error == -EBUSY) had_offset = true; else if (unlikely(error)) return error; if (flags & RENAME_WHITEOUT) { error = shmem_whiteout(idmap, old_dir, old_dentry); if (error) { if (!had_offset) simple_offset_remove(shmem_get_offset_ctx(new_dir), new_dentry); return error; } } simple_offset_rename(old_dir, old_dentry, new_dir, new_dentry); if (d_really_is_positive(new_dentry)) { (void) shmem_unlink(new_dir, new_dentry); if (they_are_dirs) { drop_nlink(d_inode(new_dentry)); drop_nlink(old_dir); } } else if (they_are_dirs) { drop_nlink(old_dir); inc_nlink(new_dir); } old_dir->i_size -= BOGO_DIRENT_SIZE; new_dir->i_size += BOGO_DIRENT_SIZE; simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry); inode_inc_iversion(old_dir); inode_inc_iversion(new_dir); return 0; } static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, const char *symname) { int error; int len; struct inode *inode; struct folio *folio; char *link; len = strlen(symname) + 1; if (len > PAGE_SIZE) return -ENAMETOOLONG; inode = shmem_get_inode(idmap, dir->i_sb, dir, S_IFLNK | 0777, 0, VM_NORESERVE); if (IS_ERR(inode)) return PTR_ERR(inode); error = security_inode_init_security(inode, dir, &dentry->d_name, shmem_initxattrs, NULL); if (error && error != -EOPNOTSUPP) goto out_iput; error = simple_offset_add(shmem_get_offset_ctx(dir), dentry); if (error) goto out_iput; inode->i_size = len-1; if (len <= SHORT_SYMLINK_LEN) { link = kmemdup(symname, len, GFP_KERNEL); if (!link) { error = -ENOMEM; goto out_remove_offset; } inode->i_op = &shmem_short_symlink_operations; inode_set_cached_link(inode, link, len - 1); } else { inode_nohighmem(inode); inode->i_mapping->a_ops = &shmem_aops; error = shmem_get_folio(inode, 0, 0, &folio, SGP_WRITE); if (error) goto out_remove_offset; inode->i_op = &shmem_symlink_inode_operations; memcpy(folio_address(folio), symname, len); folio_mark_uptodate(folio); folio_mark_dirty(folio); folio_unlock(folio); folio_put(folio); } dir->i_size += BOGO_DIRENT_SIZE; inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir)); inode_inc_iversion(dir); d_make_persistent(dentry, inode); return 0; out_remove_offset: simple_offset_remove(shmem_get_offset_ctx(dir), dentry); out_iput: iput(inode); return error; } static void shmem_put_link(void *arg) { folio_mark_accessed(arg); folio_put(arg); } static const char *shmem_get_link(struct dentry *dentry, struct inode *inode, struct delayed_call *done) { struct folio *folio = NULL; int error; if (!dentry) { folio = filemap_get_folio(inode->i_mapping, 0); if (IS_ERR(folio)) return ERR_PTR(-ECHILD); if (PageHWPoison(folio_page(folio, 0)) || !folio_test_uptodate(folio)) { folio_put(folio); return ERR_PTR(-ECHILD); } } else { error = shmem_get_folio(inode, 0, 0, &folio, SGP_READ); if (error) return ERR_PTR(error); if (!folio) return ERR_PTR(-ECHILD); if (PageHWPoison(folio_page(folio, 0))) { folio_unlock(folio); folio_put(folio); return ERR_PTR(-ECHILD); } folio_unlock(folio); } set_delayed_call(done, shmem_put_link, folio); return folio_address(folio); } #ifdef CONFIG_TMPFS_XATTR static int shmem_fileattr_get(struct dentry *dentry, struct file_kattr *fa) { struct shmem_inode_info *info = SHMEM_I(d_inode(dentry)); fileattr_fill_flags(fa, info->fsflags & SHMEM_FL_USER_VISIBLE); return 0; } static int shmem_fileattr_set(struct mnt_idmap *idmap, struct dentry *dentry, struct file_kattr *fa) { struct inode *inode = d_inode(dentry); struct shmem_inode_info *info = SHMEM_I(inode); int ret, flags; if (fileattr_has_fsx(fa)) return -EOPNOTSUPP; if (fa->flags & ~SHMEM_FL_USER_MODIFIABLE) return -EOPNOTSUPP; flags = (info->fsflags & ~SHMEM_FL_USER_MODIFIABLE) | (fa->flags & SHMEM_FL_USER_MODIFIABLE); ret = shmem_set_inode_flags(inode, flags, dentry); if (ret) return ret; info->fsflags = flags; inode_set_ctime_current(inode); inode_inc_iversion(inode); return 0; } /* * Superblocks without xattr inode operations may get some security.* xattr * support from the LSM "for free". As soon as we have any other xattrs * like ACLs, we also need to implement the security.* handlers at * filesystem level, though. */ /* * Callback for security_inode_init_security() for acquiring xattrs. */ static int shmem_initxattrs(struct inode *inode, const struct xattr *xattr_array, void *fs_info) { struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); const struct xattr *xattr; struct simple_xattr *new_xattr; size_t ispace = 0; size_t len; if (sbinfo->max_inodes) { for (xattr = xattr_array; xattr->name != NULL; xattr++) { ispace += simple_xattr_space(xattr->name, xattr->value_len + XATTR_SECURITY_PREFIX_LEN); } if (ispace) { raw_spin_lock(&sbinfo->stat_lock); if (sbinfo->free_ispace < ispace) ispace = 0; else sbinfo->free_ispace -= ispace; raw_spin_unlock(&sbinfo->stat_lock); if (!ispace) return -ENOSPC; } } for (xattr = xattr_array; xattr->name != NULL; xattr++) { new_xattr = simple_xattr_alloc(xattr->value, xattr->value_len); if (!new_xattr) break; len = strlen(xattr->name) + 1; new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len, GFP_KERNEL_ACCOUNT); if (!new_xattr->name) { kvfree(new_xattr); break; } memcpy(new_xattr->name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN); memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN, xattr->name, len); simple_xattr_add(&info->xattrs, new_xattr); } if (xattr->name != NULL) { if (ispace) { raw_spin_lock(&sbinfo->stat_lock); sbinfo->free_ispace += ispace; raw_spin_unlock(&sbinfo->stat_lock); } simple_xattrs_free(&info->xattrs, NULL); return -ENOMEM; } return 0; } static int shmem_xattr_handler_get(const struct xattr_handler *handler, struct dentry *unused, struct inode *inode, const char *name, void *buffer, size_t size) { struct shmem_inode_info *info = SHMEM_I(inode); name = xattr_full_name(handler, name); return simple_xattr_get(&info->xattrs, name, buffer, size); } static int shmem_xattr_handler_set(const struct xattr_handler *handler, struct mnt_idmap *idmap, struct dentry *unused, struct inode *inode, const char *name, const void *value, size_t size, int flags) { struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); struct simple_xattr *old_xattr; size_t ispace = 0; name = xattr_full_name(handler, name); if (value && sbinfo->max_inodes) { ispace = simple_xattr_space(name, size); raw_spin_lock(&sbinfo->stat_lock); if (sbinfo->free_ispace < ispace) ispace = 0; else sbinfo->free_ispace -= ispace; raw_spin_unlock(&sbinfo->stat_lock); if (!ispace) return -ENOSPC; } old_xattr = simple_xattr_set(&info->xattrs, name, value, size, flags); if (!IS_ERR(old_xattr)) { ispace = 0; if (old_xattr && sbinfo->max_inodes) ispace = simple_xattr_space(old_xattr->name, old_xattr->size); simple_xattr_free(old_xattr); old_xattr = NULL; inode_set_ctime_current(inode); inode_inc_iversion(inode); } if (ispace) { raw_spin_lock(&sbinfo->stat_lock); sbinfo->free_ispace += ispace; raw_spin_unlock(&sbinfo->stat_lock); } return PTR_ERR(old_xattr); } static const struct xattr_handler shmem_security_xattr_handler = { .prefix = XATTR_SECURITY_PREFIX, .get = shmem_xattr_handler_get, .set = shmem_xattr_handler_set, }; static const struct xattr_handler shmem_trusted_xattr_handler = { .prefix = XATTR_TRUSTED_PREFIX, .get = shmem_xattr_handler_get, .set = shmem_xattr_handler_set, }; static const struct xattr_handler shmem_user_xattr_handler = { .prefix = XATTR_USER_PREFIX, .get = shmem_xattr_handler_get, .set = shmem_xattr_handler_set, }; static const struct xattr_handler * const shmem_xattr_handlers[] = { &shmem_security_xattr_handler, &shmem_trusted_xattr_handler, &shmem_user_xattr_handler, NULL }; static ssize_t shmem_listxattr(struct dentry *dentry, char *buffer, size_t size) { struct shmem_inode_info *info = SHMEM_I(d_inode(dentry)); return simple_xattr_list(d_inode(dentry), &info->xattrs, buffer, size); } #endif /* CONFIG_TMPFS_XATTR */ static const struct inode_operations shmem_short_symlink_operations = { .getattr = shmem_getattr, .setattr = shmem_setattr, .get_link = simple_get_link, #ifdef CONFIG_TMPFS_XATTR .listxattr = shmem_listxattr, #endif }; static const struct inode_operations shmem_symlink_inode_operations = { .getattr = shmem_getattr, .setattr = shmem_setattr, .get_link = shmem_get_link, #ifdef CONFIG_TMPFS_XATTR .listxattr = shmem_listxattr, #endif }; static struct dentry *shmem_get_parent(struct dentry *child) { return ERR_PTR(-ESTALE); } static int shmem_match(struct inode *ino, void *vfh) { __u32 *fh = vfh; __u64 inum = fh[2]; inum = (inum << 32) | fh[1]; return ino->i_ino == inum && fh[0] == ino->i_generation; } /* Find any alias of inode, but prefer a hashed alias */ static struct dentry *shmem_find_alias(struct inode *inode) { struct dentry *alias = d_find_alias(inode); return alias ?: d_find_any_alias(inode); } static struct dentry *shmem_fh_to_dentry(struct super_block *sb, struct fid *fid, int fh_len, int fh_type) { struct inode *inode; struct dentry *dentry = NULL; u64 inum; if (fh_len < 3) return NULL; inum = fid->raw[2]; inum = (inum << 32) | fid->raw[1]; inode = ilookup5(sb, (unsigned long)(inum + fid->raw[0]), shmem_match, fid->raw); if (inode) { dentry = shmem_find_alias(inode); iput(inode); } return dentry; } static int shmem_encode_fh(struct inode *inode, __u32 *fh, int *len, struct inode *parent) { if (*len < 3) { *len = 3; return FILEID_INVALID; } if (inode_unhashed(inode)) { /* Unfortunately insert_inode_hash is not idempotent, * so as we hash inodes here rather than at creation * time, we need a lock to ensure we only try * to do it once */ static DEFINE_SPINLOCK(lock); spin_lock(&lock); if (inode_unhashed(inode)) __insert_inode_hash(inode, inode->i_ino + inode->i_generation); spin_unlock(&lock); } fh[0] = inode->i_generation; fh[1] = inode->i_ino; fh[2] = ((__u64)inode->i_ino) >> 32; *len = 3; return 1; } static const struct export_operations shmem_export_ops = { .get_parent = shmem_get_parent, .encode_fh = shmem_encode_fh, .fh_to_dentry = shmem_fh_to_dentry, }; enum shmem_param { Opt_gid, Opt_huge, Opt_mode, Opt_mpol, Opt_nr_blocks, Opt_nr_inodes, Opt_size, Opt_uid, Opt_inode32, Opt_inode64, Opt_noswap, Opt_quota, Opt_usrquota, Opt_grpquota, Opt_usrquota_block_hardlimit, Opt_usrquota_inode_hardlimit, Opt_grpquota_block_hardlimit, Opt_grpquota_inode_hardlimit, Opt_casefold_version, Opt_casefold, Opt_strict_encoding, }; static const struct constant_table shmem_param_enums_huge[] = { {"never", SHMEM_HUGE_NEVER }, {"always", SHMEM_HUGE_ALWAYS }, {"within_size", SHMEM_HUGE_WITHIN_SIZE }, {"advise", SHMEM_HUGE_ADVISE }, {} }; const struct fs_parameter_spec shmem_fs_parameters[] = { fsparam_gid ("gid", Opt_gid), fsparam_enum ("huge", Opt_huge, shmem_param_enums_huge), fsparam_u32oct("mode", Opt_mode), fsparam_string("mpol", Opt_mpol), fsparam_string("nr_blocks", Opt_nr_blocks), fsparam_string("nr_inodes", Opt_nr_inodes), fsparam_string("size", Opt_size), fsparam_uid ("uid", Opt_uid), fsparam_flag ("inode32", Opt_inode32), fsparam_flag ("inode64", Opt_inode64), fsparam_flag ("noswap", Opt_noswap), #ifdef CONFIG_TMPFS_QUOTA fsparam_flag ("quota", Opt_quota), fsparam_flag ("usrquota", Opt_usrquota), fsparam_flag ("grpquota", Opt_grpquota), fsparam_string("usrquota_block_hardlimit", Opt_usrquota_block_hardlimit), fsparam_string("usrquota_inode_hardlimit", Opt_usrquota_inode_hardlimit), fsparam_string("grpquota_block_hardlimit", Opt_grpquota_block_hardlimit), fsparam_string("grpquota_inode_hardlimit", Opt_grpquota_inode_hardlimit), #endif fsparam_string("casefold", Opt_casefold_version), fsparam_flag ("casefold", Opt_casefold), fsparam_flag ("strict_encoding", Opt_strict_encoding), {} }; #if IS_ENABLED(CONFIG_UNICODE) static int shmem_parse_opt_casefold(struct fs_context *fc, struct fs_parameter *param, bool latest_version) { struct shmem_options *ctx = fc->fs_private; int version = UTF8_LATEST; struct unicode_map *encoding; char *version_str = param->string + 5; if (!latest_version) { if (strncmp(param->string, "utf8-", 5)) return invalfc(fc, "Only UTF-8 encodings are supported " "in the format: utf8-<version number>"); version = utf8_parse_version(version_str); if (version < 0) return invalfc(fc, "Invalid UTF-8 version: %s", version_str); } encoding = utf8_load(version); if (IS_ERR(encoding)) { return invalfc(fc, "Failed loading UTF-8 version: utf8-%u.%u.%u\n", unicode_major(version), unicode_minor(version), unicode_rev(version)); } pr_info("tmpfs: Using encoding : utf8-%u.%u.%u\n", unicode_major(version), unicode_minor(version), unicode_rev(version)); ctx->encoding = encoding; return 0; } #else static int shmem_parse_opt_casefold(struct fs_context *fc, struct fs_parameter *param, bool latest_version) { return invalfc(fc, "tmpfs: Kernel not built with CONFIG_UNICODE\n"); } #endif static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) { struct shmem_options *ctx = fc->fs_private; struct fs_parse_result result; unsigned long long size; char *rest; int opt; kuid_t kuid; kgid_t kgid; opt = fs_parse(fc, shmem_fs_parameters, param, &result); if (opt < 0) return opt; switch (opt) { case Opt_size: size = memparse(param->string, &rest); if (*rest == '%') { size <<= PAGE_SHIFT; size *= totalram_pages(); do_div(size, 100); rest++; } if (*rest) goto bad_value; ctx->blocks = DIV_ROUND_UP(size, PAGE_SIZE); ctx->seen |= SHMEM_SEEN_BLOCKS; break; case Opt_nr_blocks: ctx->blocks = memparse(param->string, &rest); if (*rest || ctx->blocks > LONG_MAX) goto bad_value; ctx->seen |= SHMEM_SEEN_BLOCKS; break; case Opt_nr_inodes: ctx->inodes = memparse(param->string, &rest); if (*rest || ctx->inodes > ULONG_MAX / BOGO_INODE_SIZE) goto bad_value; ctx->seen |= SHMEM_SEEN_INODES; break; case Opt_mode: ctx->mode = result.uint_32 & 07777; break; case Opt_uid: kuid = result.uid; /* * The requested uid must be representable in the * filesystem's idmapping. */ if (!kuid_has_mapping(fc->user_ns, kuid)) goto bad_value; ctx->uid = kuid; break; case Opt_gid: kgid = result.gid; /* * The requested gid must be representable in the * filesystem's idmapping. */ if (!kgid_has_mapping(fc->user_ns, kgid)) goto bad_value; ctx->gid = kgid; break; case Opt_huge: ctx->huge = result.uint_32; if (ctx->huge != SHMEM_HUGE_NEVER && !(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && has_transparent_hugepage())) goto unsupported_parameter; ctx->seen |= SHMEM_SEEN_HUGE; break; case Opt_mpol: if (IS_ENABLED(CONFIG_NUMA)) { mpol_put(ctx->mpol); ctx->mpol = NULL; if (mpol_parse_str(param->string, &ctx->mpol)) goto bad_value; break; } goto unsupported_parameter; case Opt_inode32: ctx->full_inums = false; ctx->seen |= SHMEM_SEEN_INUMS; break; case Opt_inode64: if (sizeof(ino_t) < 8) { return invalfc(fc, "Cannot use inode64 with <64bit inums in kernel\n"); } ctx->full_inums = true; ctx->seen |= SHMEM_SEEN_INUMS; break; case Opt_noswap: if ((fc->user_ns != &init_user_ns) || !capable(CAP_SYS_ADMIN)) { return invalfc(fc, "Turning off swap in unprivileged tmpfs mounts unsupported"); } ctx->noswap = true; break; case Opt_quota: if (fc->user_ns != &init_user_ns) return invalfc(fc, "Quotas in unprivileged tmpfs mounts are unsupported"); ctx->seen |= SHMEM_SEEN_QUOTA; ctx->quota_types |= (QTYPE_MASK_USR | QTYPE_MASK_GRP); break; case Opt_usrquota: if (fc->user_ns != &init_user_ns) return invalfc(fc, "Quotas in unprivileged tmpfs mounts are unsupported"); ctx->seen |= SHMEM_SEEN_QUOTA; ctx->quota_types |= QTYPE_MASK_USR; break; case Opt_grpquota: if (fc->user_ns != &init_user_ns) return invalfc(fc, "Quotas in unprivileged tmpfs mounts are unsupported"); ctx->seen |= SHMEM_SEEN_QUOTA; ctx->quota_types |= QTYPE_MASK_GRP; break; case Opt_usrquota_block_hardlimit: size = memparse(param->string, &rest); if (*rest || !size) goto bad_value; if (size > SHMEM_QUOTA_MAX_SPC_LIMIT) return invalfc(fc, "User quota block hardlimit too large."); ctx->qlimits.usrquota_bhardlimit = size; break; case Opt_grpquota_block_hardlimit: size = memparse(param->string, &rest); if (*rest || !size) goto bad_value; if (size > SHMEM_QUOTA_MAX_SPC_LIMIT) return invalfc(fc, "Group quota block hardlimit too large."); ctx->qlimits.grpquota_bhardlimit = size; break; case Opt_usrquota_inode_hardlimit: size = memparse(param->string, &rest); if (*rest || !size) goto bad_value; if (size > SHMEM_QUOTA_MAX_INO_LIMIT) return invalfc(fc, "User quota inode hardlimit too large."); ctx->qlimits.usrquota_ihardlimit = size; break; case Opt_grpquota_inode_hardlimit: size = memparse(param->string, &rest); if (*rest || !size) goto bad_value; if (size > SHMEM_QUOTA_MAX_INO_LIMIT) return invalfc(fc, "Group quota inode hardlimit too large."); ctx->qlimits.grpquota_ihardlimit = size; break; case Opt_casefold_version: return shmem_parse_opt_casefold(fc, param, false); case Opt_casefold: return shmem_parse_opt_casefold(fc, param, true); case Opt_strict_encoding: #if IS_ENABLED(CONFIG_UNICODE) ctx->strict_encoding = true; break; #else return invalfc(fc, "tmpfs: Kernel not built with CONFIG_UNICODE\n"); #endif } return 0; unsupported_parameter: return invalfc(fc, "Unsupported parameter '%s'", param->key); bad_value: return invalfc(fc, "Bad value for '%s'", param->key); } static char *shmem_next_opt(char **s) { char *sbegin = *s; char *p; if (sbegin == NULL) return NULL; /* * NUL-terminate this option: unfortunately, * mount options form a comma-separated list, * but mpol's nodelist may also contain commas. */ for (;;) { p = strchr(*s, ','); if (p == NULL) break; *s = p + 1; if (!isdigit(*(p+1))) { *p = '\0'; return sbegin; } } *s = NULL; return sbegin; } static int shmem_parse_monolithic(struct fs_context *fc, void *data) { return vfs_parse_monolithic_sep(fc, data, shmem_next_opt); } /* * Reconfigure a shmem filesystem. */ static int shmem_reconfigure(struct fs_context *fc) { struct shmem_options *ctx = fc->fs_private; struct shmem_sb_info *sbinfo = SHMEM_SB(fc->root->d_sb); unsigned long used_isp; struct mempolicy *mpol = NULL; const char *err; raw_spin_lock(&sbinfo->stat_lock); used_isp = sbinfo->max_inodes * BOGO_INODE_SIZE - sbinfo->free_ispace; if ((ctx->seen & SHMEM_SEEN_BLOCKS) && ctx->blocks) { if (!sbinfo->max_blocks) { err = "Cannot retroactively limit size"; goto out; } if (percpu_counter_compare(&sbinfo->used_blocks, ctx->blocks) > 0) { err = "Too small a size for current use"; goto out; } } if ((ctx->seen & SHMEM_SEEN_INODES) && ctx->inodes) { if (!sbinfo->max_inodes) { err = "Cannot retroactively limit inodes"; goto out; } if (ctx->inodes * BOGO_INODE_SIZE < used_isp) { err = "Too few inodes for current use"; goto out; } } if ((ctx->seen & SHMEM_SEEN_INUMS) && !ctx->full_inums && sbinfo->next_ino > UINT_MAX) { err = "Current inum too high to switch to 32-bit inums"; goto out; } /* * "noswap" doesn't use fsparam_flag_no, i.e. there's no "swap" * counterpart for (re-)enabling swap. */ if (ctx->noswap && !sbinfo->noswap) { err = "Cannot disable swap on remount"; goto out; } if (ctx->seen & SHMEM_SEEN_QUOTA && !sb_any_quota_loaded(fc->root->d_sb)) { err = "Cannot enable quota on remount"; goto out; } #ifdef CONFIG_TMPFS_QUOTA #define CHANGED_LIMIT(name) \ (ctx->qlimits.name## hardlimit && \ (ctx->qlimits.name## hardlimit != sbinfo->qlimits.name## hardlimit)) if (CHANGED_LIMIT(usrquota_b) || CHANGED_LIMIT(usrquota_i) || CHANGED_LIMIT(grpquota_b) || CHANGED_LIMIT(grpquota_i)) { err = "Cannot change global quota limit on remount"; goto out; } #endif /* CONFIG_TMPFS_QUOTA */ if (ctx->seen & SHMEM_SEEN_HUGE) sbinfo->huge = ctx->huge; if (ctx->seen & SHMEM_SEEN_INUMS) sbinfo->full_inums = ctx->full_inums; if (ctx->seen & SHMEM_SEEN_BLOCKS) sbinfo->max_blocks = ctx->blocks; if (ctx->seen & SHMEM_SEEN_INODES) { sbinfo->max_inodes = ctx->inodes; sbinfo->free_ispace = ctx->inodes * BOGO_INODE_SIZE - used_isp; } /* * Preserve previous mempolicy unless mpol remount option was specified. */ if (ctx->mpol) { mpol = sbinfo->mpol; sbinfo->mpol = ctx->mpol; /* transfers initial ref */ ctx->mpol = NULL; } if (ctx->noswap) sbinfo->noswap = true; raw_spin_unlock(&sbinfo->stat_lock); mpol_put(mpol); return 0; out: raw_spin_unlock(&sbinfo->stat_lock); return invalfc(fc, "%s", err); } static int shmem_show_options(struct seq_file *seq, struct dentry *root) { struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb); struct mempolicy *mpol; if (sbinfo->max_blocks != shmem_default_max_blocks()) seq_printf(seq, ",size=%luk", K(sbinfo->max_blocks)); if (sbinfo->max_inodes != shmem_default_max_inodes()) seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes); if (sbinfo->mode != (0777 | S_ISVTX)) seq_printf(seq, ",mode=%03ho", sbinfo->mode); if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID)) seq_printf(seq, ",uid=%u", from_kuid_munged(&init_user_ns, sbinfo->uid)); if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID)) seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, sbinfo->gid)); /* * Showing inode{64,32} might be useful even if it's the system default, * since then people don't have to resort to checking both here and * /proc/config.gz to confirm 64-bit inums were successfully applied * (which may not even exist if IKCONFIG_PROC isn't enabled). * * We hide it when inode64 isn't the default and we are using 32-bit * inodes, since that probably just means the feature isn't even under * consideration. * * As such: * * +-----------------+-----------------+ * | TMPFS_INODE64=y | TMPFS_INODE64=n | * +------------------+-----------------+-----------------+ * | full_inums=true | show | show | * | full_inums=false | show | hide | * +------------------+-----------------+-----------------+ * */ if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums) seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32)); #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */ if (sbinfo->huge) seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge)); #endif mpol = shmem_get_sbmpol(sbinfo); shmem_show_mpol(seq, mpol); mpol_put(mpol); if (sbinfo->noswap) seq_printf(seq, ",noswap"); #ifdef CONFIG_TMPFS_QUOTA if (sb_has_quota_active(root->d_sb, USRQUOTA)) seq_printf(seq, ",usrquota"); if (sb_has_quota_active(root->d_sb, GRPQUOTA)) seq_printf(seq, ",grpquota"); if (sbinfo->qlimits.usrquota_bhardlimit) seq_printf(seq, ",usrquota_block_hardlimit=%lld", sbinfo->qlimits.usrquota_bhardlimit); if (sbinfo->qlimits.grpquota_bhardlimit) seq_printf(seq, ",grpquota_block_hardlimit=%lld", sbinfo->qlimits.grpquota_bhardlimit); if (sbinfo->qlimits.usrquota_ihardlimit) seq_printf(seq, ",usrquota_inode_hardlimit=%lld", sbinfo->qlimits.usrquota_ihardlimit); if (sbinfo->qlimits.grpquota_ihardlimit) seq_printf(seq, ",grpquota_inode_hardlimit=%lld", sbinfo->qlimits.grpquota_ihardlimit); #endif return 0; } #endif /* CONFIG_TMPFS */ static void shmem_put_super(struct super_block *sb) { struct shmem_sb_info *sbinfo = SHMEM_SB(sb); #if IS_ENABLED(CONFIG_UNICODE) if (sb->s_encoding) utf8_unload(sb->s_encoding); #endif #ifdef CONFIG_TMPFS_QUOTA shmem_disable_quotas(sb); #endif free_percpu(sbinfo->ino_batch); percpu_counter_destroy(&sbinfo->used_blocks); mpol_put(sbinfo->mpol); kfree(sbinfo); sb->s_fs_info = NULL; } #if IS_ENABLED(CONFIG_UNICODE) && defined(CONFIG_TMPFS) static const struct dentry_operations shmem_ci_dentry_ops = { .d_hash = generic_ci_d_hash, .d_compare = generic_ci_d_compare, }; #endif static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) { struct shmem_options *ctx = fc->fs_private; struct inode *inode; struct shmem_sb_info *sbinfo; int error = -ENOMEM; /* Round up to L1_CACHE_BYTES to resist false sharing */ sbinfo = kzalloc(max((int)sizeof(struct shmem_sb_info), L1_CACHE_BYTES), GFP_KERNEL); if (!sbinfo) return error; sb->s_fs_info = sbinfo; #ifdef CONFIG_TMPFS /* * Per default we only allow half of the physical ram per * tmpfs instance, limiting inodes to one per page of lowmem; * but the internal instance is left unlimited. */ if (!(sb->s_flags & SB_KERNMOUNT)) { if (!(ctx->seen & SHMEM_SEEN_BLOCKS)) ctx->blocks = shmem_default_max_blocks(); if (!(ctx->seen & SHMEM_SEEN_INODES)) ctx->inodes = shmem_default_max_inodes(); if (!(ctx->seen & SHMEM_SEEN_INUMS)) ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64); sbinfo->noswap = ctx->noswap; } else { sb->s_flags |= SB_NOUSER; } sb->s_export_op = &shmem_export_ops; sb->s_flags |= SB_NOSEC; #if IS_ENABLED(CONFIG_UNICODE) if (!ctx->encoding && ctx->strict_encoding) { pr_err("tmpfs: strict_encoding option without encoding is forbidden\n"); error = -EINVAL; goto failed; } if (ctx->encoding) { sb->s_encoding = ctx->encoding; set_default_d_op(sb, &shmem_ci_dentry_ops); if (ctx->strict_encoding) sb->s_encoding_flags = SB_ENC_STRICT_MODE_FL; } #endif #else sb->s_flags |= SB_NOUSER; #endif /* CONFIG_TMPFS */ sb->s_d_flags |= DCACHE_DONTCACHE; sbinfo->max_blocks = ctx->blocks; sbinfo->max_inodes = ctx->inodes; sbinfo->free_ispace = sbinfo->max_inodes * BOGO_INODE_SIZE; if (sb->s_flags & SB_KERNMOUNT) { sbinfo->ino_batch = alloc_percpu(ino_t); if (!sbinfo->ino_batch) goto failed; } sbinfo->uid = ctx->uid; sbinfo->gid = ctx->gid; sbinfo->full_inums = ctx->full_inums; sbinfo->mode = ctx->mode; #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (ctx->seen & SHMEM_SEEN_HUGE) sbinfo->huge = ctx->huge; else sbinfo->huge = tmpfs_huge; #endif sbinfo->mpol = ctx->mpol; ctx->mpol = NULL; raw_spin_lock_init(&sbinfo->stat_lock); if (percpu_counter_init(&sbinfo->used_blocks, 0, GFP_KERNEL)) goto failed; spin_lock_init(&sbinfo->shrinklist_lock); INIT_LIST_HEAD(&sbinfo->shrinklist); sb->s_maxbytes = MAX_LFS_FILESIZE; sb->s_blocksize = PAGE_SIZE; sb->s_blocksize_bits = PAGE_SHIFT; sb->s_magic = TMPFS_MAGIC; sb->s_op = &shmem_ops; sb->s_time_gran = 1; #ifdef CONFIG_TMPFS_XATTR sb->s_xattr = shmem_xattr_handlers; #endif #ifdef CONFIG_TMPFS_POSIX_ACL sb->s_flags |= SB_POSIXACL; #endif uuid_t uuid; uuid_gen(&uuid); super_set_uuid(sb, uuid.b, sizeof(uuid)); #ifdef CONFIG_TMPFS_QUOTA if (ctx->seen & SHMEM_SEEN_QUOTA) { sb->dq_op = &shmem_quota_operations; sb->s_qcop = &dquot_quotactl_sysfile_ops; sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP; /* Copy the default limits from ctx into sbinfo */ memcpy(&sbinfo->qlimits, &ctx->qlimits, sizeof(struct shmem_quota_limits)); if (shmem_enable_quotas(sb, ctx->quota_types)) goto failed; } #endif /* CONFIG_TMPFS_QUOTA */ inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL, S_IFDIR | sbinfo->mode, 0, VM_NORESERVE); if (IS_ERR(inode)) { error = PTR_ERR(inode); goto failed; } inode->i_uid = sbinfo->uid; inode->i_gid = sbinfo->gid; sb->s_root = d_make_root(inode); if (!sb->s_root) goto failed; return 0; failed: shmem_put_super(sb); return error; } static int shmem_get_tree(struct fs_context *fc) { return get_tree_nodev(fc, shmem_fill_super); } static void shmem_free_fc(struct fs_context *fc) { struct shmem_options *ctx = fc->fs_private; if (ctx) { mpol_put(ctx->mpol); kfree(ctx); } } static const struct fs_context_operations shmem_fs_context_ops = { .free = shmem_free_fc, .get_tree = shmem_get_tree, #ifdef CONFIG_TMPFS .parse_monolithic = shmem_parse_monolithic, .parse_param = shmem_parse_one, .reconfigure = shmem_reconfigure, #endif }; static struct kmem_cache *shmem_inode_cachep __ro_after_init; static struct inode *shmem_alloc_inode(struct super_block *sb) { struct shmem_inode_info *info; info = alloc_inode_sb(sb, shmem_inode_cachep, GFP_KERNEL); if (!info) return NULL; return &info->vfs_inode; } static void shmem_free_in_core_inode(struct inode *inode) { if (S_ISLNK(inode->i_mode)) kfree(inode->i_link); kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode)); } static void shmem_destroy_inode(struct inode *inode) { if (S_ISREG(inode->i_mode)) mpol_free_shared_policy(&SHMEM_I(inode)->policy); if (S_ISDIR(inode->i_mode)) simple_offset_destroy(shmem_get_offset_ctx(inode)); } static void shmem_init_inode(void *foo) { struct shmem_inode_info *info = foo; inode_init_once(&info->vfs_inode); } static void __init shmem_init_inodecache(void) { shmem_inode_cachep = kmem_cache_create("shmem_inode_cache", sizeof(struct shmem_inode_info), 0, SLAB_PANIC|SLAB_ACCOUNT, shmem_init_inode); } static void __init shmem_destroy_inodecache(void) { kmem_cache_destroy(shmem_inode_cachep); } /* Keep the page in page cache instead of truncating it */ static int shmem_error_remove_folio(struct address_space *mapping, struct folio *folio) { return 0; } static const struct address_space_operations shmem_aops = { .dirty_folio = noop_dirty_folio, #ifdef CONFIG_TMPFS .write_begin = shmem_write_begin, .write_end = shmem_write_end, #endif #ifdef CONFIG_MIGRATION .migrate_folio = migrate_folio, #endif .error_remove_folio = shmem_error_remove_folio, }; static const struct file_operations shmem_file_operations = { .mmap_prepare = shmem_mmap_prepare, .open = shmem_file_open, .get_unmapped_area = shmem_get_unmapped_area, #ifdef CONFIG_TMPFS .llseek = shmem_file_llseek, .read_iter = shmem_file_read_iter, .write_iter = shmem_file_write_iter, .fsync = noop_fsync, .splice_read = shmem_file_splice_read, .splice_write = iter_file_splice_write, .fallocate = shmem_fallocate, #endif }; static const struct inode_operations shmem_inode_operations = { .getattr = shmem_getattr, .setattr = shmem_setattr, #ifdef CONFIG_TMPFS_XATTR .listxattr = shmem_listxattr, .set_acl = simple_set_acl, .fileattr_get = shmem_fileattr_get, .fileattr_set = shmem_fileattr_set, #endif }; static const struct inode_operations shmem_dir_inode_operations = { #ifdef CONFIG_TMPFS .getattr = shmem_getattr, .create = shmem_create, .lookup = simple_lookup, .link = shmem_link, .unlink = shmem_unlink, .symlink = shmem_symlink, .mkdir = shmem_mkdir, .rmdir = shmem_rmdir, .mknod = shmem_mknod, .rename = shmem_rename2, .tmpfile = shmem_tmpfile, .get_offset_ctx = shmem_get_offset_ctx, #endif #ifdef CONFIG_TMPFS_XATTR .listxattr = shmem_listxattr, .fileattr_get = shmem_fileattr_get, .fileattr_set = shmem_fileattr_set, #endif #ifdef CONFIG_TMPFS_POSIX_ACL .setattr = shmem_setattr, .set_acl = simple_set_acl, #endif }; static const struct inode_operations shmem_special_inode_operations = { .getattr = shmem_getattr, #ifdef CONFIG_TMPFS_XATTR .listxattr = shmem_listxattr, #endif #ifdef CONFIG_TMPFS_POSIX_ACL .setattr = shmem_setattr, .set_acl = simple_set_acl, #endif }; static const struct super_operations shmem_ops = { .alloc_inode = shmem_alloc_inode, .free_inode = shmem_free_in_core_inode, .destroy_inode = shmem_destroy_inode, #ifdef CONFIG_TMPFS .statfs = shmem_statfs, .show_options = shmem_show_options, #endif #ifdef CONFIG_TMPFS_QUOTA .get_dquots = shmem_get_dquots, #endif .evict_inode = shmem_evict_inode, .drop_inode = inode_just_drop, .put_super = shmem_put_super, #ifdef CONFIG_TRANSPARENT_HUGEPAGE .nr_cached_objects = shmem_unused_huge_count, .free_cached_objects = shmem_unused_huge_scan, #endif }; static const struct vm_operations_struct shmem_vm_ops = { .fault = shmem_fault, .map_pages = filemap_map_pages, #ifdef CONFIG_NUMA .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif }; static const struct vm_operations_struct shmem_anon_vm_ops = { .fault = shmem_fault, .map_pages = filemap_map_pages, #ifdef CONFIG_NUMA .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif }; int shmem_init_fs_context(struct fs_context *fc) { struct shmem_options *ctx; ctx = kzalloc(sizeof(struct shmem_options), GFP_KERNEL); if (!ctx) return -ENOMEM; ctx->mode = 0777 | S_ISVTX; ctx->uid = current_fsuid(); ctx->gid = current_fsgid(); #if IS_ENABLED(CONFIG_UNICODE) ctx->encoding = NULL; #endif fc->fs_private = ctx; fc->ops = &shmem_fs_context_ops; #ifdef CONFIG_TMPFS fc->sb_flags |= SB_I_VERSION; #endif return 0; } static struct file_system_type shmem_fs_type = { .owner = THIS_MODULE, .name = "tmpfs", .init_fs_context = shmem_init_fs_context, #ifdef CONFIG_TMPFS .parameters = shmem_fs_parameters, #endif .kill_sb = kill_anon_super, .fs_flags = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME, }; #if defined(CONFIG_SYSFS) && defined(CONFIG_TMPFS) #define __INIT_KOBJ_ATTR(_name, _mode, _show, _store) \ { \ .attr = { .name = __stringify(_name), .mode = _mode }, \ .show = _show, \ .store = _store, \ } #define TMPFS_ATTR_W(_name, _store) \ static struct kobj_attribute tmpfs_attr_##_name = \ __INIT_KOBJ_ATTR(_name, 0200, NULL, _store) #define TMPFS_ATTR_RW(_name, _show, _store) \ static struct kobj_attribute tmpfs_attr_##_name = \ __INIT_KOBJ_ATTR(_name, 0644, _show, _store) #define TMPFS_ATTR_RO(_name, _show) \ static struct kobj_attribute tmpfs_attr_##_name = \ __INIT_KOBJ_ATTR(_name, 0444, _show, NULL) #if IS_ENABLED(CONFIG_UNICODE) static ssize_t casefold_show(struct kobject *kobj, struct kobj_attribute *a, char *buf) { return sysfs_emit(buf, "supported\n"); } TMPFS_ATTR_RO(casefold, casefold_show); #endif static struct attribute *tmpfs_attributes[] = { #if IS_ENABLED(CONFIG_UNICODE) &tmpfs_attr_casefold.attr, #endif NULL }; static const struct attribute_group tmpfs_attribute_group = { .attrs = tmpfs_attributes, .name = "features" }; static struct kobject *tmpfs_kobj; static int __init tmpfs_sysfs_init(void) { int ret; tmpfs_kobj = kobject_create_and_add("tmpfs", fs_kobj); if (!tmpfs_kobj) return -ENOMEM; ret = sysfs_create_group(tmpfs_kobj, &tmpfs_attribute_group); if (ret) kobject_put(tmpfs_kobj); return ret; } #endif /* CONFIG_SYSFS && CONFIG_TMPFS */ void __init shmem_init(void) { int error; shmem_init_inodecache(); #ifdef CONFIG_TMPFS_QUOTA register_quota_format(&shmem_quota_format); #endif error = register_filesystem(&shmem_fs_type); if (error) { pr_err("Could not register tmpfs\n"); goto out2; } shm_mnt = kern_mount(&shmem_fs_type); if (IS_ERR(shm_mnt)) { error = PTR_ERR(shm_mnt); pr_err("Could not kern_mount tmpfs\n"); goto out1; } #if defined(CONFIG_SYSFS) && defined(CONFIG_TMPFS) error = tmpfs_sysfs_init(); if (error) { pr_err("Could not init tmpfs sysfs\n"); goto out1; } #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ /* * Default to setting PMD-sized THP to inherit the global setting and * disable all other multi-size THPs. */ if (!shmem_orders_configured) huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); #endif return; out1: unregister_filesystem(&shmem_fs_type); out2: #ifdef CONFIG_TMPFS_QUOTA unregister_quota_format(&shmem_quota_format); #endif shmem_destroy_inodecache(); shm_mnt = ERR_PTR(error); } #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS) static ssize_t shmem_enabled_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { static const int values[] = { SHMEM_HUGE_ALWAYS, SHMEM_HUGE_WITHIN_SIZE, SHMEM_HUGE_ADVISE, SHMEM_HUGE_NEVER, SHMEM_HUGE_DENY, SHMEM_HUGE_FORCE, }; int len = 0; int i; for (i = 0; i < ARRAY_SIZE(values); i++) { len += sysfs_emit_at(buf, len, shmem_huge == values[i] ? "%s[%s]" : "%s%s", i ? " " : "", shmem_format_huge(values[i])); } len += sysfs_emit_at(buf, len, "\n"); return len; } static ssize_t shmem_enabled_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { char tmp[16]; int huge, err; if (count + 1 > sizeof(tmp)) return -EINVAL; memcpy(tmp, buf, count); tmp[count] = '\0'; if (count && tmp[count - 1] == '\n') tmp[count - 1] = '\0'; huge = shmem_parse_huge(tmp); if (huge == -EINVAL) return huge; shmem_huge = huge; if (shmem_huge > SHMEM_HUGE_DENY) SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; err = start_stop_khugepaged(); return err ? err : count; } struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled); static DEFINE_SPINLOCK(huge_shmem_orders_lock); static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { int order = to_thpsize(kobj)->order; const char *output; if (test_bit(order, &huge_shmem_orders_always)) output = "[always] inherit within_size advise never"; else if (test_bit(order, &huge_shmem_orders_inherit)) output = "always [inherit] within_size advise never"; else if (test_bit(order, &huge_shmem_orders_within_size)) output = "always inherit [within_size] advise never"; else if (test_bit(order, &huge_shmem_orders_madvise)) output = "always inherit within_size [advise] never"; else output = "always inherit within_size advise [never]"; return sysfs_emit(buf, "%s\n", output); } static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int order = to_thpsize(kobj)->order; ssize_t ret = count; if (sysfs_streq(buf, "always")) { spin_lock(&huge_shmem_orders_lock); clear_bit(order, &huge_shmem_orders_inherit); clear_bit(order, &huge_shmem_orders_madvise); clear_bit(order, &huge_shmem_orders_within_size); set_bit(order, &huge_shmem_orders_always); spin_unlock(&huge_shmem_orders_lock); } else if (sysfs_streq(buf, "inherit")) { /* Do not override huge allocation policy with non-PMD sized mTHP */ if (shmem_huge == SHMEM_HUGE_FORCE && order != HPAGE_PMD_ORDER) return -EINVAL; spin_lock(&huge_shmem_orders_lock); clear_bit(order, &huge_shmem_orders_always); clear_bit(order, &huge_shmem_orders_madvise); clear_bit(order, &huge_shmem_orders_within_size); set_bit(order, &huge_shmem_orders_inherit); spin_unlock(&huge_shmem_orders_lock); } else if (sysfs_streq(buf, "within_size")) { spin_lock(&huge_shmem_orders_lock); clear_bit(order, &huge_shmem_orders_always); clear_bit(order, &huge_shmem_orders_inherit); clear_bit(order, &huge_shmem_orders_madvise); set_bit(order, &huge_shmem_orders_within_size); spin_unlock(&huge_shmem_orders_lock); } else if (sysfs_streq(buf, "advise")) { spin_lock(&huge_shmem_orders_lock); clear_bit(order, &huge_shmem_orders_always); clear_bit(order, &huge_shmem_orders_inherit); clear_bit(order, &huge_shmem_orders_within_size); set_bit(order, &huge_shmem_orders_madvise); spin_unlock(&huge_shmem_orders_lock); } else if (sysfs_streq(buf, "never")) { spin_lock(&huge_shmem_orders_lock); clear_bit(order, &huge_shmem_orders_always); clear_bit(order, &huge_shmem_orders_inherit); clear_bit(order, &huge_shmem_orders_within_size); clear_bit(order, &huge_shmem_orders_madvise); spin_unlock(&huge_shmem_orders_lock); } else { ret = -EINVAL; } if (ret > 0) { int err = start_stop_khugepaged(); if (err) ret = err; } return ret; } struct kobj_attribute thpsize_shmem_enabled_attr = __ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store); #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */ #if defined(CONFIG_TRANSPARENT_HUGEPAGE) static int __init setup_transparent_hugepage_shmem(char *str) { int huge; huge = shmem_parse_huge(str); if (huge == -EINVAL) { pr_warn("transparent_hugepage_shmem= cannot parse, ignored\n"); return huge; } shmem_huge = huge; return 1; } __setup("transparent_hugepage_shmem=", setup_transparent_hugepage_shmem); static int __init setup_transparent_hugepage_tmpfs(char *str) { int huge; huge = shmem_parse_huge(str); if (huge < 0) { pr_warn("transparent_hugepage_tmpfs= cannot parse, ignored\n"); return huge; } tmpfs_huge = huge; return 1; } __setup("transparent_hugepage_tmpfs=", setup_transparent_hugepage_tmpfs); static char str_dup[PAGE_SIZE] __initdata; static int __init setup_thp_shmem(char *str) { char *token, *range, *policy, *subtoken; unsigned long always, inherit, madvise, within_size; char *start_size, *end_size; int start, end, nr; char *p; if (!str || strlen(str) + 1 > PAGE_SIZE) goto err; strscpy(str_dup, str); always = huge_shmem_orders_always; inherit = huge_shmem_orders_inherit; madvise = huge_shmem_orders_madvise; within_size = huge_shmem_orders_within_size; p = str_dup; while ((token = strsep(&p, ";")) != NULL) { range = strsep(&token, ":"); policy = token; if (!policy) goto err; while ((subtoken = strsep(&range, ",")) != NULL) { if (strchr(subtoken, '-')) { start_size = strsep(&subtoken, "-"); end_size = subtoken; start = get_order_from_str(start_size, THP_ORDERS_ALL_FILE_DEFAULT); end = get_order_from_str(end_size, THP_ORDERS_ALL_FILE_DEFAULT); } else { start_size = end_size = subtoken; start = end = get_order_from_str(subtoken, THP_ORDERS_ALL_FILE_DEFAULT); } if (start < 0) { pr_err("invalid size %s in thp_shmem boot parameter\n", start_size); goto err; } if (end < 0) { pr_err("invalid size %s in thp_shmem boot parameter\n", end_size); goto err; } if (start > end) goto err; nr = end - start + 1; if (!strcmp(policy, "always")) { bitmap_set(&always, start, nr); bitmap_clear(&inherit, start, nr); bitmap_clear(&madvise, start, nr); bitmap_clear(&within_size, start, nr); } else if (!strcmp(policy, "advise")) { bitmap_set(&madvise, start, nr); bitmap_clear(&inherit, start, nr); bitmap_clear(&always, start, nr); bitmap_clear(&within_size, start, nr); } else if (!strcmp(policy, "inherit")) { bitmap_set(&inherit, start, nr); bitmap_clear(&madvise, start, nr); bitmap_clear(&always, start, nr); bitmap_clear(&within_size, start, nr); } else if (!strcmp(policy, "within_size")) { bitmap_set(&within_size, start, nr); bitmap_clear(&inherit, start, nr); bitmap_clear(&madvise, start, nr); bitmap_clear(&always, start, nr); } else if (!strcmp(policy, "never")) { bitmap_clear(&inherit, start, nr); bitmap_clear(&madvise, start, nr); bitmap_clear(&always, start, nr); bitmap_clear(&within_size, start, nr); } else { pr_err("invalid policy %s in thp_shmem boot parameter\n", policy); goto err; } } } huge_shmem_orders_always = always; huge_shmem_orders_madvise = madvise; huge_shmem_orders_inherit = inherit; huge_shmem_orders_within_size = within_size; shmem_orders_configured = true; return 1; err: pr_warn("thp_shmem=%s: error parsing string, ignoring setting\n", str); return 0; } __setup("thp_shmem=", setup_thp_shmem); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #else /* !CONFIG_SHMEM */ /* * tiny-shmem: simple shmemfs and tmpfs using ramfs code * * This is intended for small system where the benefits of the full * shmem code (swap-backed and resource-limited) are outweighed by * their complexity. On systems without swap this code should be * effectively equivalent, but much lighter weight. */ static struct file_system_type shmem_fs_type = { .name = "tmpfs", .init_fs_context = ramfs_init_fs_context, .parameters = ramfs_fs_parameters, .kill_sb = ramfs_kill_sb, .fs_flags = FS_USERNS_MOUNT, }; void __init shmem_init(void) { BUG_ON(register_filesystem(&shmem_fs_type) != 0); shm_mnt = kern_mount(&shmem_fs_type); BUG_ON(IS_ERR(shm_mnt)); } int shmem_unuse(unsigned int type) { return 0; } int shmem_lock(struct file *file, int lock, struct ucounts *ucounts) { return 0; } void shmem_unlock_mapping(struct address_space *mapping) { } #ifdef CONFIG_MMU unsigned long shmem_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { return mm_get_unmapped_area(file, addr, len, pgoff, flags); } #endif void shmem_truncate_range(struct inode *inode, loff_t lstart, uoff_t lend) { truncate_inode_pages_range(inode->i_mapping, lstart, lend); } EXPORT_SYMBOL_GPL(shmem_truncate_range); #define shmem_vm_ops generic_file_vm_ops #define shmem_anon_vm_ops generic_file_vm_ops #define shmem_file_operations ramfs_file_operations static inline int shmem_acct_size(unsigned long flags, loff_t size) { return 0; } static inline void shmem_unacct_size(unsigned long flags, loff_t size) { } static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block *sb, struct inode *dir, umode_t mode, dev_t dev, unsigned long flags) { struct inode *inode = ramfs_get_inode(sb, dir, mode, dev); return inode ? inode : ERR_PTR(-ENOSPC); } #endif /* CONFIG_SHMEM */ /* common code */ static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, loff_t size, unsigned long vm_flags, unsigned int i_flags) { unsigned long flags = (vm_flags & VM_NORESERVE) ? SHMEM_F_NORESERVE : 0; struct inode *inode; struct file *res; if (IS_ERR(mnt)) return ERR_CAST(mnt); if (size < 0 || size > MAX_LFS_FILESIZE) return ERR_PTR(-EINVAL); if (is_idmapped_mnt(mnt)) return ERR_PTR(-EINVAL); if (shmem_acct_size(flags, size)) return ERR_PTR(-ENOMEM); inode = shmem_get_inode(&nop_mnt_idmap, mnt->mnt_sb, NULL, S_IFREG | S_IRWXUGO, 0, vm_flags); if (IS_ERR(inode)) { shmem_unacct_size(flags, size); return ERR_CAST(inode); } inode->i_flags |= i_flags; inode->i_size = size; clear_nlink(inode); /* It is unlinked */ res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size)); if (!IS_ERR(res)) res = alloc_file_pseudo(inode, mnt, name, O_RDWR, &shmem_file_operations); if (IS_ERR(res)) iput(inode); return res; } /** * shmem_kernel_file_setup - get an unlinked file living in tmpfs which must be * kernel internal. There will be NO LSM permission checks against the * underlying inode. So users of this interface must do LSM checks at a * higher layer. The users are the big_key and shm implementations. LSM * checks are provided at the key or shm level rather than the inode. * @name: name for dentry (to be seen in /proc/<pid>/maps) * @size: size to be set for the file * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size */ struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned long flags) { return __shmem_file_setup(shm_mnt, name, size, flags, S_PRIVATE); } EXPORT_SYMBOL_GPL(shmem_kernel_file_setup); /** * shmem_file_setup - get an unlinked file living in tmpfs * @name: name for dentry (to be seen in /proc/<pid>/maps) * @size: size to be set for the file * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size */ struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags) { return __shmem_file_setup(shm_mnt, name, size, flags, 0); } EXPORT_SYMBOL_GPL(shmem_file_setup); /** * shmem_file_setup_with_mnt - get an unlinked file living in tmpfs * @mnt: the tmpfs mount where the file will be created * @name: name for dentry (to be seen in /proc/<pid>/maps) * @size: size to be set for the file * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size */ struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt, const char *name, loff_t size, unsigned long flags) { return __shmem_file_setup(mnt, name, size, flags, 0); } EXPORT_SYMBOL_GPL(shmem_file_setup_with_mnt); static struct file *__shmem_zero_setup(unsigned long start, unsigned long end, vm_flags_t vm_flags) { loff_t size = end - start; /* * Cloning a new file under mmap_lock leads to a lock ordering conflict * between XFS directory reading and selinux: since this file is only * accessible to the user through its mapping, use S_PRIVATE flag to * bypass file security, in the same way as shmem_kernel_file_setup(). */ return shmem_kernel_file_setup("dev/zero", size, vm_flags); } /** * shmem_zero_setup - setup a shared anonymous mapping * @vma: the vma to be mmapped is prepared by do_mmap * Returns: 0 on success, or error */ int shmem_zero_setup(struct vm_area_struct *vma) { struct file *file = __shmem_zero_setup(vma->vm_start, vma->vm_end, vma->vm_flags); if (IS_ERR(file)) return PTR_ERR(file); if (vma->vm_file) fput(vma->vm_file); vma->vm_file = file; vma->vm_ops = &shmem_anon_vm_ops; return 0; } /** * shmem_zero_setup_desc - same as shmem_zero_setup, but determined by VMA * descriptor for convenience. * @desc: Describes VMA * Returns: 0 on success, or error */ int shmem_zero_setup_desc(struct vm_area_desc *desc) { struct file *file = __shmem_zero_setup(desc->start, desc->end, desc->vm_flags); if (IS_ERR(file)) return PTR_ERR(file); desc->vm_file = file; desc->vm_ops = &shmem_anon_vm_ops; return 0; } /** * shmem_read_folio_gfp - read into page cache, using specified page allocation flags. * @mapping: the folio's address_space * @index: the folio index * @gfp: the page allocator flags to use if allocating * * This behaves as a tmpfs "read_cache_page_gfp(mapping, index, gfp)", * with any new page allocations done using the specified allocation flags. * But read_cache_page_gfp() uses the ->read_folio() method: which does not * suit tmpfs, since it may have pages in swapcache, and needs to find those * for itself; although drivers/gpu/drm i915 and ttm rely upon this support. * * i915_gem_object_get_pages_gtt() mixes __GFP_NORETRY | __GFP_NOWARN in * with the mapping_gfp_mask(), to avoid OOMing the machine unnecessarily. */ struct folio *shmem_read_folio_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp) { #ifdef CONFIG_SHMEM struct inode *inode = mapping->host; struct folio *folio; int error; error = shmem_get_folio_gfp(inode, index, i_size_read(inode), &folio, SGP_CACHE, gfp, NULL, NULL); if (error) return ERR_PTR(error); folio_unlock(folio); return folio; #else /* * The tiny !SHMEM case uses ramfs without swap */ return mapping_read_folio_gfp(mapping, index, gfp); #endif } EXPORT_SYMBOL_GPL(shmem_read_folio_gfp); struct page *shmem_read_mapping_page_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp) { struct folio *folio = shmem_read_folio_gfp(mapping, index, gfp); struct page *page; if (IS_ERR(folio)) return &folio->page; page = folio_file_page(folio, index); if (PageHWPoison(page)) { folio_put(folio); return ERR_PTR(-EIO); } return page; } EXPORT_SYMBOL_GPL(shmem_read_mapping_page_gfp); |
| 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * AppArmor security module * * This file contains AppArmor lib definitions * * 2017 Canonical Ltd. */ #ifndef __AA_LIB_H #define __AA_LIB_H #include <linux/slab.h> #include <linux/fs.h> #include <linux/lsm_hooks.h> #include "match.h" extern struct aa_dfa *stacksplitdfa; /* * split individual debug cases out in preparation for finer grained * debug controls in the future. */ #define dbg_printk(__fmt, __args...) pr_debug(__fmt, ##__args) #define DEBUG_NONE 0 #define DEBUG_LABEL_ABS_ROOT 1 #define DEBUG_LABEL 2 #define DEBUG_DOMAIN 4 #define DEBUG_POLICY 8 #define DEBUG_INTERFACE 0x10 #define DEBUG_ALL 0x1f /* update if new DEBUG_X added */ #define DEBUG_PARSE_ERROR (-1) #define DEBUG_ON (aa_g_debug != DEBUG_NONE) #define DEBUG_ABS_ROOT (aa_g_debug & DEBUG_LABEL_ABS_ROOT) #define AA_DEBUG(opt, fmt, args...) \ do { \ if (aa_g_debug & opt) \ pr_warn_ratelimited("%s: " fmt, __func__, ##args); \ } while (0) #define AA_DEBUG_LABEL(LAB, X, fmt, args...) \ do { \ if ((LAB)->flags & FLAG_DEBUG1) \ AA_DEBUG(X, fmt, args); \ } while (0) #define AA_WARN(X) WARN((X), "APPARMOR WARN %s: %s\n", __func__, #X) #define AA_BUG(X, args...) \ do { \ _Pragma("GCC diagnostic ignored \"-Wformat-zero-length\""); \ AA_BUG_FMT((X), "" args); \ _Pragma("GCC diagnostic warning \"-Wformat-zero-length\""); \ } while (0) #ifdef CONFIG_SECURITY_APPARMOR_DEBUG_ASSERTS #define AA_BUG_FMT(X, fmt, args...) \ WARN((X), "AppArmor WARN %s: (" #X "): " fmt, __func__, ##args) #else #define AA_BUG_FMT(X, fmt, args...) \ do { \ BUILD_BUG_ON_INVALID(X); \ no_printk(fmt, ##args); \ } while (0) #endif int aa_parse_debug_params(const char *str); int aa_print_debug_params(char *buffer); #define AA_ERROR(fmt, args...) \ pr_err_ratelimited("AppArmor: " fmt, ##args) /* Flag indicating whether initialization completed */ extern int apparmor_initialized; /* fn's in lib */ const char *skipn_spaces(const char *str, size_t n); const char *aa_splitn_fqname(const char *fqname, size_t n, const char **ns_name, size_t *ns_len); void aa_info_message(const char *str); /* Security blob offsets */ extern struct lsm_blob_sizes apparmor_blob_sizes; /** * aa_strneq - compare null terminated @str to a non null terminated substring * @str: a null terminated string * @sub: a substring, not necessarily null terminated * @len: length of @sub to compare * * The @str string must be full consumed for this to be considered a match */ static inline bool aa_strneq(const char *str, const char *sub, int len) { return !strncmp(str, sub, len) && !str[len]; } /** * aa_dfa_null_transition - step to next state after null character * @dfa: the dfa to match against * @start: the state of the dfa to start matching in * * aa_dfa_null_transition transitions to the next state after a null * character which is not used in standard matching and is only * used to separate pairs. */ static inline aa_state_t aa_dfa_null_transition(struct aa_dfa *dfa, aa_state_t start) { /* the null transition only needs the string's null terminator byte */ return aa_dfa_next(dfa, start, 0); } static inline bool path_mediated_fs(struct dentry *dentry) { return !(dentry->d_sb->s_flags & SB_NOUSER); } struct aa_str_table { int size; char **table; }; void aa_free_str_table(struct aa_str_table *table); bool aa_resize_str_table(struct aa_str_table *t, int newsize, gfp_t gfp); struct counted_str { struct kref count; char name[]; }; #define str_to_counted(str) \ ((struct counted_str *)(str - offsetof(struct counted_str, name))) #define __counted /* atm just a notation */ void aa_str_kref(struct kref *kref); char *aa_str_alloc(int size, gfp_t gfp); static inline __counted char *aa_get_str(__counted char *str) { if (str) kref_get(&(str_to_counted(str)->count)); return str; } static inline void aa_put_str(__counted char *str) { if (str) kref_put(&str_to_counted(str)->count, aa_str_kref); } /* struct aa_policy - common part of both namespaces and profiles * @name: name of the object * @hname - The hierarchical name * @list: list policy object is on * @profiles: head of the profiles list contained in the object */ struct aa_policy { const char *name; __counted char *hname; struct list_head list; struct list_head profiles; }; /** * basename - find the last component of an hname * @hname: hname to find the base profile name component of (NOT NULL) * * Returns: the tail (base profile name) name component of an hname */ static inline const char *basename(const char *hname) { char *split; hname = strim((char *)hname); for (split = strstr(hname, "//"); split; split = strstr(hname, "//")) hname = split + 2; return hname; } /** * __policy_find - find a policy by @name on a policy list * @head: list to search (NOT NULL) * @name: name to search for (NOT NULL) * * Requires: rcu_read_lock be held * * Returns: unrefcounted policy that match @name or NULL if not found */ static inline struct aa_policy *__policy_find(struct list_head *head, const char *name) { struct aa_policy *policy; list_for_each_entry_rcu(policy, head, list) { if (!strcmp(policy->name, name)) return policy; } return NULL; } /** * __policy_strn_find - find a policy that's name matches @len chars of @str * @head: list to search (NOT NULL) * @str: string to search for (NOT NULL) * @len: length of match required * * Requires: rcu_read_lock be held * * Returns: unrefcounted policy that match @str or NULL if not found * * if @len == strlen(@strlen) then this is equiv to __policy_find * other wise it allows searching for policy by a partial match of name */ static inline struct aa_policy *__policy_strn_find(struct list_head *head, const char *str, int len) { struct aa_policy *policy; list_for_each_entry_rcu(policy, head, list) { if (aa_strneq(policy->name, str, len)) return policy; } return NULL; } bool aa_policy_init(struct aa_policy *policy, const char *prefix, const char *name, gfp_t gfp); void aa_policy_destroy(struct aa_policy *policy); /* * fn_label_build - abstract out the build of a label transition * @L: label the transition is being computed for * @P: profile parameter derived from L by this macro, can be passed to FN * @GFP: memory allocation type to use * @FN: fn to call for each profile transition. @P is set to the profile * * Returns: new label on success * ERR_PTR if build @FN fails * NULL if label_build fails due to low memory conditions * * @FN must return a label or ERR_PTR on failure. NULL is not allowed */ #define fn_label_build(L, P, GFP, FN) \ ({ \ __label__ __do_cleanup, __done; \ struct aa_label *__new_; \ \ if ((L)->size > 1) { \ /* TODO: add cache of transitions already done */ \ struct label_it __i; \ int __j, __k, __count; \ DEFINE_VEC(label, __lvec); \ DEFINE_VEC(profile, __pvec); \ if (vec_setup(label, __lvec, (L)->size, (GFP))) { \ __new_ = NULL; \ goto __done; \ } \ __j = 0; \ label_for_each(__i, (L), (P)) { \ __new_ = (FN); \ AA_BUG(!__new_); \ if (IS_ERR(__new_)) \ goto __do_cleanup; \ __lvec[__j++] = __new_; \ } \ for (__j = __count = 0; __j < (L)->size; __j++) \ __count += __lvec[__j]->size; \ if (!vec_setup(profile, __pvec, __count, (GFP))) { \ for (__j = __k = 0; __j < (L)->size; __j++) { \ label_for_each(__i, __lvec[__j], (P)) \ __pvec[__k++] = aa_get_profile(P); \ } \ __count -= aa_vec_unique(__pvec, __count, 0); \ if (__count > 1) { \ __new_ = aa_vec_find_or_create_label(__pvec,\ __count, (GFP)); \ /* only fails if out of Mem */ \ if (!__new_) \ __new_ = NULL; \ } else \ __new_ = aa_get_label(&__pvec[0]->label); \ vec_cleanup(profile, __pvec, __count); \ } else \ __new_ = NULL; \ __do_cleanup: \ vec_cleanup(label, __lvec, (L)->size); \ } else { \ (P) = labels_profile(L); \ __new_ = (FN); \ } \ __done: \ if (!__new_) \ AA_DEBUG(DEBUG_LABEL, "label build failed\n"); \ (__new_); \ }) #define __fn_build_in_ns(NS, P, NS_FN, OTHER_FN) \ ({ \ struct aa_label *__new; \ if ((P)->ns != (NS)) \ __new = (OTHER_FN); \ else \ __new = (NS_FN); \ (__new); \ }) #define fn_label_build_in_ns(L, P, GFP, NS_FN, OTHER_FN) \ ({ \ fn_label_build((L), (P), (GFP), \ __fn_build_in_ns(labels_ns(L), (P), (NS_FN), (OTHER_FN))); \ }) #endif /* __AA_LIB_H */ |
| 3 3 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_TRACE_EVENT_H #define _LINUX_TRACE_EVENT_H #include <linux/ring_buffer.h> #include <linux/trace_seq.h> #include <linux/percpu.h> #include <linux/hardirq.h> #include <linux/perf_event.h> #include <linux/tracepoint.h> struct trace_array; struct array_buffer; struct tracer; struct dentry; struct bpf_prog; union bpf_attr; /* Used for event string fields when they are NULL */ #define EVENT_NULL_STR "(null)" const char *trace_print_flags_seq(struct trace_seq *p, const char *delim, unsigned long flags, const struct trace_print_flags *flag_array); const char *trace_print_symbols_seq(struct trace_seq *p, unsigned long val, const struct trace_print_flags *symbol_array); #if BITS_PER_LONG == 32 const char *trace_print_flags_seq_u64(struct trace_seq *p, const char *delim, unsigned long long flags, const struct trace_print_flags_u64 *flag_array); const char *trace_print_symbols_seq_u64(struct trace_seq *p, unsigned long long val, const struct trace_print_flags_u64 *symbol_array); #endif const char *trace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr, unsigned int bitmask_size); const char *trace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int len, bool concatenate); const char *trace_print_array_seq(struct trace_seq *p, const void *buf, int count, size_t el_size); const char * trace_print_hex_dump_seq(struct trace_seq *p, const char *prefix_str, int prefix_type, int rowsize, int groupsize, const void *buf, size_t len, bool ascii); struct trace_iterator; struct trace_event; int trace_raw_output_prep(struct trace_iterator *iter, struct trace_event *event); extern __printf(2, 3) void trace_event_printf(struct trace_iterator *iter, const char *fmt, ...); /* Used to find the offset and length of dynamic fields in trace events */ struct trace_dynamic_info { #ifdef CONFIG_CPU_BIG_ENDIAN u16 len; u16 offset; #else u16 offset; u16 len; #endif } __packed; /* * The trace entry - the most basic unit of tracing. This is what * is printed in the end as a single line in the trace output, such as: * * bash-15816 [01] 235.197585: idle_cpu <- irq_enter */ struct trace_entry { unsigned short type; unsigned char flags; unsigned char preempt_count; int pid; }; #define TRACE_EVENT_TYPE_MAX \ ((1 << (sizeof(((struct trace_entry *)0)->type) * 8)) - 1) /* * Trace iterator - used by printout routines who present trace * results to users and which routines might sleep, etc: */ struct trace_iterator { struct trace_array *tr; struct tracer *trace; struct array_buffer *array_buffer; void *private; int cpu_file; struct mutex mutex; struct ring_buffer_iter **buffer_iter; unsigned long iter_flags; void *temp; /* temp holder */ unsigned int temp_size; char *fmt; /* modified format holder */ unsigned int fmt_size; atomic_t wait_index; /* trace_seq for __print_flags() and __print_symbolic() etc. */ struct trace_seq tmp_seq; cpumask_var_t started; /* Set when the file is closed to prevent new waiters */ bool closed; /* it's true when current open file is snapshot */ bool snapshot; /* The below is zeroed out in pipe_read */ struct trace_seq seq; struct trace_entry *ent; unsigned long lost_events; int leftover; int ent_size; int cpu; u64 ts; loff_t pos; long idx; /* All new field here will be zeroed out in pipe_read */ }; enum trace_iter_flags { TRACE_FILE_LAT_FMT = 1, TRACE_FILE_ANNOTATE = 2, TRACE_FILE_TIME_IN_NS = 4, TRACE_FILE_PAUSE = 8, }; typedef enum print_line_t (*trace_print_func)(struct trace_iterator *iter, int flags, struct trace_event *event); struct trace_event_functions { trace_print_func trace; trace_print_func raw; trace_print_func hex; trace_print_func binary; }; struct trace_event { struct hlist_node node; int type; struct trace_event_functions *funcs; }; extern int register_trace_event(struct trace_event *event); extern int unregister_trace_event(struct trace_event *event); /* Return values for print_line callback */ enum print_line_t { TRACE_TYPE_PARTIAL_LINE = 0, /* Retry after flushing the seq */ TRACE_TYPE_HANDLED = 1, TRACE_TYPE_UNHANDLED = 2, /* Relay to other output functions */ TRACE_TYPE_NO_CONSUME = 3 /* Handled but ask to not consume */ }; enum print_line_t trace_handle_return(struct trace_seq *s); static inline void tracing_generic_entry_update(struct trace_entry *entry, unsigned short type, unsigned int trace_ctx) { entry->preempt_count = trace_ctx & 0xff; entry->pid = current->pid; entry->type = type; entry->flags = trace_ctx >> 16; } unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status); enum trace_flag_type { TRACE_FLAG_IRQS_OFF = 0x01, TRACE_FLAG_NEED_RESCHED_LAZY = 0x02, TRACE_FLAG_NEED_RESCHED = 0x04, TRACE_FLAG_HARDIRQ = 0x08, TRACE_FLAG_SOFTIRQ = 0x10, TRACE_FLAG_PREEMPT_RESCHED = 0x20, TRACE_FLAG_NMI = 0x40, TRACE_FLAG_BH_OFF = 0x80, }; static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags) { unsigned int irq_status = irqs_disabled_flags(irqflags) ? TRACE_FLAG_IRQS_OFF : 0; return tracing_gen_ctx_irq_test(irq_status); } static inline unsigned int tracing_gen_ctx(void) { unsigned long irqflags; local_save_flags(irqflags); return tracing_gen_ctx_flags(irqflags); } static inline unsigned int tracing_gen_ctx_dec(void) { unsigned int trace_ctx; trace_ctx = tracing_gen_ctx(); /* * Subtract one from the preemption counter if preemption is enabled, * see trace_event_buffer_reserve()for details. */ if (IS_ENABLED(CONFIG_PREEMPTION)) trace_ctx--; return trace_ctx; } struct trace_event_file; struct ring_buffer_event * trace_event_buffer_lock_reserve(struct trace_buffer **current_buffer, struct trace_event_file *trace_file, int type, unsigned long len, unsigned int trace_ctx); #define TRACE_RECORD_CMDLINE BIT(0) #define TRACE_RECORD_TGID BIT(1) void tracing_record_taskinfo(struct task_struct *task, int flags); void tracing_record_taskinfo_sched_switch(struct task_struct *prev, struct task_struct *next, int flags); void tracing_record_cmdline(struct task_struct *task); void tracing_record_tgid(struct task_struct *task); int trace_output_call(struct trace_iterator *iter, char *name, char *fmt, ...) __printf(3, 4); struct event_filter; enum trace_reg { TRACE_REG_REGISTER, TRACE_REG_UNREGISTER, #ifdef CONFIG_PERF_EVENTS TRACE_REG_PERF_REGISTER, TRACE_REG_PERF_UNREGISTER, TRACE_REG_PERF_OPEN, TRACE_REG_PERF_CLOSE, /* * These (ADD/DEL) use a 'boolean' return value, where 1 (true) means a * custom action was taken and the default action is not to be * performed. */ TRACE_REG_PERF_ADD, TRACE_REG_PERF_DEL, #endif }; struct trace_event_call; #define TRACE_FUNCTION_TYPE ((const char *)~0UL) struct trace_event_fields { const char *type; union { struct { const char *name; const int size; const int align; const unsigned int is_signed:1; unsigned int needs_test:1; const int filter_type; const int len; }; int (*define_fields)(struct trace_event_call *); }; }; struct trace_event_class { const char *system; void *probe; #ifdef CONFIG_PERF_EVENTS void *perf_probe; #endif int (*reg)(struct trace_event_call *event, enum trace_reg type, void *data); struct trace_event_fields *fields_array; struct list_head *(*get_fields)(struct trace_event_call *); struct list_head fields; int (*raw_init)(struct trace_event_call *); }; extern int trace_event_reg(struct trace_event_call *event, enum trace_reg type, void *data); struct trace_event_buffer { struct trace_buffer *buffer; struct ring_buffer_event *event; struct trace_event_file *trace_file; void *entry; unsigned int trace_ctx; struct pt_regs *regs; }; void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer, struct trace_event_file *trace_file, unsigned long len); void trace_event_buffer_commit(struct trace_event_buffer *fbuffer); enum { TRACE_EVENT_FL_CAP_ANY_BIT, TRACE_EVENT_FL_NO_SET_FILTER_BIT, TRACE_EVENT_FL_IGNORE_ENABLE_BIT, TRACE_EVENT_FL_TRACEPOINT_BIT, TRACE_EVENT_FL_DYNAMIC_BIT, TRACE_EVENT_FL_KPROBE_BIT, TRACE_EVENT_FL_UPROBE_BIT, TRACE_EVENT_FL_EPROBE_BIT, TRACE_EVENT_FL_FPROBE_BIT, TRACE_EVENT_FL_CUSTOM_BIT, TRACE_EVENT_FL_TEST_STR_BIT, }; /* * Event flags: * CAP_ANY - Any user can enable for perf * NO_SET_FILTER - Set when filter has error and is to be ignored * IGNORE_ENABLE - For trace internal events, do not enable with debugfs file * TRACEPOINT - Event is a tracepoint * DYNAMIC - Event is a dynamic event (created at run time) * KPROBE - Event is a kprobe * UPROBE - Event is a uprobe * EPROBE - Event is an event probe * FPROBE - Event is an function probe * CUSTOM - Event is a custom event (to be attached to an exsiting tracepoint) * This is set when the custom event has not been attached * to a tracepoint yet, then it is cleared when it is. * TEST_STR - The event has a "%s" that points to a string outside the event */ enum { TRACE_EVENT_FL_CAP_ANY = (1 << TRACE_EVENT_FL_CAP_ANY_BIT), TRACE_EVENT_FL_NO_SET_FILTER = (1 << TRACE_EVENT_FL_NO_SET_FILTER_BIT), TRACE_EVENT_FL_IGNORE_ENABLE = (1 << TRACE_EVENT_FL_IGNORE_ENABLE_BIT), TRACE_EVENT_FL_TRACEPOINT = (1 << TRACE_EVENT_FL_TRACEPOINT_BIT), TRACE_EVENT_FL_DYNAMIC = (1 << TRACE_EVENT_FL_DYNAMIC_BIT), TRACE_EVENT_FL_KPROBE = (1 << TRACE_EVENT_FL_KPROBE_BIT), TRACE_EVENT_FL_UPROBE = (1 << TRACE_EVENT_FL_UPROBE_BIT), TRACE_EVENT_FL_EPROBE = (1 << TRACE_EVENT_FL_EPROBE_BIT), TRACE_EVENT_FL_FPROBE = (1 << TRACE_EVENT_FL_FPROBE_BIT), TRACE_EVENT_FL_CUSTOM = (1 << TRACE_EVENT_FL_CUSTOM_BIT), TRACE_EVENT_FL_TEST_STR = (1 << TRACE_EVENT_FL_TEST_STR_BIT), }; #define TRACE_EVENT_FL_UKPROBE (TRACE_EVENT_FL_KPROBE | TRACE_EVENT_FL_UPROBE) struct trace_event_call { struct list_head list; struct trace_event_class *class; union { const char *name; /* Set TRACE_EVENT_FL_TRACEPOINT flag when using "tp" */ struct tracepoint *tp; }; struct trace_event event; char *print_fmt; /* * Static events can disappear with modules, * where as dynamic ones need their own ref count. */ union { void *module; atomic_t refcnt; }; void *data; /* See the TRACE_EVENT_FL_* flags above */ int flags; /* static flags of different events */ #ifdef CONFIG_PERF_EVENTS int perf_refcount; struct hlist_head __percpu *perf_events; struct bpf_prog_array __rcu *prog_array; int (*perf_perm)(struct trace_event_call *, struct perf_event *); #endif }; #ifdef CONFIG_DYNAMIC_EVENTS bool trace_event_dyn_try_get_ref(struct trace_event_call *call); void trace_event_dyn_put_ref(struct trace_event_call *call); bool trace_event_dyn_busy(struct trace_event_call *call); #else static inline bool trace_event_dyn_try_get_ref(struct trace_event_call *call) { /* Without DYNAMIC_EVENTS configured, nothing should be calling this */ return false; } static inline void trace_event_dyn_put_ref(struct trace_event_call *call) { } static inline bool trace_event_dyn_busy(struct trace_event_call *call) { /* Nothing should call this without DYNAIMIC_EVENTS configured. */ return true; } #endif static inline bool trace_event_try_get_ref(struct trace_event_call *call) { if (call->flags & TRACE_EVENT_FL_DYNAMIC) return trace_event_dyn_try_get_ref(call); else return try_module_get(call->module); } static inline void trace_event_put_ref(struct trace_event_call *call) { if (call->flags & TRACE_EVENT_FL_DYNAMIC) trace_event_dyn_put_ref(call); else module_put(call->module); } #ifdef CONFIG_PERF_EVENTS static inline bool bpf_prog_array_valid(struct trace_event_call *call) { /* * This inline function checks whether call->prog_array * is valid or not. The function is called in various places, * outside rcu_read_lock/unlock, as a heuristic to speed up execution. * * If this function returns true, and later call->prog_array * becomes false inside rcu_read_lock/unlock region, * we bail out then. If this function return false, * there is a risk that we might miss a few events if the checking * were delayed until inside rcu_read_lock/unlock region and * call->prog_array happened to become non-NULL then. * * Here, READ_ONCE() is used instead of rcu_access_pointer(). * rcu_access_pointer() requires the actual definition of * "struct bpf_prog_array" while READ_ONCE() only needs * a declaration of the same type. */ return !!READ_ONCE(call->prog_array); } #endif static inline const char * trace_event_name(struct trace_event_call *call) { if (call->flags & TRACE_EVENT_FL_CUSTOM) return call->name; else if (call->flags & TRACE_EVENT_FL_TRACEPOINT) return call->tp ? call->tp->name : NULL; else return call->name; } static inline struct list_head * trace_get_fields(struct trace_event_call *event_call) { if (!event_call->class->get_fields) return &event_call->class->fields; return event_call->class->get_fields(event_call); } struct trace_subsystem_dir; enum { EVENT_FILE_FL_ENABLED_BIT, EVENT_FILE_FL_RECORDED_CMD_BIT, EVENT_FILE_FL_RECORDED_TGID_BIT, EVENT_FILE_FL_FILTERED_BIT, EVENT_FILE_FL_NO_SET_FILTER_BIT, EVENT_FILE_FL_SOFT_DISABLED_BIT, EVENT_FILE_FL_TRIGGER_MODE_BIT, EVENT_FILE_FL_TRIGGER_COND_BIT, EVENT_FILE_FL_PID_FILTER_BIT, EVENT_FILE_FL_WAS_ENABLED_BIT, EVENT_FILE_FL_FREED_BIT, }; extern struct trace_event_file *trace_get_event_file(const char *instance, const char *system, const char *event); extern void trace_put_event_file(struct trace_event_file *file); #define MAX_DYNEVENT_CMD_LEN (2048) enum dynevent_type { DYNEVENT_TYPE_SYNTH = 1, DYNEVENT_TYPE_KPROBE, DYNEVENT_TYPE_NONE, }; struct dynevent_cmd; typedef int (*dynevent_create_fn_t)(struct dynevent_cmd *cmd); struct dynevent_cmd { struct seq_buf seq; const char *event_name; unsigned int n_fields; enum dynevent_type type; dynevent_create_fn_t run_command; void *private_data; }; extern int dynevent_create(struct dynevent_cmd *cmd); extern int synth_event_delete(const char *name); extern void synth_event_cmd_init(struct dynevent_cmd *cmd, char *buf, int maxlen); extern int __synth_event_gen_cmd_start(struct dynevent_cmd *cmd, const char *name, struct module *mod, ...); #define synth_event_gen_cmd_start(cmd, name, mod, ...) \ __synth_event_gen_cmd_start(cmd, name, mod, ## __VA_ARGS__, NULL) struct synth_field_desc { const char *type; const char *name; }; extern int synth_event_gen_cmd_array_start(struct dynevent_cmd *cmd, const char *name, struct module *mod, struct synth_field_desc *fields, unsigned int n_fields); extern int synth_event_create(const char *name, struct synth_field_desc *fields, unsigned int n_fields, struct module *mod); extern int synth_event_add_field(struct dynevent_cmd *cmd, const char *type, const char *name); extern int synth_event_add_field_str(struct dynevent_cmd *cmd, const char *type_name); extern int synth_event_add_fields(struct dynevent_cmd *cmd, struct synth_field_desc *fields, unsigned int n_fields); #define synth_event_gen_cmd_end(cmd) \ dynevent_create(cmd) struct synth_event; struct synth_event_trace_state { struct trace_event_buffer fbuffer; struct synth_trace_event *entry; struct trace_buffer *buffer; struct synth_event *event; unsigned int cur_field; unsigned int n_u64; bool disabled; bool add_next; bool add_name; }; extern int synth_event_trace(struct trace_event_file *file, unsigned int n_vals, ...); extern int synth_event_trace_array(struct trace_event_file *file, u64 *vals, unsigned int n_vals); extern int synth_event_trace_start(struct trace_event_file *file, struct synth_event_trace_state *trace_state); extern int synth_event_add_next_val(u64 val, struct synth_event_trace_state *trace_state); extern int synth_event_add_val(const char *field_name, u64 val, struct synth_event_trace_state *trace_state); extern int synth_event_trace_end(struct synth_event_trace_state *trace_state); extern int kprobe_event_delete(const char *name); extern void kprobe_event_cmd_init(struct dynevent_cmd *cmd, char *buf, int maxlen); #define kprobe_event_gen_cmd_start(cmd, name, loc, ...) \ __kprobe_event_gen_cmd_start(cmd, false, name, loc, ## __VA_ARGS__, NULL) #define kretprobe_event_gen_cmd_start(cmd, name, loc, ...) \ __kprobe_event_gen_cmd_start(cmd, true, name, loc, ## __VA_ARGS__, NULL) extern int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe, const char *name, const char *loc, ...); #define kprobe_event_add_fields(cmd, ...) \ __kprobe_event_add_fields(cmd, ## __VA_ARGS__, NULL) #define kprobe_event_add_field(cmd, field) \ __kprobe_event_add_fields(cmd, field, NULL) extern int __kprobe_event_add_fields(struct dynevent_cmd *cmd, ...); #define kprobe_event_gen_cmd_end(cmd) \ dynevent_create(cmd) #define kretprobe_event_gen_cmd_end(cmd) \ dynevent_create(cmd) /* * Event file flags: * ENABLED - The event is enabled * RECORDED_CMD - The comms should be recorded at sched_switch * RECORDED_TGID - The tgids should be recorded at sched_switch * FILTERED - The event has a filter attached * NO_SET_FILTER - Set when filter has error and is to be ignored * SOFT_DISABLED - When set, do not trace the event (even though its * tracepoint may be enabled) * TRIGGER_MODE - When set, invoke the triggers associated with the event * TRIGGER_COND - When set, one or more triggers has an associated filter * PID_FILTER - When set, the event is filtered based on pid * WAS_ENABLED - Set when enabled to know to clear trace on module removal * FREED - File descriptor is freed, all fields should be considered invalid */ enum { EVENT_FILE_FL_ENABLED = (1 << EVENT_FILE_FL_ENABLED_BIT), EVENT_FILE_FL_RECORDED_CMD = (1 << EVENT_FILE_FL_RECORDED_CMD_BIT), EVENT_FILE_FL_RECORDED_TGID = (1 << EVENT_FILE_FL_RECORDED_TGID_BIT), EVENT_FILE_FL_FILTERED = (1 << EVENT_FILE_FL_FILTERED_BIT), EVENT_FILE_FL_NO_SET_FILTER = (1 << EVENT_FILE_FL_NO_SET_FILTER_BIT), EVENT_FILE_FL_SOFT_DISABLED = (1 << EVENT_FILE_FL_SOFT_DISABLED_BIT), EVENT_FILE_FL_TRIGGER_MODE = (1 << EVENT_FILE_FL_TRIGGER_MODE_BIT), EVENT_FILE_FL_TRIGGER_COND = (1 << EVENT_FILE_FL_TRIGGER_COND_BIT), EVENT_FILE_FL_PID_FILTER = (1 << EVENT_FILE_FL_PID_FILTER_BIT), EVENT_FILE_FL_WAS_ENABLED = (1 << EVENT_FILE_FL_WAS_ENABLED_BIT), EVENT_FILE_FL_FREED = (1 << EVENT_FILE_FL_FREED_BIT), }; struct trace_event_file { struct list_head list; struct trace_event_call *event_call; struct event_filter __rcu *filter; struct eventfs_inode *ei; struct trace_array *tr; struct trace_subsystem_dir *system; struct list_head triggers; /* * 32 bit flags: * bit 0: enabled * bit 1: enabled cmd record * bit 2: enable/disable with the soft disable bit * bit 3: soft disabled * bit 4: trigger enabled * * Note: The bits must be set atomically to prevent races * from other writers. Reads of flags do not need to be in * sync as they occur in critical sections. But the way flags * is currently used, these changes do not affect the code * except that when a change is made, it may have a slight * delay in propagating the changes to other CPUs due to * caching and such. Which is mostly OK ;-) */ unsigned long flags; refcount_t ref; /* ref count for opened files */ atomic_t sm_ref; /* soft-mode reference counter */ atomic_t tm_ref; /* trigger-mode reference counter */ }; #ifdef CONFIG_HIST_TRIGGERS extern struct irq_work hist_poll_work; extern wait_queue_head_t hist_poll_wq; static inline void hist_poll_wakeup(void) { if (wq_has_sleeper(&hist_poll_wq)) irq_work_queue(&hist_poll_work); } #define hist_poll_wait(file, wait) \ poll_wait(file, &hist_poll_wq, wait) #endif #define __TRACE_EVENT_FLAGS(name, value) \ static int __init trace_init_flags_##name(void) \ { \ event_##name.flags |= value; \ return 0; \ } \ early_initcall(trace_init_flags_##name); #define __TRACE_EVENT_PERF_PERM(name, expr...) \ static int perf_perm_##name(struct trace_event_call *tp_event, \ struct perf_event *p_event) \ { \ return ({ expr; }); \ } \ static int __init trace_init_perf_perm_##name(void) \ { \ event_##name.perf_perm = &perf_perm_##name; \ return 0; \ } \ early_initcall(trace_init_perf_perm_##name); #define PERF_MAX_TRACE_SIZE 8192 #define MAX_FILTER_STR_VAL 256U /* Should handle KSYM_SYMBOL_LEN */ enum event_trigger_type { ETT_NONE = (0), ETT_TRACE_ONOFF = (1 << 0), ETT_SNAPSHOT = (1 << 1), ETT_STACKTRACE = (1 << 2), ETT_EVENT_ENABLE = (1 << 3), ETT_EVENT_HIST = (1 << 4), ETT_HIST_ENABLE = (1 << 5), ETT_EVENT_EPROBE = (1 << 6), }; extern int filter_match_preds(struct event_filter *filter, void *rec); extern enum event_trigger_type event_triggers_call(struct trace_event_file *file, struct trace_buffer *buffer, void *rec, struct ring_buffer_event *event); extern void event_triggers_post_call(struct trace_event_file *file, enum event_trigger_type tt); bool trace_event_ignore_this_pid(struct trace_event_file *trace_file); bool __trace_trigger_soft_disabled(struct trace_event_file *file); /** * trace_trigger_soft_disabled - do triggers and test if soft disabled * @file: The file pointer of the event to test * * If any triggers without filters are attached to this event, they * will be called here. If the event is soft disabled and has no * triggers that require testing the fields, it will return true, * otherwise false. */ static __always_inline bool trace_trigger_soft_disabled(struct trace_event_file *file) { unsigned long eflags = file->flags; if (likely(!(eflags & (EVENT_FILE_FL_TRIGGER_MODE | EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_PID_FILTER)))) return false; if (likely(eflags & EVENT_FILE_FL_TRIGGER_COND)) return false; return __trace_trigger_soft_disabled(file); } #ifdef CONFIG_BPF_EVENTS unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx); int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie); void perf_event_detach_bpf_prog(struct perf_event *event); int perf_event_query_prog_array(struct perf_event *event, void __user *info); struct bpf_raw_tp_link; int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *link); int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *link); struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name); void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp); int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, u32 *fd_type, const char **buf, u64 *probe_offset, u64 *probe_addr, unsigned long *missed); int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *prog); int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *prog); #else static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) { return 1; } static inline int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie) { return -EOPNOTSUPP; } static inline void perf_event_detach_bpf_prog(struct perf_event *event) { } static inline int perf_event_query_prog_array(struct perf_event *event, void __user *info) { return -EOPNOTSUPP; } struct bpf_raw_tp_link; static inline int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *link) { return -EOPNOTSUPP; } static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *link) { return -EOPNOTSUPP; } static inline struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name) { return NULL; } static inline void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp) { } static inline int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, u32 *fd_type, const char **buf, u64 *probe_offset, u64 *probe_addr, unsigned long *missed) { return -EOPNOTSUPP; } static inline int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) { return -EOPNOTSUPP; } static inline int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) { return -EOPNOTSUPP; } #endif enum { FILTER_OTHER = 0, FILTER_STATIC_STRING, FILTER_DYN_STRING, FILTER_RDYN_STRING, FILTER_PTR_STRING, FILTER_TRACE_FN, FILTER_CPUMASK, FILTER_COMM, FILTER_CPU, FILTER_STACKTRACE, }; extern int trace_event_raw_init(struct trace_event_call *call); extern int trace_define_field(struct trace_event_call *call, const char *type, const char *name, int offset, int size, int is_signed, int filter_type); extern int trace_add_event_call(struct trace_event_call *call); extern int trace_remove_event_call(struct trace_event_call *call); extern int trace_event_get_offsets(struct trace_event_call *call); int ftrace_set_clr_event(struct trace_array *tr, char *buf, int set); int trace_set_clr_event(const char *system, const char *event, int set); int trace_array_set_clr_event(struct trace_array *tr, const char *system, const char *event, bool enable); #ifdef CONFIG_PERF_EVENTS struct perf_event; DECLARE_PER_CPU(struct pt_regs, perf_trace_regs); extern int perf_trace_init(struct perf_event *event); extern void perf_trace_destroy(struct perf_event *event); extern int perf_trace_add(struct perf_event *event, int flags); extern void perf_trace_del(struct perf_event *event, int flags); #ifdef CONFIG_KPROBE_EVENTS extern int perf_kprobe_init(struct perf_event *event, bool is_retprobe); extern void perf_kprobe_destroy(struct perf_event *event); extern int bpf_get_kprobe_info(const struct perf_event *event, u32 *fd_type, const char **symbol, u64 *probe_offset, u64 *probe_addr, unsigned long *missed, bool perf_type_tracepoint); #endif #ifdef CONFIG_UPROBE_EVENTS extern int perf_uprobe_init(struct perf_event *event, unsigned long ref_ctr_offset, bool is_retprobe); extern void perf_uprobe_destroy(struct perf_event *event); extern int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type, const char **filename, u64 *probe_offset, u64 *probe_addr, bool perf_type_tracepoint); #endif extern int ftrace_profile_set_filter(struct perf_event *event, int event_id, char *filter_str); extern void ftrace_profile_free_filter(struct perf_event *event); void perf_trace_buf_update(void *record, u16 type); void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp); int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie); void perf_event_free_bpf_prog(struct perf_event *event); void bpf_trace_run1(struct bpf_raw_tp_link *link, u64 arg1); void bpf_trace_run2(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2); void bpf_trace_run3(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2, u64 arg3); void bpf_trace_run4(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2, u64 arg3, u64 arg4); void bpf_trace_run5(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2, u64 arg3, u64 arg4, u64 arg5); void bpf_trace_run6(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2, u64 arg3, u64 arg4, u64 arg5, u64 arg6); void bpf_trace_run7(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2, u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7); void bpf_trace_run8(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2, u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, u64 arg8); void bpf_trace_run9(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2, u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, u64 arg8, u64 arg9); void bpf_trace_run10(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2, u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, u64 arg8, u64 arg9, u64 arg10); void bpf_trace_run11(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2, u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, u64 arg8, u64 arg9, u64 arg10, u64 arg11); void bpf_trace_run12(struct bpf_raw_tp_link *link, u64 arg1, u64 arg2, u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12); void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx, struct trace_event_call *call, u64 count, struct pt_regs *regs, struct hlist_head *head, struct task_struct *task); static inline void perf_trace_buf_submit(void *raw_data, int size, int rctx, u16 type, u64 count, struct pt_regs *regs, void *head, struct task_struct *task) { perf_tp_event(type, count, raw_data, size, regs, head, rctx, task); } #endif #define TRACE_EVENT_STR_MAX 512 /* * gcc warns that you can not use a va_list in an inlined * function. But lets me make it into a macro :-/ */ #define __trace_event_vstr_len(fmt, va) \ ({ \ va_list __ap; \ int __ret; \ \ va_copy(__ap, *(va)); \ __ret = vsnprintf(NULL, 0, fmt, __ap) + 1; \ va_end(__ap); \ \ min(__ret, TRACE_EVENT_STR_MAX); \ }) #endif /* _LINUX_TRACE_EVENT_H */ /* * Note: we keep the TRACE_CUSTOM_EVENT outside the include file ifdef protection. * This is due to the way trace custom events work. If a file includes two * trace event headers under one "CREATE_CUSTOM_TRACE_EVENTS" the first include * will override the TRACE_CUSTOM_EVENT and break the second include. */ #ifndef TRACE_CUSTOM_EVENT #define DECLARE_CUSTOM_EVENT_CLASS(name, proto, args, tstruct, assign, print) #define DEFINE_CUSTOM_EVENT(template, name, proto, args) #define TRACE_CUSTOM_EVENT(name, proto, args, struct, assign, print) #endif /* ifdef TRACE_CUSTOM_EVENT (see note above) */ |
| 1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 | /* * Copyright (c) 2006 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #include <linux/completion.h> #include <linux/dma-mapping.h> #include <linux/err.h> #include <linux/interrupt.h> #include <linux/export.h> #include <linux/slab.h> #include <linux/bitops.h> #include <linux/random.h> #include <rdma/ib_cache.h> #include "sa.h" static int mcast_add_one(struct ib_device *device); static void mcast_remove_one(struct ib_device *device, void *client_data); static struct ib_client mcast_client = { .name = "ib_multicast", .add = mcast_add_one, .remove = mcast_remove_one }; static struct ib_sa_client sa_client; static struct workqueue_struct *mcast_wq; static union ib_gid mgid0; struct mcast_device; struct mcast_port { struct mcast_device *dev; spinlock_t lock; struct rb_root table; refcount_t refcount; struct completion comp; u32 port_num; }; struct mcast_device { struct ib_device *device; struct ib_event_handler event_handler; int start_port; int end_port; struct mcast_port port[]; }; enum mcast_state { MCAST_JOINING, MCAST_MEMBER, MCAST_ERROR, }; enum mcast_group_state { MCAST_IDLE, MCAST_BUSY, MCAST_GROUP_ERROR, MCAST_PKEY_EVENT }; enum { MCAST_INVALID_PKEY_INDEX = 0xFFFF }; struct mcast_member; struct mcast_group { struct ib_sa_mcmember_rec rec; struct rb_node node; struct mcast_port *port; spinlock_t lock; struct work_struct work; struct list_head pending_list; struct list_head active_list; struct mcast_member *last_join; int members[NUM_JOIN_MEMBERSHIP_TYPES]; atomic_t refcount; enum mcast_group_state state; struct ib_sa_query *query; u16 pkey_index; u8 leave_state; int retries; }; struct mcast_member { struct ib_sa_multicast multicast; struct ib_sa_client *client; struct mcast_group *group; struct list_head list; enum mcast_state state; refcount_t refcount; struct completion comp; }; static void join_handler(int status, struct ib_sa_mcmember_rec *rec, void *context); static void leave_handler(int status, struct ib_sa_mcmember_rec *rec, void *context); static struct mcast_group *mcast_find(struct mcast_port *port, union ib_gid *mgid) { struct rb_node *node = port->table.rb_node; struct mcast_group *group; int ret; while (node) { group = rb_entry(node, struct mcast_group, node); ret = memcmp(mgid->raw, group->rec.mgid.raw, sizeof *mgid); if (!ret) return group; if (ret < 0) node = node->rb_left; else node = node->rb_right; } return NULL; } static struct mcast_group *mcast_insert(struct mcast_port *port, struct mcast_group *group, int allow_duplicates) { struct rb_node **link = &port->table.rb_node; struct rb_node *parent = NULL; struct mcast_group *cur_group; int ret; while (*link) { parent = *link; cur_group = rb_entry(parent, struct mcast_group, node); ret = memcmp(group->rec.mgid.raw, cur_group->rec.mgid.raw, sizeof group->rec.mgid); if (ret < 0) link = &(*link)->rb_left; else if (ret > 0) link = &(*link)->rb_right; else if (allow_duplicates) link = &(*link)->rb_left; else return cur_group; } rb_link_node(&group->node, parent, link); rb_insert_color(&group->node, &port->table); return NULL; } static void deref_port(struct mcast_port *port) { if (refcount_dec_and_test(&port->refcount)) complete(&port->comp); } static void release_group(struct mcast_group *group) { struct mcast_port *port = group->port; unsigned long flags; spin_lock_irqsave(&port->lock, flags); if (atomic_dec_and_test(&group->refcount)) { rb_erase(&group->node, &port->table); spin_unlock_irqrestore(&port->lock, flags); kfree(group); deref_port(port); } else spin_unlock_irqrestore(&port->lock, flags); } static void deref_member(struct mcast_member *member) { if (refcount_dec_and_test(&member->refcount)) complete(&member->comp); } static void queue_join(struct mcast_member *member) { struct mcast_group *group = member->group; unsigned long flags; spin_lock_irqsave(&group->lock, flags); list_add_tail(&member->list, &group->pending_list); if (group->state == MCAST_IDLE) { group->state = MCAST_BUSY; atomic_inc(&group->refcount); queue_work(mcast_wq, &group->work); } spin_unlock_irqrestore(&group->lock, flags); } /* * A multicast group has four types of members: full member, non member, * sendonly non member and sendonly full member. * We need to keep track of the number of members of each * type based on their join state. Adjust the number of members the belong to * the specified join states. */ static void adjust_membership(struct mcast_group *group, u8 join_state, int inc) { int i; for (i = 0; i < NUM_JOIN_MEMBERSHIP_TYPES; i++, join_state >>= 1) if (join_state & 0x1) group->members[i] += inc; } /* * If a multicast group has zero members left for a particular join state, but * the group is still a member with the SA, we need to leave that join state. * Determine which join states we still belong to, but that do not have any * active members. */ static u8 get_leave_state(struct mcast_group *group) { u8 leave_state = 0; int i; for (i = 0; i < NUM_JOIN_MEMBERSHIP_TYPES; i++) if (!group->members[i]) leave_state |= (0x1 << i); return leave_state & group->rec.join_state; } static int check_selector(ib_sa_comp_mask comp_mask, ib_sa_comp_mask selector_mask, ib_sa_comp_mask value_mask, u8 selector, u8 src_value, u8 dst_value) { int err; if (!(comp_mask & selector_mask) || !(comp_mask & value_mask)) return 0; switch (selector) { case IB_SA_GT: err = (src_value <= dst_value); break; case IB_SA_LT: err = (src_value >= dst_value); break; case IB_SA_EQ: err = (src_value != dst_value); break; default: err = 0; break; } return err; } static int cmp_rec(struct ib_sa_mcmember_rec *src, struct ib_sa_mcmember_rec *dst, ib_sa_comp_mask comp_mask) { /* MGID must already match */ if (comp_mask & IB_SA_MCMEMBER_REC_PORT_GID && memcmp(&src->port_gid, &dst->port_gid, sizeof src->port_gid)) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_QKEY && src->qkey != dst->qkey) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_MLID && src->mlid != dst->mlid) return -EINVAL; if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_MTU_SELECTOR, IB_SA_MCMEMBER_REC_MTU, dst->mtu_selector, src->mtu, dst->mtu)) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_TRAFFIC_CLASS && src->traffic_class != dst->traffic_class) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_PKEY && src->pkey != dst->pkey) return -EINVAL; if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_RATE_SELECTOR, IB_SA_MCMEMBER_REC_RATE, dst->rate_selector, src->rate, dst->rate)) return -EINVAL; if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR, IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME, dst->packet_life_time_selector, src->packet_life_time, dst->packet_life_time)) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_SL && src->sl != dst->sl) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_FLOW_LABEL && src->flow_label != dst->flow_label) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_HOP_LIMIT && src->hop_limit != dst->hop_limit) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_SCOPE && src->scope != dst->scope) return -EINVAL; /* join_state checked separately, proxy_join ignored */ return 0; } static int send_join(struct mcast_group *group, struct mcast_member *member) { struct mcast_port *port = group->port; int ret; group->last_join = member; ret = ib_sa_mcmember_rec_query(&sa_client, port->dev->device, port->port_num, IB_MGMT_METHOD_SET, &member->multicast.rec, member->multicast.comp_mask, 3000, GFP_KERNEL, join_handler, group, &group->query); return (ret > 0) ? 0 : ret; } static int send_leave(struct mcast_group *group, u8 leave_state) { struct mcast_port *port = group->port; struct ib_sa_mcmember_rec rec; int ret; rec = group->rec; rec.join_state = leave_state; group->leave_state = leave_state; ret = ib_sa_mcmember_rec_query(&sa_client, port->dev->device, port->port_num, IB_SA_METHOD_DELETE, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_JOIN_STATE, 3000, GFP_KERNEL, leave_handler, group, &group->query); return (ret > 0) ? 0 : ret; } static void join_group(struct mcast_group *group, struct mcast_member *member, u8 join_state) { member->state = MCAST_MEMBER; adjust_membership(group, join_state, 1); group->rec.join_state |= join_state; member->multicast.rec = group->rec; member->multicast.rec.join_state = join_state; list_move(&member->list, &group->active_list); } static int fail_join(struct mcast_group *group, struct mcast_member *member, int status) { spin_lock_irq(&group->lock); list_del_init(&member->list); spin_unlock_irq(&group->lock); return member->multicast.callback(status, &member->multicast); } static void process_group_error(struct mcast_group *group) { struct mcast_member *member; int ret = 0; u16 pkey_index; if (group->state == MCAST_PKEY_EVENT) ret = ib_find_pkey(group->port->dev->device, group->port->port_num, be16_to_cpu(group->rec.pkey), &pkey_index); spin_lock_irq(&group->lock); if (group->state == MCAST_PKEY_EVENT && !ret && group->pkey_index == pkey_index) goto out; while (!list_empty(&group->active_list)) { member = list_entry(group->active_list.next, struct mcast_member, list); refcount_inc(&member->refcount); list_del_init(&member->list); adjust_membership(group, member->multicast.rec.join_state, -1); member->state = MCAST_ERROR; spin_unlock_irq(&group->lock); ret = member->multicast.callback(-ENETRESET, &member->multicast); deref_member(member); if (ret) ib_sa_free_multicast(&member->multicast); spin_lock_irq(&group->lock); } group->rec.join_state = 0; out: group->state = MCAST_BUSY; spin_unlock_irq(&group->lock); } static void mcast_work_handler(struct work_struct *work) { struct mcast_group *group; struct mcast_member *member; struct ib_sa_multicast *multicast; int status, ret; u8 join_state; group = container_of(work, typeof(*group), work); retest: spin_lock_irq(&group->lock); while (!list_empty(&group->pending_list) || (group->state != MCAST_BUSY)) { if (group->state != MCAST_BUSY) { spin_unlock_irq(&group->lock); process_group_error(group); goto retest; } member = list_entry(group->pending_list.next, struct mcast_member, list); multicast = &member->multicast; join_state = multicast->rec.join_state; refcount_inc(&member->refcount); if (join_state == (group->rec.join_state & join_state)) { status = cmp_rec(&group->rec, &multicast->rec, multicast->comp_mask); if (!status) join_group(group, member, join_state); else list_del_init(&member->list); spin_unlock_irq(&group->lock); ret = multicast->callback(status, multicast); } else { spin_unlock_irq(&group->lock); status = send_join(group, member); if (!status) { deref_member(member); return; } ret = fail_join(group, member, status); } deref_member(member); if (ret) ib_sa_free_multicast(&member->multicast); spin_lock_irq(&group->lock); } join_state = get_leave_state(group); if (join_state) { group->rec.join_state &= ~join_state; spin_unlock_irq(&group->lock); if (send_leave(group, join_state)) goto retest; } else { group->state = MCAST_IDLE; spin_unlock_irq(&group->lock); release_group(group); } } /* * Fail a join request if it is still active - at the head of the pending queue. */ static void process_join_error(struct mcast_group *group, int status) { struct mcast_member *member; int ret; spin_lock_irq(&group->lock); member = list_entry(group->pending_list.next, struct mcast_member, list); if (group->last_join == member) { refcount_inc(&member->refcount); list_del_init(&member->list); spin_unlock_irq(&group->lock); ret = member->multicast.callback(status, &member->multicast); deref_member(member); if (ret) ib_sa_free_multicast(&member->multicast); } else spin_unlock_irq(&group->lock); } static void join_handler(int status, struct ib_sa_mcmember_rec *rec, void *context) { struct mcast_group *group = context; u16 pkey_index = MCAST_INVALID_PKEY_INDEX; if (status) process_join_error(group, status); else { int mgids_changed, is_mgid0; if (ib_find_pkey(group->port->dev->device, group->port->port_num, be16_to_cpu(rec->pkey), &pkey_index)) pkey_index = MCAST_INVALID_PKEY_INDEX; spin_lock_irq(&group->port->lock); if (group->state == MCAST_BUSY && group->pkey_index == MCAST_INVALID_PKEY_INDEX) group->pkey_index = pkey_index; mgids_changed = memcmp(&rec->mgid, &group->rec.mgid, sizeof(group->rec.mgid)); group->rec = *rec; if (mgids_changed) { rb_erase(&group->node, &group->port->table); is_mgid0 = !memcmp(&mgid0, &group->rec.mgid, sizeof(mgid0)); mcast_insert(group->port, group, is_mgid0); } spin_unlock_irq(&group->port->lock); } mcast_work_handler(&group->work); } static void leave_handler(int status, struct ib_sa_mcmember_rec *rec, void *context) { struct mcast_group *group = context; if (status && group->retries > 0 && !send_leave(group, group->leave_state)) group->retries--; else mcast_work_handler(&group->work); } static struct mcast_group *acquire_group(struct mcast_port *port, union ib_gid *mgid, gfp_t gfp_mask) { struct mcast_group *group, *cur_group; unsigned long flags; int is_mgid0; is_mgid0 = !memcmp(&mgid0, mgid, sizeof mgid0); if (!is_mgid0) { spin_lock_irqsave(&port->lock, flags); group = mcast_find(port, mgid); if (group) goto found; spin_unlock_irqrestore(&port->lock, flags); } group = kzalloc(sizeof *group, gfp_mask); if (!group) return NULL; group->retries = 3; group->port = port; group->rec.mgid = *mgid; group->pkey_index = MCAST_INVALID_PKEY_INDEX; INIT_LIST_HEAD(&group->pending_list); INIT_LIST_HEAD(&group->active_list); INIT_WORK(&group->work, mcast_work_handler); spin_lock_init(&group->lock); spin_lock_irqsave(&port->lock, flags); cur_group = mcast_insert(port, group, is_mgid0); if (cur_group) { kfree(group); group = cur_group; } else refcount_inc(&port->refcount); found: atomic_inc(&group->refcount); spin_unlock_irqrestore(&port->lock, flags); return group; } /* * We serialize all join requests to a single group to make our lives much * easier. Otherwise, two users could try to join the same group * simultaneously, with different configurations, one could leave while the * join is in progress, etc., which makes locking around error recovery * difficult. */ struct ib_sa_multicast * ib_sa_join_multicast(struct ib_sa_client *client, struct ib_device *device, u32 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, gfp_t gfp_mask, int (*callback)(int status, struct ib_sa_multicast *multicast), void *context) { struct mcast_device *dev; struct mcast_member *member; struct ib_sa_multicast *multicast; int ret; dev = ib_get_client_data(device, &mcast_client); if (!dev) return ERR_PTR(-ENODEV); member = kmalloc(sizeof *member, gfp_mask); if (!member) return ERR_PTR(-ENOMEM); ib_sa_client_get(client); member->client = client; member->multicast.rec = *rec; member->multicast.comp_mask = comp_mask; member->multicast.callback = callback; member->multicast.context = context; init_completion(&member->comp); refcount_set(&member->refcount, 1); member->state = MCAST_JOINING; member->group = acquire_group(&dev->port[port_num - dev->start_port], &rec->mgid, gfp_mask); if (!member->group) { ret = -ENOMEM; goto err; } /* * The user will get the multicast structure in their callback. They * could then free the multicast structure before we can return from * this routine. So we save the pointer to return before queuing * any callback. */ multicast = &member->multicast; queue_join(member); return multicast; err: ib_sa_client_put(client); kfree(member); return ERR_PTR(ret); } EXPORT_SYMBOL(ib_sa_join_multicast); void ib_sa_free_multicast(struct ib_sa_multicast *multicast) { struct mcast_member *member; struct mcast_group *group; member = container_of(multicast, struct mcast_member, multicast); group = member->group; spin_lock_irq(&group->lock); if (member->state == MCAST_MEMBER) adjust_membership(group, multicast->rec.join_state, -1); list_del_init(&member->list); if (group->state == MCAST_IDLE) { group->state = MCAST_BUSY; spin_unlock_irq(&group->lock); /* Continue to hold reference on group until callback */ queue_work(mcast_wq, &group->work); } else { spin_unlock_irq(&group->lock); release_group(group); } deref_member(member); wait_for_completion(&member->comp); ib_sa_client_put(member->client); kfree(member); } EXPORT_SYMBOL(ib_sa_free_multicast); int ib_sa_get_mcmember_rec(struct ib_device *device, u32 port_num, union ib_gid *mgid, struct ib_sa_mcmember_rec *rec) { struct mcast_device *dev; struct mcast_port *port; struct mcast_group *group; unsigned long flags; int ret = 0; dev = ib_get_client_data(device, &mcast_client); if (!dev) return -ENODEV; port = &dev->port[port_num - dev->start_port]; spin_lock_irqsave(&port->lock, flags); group = mcast_find(port, mgid); if (group) *rec = group->rec; else ret = -EADDRNOTAVAIL; spin_unlock_irqrestore(&port->lock, flags); return ret; } EXPORT_SYMBOL(ib_sa_get_mcmember_rec); /** * ib_init_ah_from_mcmember - Initialize AH attribute from multicast * member record and gid of the device. * @device: RDMA device * @port_num: Port of the rdma device to consider * @rec: Multicast member record to use * @ndev: Optional netdevice, applicable only for RoCE * @gid_type: GID type to consider * @ah_attr: AH attribute to fillup on successful completion * * ib_init_ah_from_mcmember() initializes AH attribute based on multicast * member record and other device properties. On success the caller is * responsible to call rdma_destroy_ah_attr on the ah_attr. Returns 0 on * success or appropriate error code. * */ int ib_init_ah_from_mcmember(struct ib_device *device, u32 port_num, struct ib_sa_mcmember_rec *rec, struct net_device *ndev, enum ib_gid_type gid_type, struct rdma_ah_attr *ah_attr) { const struct ib_gid_attr *sgid_attr; /* GID table is not based on the netdevice for IB link layer, * so ignore ndev during search. */ if (rdma_protocol_ib(device, port_num)) ndev = NULL; else if (!rdma_protocol_roce(device, port_num)) return -EINVAL; sgid_attr = rdma_find_gid_by_port(device, &rec->port_gid, gid_type, port_num, ndev); if (IS_ERR(sgid_attr)) return PTR_ERR(sgid_attr); memset(ah_attr, 0, sizeof(*ah_attr)); ah_attr->type = rdma_ah_find_type(device, port_num); rdma_ah_set_dlid(ah_attr, be16_to_cpu(rec->mlid)); rdma_ah_set_sl(ah_attr, rec->sl); rdma_ah_set_port_num(ah_attr, port_num); rdma_ah_set_static_rate(ah_attr, rec->rate); rdma_move_grh_sgid_attr(ah_attr, &rec->mgid, be32_to_cpu(rec->flow_label), rec->hop_limit, rec->traffic_class, sgid_attr); return 0; } EXPORT_SYMBOL(ib_init_ah_from_mcmember); static void mcast_groups_event(struct mcast_port *port, enum mcast_group_state state) { struct mcast_group *group; struct rb_node *node; unsigned long flags; spin_lock_irqsave(&port->lock, flags); for (node = rb_first(&port->table); node; node = rb_next(node)) { group = rb_entry(node, struct mcast_group, node); spin_lock(&group->lock); if (group->state == MCAST_IDLE) { atomic_inc(&group->refcount); queue_work(mcast_wq, &group->work); } if (group->state != MCAST_GROUP_ERROR) group->state = state; spin_unlock(&group->lock); } spin_unlock_irqrestore(&port->lock, flags); } static void mcast_event_handler(struct ib_event_handler *handler, struct ib_event *event) { struct mcast_device *dev; int index; dev = container_of(handler, struct mcast_device, event_handler); if (!rdma_cap_ib_mcast(dev->device, event->element.port_num)) return; index = event->element.port_num - dev->start_port; switch (event->event) { case IB_EVENT_PORT_ERR: case IB_EVENT_LID_CHANGE: case IB_EVENT_CLIENT_REREGISTER: mcast_groups_event(&dev->port[index], MCAST_GROUP_ERROR); break; case IB_EVENT_PKEY_CHANGE: mcast_groups_event(&dev->port[index], MCAST_PKEY_EVENT); break; default: break; } } static int mcast_add_one(struct ib_device *device) { struct mcast_device *dev; struct mcast_port *port; int i; int count = 0; dev = kmalloc(struct_size(dev, port, device->phys_port_cnt), GFP_KERNEL); if (!dev) return -ENOMEM; dev->start_port = rdma_start_port(device); dev->end_port = rdma_end_port(device); for (i = 0; i <= dev->end_port - dev->start_port; i++) { if (!rdma_cap_ib_mcast(device, dev->start_port + i)) continue; port = &dev->port[i]; port->dev = dev; port->port_num = dev->start_port + i; spin_lock_init(&port->lock); port->table = RB_ROOT; init_completion(&port->comp); refcount_set(&port->refcount, 1); ++count; } if (!count) { kfree(dev); return -EOPNOTSUPP; } dev->device = device; ib_set_client_data(device, &mcast_client, dev); INIT_IB_EVENT_HANDLER(&dev->event_handler, device, mcast_event_handler); ib_register_event_handler(&dev->event_handler); return 0; } static void mcast_remove_one(struct ib_device *device, void *client_data) { struct mcast_device *dev = client_data; struct mcast_port *port; int i; ib_unregister_event_handler(&dev->event_handler); flush_workqueue(mcast_wq); for (i = 0; i <= dev->end_port - dev->start_port; i++) { if (rdma_cap_ib_mcast(device, dev->start_port + i)) { port = &dev->port[i]; deref_port(port); wait_for_completion(&port->comp); } } kfree(dev); } int mcast_init(void) { int ret; mcast_wq = alloc_ordered_workqueue("ib_mcast", WQ_MEM_RECLAIM); if (!mcast_wq) return -ENOMEM; ib_sa_register_client(&sa_client); ret = ib_register_client(&mcast_client); if (ret) goto err; return 0; err: ib_sa_unregister_client(&sa_client); destroy_workqueue(mcast_wq); return ret; } void mcast_cleanup(void) { ib_unregister_client(&mcast_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(mcast_wq); } |
| 20 20 19 19 4 2 12 12 12 12 1 12 12 12 12 12 12 12 12 12 12 12 12 1 1 1 47 76 47 78 78 25 11 23 2 25 25 25 25 25 1 76 73 76 4 4 2 2 10 10 4 4 2 2 1 1 1 1 4 1 1 4 3 3 4 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 | // SPDX-License-Identifier: GPL-2.0-only /* * fs/kernfs/inode.c - kernfs inode implementation * * Copyright (c) 2001-3 Patrick Mochel * Copyright (c) 2007 SUSE Linux Products GmbH * Copyright (c) 2007, 2013 Tejun Heo <tj@kernel.org> */ #include <linux/pagemap.h> #include <linux/backing-dev.h> #include <linux/capability.h> #include <linux/errno.h> #include <linux/slab.h> #include <linux/xattr.h> #include <linux/security.h> #include "kernfs-internal.h" static const struct inode_operations kernfs_iops = { .permission = kernfs_iop_permission, .setattr = kernfs_iop_setattr, .getattr = kernfs_iop_getattr, .listxattr = kernfs_iop_listxattr, }; static struct kernfs_iattrs *__kernfs_iattrs(struct kernfs_node *kn, bool alloc) { struct kernfs_iattrs *ret __free(kfree) = NULL; struct kernfs_iattrs *attr; attr = READ_ONCE(kn->iattr); if (attr || !alloc) return attr; ret = kmem_cache_zalloc(kernfs_iattrs_cache, GFP_KERNEL); if (!ret) return NULL; /* assign default attributes */ ret->ia_uid = GLOBAL_ROOT_UID; ret->ia_gid = GLOBAL_ROOT_GID; ktime_get_real_ts64(&ret->ia_atime); ret->ia_mtime = ret->ia_atime; ret->ia_ctime = ret->ia_atime; simple_xattrs_init(&ret->xattrs); atomic_set(&ret->nr_user_xattrs, 0); atomic_set(&ret->user_xattr_size, 0); /* If someone raced us, recognize it. */ if (!try_cmpxchg(&kn->iattr, &attr, ret)) return READ_ONCE(kn->iattr); return no_free_ptr(ret); } static struct kernfs_iattrs *kernfs_iattrs(struct kernfs_node *kn) { return __kernfs_iattrs(kn, true); } static struct kernfs_iattrs *kernfs_iattrs_noalloc(struct kernfs_node *kn) { return __kernfs_iattrs(kn, false); } int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr) { struct kernfs_iattrs *attrs; unsigned int ia_valid = iattr->ia_valid; attrs = kernfs_iattrs(kn); if (!attrs) return -ENOMEM; if (ia_valid & ATTR_UID) attrs->ia_uid = iattr->ia_uid; if (ia_valid & ATTR_GID) attrs->ia_gid = iattr->ia_gid; if (ia_valid & ATTR_ATIME) attrs->ia_atime = iattr->ia_atime; if (ia_valid & ATTR_MTIME) attrs->ia_mtime = iattr->ia_mtime; if (ia_valid & ATTR_CTIME) attrs->ia_ctime = iattr->ia_ctime; if (ia_valid & ATTR_MODE) kn->mode = iattr->ia_mode; return 0; } /** * kernfs_setattr - set iattr on a node * @kn: target node * @iattr: iattr to set * * Return: %0 on success, -errno on failure. */ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr) { int ret; struct kernfs_root *root = kernfs_root(kn); down_write(&root->kernfs_iattr_rwsem); ret = __kernfs_setattr(kn, iattr); up_write(&root->kernfs_iattr_rwsem); return ret; } int kernfs_iop_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct iattr *iattr) { struct inode *inode = d_inode(dentry); struct kernfs_node *kn = inode->i_private; struct kernfs_root *root; int error; if (!kn) return -EINVAL; root = kernfs_root(kn); down_write(&root->kernfs_iattr_rwsem); error = setattr_prepare(&nop_mnt_idmap, dentry, iattr); if (error) goto out; error = __kernfs_setattr(kn, iattr); if (error) goto out; /* this ignores size changes */ setattr_copy(&nop_mnt_idmap, inode, iattr); out: up_write(&root->kernfs_iattr_rwsem); return error; } ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size) { struct kernfs_node *kn = kernfs_dentry_node(dentry); struct kernfs_iattrs *attrs; attrs = kernfs_iattrs(kn); if (!attrs) return -ENOMEM; return simple_xattr_list(d_inode(dentry), &attrs->xattrs, buf, size); } static inline void set_default_inode_attr(struct inode *inode, umode_t mode) { inode->i_mode = mode; simple_inode_init_ts(inode); } static inline void set_inode_attr(struct inode *inode, struct kernfs_iattrs *attrs) { inode->i_uid = attrs->ia_uid; inode->i_gid = attrs->ia_gid; inode_set_atime_to_ts(inode, attrs->ia_atime); inode_set_mtime_to_ts(inode, attrs->ia_mtime); inode_set_ctime_to_ts(inode, attrs->ia_ctime); } static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode) { struct kernfs_iattrs *attrs; inode->i_mode = kn->mode; attrs = kernfs_iattrs_noalloc(kn); if (attrs) /* * kernfs_node has non-default attributes get them from * persistent copy in kernfs_node. */ set_inode_attr(inode, attrs); if (kernfs_type(kn) == KERNFS_DIR) set_nlink(inode, kn->dir.subdirs + 2); } int kernfs_iop_getattr(struct mnt_idmap *idmap, const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = d_inode(path->dentry); struct kernfs_node *kn = inode->i_private; struct kernfs_root *root = kernfs_root(kn); down_read(&root->kernfs_iattr_rwsem); kernfs_refresh_inode(kn, inode); generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat); up_read(&root->kernfs_iattr_rwsem); return 0; } static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode) { kernfs_get(kn); inode->i_private = kn; inode->i_mapping->a_ops = &ram_aops; inode->i_op = &kernfs_iops; inode->i_generation = kernfs_gen(kn); set_default_inode_attr(inode, kn->mode); kernfs_refresh_inode(kn, inode); /* initialize inode according to type */ switch (kernfs_type(kn)) { case KERNFS_DIR: inode->i_op = &kernfs_dir_iops; inode->i_fop = &kernfs_dir_fops; if (kn->flags & KERNFS_EMPTY_DIR) make_empty_dir_inode(inode); break; case KERNFS_FILE: inode->i_size = kn->attr.size; inode->i_fop = &kernfs_file_fops; break; case KERNFS_LINK: inode->i_op = &kernfs_symlink_iops; break; default: BUG(); } unlock_new_inode(inode); } /** * kernfs_get_inode - get inode for kernfs_node * @sb: super block * @kn: kernfs_node to allocate inode for * * Get inode for @kn. If such inode doesn't exist, a new inode is * allocated and basics are initialized. New inode is returned * locked. * * Locking: * Kernel thread context (may sleep). * * Return: * Pointer to allocated inode on success, %NULL on failure. */ struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn) { struct inode *inode; inode = iget_locked(sb, kernfs_ino(kn)); if (inode && (inode_state_read_once(inode) & I_NEW)) kernfs_init_inode(kn, inode); return inode; } /* * The kernfs_node serves as both an inode and a directory entry for * kernfs. To prevent the kernfs inode numbers from being freed * prematurely we take a reference to kernfs_node from the kernfs inode. A * super_operations.evict_inode() implementation is needed to drop that * reference upon inode destruction. */ void kernfs_evict_inode(struct inode *inode) { struct kernfs_node *kn = inode->i_private; truncate_inode_pages_final(&inode->i_data); clear_inode(inode); kernfs_put(kn); } int kernfs_iop_permission(struct mnt_idmap *idmap, struct inode *inode, int mask) { struct kernfs_node *kn; struct kernfs_root *root; int ret; if (mask & MAY_NOT_BLOCK) return -ECHILD; kn = inode->i_private; root = kernfs_root(kn); down_read(&root->kernfs_iattr_rwsem); kernfs_refresh_inode(kn, inode); ret = generic_permission(&nop_mnt_idmap, inode, mask); up_read(&root->kernfs_iattr_rwsem); return ret; } int kernfs_xattr_get(struct kernfs_node *kn, const char *name, void *value, size_t size) { struct kernfs_iattrs *attrs = kernfs_iattrs_noalloc(kn); if (!attrs) return -ENODATA; return simple_xattr_get(&attrs->xattrs, name, value, size); } int kernfs_xattr_set(struct kernfs_node *kn, const char *name, const void *value, size_t size, int flags) { struct simple_xattr *old_xattr; struct kernfs_iattrs *attrs; attrs = kernfs_iattrs(kn); if (!attrs) return -ENOMEM; old_xattr = simple_xattr_set(&attrs->xattrs, name, value, size, flags); if (IS_ERR(old_xattr)) return PTR_ERR(old_xattr); simple_xattr_free(old_xattr); return 0; } static int kernfs_vfs_xattr_get(const struct xattr_handler *handler, struct dentry *unused, struct inode *inode, const char *suffix, void *value, size_t size) { const char *name = xattr_full_name(handler, suffix); struct kernfs_node *kn = inode->i_private; return kernfs_xattr_get(kn, name, value, size); } static int kernfs_vfs_xattr_set(const struct xattr_handler *handler, struct mnt_idmap *idmap, struct dentry *unused, struct inode *inode, const char *suffix, const void *value, size_t size, int flags) { const char *name = xattr_full_name(handler, suffix); struct kernfs_node *kn = inode->i_private; return kernfs_xattr_set(kn, name, value, size, flags); } static int kernfs_vfs_user_xattr_add(struct kernfs_node *kn, const char *full_name, struct simple_xattrs *xattrs, const void *value, size_t size, int flags) { struct kernfs_iattrs *attr = kernfs_iattrs_noalloc(kn); atomic_t *sz = &attr->user_xattr_size; atomic_t *nr = &attr->nr_user_xattrs; struct simple_xattr *old_xattr; int ret; if (atomic_inc_return(nr) > KERNFS_MAX_USER_XATTRS) { ret = -ENOSPC; goto dec_count_out; } if (atomic_add_return(size, sz) > KERNFS_USER_XATTR_SIZE_LIMIT) { ret = -ENOSPC; goto dec_size_out; } old_xattr = simple_xattr_set(xattrs, full_name, value, size, flags); if (!old_xattr) return 0; if (IS_ERR(old_xattr)) { ret = PTR_ERR(old_xattr); goto dec_size_out; } ret = 0; size = old_xattr->size; simple_xattr_free(old_xattr); dec_size_out: atomic_sub(size, sz); dec_count_out: atomic_dec(nr); return ret; } static int kernfs_vfs_user_xattr_rm(struct kernfs_node *kn, const char *full_name, struct simple_xattrs *xattrs, const void *value, size_t size, int flags) { struct kernfs_iattrs *attr = kernfs_iattrs_noalloc(kn); atomic_t *sz = &attr->user_xattr_size; atomic_t *nr = &attr->nr_user_xattrs; struct simple_xattr *old_xattr; old_xattr = simple_xattr_set(xattrs, full_name, value, size, flags); if (!old_xattr) return 0; if (IS_ERR(old_xattr)) return PTR_ERR(old_xattr); atomic_sub(old_xattr->size, sz); atomic_dec(nr); simple_xattr_free(old_xattr); return 0; } static int kernfs_vfs_user_xattr_set(const struct xattr_handler *handler, struct mnt_idmap *idmap, struct dentry *unused, struct inode *inode, const char *suffix, const void *value, size_t size, int flags) { const char *full_name = xattr_full_name(handler, suffix); struct kernfs_node *kn = inode->i_private; struct kernfs_iattrs *attrs; if (!(kernfs_root(kn)->flags & KERNFS_ROOT_SUPPORT_USER_XATTR)) return -EOPNOTSUPP; attrs = kernfs_iattrs(kn); if (!attrs) return -ENOMEM; if (value) return kernfs_vfs_user_xattr_add(kn, full_name, &attrs->xattrs, value, size, flags); else return kernfs_vfs_user_xattr_rm(kn, full_name, &attrs->xattrs, value, size, flags); } static const struct xattr_handler kernfs_trusted_xattr_handler = { .prefix = XATTR_TRUSTED_PREFIX, .get = kernfs_vfs_xattr_get, .set = kernfs_vfs_xattr_set, }; static const struct xattr_handler kernfs_security_xattr_handler = { .prefix = XATTR_SECURITY_PREFIX, .get = kernfs_vfs_xattr_get, .set = kernfs_vfs_xattr_set, }; static const struct xattr_handler kernfs_user_xattr_handler = { .prefix = XATTR_USER_PREFIX, .get = kernfs_vfs_xattr_get, .set = kernfs_vfs_user_xattr_set, }; const struct xattr_handler * const kernfs_xattr_handlers[] = { &kernfs_trusted_xattr_handler, &kernfs_security_xattr_handler, &kernfs_user_xattr_handler, NULL }; |
| 24 8 9 25 1 8 1 1 1 1 1 170 1 5 66 4 5 1 13 1 48 49 48 47 46 6 43 42 22 22 3 21 2 1 1 21 42 5 41 37 7 6 2 2 2 1 2 40 49 3 3 3 2 1 54 405 24 391 25 25 190 7 241 15 15 113 6 6 6 6 3 1783 1779 1794 1772 1791 1777 1792 1781 1778 1771 1788 1773 1794 1772 1794 1779 1795 1774 1777 199 198 192 193 9 191 191 11 5 241 243 239 22 241 8 5 5 241 414 41 42 42 42 41 42 5 42 3 3 3 3 3 3 4 3 2 2 2 2 1 1 1 1 2 1 1 13 44 1 1 1 2 2 2 14 415 48 49 45 45 44 42 14 15 13 7 6 106 5 54 170 5 4 66 418 411 415 79 42 1795 432 1778 32 1461 1465 4 3 4 140 37 138 11 1466 1476 32 33 119 1325 1327 32 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/kernel.h> #include <linux/skbuff.h> #include <linux/export.h> #include <linux/ip.h> #include <linux/ipv6.h> #include <linux/if_vlan.h> #include <linux/filter.h> #include <net/dsa.h> #include <net/dst_metadata.h> #include <net/ip.h> #include <net/ipv6.h> #include <net/gre.h> #include <net/pptp.h> #include <net/tipc.h> #include <linux/igmp.h> #include <linux/icmp.h> #include <linux/sctp.h> #include <linux/dccp.h> #include <linux/if_tunnel.h> #include <linux/if_pppox.h> #include <linux/ppp_defs.h> #include <linux/stddef.h> #include <linux/if_ether.h> #include <linux/if_hsr.h> #include <linux/mpls.h> #include <linux/tcp.h> #include <linux/ptp_classify.h> #include <net/flow_dissector.h> #include <net/pkt_cls.h> #include <scsi/fc/fc_fcoe.h> #include <uapi/linux/batadv_packet.h> #include <linux/bpf.h> #if IS_ENABLED(CONFIG_NF_CONNTRACK) #include <net/netfilter/nf_conntrack_core.h> #include <net/netfilter/nf_conntrack_labels.h> #endif #include <linux/bpf-netns.h> static void dissector_set_key(struct flow_dissector *flow_dissector, enum flow_dissector_key_id key_id) { flow_dissector->used_keys |= (1ULL << key_id); } void skb_flow_dissector_init(struct flow_dissector *flow_dissector, const struct flow_dissector_key *key, unsigned int key_count) { unsigned int i; memset(flow_dissector, 0, sizeof(*flow_dissector)); for (i = 0; i < key_count; i++, key++) { /* User should make sure that every key target offset is within * boundaries of unsigned short. */ BUG_ON(key->offset > USHRT_MAX); BUG_ON(dissector_uses_key(flow_dissector, key->key_id)); dissector_set_key(flow_dissector, key->key_id); flow_dissector->offset[key->key_id] = key->offset; } /* Ensure that the dissector always includes control and basic key. * That way we are able to avoid handling lack of these in fast path. */ BUG_ON(!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_CONTROL)); BUG_ON(!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_BASIC)); } EXPORT_SYMBOL(skb_flow_dissector_init); #ifdef CONFIG_BPF_SYSCALL int flow_dissector_bpf_prog_attach_check(struct net *net, struct bpf_prog *prog) { enum netns_bpf_attach_type type = NETNS_BPF_FLOW_DISSECTOR; if (net == &init_net) { /* BPF flow dissector in the root namespace overrides * any per-net-namespace one. When attaching to root, * make sure we don't have any BPF program attached * to the non-root namespaces. */ struct net *ns; for_each_net(ns) { if (ns == &init_net) continue; if (rcu_access_pointer(ns->bpf.run_array[type])) return -EEXIST; } } else { /* Make sure root flow dissector is not attached * when attaching to the non-root namespace. */ if (rcu_access_pointer(init_net.bpf.run_array[type])) return -EEXIST; } return 0; } #endif /* CONFIG_BPF_SYSCALL */ /** * skb_flow_get_ports - extract the upper layer ports and return them * @skb: sk_buff to extract the ports from * @thoff: transport header offset * @ip_proto: protocol for which to get port offset * @data: raw buffer pointer to the packet, if NULL use skb->data * @hlen: packet header length, if @data is NULL use skb_headlen(skb) * * The function will try to retrieve the ports at offset thoff + poff where poff * is the protocol port offset returned from proto_ports_offset */ __be32 skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto, const void *data, int hlen) { int poff = proto_ports_offset(ip_proto); if (!data) { data = skb->data; hlen = skb_headlen(skb); } if (poff >= 0) { __be32 *ports, _ports; ports = __skb_header_pointer(skb, thoff + poff, sizeof(_ports), data, hlen, &_ports); if (ports) return *ports; } return 0; } EXPORT_SYMBOL(skb_flow_get_ports); static bool icmp_has_id(u8 type) { switch (type) { case ICMP_ECHO: case ICMP_ECHOREPLY: case ICMP_TIMESTAMP: case ICMP_TIMESTAMPREPLY: case ICMPV6_ECHO_REQUEST: case ICMPV6_ECHO_REPLY: return true; } return false; } /** * skb_flow_get_icmp_tci - extract ICMP(6) Type, Code and Identifier fields * @skb: sk_buff to extract from * @key_icmp: struct flow_dissector_key_icmp to fill * @data: raw buffer pointer to the packet * @thoff: offset to extract at * @hlen: packet header length */ void skb_flow_get_icmp_tci(const struct sk_buff *skb, struct flow_dissector_key_icmp *key_icmp, const void *data, int thoff, int hlen) { struct icmphdr *ih, _ih; ih = __skb_header_pointer(skb, thoff, sizeof(_ih), data, hlen, &_ih); if (!ih) return; key_icmp->type = ih->type; key_icmp->code = ih->code; /* As we use 0 to signal that the Id field is not present, * avoid confusion with packets without such field */ if (icmp_has_id(ih->type)) key_icmp->id = ih->un.echo.id ? ntohs(ih->un.echo.id) : 1; else key_icmp->id = 0; } EXPORT_SYMBOL(skb_flow_get_icmp_tci); /* If FLOW_DISSECTOR_KEY_ICMP is set, dissect an ICMP packet * using skb_flow_get_icmp_tci(). */ static void __skb_flow_dissect_icmp(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, int thoff, int hlen) { struct flow_dissector_key_icmp *key_icmp; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ICMP)) return; key_icmp = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_ICMP, target_container); skb_flow_get_icmp_tci(skb, key_icmp, data, thoff, hlen); } static void __skb_flow_dissect_ah(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, int nhoff, int hlen) { struct flow_dissector_key_ipsec *key_ah; struct ip_auth_hdr _hdr, *hdr; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_IPSEC)) return; hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, hlen, &_hdr); if (!hdr) return; key_ah = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_IPSEC, target_container); key_ah->spi = hdr->spi; } static void __skb_flow_dissect_esp(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, int nhoff, int hlen) { struct flow_dissector_key_ipsec *key_esp; struct ip_esp_hdr _hdr, *hdr; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_IPSEC)) return; hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, hlen, &_hdr); if (!hdr) return; key_esp = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_IPSEC, target_container); key_esp->spi = hdr->spi; } static void __skb_flow_dissect_l2tpv3(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, int nhoff, int hlen) { struct flow_dissector_key_l2tpv3 *key_l2tpv3; struct { __be32 session_id; } *hdr, _hdr; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_L2TPV3)) return; hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, hlen, &_hdr); if (!hdr) return; key_l2tpv3 = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_L2TPV3, target_container); key_l2tpv3->session_id = hdr->session_id; } void skb_flow_dissect_meta(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container) { struct flow_dissector_key_meta *meta; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_META)) return; meta = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_META, target_container); meta->ingress_ifindex = skb->skb_iif; #if IS_ENABLED(CONFIG_NET_TC_SKB_EXT) if (tc_skb_ext_tc_enabled()) { struct tc_skb_ext *ext; ext = skb_ext_find(skb, TC_SKB_EXT); if (ext) meta->l2_miss = ext->l2_miss; } #endif } EXPORT_SYMBOL(skb_flow_dissect_meta); static void skb_flow_dissect_set_enc_control(enum flow_dissector_key_id type, u32 ctrl_flags, struct flow_dissector *flow_dissector, void *target_container) { struct flow_dissector_key_control *ctrl; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_CONTROL)) return; ctrl = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_ENC_CONTROL, target_container); ctrl->addr_type = type; ctrl->flags = ctrl_flags; } void skb_flow_dissect_ct(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, u16 *ctinfo_map, size_t mapsize, bool post_ct, u16 zone) { #if IS_ENABLED(CONFIG_NF_CONNTRACK) struct flow_dissector_key_ct *key; enum ip_conntrack_info ctinfo; struct nf_conn_labels *cl; struct nf_conn *ct; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_CT)) return; ct = nf_ct_get(skb, &ctinfo); if (!ct && !post_ct) return; key = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_CT, target_container); if (!ct) { key->ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED | TCA_FLOWER_KEY_CT_FLAGS_INVALID; key->ct_zone = zone; return; } if (ctinfo < mapsize) key->ct_state = ctinfo_map[ctinfo]; #if IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) key->ct_zone = ct->zone.id; #endif #if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) key->ct_mark = READ_ONCE(ct->mark); #endif cl = nf_ct_labels_find(ct); if (cl) memcpy(key->ct_labels, cl->bits, sizeof(key->ct_labels)); #endif /* CONFIG_NF_CONNTRACK */ } EXPORT_SYMBOL(skb_flow_dissect_ct); void skb_flow_dissect_tunnel_info(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container) { struct ip_tunnel_info *info; struct ip_tunnel_key *key; u32 ctrl_flags = 0; /* A quick check to see if there might be something to do. */ if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_KEYID) && !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) && !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) && !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_CONTROL) && !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_PORTS) && !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_IP) && !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_OPTS)) return; info = skb_tunnel_info(skb); if (!info) return; key = &info->key; if (test_bit(IP_TUNNEL_CSUM_BIT, key->tun_flags)) ctrl_flags |= FLOW_DIS_F_TUNNEL_CSUM; if (test_bit(IP_TUNNEL_DONT_FRAGMENT_BIT, key->tun_flags)) ctrl_flags |= FLOW_DIS_F_TUNNEL_DONT_FRAGMENT; if (test_bit(IP_TUNNEL_OAM_BIT, key->tun_flags)) ctrl_flags |= FLOW_DIS_F_TUNNEL_OAM; if (test_bit(IP_TUNNEL_CRIT_OPT_BIT, key->tun_flags)) ctrl_flags |= FLOW_DIS_F_TUNNEL_CRIT_OPT; switch (ip_tunnel_info_af(info)) { case AF_INET: skb_flow_dissect_set_enc_control(FLOW_DISSECTOR_KEY_IPV4_ADDRS, ctrl_flags, flow_dissector, target_container); if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS)) { struct flow_dissector_key_ipv4_addrs *ipv4; ipv4 = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS, target_container); ipv4->src = key->u.ipv4.src; ipv4->dst = key->u.ipv4.dst; } break; case AF_INET6: skb_flow_dissect_set_enc_control(FLOW_DISSECTOR_KEY_IPV6_ADDRS, ctrl_flags, flow_dissector, target_container); if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS)) { struct flow_dissector_key_ipv6_addrs *ipv6; ipv6 = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, target_container); ipv6->src = key->u.ipv6.src; ipv6->dst = key->u.ipv6.dst; } break; default: skb_flow_dissect_set_enc_control(0, ctrl_flags, flow_dissector, target_container); break; } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_KEYID)) { struct flow_dissector_key_keyid *keyid; keyid = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_ENC_KEYID, target_container); keyid->keyid = tunnel_id_to_key32(key->tun_id); } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_PORTS)) { struct flow_dissector_key_ports *tp; tp = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_ENC_PORTS, target_container); tp->src = key->tp_src; tp->dst = key->tp_dst; } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_IP)) { struct flow_dissector_key_ip *ip; ip = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_ENC_IP, target_container); ip->tos = key->tos; ip->ttl = key->ttl; } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_OPTS)) { struct flow_dissector_key_enc_opts *enc_opt; IP_TUNNEL_DECLARE_FLAGS(flags) = { }; u32 val; enc_opt = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_ENC_OPTS, target_container); if (!info->options_len) return; enc_opt->len = info->options_len; ip_tunnel_info_opts_get(enc_opt->data, info); ip_tunnel_set_options_present(flags); ip_tunnel_flags_and(flags, info->key.tun_flags, flags); val = find_next_bit(flags, __IP_TUNNEL_FLAG_NUM, IP_TUNNEL_GENEVE_OPT_BIT); enc_opt->dst_opt_type = val < __IP_TUNNEL_FLAG_NUM ? val : 0; } } EXPORT_SYMBOL(skb_flow_dissect_tunnel_info); void skb_flow_dissect_hash(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container) { struct flow_dissector_key_hash *key; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_HASH)) return; key = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_HASH, target_container); key->hash = skb_get_hash_raw(skb); } EXPORT_SYMBOL(skb_flow_dissect_hash); static enum flow_dissect_ret __skb_flow_dissect_mpls(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, int nhoff, int hlen, int lse_index, bool *entropy_label) { struct mpls_label *hdr, _hdr; u32 entry, label, bos; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS_ENTROPY) && !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS)) return FLOW_DISSECT_RET_OUT_GOOD; if (lse_index >= FLOW_DIS_MPLS_MAX) return FLOW_DISSECT_RET_OUT_GOOD; hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, hlen, &_hdr); if (!hdr) return FLOW_DISSECT_RET_OUT_BAD; entry = ntohl(hdr->entry); label = (entry & MPLS_LS_LABEL_MASK) >> MPLS_LS_LABEL_SHIFT; bos = (entry & MPLS_LS_S_MASK) >> MPLS_LS_S_SHIFT; if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS)) { struct flow_dissector_key_mpls *key_mpls; struct flow_dissector_mpls_lse *lse; key_mpls = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_MPLS, target_container); lse = &key_mpls->ls[lse_index]; lse->mpls_ttl = (entry & MPLS_LS_TTL_MASK) >> MPLS_LS_TTL_SHIFT; lse->mpls_bos = bos; lse->mpls_tc = (entry & MPLS_LS_TC_MASK) >> MPLS_LS_TC_SHIFT; lse->mpls_label = label; dissector_set_mpls_lse(key_mpls, lse_index); } if (*entropy_label && dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS_ENTROPY)) { struct flow_dissector_key_keyid *key_keyid; key_keyid = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_MPLS_ENTROPY, target_container); key_keyid->keyid = cpu_to_be32(label); } *entropy_label = label == MPLS_LABEL_ENTROPY; return bos ? FLOW_DISSECT_RET_OUT_GOOD : FLOW_DISSECT_RET_PROTO_AGAIN; } static enum flow_dissect_ret __skb_flow_dissect_arp(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, int nhoff, int hlen) { struct flow_dissector_key_arp *key_arp; struct { unsigned char ar_sha[ETH_ALEN]; unsigned char ar_sip[4]; unsigned char ar_tha[ETH_ALEN]; unsigned char ar_tip[4]; } *arp_eth, _arp_eth; const struct arphdr *arp; struct arphdr _arp; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ARP)) return FLOW_DISSECT_RET_OUT_GOOD; arp = __skb_header_pointer(skb, nhoff, sizeof(_arp), data, hlen, &_arp); if (!arp) return FLOW_DISSECT_RET_OUT_BAD; if (arp->ar_hrd != htons(ARPHRD_ETHER) || arp->ar_pro != htons(ETH_P_IP) || arp->ar_hln != ETH_ALEN || arp->ar_pln != 4 || (arp->ar_op != htons(ARPOP_REPLY) && arp->ar_op != htons(ARPOP_REQUEST))) return FLOW_DISSECT_RET_OUT_BAD; arp_eth = __skb_header_pointer(skb, nhoff + sizeof(_arp), sizeof(_arp_eth), data, hlen, &_arp_eth); if (!arp_eth) return FLOW_DISSECT_RET_OUT_BAD; key_arp = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_ARP, target_container); memcpy(&key_arp->sip, arp_eth->ar_sip, sizeof(key_arp->sip)); memcpy(&key_arp->tip, arp_eth->ar_tip, sizeof(key_arp->tip)); /* Only store the lower byte of the opcode; * this covers ARPOP_REPLY and ARPOP_REQUEST. */ key_arp->op = ntohs(arp->ar_op) & 0xff; ether_addr_copy(key_arp->sha, arp_eth->ar_sha); ether_addr_copy(key_arp->tha, arp_eth->ar_tha); return FLOW_DISSECT_RET_OUT_GOOD; } static enum flow_dissect_ret __skb_flow_dissect_cfm(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, int nhoff, int hlen) { struct flow_dissector_key_cfm *key, *hdr, _hdr; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_CFM)) return FLOW_DISSECT_RET_OUT_GOOD; hdr = __skb_header_pointer(skb, nhoff, sizeof(*key), data, hlen, &_hdr); if (!hdr) return FLOW_DISSECT_RET_OUT_BAD; key = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_CFM, target_container); key->mdl_ver = hdr->mdl_ver; key->opcode = hdr->opcode; return FLOW_DISSECT_RET_OUT_GOOD; } static enum flow_dissect_ret __skb_flow_dissect_gre(const struct sk_buff *skb, struct flow_dissector_key_control *key_control, struct flow_dissector *flow_dissector, void *target_container, const void *data, __be16 *p_proto, int *p_nhoff, int *p_hlen, unsigned int flags) { struct flow_dissector_key_keyid *key_keyid; struct gre_base_hdr *hdr, _hdr; int offset = 0; u16 gre_ver; hdr = __skb_header_pointer(skb, *p_nhoff, sizeof(_hdr), data, *p_hlen, &_hdr); if (!hdr) return FLOW_DISSECT_RET_OUT_BAD; /* Only look inside GRE without routing */ if (hdr->flags & GRE_ROUTING) return FLOW_DISSECT_RET_OUT_GOOD; /* Only look inside GRE for version 0 and 1 */ gre_ver = ntohs(hdr->flags & GRE_VERSION); if (gre_ver > 1) return FLOW_DISSECT_RET_OUT_GOOD; *p_proto = hdr->protocol; if (gre_ver) { /* Version1 must be PPTP, and check the flags */ if (!(*p_proto == GRE_PROTO_PPP && (hdr->flags & GRE_KEY))) return FLOW_DISSECT_RET_OUT_GOOD; } offset += sizeof(struct gre_base_hdr); if (hdr->flags & GRE_CSUM) offset += sizeof_field(struct gre_full_hdr, csum) + sizeof_field(struct gre_full_hdr, reserved1); if (hdr->flags & GRE_KEY) { const __be32 *keyid; __be32 _keyid; keyid = __skb_header_pointer(skb, *p_nhoff + offset, sizeof(_keyid), data, *p_hlen, &_keyid); if (!keyid) return FLOW_DISSECT_RET_OUT_BAD; if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_GRE_KEYID)) { key_keyid = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_GRE_KEYID, target_container); if (gre_ver == 0) key_keyid->keyid = *keyid; else key_keyid->keyid = *keyid & GRE_PPTP_KEY_MASK; } offset += sizeof_field(struct gre_full_hdr, key); } if (hdr->flags & GRE_SEQ) offset += sizeof_field(struct pptp_gre_header, seq); if (gre_ver == 0) { if (*p_proto == htons(ETH_P_TEB)) { const struct ethhdr *eth; struct ethhdr _eth; eth = __skb_header_pointer(skb, *p_nhoff + offset, sizeof(_eth), data, *p_hlen, &_eth); if (!eth) return FLOW_DISSECT_RET_OUT_BAD; *p_proto = eth->h_proto; offset += sizeof(*eth); /* Cap headers that we access via pointers at the * end of the Ethernet header as our maximum alignment * at that point is only 2 bytes. */ if (NET_IP_ALIGN) *p_hlen = *p_nhoff + offset; } } else { /* version 1, must be PPTP */ u8 _ppp_hdr[PPP_HDRLEN]; u8 *ppp_hdr; if (hdr->flags & GRE_ACK) offset += sizeof_field(struct pptp_gre_header, ack); ppp_hdr = __skb_header_pointer(skb, *p_nhoff + offset, sizeof(_ppp_hdr), data, *p_hlen, _ppp_hdr); if (!ppp_hdr) return FLOW_DISSECT_RET_OUT_BAD; switch (PPP_PROTOCOL(ppp_hdr)) { case PPP_IP: *p_proto = htons(ETH_P_IP); break; case PPP_IPV6: *p_proto = htons(ETH_P_IPV6); break; default: /* Could probably catch some more like MPLS */ break; } offset += PPP_HDRLEN; } *p_nhoff += offset; key_control->flags |= FLOW_DIS_ENCAPSULATION; if (flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP) return FLOW_DISSECT_RET_OUT_GOOD; return FLOW_DISSECT_RET_PROTO_AGAIN; } /** * __skb_flow_dissect_batadv() - dissect batman-adv header * @skb: sk_buff to with the batman-adv header * @key_control: flow dissectors control key * @data: raw buffer pointer to the packet, if NULL use skb->data * @p_proto: pointer used to update the protocol to process next * @p_nhoff: pointer used to update inner network header offset * @hlen: packet header length * @flags: any combination of FLOW_DISSECTOR_F_* * * ETH_P_BATMAN packets are tried to be dissected. Only * &struct batadv_unicast packets are actually processed because they contain an * inner ethernet header and are usually followed by actual network header. This * allows the flow dissector to continue processing the packet. * * Return: FLOW_DISSECT_RET_PROTO_AGAIN when &struct batadv_unicast was found, * FLOW_DISSECT_RET_OUT_GOOD when dissector should stop after encapsulation, * otherwise FLOW_DISSECT_RET_OUT_BAD */ static enum flow_dissect_ret __skb_flow_dissect_batadv(const struct sk_buff *skb, struct flow_dissector_key_control *key_control, const void *data, __be16 *p_proto, int *p_nhoff, int hlen, unsigned int flags) { struct { struct batadv_unicast_packet batadv_unicast; struct ethhdr eth; } *hdr, _hdr; hdr = __skb_header_pointer(skb, *p_nhoff, sizeof(_hdr), data, hlen, &_hdr); if (!hdr) return FLOW_DISSECT_RET_OUT_BAD; if (hdr->batadv_unicast.version != BATADV_COMPAT_VERSION) return FLOW_DISSECT_RET_OUT_BAD; if (hdr->batadv_unicast.packet_type != BATADV_UNICAST) return FLOW_DISSECT_RET_OUT_BAD; *p_proto = hdr->eth.h_proto; *p_nhoff += sizeof(*hdr); key_control->flags |= FLOW_DIS_ENCAPSULATION; if (flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP) return FLOW_DISSECT_RET_OUT_GOOD; return FLOW_DISSECT_RET_PROTO_AGAIN; } static void __skb_flow_dissect_tcp(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, int thoff, int hlen) { struct flow_dissector_key_tcp *key_tcp; struct tcphdr *th, _th; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_TCP)) return; th = __skb_header_pointer(skb, thoff, sizeof(_th), data, hlen, &_th); if (!th) return; if (unlikely(__tcp_hdrlen(th) < sizeof(_th))) return; key_tcp = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_TCP, target_container); key_tcp->flags = (*(__be16 *) &tcp_flag_word(th) & htons(0x0FFF)); } static void __skb_flow_dissect_ports(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, int nhoff, u8 ip_proto, int hlen) { struct flow_dissector_key_ports_range *key_ports_range = NULL; struct flow_dissector_key_ports *key_ports = NULL; __be32 ports; if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_PORTS)) key_ports = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_PORTS, target_container); if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_PORTS_RANGE)) key_ports_range = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_PORTS_RANGE, target_container); if (!key_ports && !key_ports_range) return; ports = skb_flow_get_ports(skb, nhoff, ip_proto, data, hlen); if (key_ports) key_ports->ports = ports; if (key_ports_range) key_ports_range->tp.ports = ports; } static void __skb_flow_dissect_ipv4(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, const struct iphdr *iph) { struct flow_dissector_key_ip *key_ip; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_IP)) return; key_ip = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_IP, target_container); key_ip->tos = iph->tos; key_ip->ttl = iph->ttl; } static void __skb_flow_dissect_ipv6(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, const struct ipv6hdr *iph) { struct flow_dissector_key_ip *key_ip; if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_IP)) return; key_ip = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_IP, target_container); key_ip->tos = ipv6_get_dsfield(iph); key_ip->ttl = iph->hop_limit; } /* Maximum number of protocol headers that can be parsed in * __skb_flow_dissect */ #define MAX_FLOW_DISSECT_HDRS 15 static bool skb_flow_dissect_allowed(int *num_hdrs) { ++*num_hdrs; return (*num_hdrs <= MAX_FLOW_DISSECT_HDRS); } static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys, struct flow_dissector *flow_dissector, void *target_container) { struct flow_dissector_key_ports_range *key_ports_range = NULL; struct flow_dissector_key_ports *key_ports = NULL; struct flow_dissector_key_control *key_control; struct flow_dissector_key_basic *key_basic; struct flow_dissector_key_addrs *key_addrs; struct flow_dissector_key_tags *key_tags; key_control = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_CONTROL, target_container); key_control->thoff = flow_keys->thoff; if (flow_keys->is_frag) key_control->flags |= FLOW_DIS_IS_FRAGMENT; if (flow_keys->is_first_frag) key_control->flags |= FLOW_DIS_FIRST_FRAG; if (flow_keys->is_encap) key_control->flags |= FLOW_DIS_ENCAPSULATION; key_basic = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_BASIC, target_container); key_basic->n_proto = flow_keys->n_proto; key_basic->ip_proto = flow_keys->ip_proto; if (flow_keys->addr_proto == ETH_P_IP && dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_IPV4_ADDRS)) { key_addrs = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_IPV4_ADDRS, target_container); key_addrs->v4addrs.src = flow_keys->ipv4_src; key_addrs->v4addrs.dst = flow_keys->ipv4_dst; key_control->addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS; } else if (flow_keys->addr_proto == ETH_P_IPV6 && dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_IPV6_ADDRS)) { key_addrs = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_IPV6_ADDRS, target_container); memcpy(&key_addrs->v6addrs.src, &flow_keys->ipv6_src, sizeof(key_addrs->v6addrs.src)); memcpy(&key_addrs->v6addrs.dst, &flow_keys->ipv6_dst, sizeof(key_addrs->v6addrs.dst)); key_control->addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_PORTS)) { key_ports = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_PORTS, target_container); key_ports->src = flow_keys->sport; key_ports->dst = flow_keys->dport; } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_PORTS_RANGE)) { key_ports_range = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_PORTS_RANGE, target_container); key_ports_range->tp.src = flow_keys->sport; key_ports_range->tp.dst = flow_keys->dport; } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_FLOW_LABEL)) { key_tags = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_FLOW_LABEL, target_container); key_tags->flow_label = ntohl(flow_keys->flow_label); } } u32 bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx, __be16 proto, int nhoff, int hlen, unsigned int flags) { struct bpf_flow_keys *flow_keys = ctx->flow_keys; u32 result; /* Pass parameters to the BPF program */ memset(flow_keys, 0, sizeof(*flow_keys)); flow_keys->n_proto = proto; flow_keys->nhoff = nhoff; flow_keys->thoff = flow_keys->nhoff; BUILD_BUG_ON((int)BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG != (int)FLOW_DISSECTOR_F_PARSE_1ST_FRAG); BUILD_BUG_ON((int)BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL != (int)FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); BUILD_BUG_ON((int)BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP != (int)FLOW_DISSECTOR_F_STOP_AT_ENCAP); flow_keys->flags = flags; result = bpf_prog_run_pin_on_cpu(prog, ctx); flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, nhoff, hlen); flow_keys->thoff = clamp_t(u16, flow_keys->thoff, flow_keys->nhoff, hlen); return result; } static bool is_pppoe_ses_hdr_valid(const struct pppoe_hdr *hdr) { return hdr->ver == 1 && hdr->type == 1 && hdr->code == 0; } /** * __skb_flow_dissect - extract the flow_keys struct and return it * @net: associated network namespace, derived from @skb if NULL * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified * @flow_dissector: list of keys to dissect * @target_container: target structure to put dissected values into * @data: raw buffer pointer to the packet, if NULL use skb->data * @proto: protocol for which to get the flow, if @data is NULL use skb->protocol * @nhoff: network header offset, if @data is NULL use skb_network_offset(skb) * @hlen: packet header length, if @data is NULL use skb_headlen(skb) * @flags: flags that control the dissection process, e.g. * FLOW_DISSECTOR_F_STOP_AT_ENCAP. * * The function will try to retrieve individual keys into target specified * by flow_dissector from either the skbuff or a raw buffer specified by the * rest parameters. * * Caller must take care of zeroing target container memory. */ bool __skb_flow_dissect(const struct net *net, const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, const void *data, __be16 proto, int nhoff, int hlen, unsigned int flags) { struct flow_dissector_key_control *key_control; struct flow_dissector_key_basic *key_basic; struct flow_dissector_key_addrs *key_addrs; struct flow_dissector_key_tags *key_tags; struct flow_dissector_key_vlan *key_vlan; enum flow_dissect_ret fdret; enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX; bool mpls_el = false; int mpls_lse = 0; int num_hdrs = 0; u8 ip_proto = 0; bool ret; if (!data) { data = skb->data; proto = skb_vlan_tag_present(skb) ? skb->vlan_proto : skb->protocol; nhoff = skb_network_offset(skb); hlen = skb_headlen(skb); #if IS_ENABLED(CONFIG_NET_DSA) if (unlikely(skb->dev && netdev_uses_dsa(skb->dev) && proto == htons(ETH_P_XDSA))) { struct metadata_dst *md_dst = skb_metadata_dst(skb); const struct dsa_device_ops *ops; int offset = 0; ops = skb->dev->dsa_ptr->tag_ops; /* Only DSA header taggers break flow dissection */ if (ops->needed_headroom && (!md_dst || md_dst->type != METADATA_HW_PORT_MUX)) { if (ops->flow_dissect) ops->flow_dissect(skb, &proto, &offset); else dsa_tag_generic_flow_dissect(skb, &proto, &offset); hlen -= offset; nhoff += offset; } } #endif } /* It is ensured by skb_flow_dissector_init() that control key will * be always present. */ key_control = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_CONTROL, target_container); /* It is ensured by skb_flow_dissector_init() that basic key will * be always present. */ key_basic = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_BASIC, target_container); rcu_read_lock(); if (skb) { if (!net) { if (skb->dev) net = dev_net_rcu(skb->dev); else if (skb->sk) net = sock_net(skb->sk); } } DEBUG_NET_WARN_ON_ONCE(!net); if (net) { enum netns_bpf_attach_type type = NETNS_BPF_FLOW_DISSECTOR; struct bpf_prog_array *run_array; run_array = rcu_dereference(init_net.bpf.run_array[type]); if (!run_array) run_array = rcu_dereference(net->bpf.run_array[type]); if (run_array) { struct bpf_flow_keys flow_keys; struct bpf_flow_dissector ctx = { .flow_keys = &flow_keys, .data = data, .data_end = data + hlen, }; __be16 n_proto = proto; struct bpf_prog *prog; u32 result; if (skb) { ctx.skb = skb; /* we can't use 'proto' in the skb case * because it might be set to skb->vlan_proto * which has been pulled from the data */ n_proto = skb->protocol; } prog = READ_ONCE(run_array->items[0].prog); result = bpf_flow_dissect(prog, &ctx, n_proto, nhoff, hlen, flags); if (result != BPF_FLOW_DISSECTOR_CONTINUE) { __skb_flow_bpf_to_target(&flow_keys, flow_dissector, target_container); rcu_read_unlock(); return result == BPF_OK; } } } rcu_read_unlock(); if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) { struct ethhdr *eth = eth_hdr(skb); struct flow_dissector_key_eth_addrs *key_eth_addrs; key_eth_addrs = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS, target_container); memcpy(key_eth_addrs, eth, sizeof(*key_eth_addrs)); } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_NUM_OF_VLANS)) { struct flow_dissector_key_num_of_vlans *key_num_of_vlans; key_num_of_vlans = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_NUM_OF_VLANS, target_container); key_num_of_vlans->num_of_vlans = 0; } proto_again: fdret = FLOW_DISSECT_RET_CONTINUE; switch (proto) { case htons(ETH_P_IP): { const struct iphdr *iph; struct iphdr _iph; iph = __skb_header_pointer(skb, nhoff, sizeof(_iph), data, hlen, &_iph); if (!iph || iph->ihl < 5) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } nhoff += iph->ihl * 4; ip_proto = iph->protocol; if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_IPV4_ADDRS)) { key_addrs = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_IPV4_ADDRS, target_container); memcpy(&key_addrs->v4addrs.src, &iph->saddr, sizeof(key_addrs->v4addrs.src)); memcpy(&key_addrs->v4addrs.dst, &iph->daddr, sizeof(key_addrs->v4addrs.dst)); key_control->addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS; } __skb_flow_dissect_ipv4(skb, flow_dissector, target_container, data, iph); if (ip_is_fragment(iph)) { key_control->flags |= FLOW_DIS_IS_FRAGMENT; if (iph->frag_off & htons(IP_OFFSET)) { fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } else { key_control->flags |= FLOW_DIS_FIRST_FRAG; if (!(flags & FLOW_DISSECTOR_F_PARSE_1ST_FRAG)) { fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } } } break; } case htons(ETH_P_IPV6): { const struct ipv6hdr *iph; struct ipv6hdr _iph; iph = __skb_header_pointer(skb, nhoff, sizeof(_iph), data, hlen, &_iph); if (!iph) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } ip_proto = iph->nexthdr; nhoff += sizeof(struct ipv6hdr); if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_IPV6_ADDRS)) { key_addrs = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_IPV6_ADDRS, target_container); memcpy(&key_addrs->v6addrs.src, &iph->saddr, sizeof(key_addrs->v6addrs.src)); memcpy(&key_addrs->v6addrs.dst, &iph->daddr, sizeof(key_addrs->v6addrs.dst)); key_control->addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; } if ((dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_FLOW_LABEL) || (flags & FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL)) && ip6_flowlabel(iph)) { __be32 flow_label = ip6_flowlabel(iph); if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_FLOW_LABEL)) { key_tags = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_FLOW_LABEL, target_container); key_tags->flow_label = ntohl(flow_label); } if (flags & FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL) { fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } } __skb_flow_dissect_ipv6(skb, flow_dissector, target_container, data, iph); break; } case htons(ETH_P_8021AD): case htons(ETH_P_8021Q): { const struct vlan_hdr *vlan = NULL; struct vlan_hdr _vlan; __be16 saved_vlan_tpid = proto; if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX && skb && skb_vlan_tag_present(skb)) { proto = skb->protocol; } else { vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), data, hlen, &_vlan); if (!vlan) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } proto = vlan->h_vlan_encapsulated_proto; nhoff += sizeof(*vlan); } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_NUM_OF_VLANS) && !(key_control->flags & FLOW_DIS_ENCAPSULATION)) { struct flow_dissector_key_num_of_vlans *key_nvs; key_nvs = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_NUM_OF_VLANS, target_container); key_nvs->num_of_vlans++; } if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX) { dissector_vlan = FLOW_DISSECTOR_KEY_VLAN; } else if (dissector_vlan == FLOW_DISSECTOR_KEY_VLAN) { dissector_vlan = FLOW_DISSECTOR_KEY_CVLAN; } else { fdret = FLOW_DISSECT_RET_PROTO_AGAIN; break; } if (dissector_uses_key(flow_dissector, dissector_vlan)) { key_vlan = skb_flow_dissector_target(flow_dissector, dissector_vlan, target_container); if (!vlan) { key_vlan->vlan_id = skb_vlan_tag_get_id(skb); key_vlan->vlan_priority = skb_vlan_tag_get_prio(skb); } else { key_vlan->vlan_id = ntohs(vlan->h_vlan_TCI) & VLAN_VID_MASK; key_vlan->vlan_priority = (ntohs(vlan->h_vlan_TCI) & VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT; } key_vlan->vlan_tpid = saved_vlan_tpid; key_vlan->vlan_eth_type = proto; } fdret = FLOW_DISSECT_RET_PROTO_AGAIN; break; } case htons(ETH_P_PPP_SES): { struct { struct pppoe_hdr hdr; __be16 proto; } *hdr, _hdr; u16 ppp_proto; hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, hlen, &_hdr); if (!hdr) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } if (!is_pppoe_ses_hdr_valid(&hdr->hdr)) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } /* least significant bit of the most significant octet * indicates if protocol field was compressed */ ppp_proto = ntohs(hdr->proto); if (ppp_proto & 0x0100) { ppp_proto = ppp_proto >> 8; nhoff += PPPOE_SES_HLEN - 1; } else { nhoff += PPPOE_SES_HLEN; } if (ppp_proto == PPP_IP) { proto = htons(ETH_P_IP); fdret = FLOW_DISSECT_RET_PROTO_AGAIN; } else if (ppp_proto == PPP_IPV6) { proto = htons(ETH_P_IPV6); fdret = FLOW_DISSECT_RET_PROTO_AGAIN; } else if (ppp_proto == PPP_MPLS_UC) { proto = htons(ETH_P_MPLS_UC); fdret = FLOW_DISSECT_RET_PROTO_AGAIN; } else if (ppp_proto == PPP_MPLS_MC) { proto = htons(ETH_P_MPLS_MC); fdret = FLOW_DISSECT_RET_PROTO_AGAIN; } else if (ppp_proto_is_valid(ppp_proto)) { fdret = FLOW_DISSECT_RET_OUT_GOOD; } else { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_PPPOE)) { struct flow_dissector_key_pppoe *key_pppoe; key_pppoe = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_PPPOE, target_container); key_pppoe->session_id = hdr->hdr.sid; key_pppoe->ppp_proto = htons(ppp_proto); key_pppoe->type = htons(ETH_P_PPP_SES); } break; } case htons(ETH_P_TIPC): { struct tipc_basic_hdr *hdr, _hdr; hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, hlen, &_hdr); if (!hdr) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_TIPC)) { key_addrs = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_TIPC, target_container); key_addrs->tipckey.key = tipc_hdr_rps_key(hdr); key_control->addr_type = FLOW_DISSECTOR_KEY_TIPC; } fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } case htons(ETH_P_MPLS_UC): case htons(ETH_P_MPLS_MC): fdret = __skb_flow_dissect_mpls(skb, flow_dissector, target_container, data, nhoff, hlen, mpls_lse, &mpls_el); nhoff += sizeof(struct mpls_label); mpls_lse++; break; case htons(ETH_P_FCOE): if ((hlen - nhoff) < FCOE_HEADER_LEN) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } nhoff += FCOE_HEADER_LEN; fdret = FLOW_DISSECT_RET_OUT_GOOD; break; case htons(ETH_P_ARP): case htons(ETH_P_RARP): fdret = __skb_flow_dissect_arp(skb, flow_dissector, target_container, data, nhoff, hlen); break; case htons(ETH_P_BATMAN): fdret = __skb_flow_dissect_batadv(skb, key_control, data, &proto, &nhoff, hlen, flags); break; case htons(ETH_P_1588): { struct ptp_header *hdr, _hdr; hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, hlen, &_hdr); if (!hdr) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } nhoff += sizeof(struct ptp_header); fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } case htons(ETH_P_PRP): case htons(ETH_P_HSR): { struct hsr_tag *hdr, _hdr; hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, hlen, &_hdr); if (!hdr) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } proto = hdr->encap_proto; nhoff += HSR_HLEN; fdret = FLOW_DISSECT_RET_PROTO_AGAIN; break; } case htons(ETH_P_CFM): fdret = __skb_flow_dissect_cfm(skb, flow_dissector, target_container, data, nhoff, hlen); break; default: fdret = FLOW_DISSECT_RET_OUT_BAD; break; } /* Process result of proto processing */ switch (fdret) { case FLOW_DISSECT_RET_OUT_GOOD: goto out_good; case FLOW_DISSECT_RET_PROTO_AGAIN: if (skb_flow_dissect_allowed(&num_hdrs)) goto proto_again; goto out_good; case FLOW_DISSECT_RET_CONTINUE: case FLOW_DISSECT_RET_IPPROTO_AGAIN: break; case FLOW_DISSECT_RET_OUT_BAD: default: goto out_bad; } ip_proto_again: fdret = FLOW_DISSECT_RET_CONTINUE; switch (ip_proto) { case IPPROTO_GRE: if (flags & FLOW_DISSECTOR_F_STOP_BEFORE_ENCAP) { fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } fdret = __skb_flow_dissect_gre(skb, key_control, flow_dissector, target_container, data, &proto, &nhoff, &hlen, flags); break; case NEXTHDR_HOP: case NEXTHDR_ROUTING: case NEXTHDR_DEST: { u8 _opthdr[2], *opthdr; if (proto != htons(ETH_P_IPV6)) break; opthdr = __skb_header_pointer(skb, nhoff, sizeof(_opthdr), data, hlen, &_opthdr); if (!opthdr) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } ip_proto = opthdr[0]; nhoff += (opthdr[1] + 1) << 3; fdret = FLOW_DISSECT_RET_IPPROTO_AGAIN; break; } case NEXTHDR_FRAGMENT: { struct frag_hdr _fh, *fh; if (proto != htons(ETH_P_IPV6)) break; fh = __skb_header_pointer(skb, nhoff, sizeof(_fh), data, hlen, &_fh); if (!fh) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } key_control->flags |= FLOW_DIS_IS_FRAGMENT; nhoff += sizeof(_fh); ip_proto = fh->nexthdr; if (!(fh->frag_off & htons(IP6_OFFSET))) { key_control->flags |= FLOW_DIS_FIRST_FRAG; if (flags & FLOW_DISSECTOR_F_PARSE_1ST_FRAG) { fdret = FLOW_DISSECT_RET_IPPROTO_AGAIN; break; } } fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } case IPPROTO_IPIP: if (flags & FLOW_DISSECTOR_F_STOP_BEFORE_ENCAP) { fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } proto = htons(ETH_P_IP); key_control->flags |= FLOW_DIS_ENCAPSULATION; if (flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP) { fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } fdret = FLOW_DISSECT_RET_PROTO_AGAIN; break; case IPPROTO_IPV6: if (flags & FLOW_DISSECTOR_F_STOP_BEFORE_ENCAP) { fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } proto = htons(ETH_P_IPV6); key_control->flags |= FLOW_DIS_ENCAPSULATION; if (flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP) { fdret = FLOW_DISSECT_RET_OUT_GOOD; break; } fdret = FLOW_DISSECT_RET_PROTO_AGAIN; break; case IPPROTO_MPLS: proto = htons(ETH_P_MPLS_UC); fdret = FLOW_DISSECT_RET_PROTO_AGAIN; break; case IPPROTO_TCP: __skb_flow_dissect_tcp(skb, flow_dissector, target_container, data, nhoff, hlen); break; case IPPROTO_ICMP: case IPPROTO_ICMPV6: __skb_flow_dissect_icmp(skb, flow_dissector, target_container, data, nhoff, hlen); break; case IPPROTO_L2TP: __skb_flow_dissect_l2tpv3(skb, flow_dissector, target_container, data, nhoff, hlen); break; case IPPROTO_ESP: __skb_flow_dissect_esp(skb, flow_dissector, target_container, data, nhoff, hlen); break; case IPPROTO_AH: __skb_flow_dissect_ah(skb, flow_dissector, target_container, data, nhoff, hlen); break; default: break; } if (!(key_control->flags & FLOW_DIS_IS_FRAGMENT)) __skb_flow_dissect_ports(skb, flow_dissector, target_container, data, nhoff, ip_proto, hlen); /* Process result of IP proto processing */ switch (fdret) { case FLOW_DISSECT_RET_PROTO_AGAIN: if (skb_flow_dissect_allowed(&num_hdrs)) goto proto_again; break; case FLOW_DISSECT_RET_IPPROTO_AGAIN: if (skb_flow_dissect_allowed(&num_hdrs)) goto ip_proto_again; break; case FLOW_DISSECT_RET_OUT_GOOD: case FLOW_DISSECT_RET_CONTINUE: break; case FLOW_DISSECT_RET_OUT_BAD: default: goto out_bad; } out_good: ret = true; out: key_control->thoff = min_t(u16, nhoff, skb ? skb->len : hlen); key_basic->n_proto = proto; key_basic->ip_proto = ip_proto; return ret; out_bad: ret = false; goto out; } EXPORT_SYMBOL(__skb_flow_dissect); static siphash_aligned_key_t hashrnd; static __always_inline void __flow_hash_secret_init(void) { net_get_random_once(&hashrnd, sizeof(hashrnd)); } static const void *flow_keys_hash_start(const struct flow_keys *flow) { BUILD_BUG_ON(FLOW_KEYS_HASH_OFFSET % SIPHASH_ALIGNMENT); return &flow->FLOW_KEYS_HASH_START_FIELD; } static inline size_t flow_keys_hash_length(const struct flow_keys *flow) { size_t diff = FLOW_KEYS_HASH_OFFSET + sizeof(flow->addrs); BUILD_BUG_ON((sizeof(*flow) - FLOW_KEYS_HASH_OFFSET) % sizeof(u32)); switch (flow->control.addr_type) { case FLOW_DISSECTOR_KEY_IPV4_ADDRS: diff -= sizeof(flow->addrs.v4addrs); break; case FLOW_DISSECTOR_KEY_IPV6_ADDRS: diff -= sizeof(flow->addrs.v6addrs); break; case FLOW_DISSECTOR_KEY_TIPC: diff -= sizeof(flow->addrs.tipckey); break; } return sizeof(*flow) - diff; } __be32 flow_get_u32_src(const struct flow_keys *flow) { switch (flow->control.addr_type) { case FLOW_DISSECTOR_KEY_IPV4_ADDRS: return flow->addrs.v4addrs.src; case FLOW_DISSECTOR_KEY_IPV6_ADDRS: return (__force __be32)ipv6_addr_hash( &flow->addrs.v6addrs.src); case FLOW_DISSECTOR_KEY_TIPC: return flow->addrs.tipckey.key; default: return 0; } } EXPORT_SYMBOL(flow_get_u32_src); __be32 flow_get_u32_dst(const struct flow_keys *flow) { switch (flow->control.addr_type) { case FLOW_DISSECTOR_KEY_IPV4_ADDRS: return flow->addrs.v4addrs.dst; case FLOW_DISSECTOR_KEY_IPV6_ADDRS: return (__force __be32)ipv6_addr_hash( &flow->addrs.v6addrs.dst); default: return 0; } } EXPORT_SYMBOL(flow_get_u32_dst); /* Sort the source and destination IP and the ports, * to have consistent hash within the two directions */ static inline void __flow_hash_consistentify(struct flow_keys *keys) { int addr_diff, i; switch (keys->control.addr_type) { case FLOW_DISSECTOR_KEY_IPV4_ADDRS: if ((__force u32)keys->addrs.v4addrs.dst < (__force u32)keys->addrs.v4addrs.src) swap(keys->addrs.v4addrs.src, keys->addrs.v4addrs.dst); if ((__force u16)keys->ports.dst < (__force u16)keys->ports.src) { swap(keys->ports.src, keys->ports.dst); } break; case FLOW_DISSECTOR_KEY_IPV6_ADDRS: addr_diff = memcmp(&keys->addrs.v6addrs.dst, &keys->addrs.v6addrs.src, sizeof(keys->addrs.v6addrs.dst)); if (addr_diff < 0) { for (i = 0; i < 4; i++) swap(keys->addrs.v6addrs.src.s6_addr32[i], keys->addrs.v6addrs.dst.s6_addr32[i]); } if ((__force u16)keys->ports.dst < (__force u16)keys->ports.src) { swap(keys->ports.src, keys->ports.dst); } break; } } static inline u32 __flow_hash_from_keys(struct flow_keys *keys, const siphash_key_t *keyval) { u32 hash; __flow_hash_consistentify(keys); hash = siphash(flow_keys_hash_start(keys), flow_keys_hash_length(keys), keyval); if (!hash) hash = 1; return hash; } u32 flow_hash_from_keys(struct flow_keys *keys) { __flow_hash_secret_init(); return __flow_hash_from_keys(keys, &hashrnd); } EXPORT_SYMBOL(flow_hash_from_keys); u32 flow_hash_from_keys_seed(struct flow_keys *keys, const siphash_key_t *keyval) { return __flow_hash_from_keys(keys, keyval); } EXPORT_SYMBOL(flow_hash_from_keys_seed); static inline u32 ___skb_get_hash(const struct sk_buff *skb, struct flow_keys *keys, const siphash_key_t *keyval) { skb_flow_dissect_flow_keys(skb, keys, FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); return __flow_hash_from_keys(keys, keyval); } struct _flow_keys_digest_data { __be16 n_proto; u8 ip_proto; u8 padding; __be32 ports; __be32 src; __be32 dst; }; void make_flow_keys_digest(struct flow_keys_digest *digest, const struct flow_keys *flow) { struct _flow_keys_digest_data *data = (struct _flow_keys_digest_data *)digest; BUILD_BUG_ON(sizeof(*data) > sizeof(*digest)); memset(digest, 0, sizeof(*digest)); data->n_proto = flow->basic.n_proto; data->ip_proto = flow->basic.ip_proto; data->ports = flow->ports.ports; data->src = flow->addrs.v4addrs.src; data->dst = flow->addrs.v4addrs.dst; } EXPORT_SYMBOL(make_flow_keys_digest); static struct flow_dissector flow_keys_dissector_symmetric __read_mostly; u32 __skb_get_hash_symmetric_net(const struct net *net, const struct sk_buff *skb) { struct flow_keys keys; __flow_hash_secret_init(); memset(&keys, 0, sizeof(keys)); __skb_flow_dissect(net, skb, &flow_keys_dissector_symmetric, &keys, NULL, 0, 0, 0, 0); return __flow_hash_from_keys(&keys, &hashrnd); } EXPORT_SYMBOL_GPL(__skb_get_hash_symmetric_net); /** * __skb_get_hash_net: calculate a flow hash * @net: associated network namespace, derived from @skb if NULL * @skb: sk_buff to calculate flow hash from * * This function calculates a flow hash based on src/dst addresses * and src/dst port numbers. Sets hash in skb to non-zero hash value * on success, zero indicates no valid hash. Also, sets l4_hash in skb * if hash is a canonical 4-tuple hash over transport ports. */ void __skb_get_hash_net(const struct net *net, struct sk_buff *skb) { struct flow_keys keys; u32 hash; memset(&keys, 0, sizeof(keys)); __skb_flow_dissect(net, skb, &flow_keys_dissector, &keys, NULL, 0, 0, 0, FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); __flow_hash_secret_init(); hash = __flow_hash_from_keys(&keys, &hashrnd); __skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys)); } EXPORT_SYMBOL(__skb_get_hash_net); __u32 skb_get_hash_perturb(const struct sk_buff *skb, const siphash_key_t *perturb) { struct flow_keys keys; return ___skb_get_hash(skb, &keys, perturb); } EXPORT_SYMBOL(skb_get_hash_perturb); u32 __skb_get_poff(const struct sk_buff *skb, const void *data, const struct flow_keys_basic *keys, int hlen) { u32 poff = keys->control.thoff; /* skip L4 headers for fragments after the first */ if ((keys->control.flags & FLOW_DIS_IS_FRAGMENT) && !(keys->control.flags & FLOW_DIS_FIRST_FRAG)) return poff; switch (keys->basic.ip_proto) { case IPPROTO_TCP: { /* access doff as u8 to avoid unaligned access */ const u8 *doff; u8 _doff; doff = __skb_header_pointer(skb, poff + 12, sizeof(_doff), data, hlen, &_doff); if (!doff) return poff; poff += max_t(u32, sizeof(struct tcphdr), (*doff & 0xF0) >> 2); break; } case IPPROTO_UDP: case IPPROTO_UDPLITE: poff += sizeof(struct udphdr); break; /* For the rest, we do not really care about header * extensions at this point for now. */ case IPPROTO_ICMP: poff += sizeof(struct icmphdr); break; case IPPROTO_ICMPV6: poff += sizeof(struct icmp6hdr); break; case IPPROTO_IGMP: poff += sizeof(struct igmphdr); break; case IPPROTO_DCCP: poff += sizeof(struct dccp_hdr); break; case IPPROTO_SCTP: poff += sizeof(struct sctphdr); break; } return poff; } /** * skb_get_poff - get the offset to the payload * @skb: sk_buff to get the payload offset from * * The function will get the offset to the payload as far as it could * be dissected. The main user is currently BPF, so that we can dynamically * truncate packets without needing to push actual payload to the user * space and can analyze headers only, instead. */ u32 skb_get_poff(const struct sk_buff *skb) { struct flow_keys_basic keys; if (!skb_flow_dissect_flow_keys_basic(NULL, skb, &keys, NULL, 0, 0, 0, 0)) return 0; return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb)); } __u32 __get_hash_from_flowi6(const struct flowi6 *fl6, struct flow_keys *keys) { memset(keys, 0, sizeof(*keys)); memcpy(&keys->addrs.v6addrs.src, &fl6->saddr, sizeof(keys->addrs.v6addrs.src)); memcpy(&keys->addrs.v6addrs.dst, &fl6->daddr, sizeof(keys->addrs.v6addrs.dst)); keys->control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; keys->ports.src = fl6->fl6_sport; keys->ports.dst = fl6->fl6_dport; keys->keyid.keyid = fl6->fl6_gre_key; keys->tags.flow_label = (__force u32)flowi6_get_flowlabel(fl6); keys->basic.ip_proto = fl6->flowi6_proto; return flow_hash_from_keys(keys); } EXPORT_SYMBOL(__get_hash_from_flowi6); static const struct flow_dissector_key flow_keys_dissector_keys[] = { { .key_id = FLOW_DISSECTOR_KEY_CONTROL, .offset = offsetof(struct flow_keys, control), }, { .key_id = FLOW_DISSECTOR_KEY_BASIC, .offset = offsetof(struct flow_keys, basic), }, { .key_id = FLOW_DISSECTOR_KEY_IPV4_ADDRS, .offset = offsetof(struct flow_keys, addrs.v4addrs), }, { .key_id = FLOW_DISSECTOR_KEY_IPV6_ADDRS, .offset = offsetof(struct flow_keys, addrs.v6addrs), }, { .key_id = FLOW_DISSECTOR_KEY_TIPC, .offset = offsetof(struct flow_keys, addrs.tipckey), }, { .key_id = FLOW_DISSECTOR_KEY_PORTS, .offset = offsetof(struct flow_keys, ports), }, { .key_id = FLOW_DISSECTOR_KEY_VLAN, .offset = offsetof(struct flow_keys, vlan), }, { .key_id = FLOW_DISSECTOR_KEY_FLOW_LABEL, .offset = offsetof(struct flow_keys, tags), }, { .key_id = FLOW_DISSECTOR_KEY_GRE_KEYID, .offset = offsetof(struct flow_keys, keyid), }, }; static const struct flow_dissector_key flow_keys_dissector_symmetric_keys[] = { { .key_id = FLOW_DISSECTOR_KEY_CONTROL, .offset = offsetof(struct flow_keys, control), }, { .key_id = FLOW_DISSECTOR_KEY_BASIC, .offset = offsetof(struct flow_keys, basic), }, { .key_id = FLOW_DISSECTOR_KEY_IPV4_ADDRS, .offset = offsetof(struct flow_keys, addrs.v4addrs), }, { .key_id = FLOW_DISSECTOR_KEY_IPV6_ADDRS, .offset = offsetof(struct flow_keys, addrs.v6addrs), }, { .key_id = FLOW_DISSECTOR_KEY_PORTS, .offset = offsetof(struct flow_keys, ports), }, }; static const struct flow_dissector_key flow_keys_basic_dissector_keys[] = { { .key_id = FLOW_DISSECTOR_KEY_CONTROL, .offset = offsetof(struct flow_keys, control), }, { .key_id = FLOW_DISSECTOR_KEY_BASIC, .offset = offsetof(struct flow_keys, basic), }, }; struct flow_dissector flow_keys_dissector __read_mostly; EXPORT_SYMBOL(flow_keys_dissector); struct flow_dissector flow_keys_basic_dissector __read_mostly; EXPORT_SYMBOL(flow_keys_basic_dissector); static int __init init_default_flow_dissectors(void) { skb_flow_dissector_init(&flow_keys_dissector, flow_keys_dissector_keys, ARRAY_SIZE(flow_keys_dissector_keys)); skb_flow_dissector_init(&flow_keys_dissector_symmetric, flow_keys_dissector_symmetric_keys, ARRAY_SIZE(flow_keys_dissector_symmetric_keys)); skb_flow_dissector_init(&flow_keys_basic_dissector, flow_keys_basic_dissector_keys, ARRAY_SIZE(flow_keys_basic_dissector_keys)); return 0; } core_initcall(init_default_flow_dissectors); |
| 3 3 3 4 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 | // SPDX-License-Identifier: GPL-2.0 /* * transport_class.c - implementation of generic transport classes * using attribute_containers * * Copyright (c) 2005 - James Bottomley <James.Bottomley@steeleye.com> * * The basic idea here is to allow any "device controller" (which * would most often be a Host Bus Adapter to use the services of one * or more tranport classes for performing transport specific * services. Transport specific services are things that the generic * command layer doesn't want to know about (speed settings, line * condidtioning, etc), but which the user might be interested in. * Thus, the HBA's use the routines exported by the transport classes * to perform these functions. The transport classes export certain * values to the user via sysfs using attribute containers. * * Note: because not every HBA will care about every transport * attribute, there's a many to one relationship that goes like this: * * transport class<-----attribute container<----class device * * Usually the attribute container is per-HBA, but the design doesn't * mandate that. Although most of the services will be specific to * the actual external storage connection used by the HBA, the generic * transport class is framed entirely in terms of generic devices to * allow it to be used by any physical HBA in the system. */ #include <linux/export.h> #include <linux/attribute_container.h> #include <linux/transport_class.h> static int transport_remove_classdev(struct attribute_container *cont, struct device *dev, struct device *classdev); /** * transport_class_register - register an initial transport class * * @tclass: a pointer to the transport class structure to be initialised * * The transport class contains an embedded class which is used to * identify it. The caller should initialise this structure with * zeros and then generic class must have been initialised with the * actual transport class unique name. There's a macro * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must * be registered). * * Returns 0 on success or error on failure. */ int transport_class_register(struct transport_class *tclass) { return class_register(&tclass->class); } EXPORT_SYMBOL_GPL(transport_class_register); /** * transport_class_unregister - unregister a previously registered class * * @tclass: The transport class to unregister * * Must be called prior to deallocating the memory for the transport * class. */ void transport_class_unregister(struct transport_class *tclass) { class_unregister(&tclass->class); } EXPORT_SYMBOL_GPL(transport_class_unregister); static int anon_transport_dummy_function(struct transport_container *tc, struct device *dev, struct device *cdev) { /* do nothing */ return 0; } /** * anon_transport_class_register - register an anonymous class * * @atc: The anon transport class to register * * The anonymous transport class contains both a transport class and a * container. The idea of an anonymous class is that it never * actually has any device attributes associated with it (and thus * saves on container storage). So it can only be used for triggering * events. Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to * initialise the anon transport class storage. */ int anon_transport_class_register(struct anon_transport_class *atc) { int error; atc->container.class = &atc->tclass.class; attribute_container_set_no_classdevs(&atc->container); error = attribute_container_register(&atc->container); if (error) return error; atc->tclass.setup = anon_transport_dummy_function; atc->tclass.remove = anon_transport_dummy_function; return 0; } EXPORT_SYMBOL_GPL(anon_transport_class_register); /** * anon_transport_class_unregister - unregister an anon class * * @atc: Pointer to the anon transport class to unregister * * Must be called prior to deallocating the memory for the anon * transport class. */ void anon_transport_class_unregister(struct anon_transport_class *atc) { if (unlikely(attribute_container_unregister(&atc->container))) BUG(); } EXPORT_SYMBOL_GPL(anon_transport_class_unregister); static int transport_setup_classdev(struct attribute_container *cont, struct device *dev, struct device *classdev) { struct transport_class *tclass = class_to_transport_class(cont->class); struct transport_container *tcont = attribute_container_to_transport_container(cont); if (tclass->setup) tclass->setup(tcont, dev, classdev); return 0; } /** * transport_setup_device - declare a new dev for transport class association but don't make it visible yet. * @dev: the generic device representing the entity being added * * Usually, dev represents some component in the HBA system (either * the HBA itself or a device remote across the HBA bus). This * routine is simply a trigger point to see if any set of transport * classes wishes to associate with the added device. This allocates * storage for the class device and initialises it, but does not yet * add it to the system or add attributes to it (you do this with * transport_add_device). If you have no need for a separate setup * and add operations, use transport_register_device (see * transport_class.h). */ void transport_setup_device(struct device *dev) { attribute_container_add_device(dev, transport_setup_classdev); } EXPORT_SYMBOL_GPL(transport_setup_device); static int transport_add_class_device(struct attribute_container *cont, struct device *dev, struct device *classdev) { struct transport_class *tclass = class_to_transport_class(cont->class); int error = attribute_container_add_class_device(classdev); struct transport_container *tcont = attribute_container_to_transport_container(cont); if (error) goto err_remove; if (tcont->statistics) { error = sysfs_create_group(&classdev->kobj, tcont->statistics); if (error) goto err_del; } return 0; err_del: attribute_container_class_device_del(classdev); err_remove: if (tclass->remove) tclass->remove(tcont, dev, classdev); return error; } /** * transport_add_device - declare a new dev for transport class association * * @dev: the generic device representing the entity being added * * Usually, dev represents some component in the HBA system (either * the HBA itself or a device remote across the HBA bus). This * routine is simply a trigger point used to add the device to the * system and register attributes for it. */ int transport_add_device(struct device *dev) { return attribute_container_device_trigger_safe(dev, transport_add_class_device, transport_remove_classdev); } EXPORT_SYMBOL_GPL(transport_add_device); static int transport_configure(struct attribute_container *cont, struct device *dev, struct device *cdev) { struct transport_class *tclass = class_to_transport_class(cont->class); struct transport_container *tcont = attribute_container_to_transport_container(cont); if (tclass->configure) tclass->configure(tcont, dev, cdev); return 0; } /** * transport_configure_device - configure an already set up device * * @dev: generic device representing device to be configured * * The idea of configure is simply to provide a point within the setup * process to allow the transport class to extract information from a * device after it has been setup. This is used in SCSI because we * have to have a setup device to begin using the HBA, but after we * send the initial inquiry, we use configure to extract the device * parameters. The device need not have been added to be configured. */ void transport_configure_device(struct device *dev) { attribute_container_device_trigger(dev, transport_configure); } EXPORT_SYMBOL_GPL(transport_configure_device); static int transport_remove_classdev(struct attribute_container *cont, struct device *dev, struct device *classdev) { struct transport_container *tcont = attribute_container_to_transport_container(cont); struct transport_class *tclass = class_to_transport_class(cont->class); if (tclass->remove) tclass->remove(tcont, dev, classdev); if (tclass->remove != anon_transport_dummy_function) { if (tcont->statistics) sysfs_remove_group(&classdev->kobj, tcont->statistics); attribute_container_class_device_del(classdev); } return 0; } /** * transport_remove_device - remove the visibility of a device * * @dev: generic device to remove * * This call removes the visibility of the device (to the user from * sysfs), but does not destroy it. To eliminate a device entirely * you must also call transport_destroy_device. If you don't need to * do remove and destroy as separate operations, use * transport_unregister_device() (see transport_class.h) which will * perform both calls for you. */ void transport_remove_device(struct device *dev) { attribute_container_device_trigger(dev, transport_remove_classdev); } EXPORT_SYMBOL_GPL(transport_remove_device); static void transport_destroy_classdev(struct attribute_container *cont, struct device *dev, struct device *classdev) { struct transport_class *tclass = class_to_transport_class(cont->class); if (tclass->remove != anon_transport_dummy_function) put_device(classdev); } /** * transport_destroy_device - destroy a removed device * * @dev: device to eliminate from the transport class. * * This call triggers the elimination of storage associated with the * transport classdev. Note: all it really does is relinquish a * reference to the classdev. The memory will not be freed until the * last reference goes to zero. Note also that the classdev retains a * reference count on dev, so dev too will remain for as long as the * transport class device remains around. */ void transport_destroy_device(struct device *dev) { attribute_container_remove_device(dev, transport_destroy_classdev); } EXPORT_SYMBOL_GPL(transport_destroy_device); |
| 5 2 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * linux/include/linux/jbd2.h * * Written by Stephen C. Tweedie <sct@redhat.com> * * Copyright 1998-2000 Red Hat, Inc --- All Rights Reserved * * Definitions for transaction data structures for the buffer cache * filesystem journaling support. */ #ifndef _LINUX_JBD2_H #define _LINUX_JBD2_H /* Allow this file to be included directly into e2fsprogs */ #ifndef __KERNEL__ #include "jfs_compat.h" #define JBD2_DEBUG #else #include <linux/types.h> #include <linux/buffer_head.h> #include <linux/journal-head.h> #include <linux/stddef.h> #include <linux/mutex.h> #include <linux/timer.h> #include <linux/slab.h> #include <linux/bit_spinlock.h> #include <linux/blkdev.h> #include <linux/crc32c.h> #endif #define journal_oom_retry 1 /* * Define JBD2_PARANIOD_IOFAIL to cause a kernel BUG() if ext4 finds * certain classes of error which can occur due to failed IOs. Under * normal use we want ext4 to continue after such errors, because * hardware _can_ fail, but for debugging purposes when running tests on * known-good hardware we may want to trap these errors. */ #undef JBD2_PARANOID_IOFAIL /* * The default maximum commit age, in seconds. */ #define JBD2_DEFAULT_MAX_COMMIT_AGE 5 #ifdef CONFIG_JBD2_DEBUG /* * Define JBD2_EXPENSIVE_CHECKING to enable more expensive internal * consistency checks. By default we don't do this unless * CONFIG_JBD2_DEBUG is on. */ #define JBD2_EXPENSIVE_CHECKING void __jbd2_debug(int level, const char *file, const char *func, unsigned int line, const char *fmt, ...); #define jbd2_debug(n, fmt, a...) \ __jbd2_debug((n), __FILE__, __func__, __LINE__, (fmt), ##a) #else #define jbd2_debug(n, fmt, a...) no_printk(fmt, ##a) #endif extern void *jbd2_alloc(size_t size, gfp_t flags); extern void jbd2_free(void *ptr, size_t size); #define JBD2_MIN_JOURNAL_BLOCKS 1024 #define JBD2_DEFAULT_FAST_COMMIT_BLOCKS 256 #ifdef __KERNEL__ /** * typedef handle_t - The handle_t type represents a single atomic update being performed by some process. * * All filesystem modifications made by the process go * through this handle. Recursive operations (such as quota operations) * are gathered into a single update. * * The buffer credits field is used to account for journaled buffers * being modified by the running process. To ensure that there is * enough log space for all outstanding operations, we need to limit the * number of outstanding buffers possible at any time. When the * operation completes, any buffer credits not used are credited back to * the transaction, so that at all times we know how many buffers the * outstanding updates on a transaction might possibly touch. * * This is an opaque datatype. **/ typedef struct jbd2_journal_handle handle_t; /* Atomic operation type */ /** * typedef journal_t - The journal_t maintains all of the journaling state information for a single filesystem. * * journal_t is linked to from the fs superblock structure. * * We use the journal_t to keep track of all outstanding transaction * activity on the filesystem, and to manage the state of the log * writing process. * * This is an opaque datatype. **/ typedef struct journal_s journal_t; /* Journal control structure */ #endif /* * Internal structures used by the logging mechanism: */ #define JBD2_MAGIC_NUMBER 0xc03b3998U /* The first 4 bytes of /dev/random! */ /* * On-disk structures */ /* * Descriptor block types: */ #define JBD2_DESCRIPTOR_BLOCK 1 #define JBD2_COMMIT_BLOCK 2 #define JBD2_SUPERBLOCK_V1 3 #define JBD2_SUPERBLOCK_V2 4 #define JBD2_REVOKE_BLOCK 5 /* * Standard header for all descriptor blocks: */ typedef struct journal_header_s { __be32 h_magic; __be32 h_blocktype; __be32 h_sequence; } journal_header_t; /* * Checksum types. */ #define JBD2_CRC32_CHKSUM 1 #define JBD2_MD5_CHKSUM 2 #define JBD2_SHA1_CHKSUM 3 #define JBD2_CRC32C_CHKSUM 4 #define JBD2_CRC32_CHKSUM_SIZE 4 #define JBD2_CHECKSUM_BYTES (32 / sizeof(u32)) /* * Commit block header for storing transactional checksums: * * NOTE: If FEATURE_COMPAT_CHECKSUM (checksum v1) is set, the h_chksum* * fields are used to store a checksum of the descriptor and data blocks. * * If FEATURE_INCOMPAT_CSUM_V2 (checksum v2) is set, then the h_chksum * field is used to store crc32c(uuid+commit_block). Each journal metadata * block gets its own checksum, and data block checksums are stored in * journal_block_tag (in the descriptor). The other h_chksum* fields are * not used. * * If FEATURE_INCOMPAT_CSUM_V3 is set, the descriptor block uses * journal_block_tag3_t to store a full 32-bit checksum. Everything else * is the same as v2. * * Checksum v1, v2, and v3 are mutually exclusive features. */ struct commit_header { __be32 h_magic; __be32 h_blocktype; __be32 h_sequence; unsigned char h_chksum_type; unsigned char h_chksum_size; unsigned char h_padding[2]; __be32 h_chksum[JBD2_CHECKSUM_BYTES]; __be64 h_commit_sec; __be32 h_commit_nsec; }; /* * The block tag: used to describe a single buffer in the journal. * t_blocknr_high is only used if INCOMPAT_64BIT is set, so this * raw struct shouldn't be used for pointer math or sizeof() - use * journal_tag_bytes(journal) instead to compute this. */ typedef struct journal_block_tag3_s { __be32 t_blocknr; /* The on-disk block number */ __be32 t_flags; /* See below */ __be32 t_blocknr_high; /* most-significant high 32bits. */ __be32 t_checksum; /* crc32c(uuid+seq+block) */ } journal_block_tag3_t; typedef struct journal_block_tag_s { __be32 t_blocknr; /* The on-disk block number */ __be16 t_checksum; /* truncated crc32c(uuid+seq+block) */ __be16 t_flags; /* See below */ __be32 t_blocknr_high; /* most-significant high 32bits. */ } journal_block_tag_t; /* Tail of descriptor or revoke block, for checksumming */ struct jbd2_journal_block_tail { __be32 t_checksum; /* crc32c(uuid+descr_block) */ }; /* * The revoke descriptor: used on disk to describe a series of blocks to * be revoked from the log */ typedef struct jbd2_journal_revoke_header_s { journal_header_t r_header; __be32 r_count; /* Count of bytes used in the block */ } jbd2_journal_revoke_header_t; /* Definitions for the journal tag flags word: */ #define JBD2_FLAG_ESCAPE 1 /* on-disk block is escaped */ #define JBD2_FLAG_SAME_UUID 2 /* block has same uuid as previous */ #define JBD2_FLAG_DELETED 4 /* block deleted by this transaction */ #define JBD2_FLAG_LAST_TAG 8 /* last tag in this descriptor block */ /* * The journal superblock. All fields are in big-endian byte order. */ typedef struct journal_superblock_s { /* 0x0000 */ journal_header_t s_header; /* 0x000C */ /* Static information describing the journal */ __be32 s_blocksize; /* journal device blocksize */ __be32 s_maxlen; /* total blocks in journal file */ __be32 s_first; /* first block of log information */ /* 0x0018 */ /* Dynamic information describing the current state of the log */ __be32 s_sequence; /* first commit ID expected in log */ __be32 s_start; /* blocknr of start of log */ /* 0x0020 */ /* Error value, as set by jbd2_journal_abort(). */ __be32 s_errno; /* 0x0024 */ /* Remaining fields are only valid in a version-2 superblock */ __be32 s_feature_compat; /* compatible feature set */ __be32 s_feature_incompat; /* incompatible feature set */ __be32 s_feature_ro_compat; /* readonly-compatible feature set */ /* 0x0030 */ __u8 s_uuid[16]; /* 128-bit uuid for journal */ /* 0x0040 */ __be32 s_nr_users; /* Nr of filesystems sharing log */ __be32 s_dynsuper; /* Blocknr of dynamic superblock copy*/ /* 0x0048 */ __be32 s_max_transaction; /* Limit of journal blocks per trans.*/ __be32 s_max_trans_data; /* Limit of data blocks per trans. */ /* 0x0050 */ __u8 s_checksum_type; /* checksum type */ __u8 s_padding2[3]; /* 0x0054 */ __be32 s_num_fc_blks; /* Number of fast commit blocks */ __be32 s_head; /* blocknr of head of log, only uptodate * while the filesystem is clean */ /* 0x005C */ __u32 s_padding[40]; __be32 s_checksum; /* crc32c(superblock) */ /* 0x0100 */ __u8 s_users[16*48]; /* ids of all fs'es sharing the log */ /* 0x0400 */ } journal_superblock_t; #define JBD2_FEATURE_COMPAT_CHECKSUM 0x00000001 #define JBD2_FEATURE_INCOMPAT_REVOKE 0x00000001 #define JBD2_FEATURE_INCOMPAT_64BIT 0x00000002 #define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT 0x00000004 #define JBD2_FEATURE_INCOMPAT_CSUM_V2 0x00000008 #define JBD2_FEATURE_INCOMPAT_CSUM_V3 0x00000010 #define JBD2_FEATURE_INCOMPAT_FAST_COMMIT 0x00000020 /* See "journal feature predicate functions" below */ /* Features known to this kernel version: */ #define JBD2_KNOWN_COMPAT_FEATURES JBD2_FEATURE_COMPAT_CHECKSUM #define JBD2_KNOWN_ROCOMPAT_FEATURES 0 #define JBD2_KNOWN_INCOMPAT_FEATURES (JBD2_FEATURE_INCOMPAT_REVOKE | \ JBD2_FEATURE_INCOMPAT_64BIT | \ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT | \ JBD2_FEATURE_INCOMPAT_CSUM_V2 | \ JBD2_FEATURE_INCOMPAT_CSUM_V3 | \ JBD2_FEATURE_INCOMPAT_FAST_COMMIT) #ifdef __KERNEL__ #include <linux/fs.h> #include <linux/sched.h> enum jbd_state_bits { BH_JBD /* Has an attached ext3 journal_head */ = BH_PrivateStart, BH_JWrite, /* Being written to log (@@@ DEBUGGING) */ BH_Freed, /* Has been freed (truncated) */ BH_Revoked, /* Has been revoked from the log */ BH_RevokeValid, /* Revoked flag is valid */ BH_JBDDirty, /* Is dirty but journaled */ BH_JournalHead, /* Pins bh->b_private and jh->b_bh */ BH_Shadow, /* IO on shadow buffer is running */ BH_Verified, /* Metadata block has been verified ok */ BH_JBDPrivateStart, /* First bit available for private use by FS */ }; BUFFER_FNS(JBD, jbd) BUFFER_FNS(JWrite, jwrite) BUFFER_FNS(JBDDirty, jbddirty) TAS_BUFFER_FNS(JBDDirty, jbddirty) BUFFER_FNS(Revoked, revoked) TAS_BUFFER_FNS(Revoked, revoked) BUFFER_FNS(RevokeValid, revokevalid) TAS_BUFFER_FNS(RevokeValid, revokevalid) BUFFER_FNS(Freed, freed) BUFFER_FNS(Shadow, shadow) BUFFER_FNS(Verified, verified) static inline struct buffer_head *jh2bh(struct journal_head *jh) { return jh->b_bh; } static inline struct journal_head *bh2jh(struct buffer_head *bh) { return bh->b_private; } static inline void jbd_lock_bh_journal_head(struct buffer_head *bh) { bit_spin_lock(BH_JournalHead, &bh->b_state); } static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh) { bit_spin_unlock(BH_JournalHead, &bh->b_state); } #define J_ASSERT(assert) BUG_ON(!(assert)) #define J_ASSERT_BH(bh, expr) J_ASSERT(expr) #define J_ASSERT_JH(jh, expr) J_ASSERT(expr) #if defined(JBD2_PARANOID_IOFAIL) #define J_EXPECT(expr, why...) J_ASSERT(expr) #define J_EXPECT_BH(bh, expr, why...) J_ASSERT_BH(bh, expr) #define J_EXPECT_JH(jh, expr, why...) J_ASSERT_JH(jh, expr) #else #define __journal_expect(expr, why...) \ ({ \ int val = (expr); \ if (!val) { \ printk(KERN_ERR \ "JBD2 unexpected failure: %s: %s;\n", \ __func__, #expr); \ printk(KERN_ERR why "\n"); \ } \ val; \ }) #define J_EXPECT(expr, why...) __journal_expect(expr, ## why) #define J_EXPECT_BH(bh, expr, why...) __journal_expect(expr, ## why) #define J_EXPECT_JH(jh, expr, why...) __journal_expect(expr, ## why) #endif /* Flags in jbd_inode->i_flags */ #define __JI_COMMIT_RUNNING 0 #define __JI_WRITE_DATA 1 #define __JI_WAIT_DATA 2 /* * Commit of the inode data in progress. We use this flag to protect us from * concurrent deletion of inode. We cannot use reference to inode for this * since we cannot afford doing last iput() on behalf of kjournald */ #define JI_COMMIT_RUNNING (1 << __JI_COMMIT_RUNNING) /* Write allocated dirty buffers in this inode before commit */ #define JI_WRITE_DATA (1 << __JI_WRITE_DATA) /* Wait for outstanding data writes for this inode before commit */ #define JI_WAIT_DATA (1 << __JI_WAIT_DATA) /** * struct jbd2_inode - The jbd_inode type is the structure linking inodes in * ordered mode present in a transaction so that we can sync them during commit. */ struct jbd2_inode { /** * @i_transaction: * * Which transaction does this inode belong to? Either the running * transaction or the committing one. [j_list_lock] */ transaction_t *i_transaction; /** * @i_next_transaction: * * Pointer to the running transaction modifying inode's data in case * there is already a committing transaction touching it. [j_list_lock] */ transaction_t *i_next_transaction; /** * @i_list: List of inodes in the i_transaction [j_list_lock] */ struct list_head i_list; /** * @i_vfs_inode: * * VFS inode this inode belongs to [constant for lifetime of structure] */ struct inode *i_vfs_inode; /** * @i_flags: Flags of inode [j_list_lock] */ unsigned long i_flags; /** * @i_dirty_start: * * Offset in bytes where the dirty range for this inode starts. * [j_list_lock] */ loff_t i_dirty_start; /** * @i_dirty_end: * * Inclusive offset in bytes where the dirty range for this inode * ends. [j_list_lock] */ loff_t i_dirty_end; }; struct jbd2_revoke_table_s; /** * struct jbd2_journal_handle - The jbd2_journal_handle type is the concrete * type associated with handle_t. * @h_transaction: Which compound transaction is this update a part of? * @h_journal: Which journal handle belongs to - used iff h_reserved set. * @h_rsv_handle: Handle reserved for finishing the logical operation. * @h_total_credits: Number of remaining buffers we are allowed to add to * journal. These are dirty buffers and revoke descriptor blocks. * @h_revoke_credits: Number of remaining revoke records available for handle * @h_ref: Reference count on this handle. * @h_err: Field for caller's use to track errors through large fs operations. * @h_sync: Flag for sync-on-close. * @h_reserved: Flag for handle for reserved credits. * @h_aborted: Flag indicating fatal error on handle. * @h_type: For handle statistics. * @h_line_no: For handle statistics. * @h_start_jiffies: Handle Start time. * @h_requested_credits: Holds @h_total_credits after handle is started. * @h_revoke_credits_requested: Holds @h_revoke_credits after handle is started. * @saved_alloc_context: Saved context while transaction is open. **/ /* Docbook can't yet cope with the bit fields, but will leave the documentation * in so it can be fixed later. */ struct jbd2_journal_handle { union { transaction_t *h_transaction; /* Which journal handle belongs to - used iff h_reserved set */ journal_t *h_journal; }; handle_t *h_rsv_handle; int h_total_credits; int h_revoke_credits; int h_revoke_credits_requested; int h_ref; int h_err; /* Flags [no locking] */ unsigned int h_sync: 1; unsigned int h_reserved: 1; unsigned int h_aborted: 1; unsigned int h_type: 8; unsigned int h_line_no: 16; unsigned long h_start_jiffies; unsigned int h_requested_credits; unsigned int saved_alloc_context; }; /* * Some stats for checkpoint phase */ struct transaction_chp_stats_s { unsigned long cs_chp_time; __u32 cs_forced_to_close; __u32 cs_written; __u32 cs_dropped; }; /* The transaction_t type is the guts of the journaling mechanism. It * tracks a compound transaction through its various states: * * RUNNING: accepting new updates * LOCKED: Updates still running but we don't accept new ones * RUNDOWN: Updates are tidying up but have finished requesting * new buffers to modify (state not used for now) * FLUSH: All updates complete, but we are still writing to disk * COMMIT: All data on disk, writing commit record * FINISHED: We still have to keep the transaction for checkpointing. * * The transaction keeps track of all of the buffers modified by a * running transaction, and all of the buffers committed but not yet * flushed to home for finished transactions. * (Locking Documentation improved by LockDoc) */ /* * Lock ranking: * * j_list_lock * ->jbd_lock_bh_journal_head() (This is "innermost") * * j_state_lock * ->b_state_lock * * b_state_lock * ->j_list_lock * * j_state_lock * ->j_list_lock (journal_unmap_buffer) * */ struct transaction_s { /* Pointer to the journal for this transaction. [no locking] */ journal_t *t_journal; /* Sequence number for this transaction [no locking] */ tid_t t_tid; /* * Transaction's current state * [no locking - only kjournald2 alters this] * [j_list_lock] guards transition of a transaction into T_FINISHED * state and subsequent call of __jbd2_journal_drop_transaction() * FIXME: needs barriers * KLUDGE: [use j_state_lock] */ enum { T_RUNNING, T_LOCKED, T_SWITCH, T_FLUSH, T_COMMIT, T_COMMIT_DFLUSH, T_COMMIT_JFLUSH, T_COMMIT_CALLBACK, T_FINISHED } t_state; /* * Where in the log does this transaction's commit start? [no locking] */ unsigned long t_log_start; /* * Number of buffers on the t_buffers list [j_list_lock, no locks * needed for jbd2 thread] */ int t_nr_buffers; /* * Doubly-linked circular list of all buffers reserved but not yet * modified by this transaction [j_list_lock, no locks needed fo * jbd2 thread] */ struct journal_head *t_reserved_list; /* * Doubly-linked circular list of all metadata buffers owned by this * transaction [j_list_lock, no locks needed for jbd2 thread] */ struct journal_head *t_buffers; /* * Doubly-linked circular list of all forget buffers (superseded * buffers which we can un-checkpoint once this transaction commits) * [j_list_lock] */ struct journal_head *t_forget; /* * Doubly-linked circular list of all buffers still to be flushed before * this transaction can be checkpointed. [j_list_lock] */ struct journal_head *t_checkpoint_list; /* * Doubly-linked circular list of metadata buffers being * shadowed by log IO. The IO buffers on the iobuf list and * the shadow buffers on this list match each other one for * one at all times. [j_list_lock, no locks needed for jbd2 * thread] */ struct journal_head *t_shadow_list; /* * List of inodes associated with the transaction; e.g., ext4 uses * this to track inodes in data=ordered and data=journal mode that * need special handling on transaction commit; also used by ocfs2. * [j_list_lock] */ struct list_head t_inode_list; /* * Longest time some handle had to wait for running transaction */ unsigned long t_max_wait; /* * When transaction started */ unsigned long t_start; /* * When commit was requested [j_state_lock] */ unsigned long t_requested; /* * Checkpointing stats [j_list_lock] */ struct transaction_chp_stats_s t_chp_stats; /* * Number of outstanding updates running on this transaction * [none] */ atomic_t t_updates; /* * Number of blocks reserved for this transaction in the journal. * This is including all credits reserved when starting transaction * handles as well as all journal descriptor blocks needed for this * transaction. [none] */ atomic_t t_outstanding_credits; /* * Number of revoke records for this transaction added by already * stopped handles. [none] */ atomic_t t_outstanding_revokes; /* * How many handles used this transaction? [none] */ atomic_t t_handle_count; /* * Forward and backward links for the circular list of all transactions * awaiting checkpoint. [j_list_lock] */ transaction_t *t_cpnext, *t_cpprev; /* * When will the transaction expire (become due for commit), in jiffies? * [no locking] */ unsigned long t_expires; /* * When this transaction started, in nanoseconds [no locking] */ ktime_t t_start_time; /* * This transaction is being forced and some process is * waiting for it to finish. */ unsigned int t_synchronous_commit:1; /* Disk flush needs to be sent to fs partition [no locking] */ int t_need_data_flush; }; struct transaction_run_stats_s { unsigned long rs_wait; unsigned long rs_request_delay; unsigned long rs_running; unsigned long rs_locked; unsigned long rs_flushing; unsigned long rs_logging; __u32 rs_handle_count; __u32 rs_blocks; __u32 rs_blocks_logged; }; struct transaction_stats_s { unsigned long ts_tid; unsigned long ts_requested; struct transaction_run_stats_s run; }; static inline unsigned long jbd2_time_diff(unsigned long start, unsigned long end) { if (end >= start) return end - start; return end + (MAX_JIFFY_OFFSET - start); } #define JBD2_NR_BATCH 64 enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY}; #define JBD2_FC_REPLAY_STOP 0 #define JBD2_FC_REPLAY_CONTINUE 1 /** * struct journal_s - The journal_s type is the concrete type associated with * journal_t. */ struct journal_s { /** * @j_flags: General journaling state flags [j_state_lock, * no lock for quick racy checks] */ unsigned long j_flags; /** * @j_errno: * * Is there an outstanding uncleared error on the journal (from a prior * abort)? [j_state_lock] */ int j_errno; /** * @j_abort_mutex: Lock the whole aborting procedure. */ struct mutex j_abort_mutex; /** * @j_sb_buffer: The first part of the superblock buffer. */ struct buffer_head *j_sb_buffer; /** * @j_superblock: The second part of the superblock buffer. */ journal_superblock_t *j_superblock; /** * @j_state_lock: Protect the various scalars in the journal. */ rwlock_t j_state_lock; /** * @j_barrier_count: * * Number of processes waiting to create a barrier lock [j_state_lock, * no lock for quick racy checks] */ int j_barrier_count; /** * @j_barrier: The barrier lock itself. */ struct mutex j_barrier; /** * @j_running_transaction: * * Transactions: The current running transaction... * [j_state_lock, no lock for quick racy checks] [caller holding * open handle] */ transaction_t *j_running_transaction; /** * @j_committing_transaction: * * the transaction we are pushing to disk * [j_state_lock] [caller holding open handle] */ transaction_t *j_committing_transaction; /** * @j_checkpoint_transactions: * * ... and a linked circular list of all transactions waiting for * checkpointing. [j_list_lock] */ transaction_t *j_checkpoint_transactions; /** * @j_wait_transaction_locked: * * Wait queue for waiting for a locked transaction to start committing, * or for a barrier lock to be released. */ wait_queue_head_t j_wait_transaction_locked; /** * @j_wait_done_commit: Wait queue for waiting for commit to complete. */ wait_queue_head_t j_wait_done_commit; /** * @j_wait_commit: Wait queue to trigger commit. */ wait_queue_head_t j_wait_commit; /** * @j_wait_updates: Wait queue to wait for updates to complete. */ wait_queue_head_t j_wait_updates; /** * @j_wait_reserved: * * Wait queue to wait for reserved buffer credits to drop. */ wait_queue_head_t j_wait_reserved; /** * @j_fc_wait: * * Wait queue to wait for completion of async fast commits. */ wait_queue_head_t j_fc_wait; /** * @j_checkpoint_mutex: * * Semaphore for locking against concurrent checkpoints. */ struct mutex j_checkpoint_mutex; /** * @j_chkpt_bhs: * * List of buffer heads used by the checkpoint routine. This * was moved from jbd2_log_do_checkpoint() to reduce stack * usage. Access to this array is controlled by the * @j_checkpoint_mutex. [j_checkpoint_mutex] */ struct buffer_head *j_chkpt_bhs[JBD2_NR_BATCH]; /** * @j_shrinker: * * Journal head shrinker, reclaim buffer's journal head which * has been written back. */ struct shrinker *j_shrinker; /** * @j_checkpoint_jh_count: * * Number of journal buffers on the checkpoint list. [j_list_lock] */ struct percpu_counter j_checkpoint_jh_count; /** * @j_shrink_transaction: * * Record next transaction will shrink on the checkpoint list. * [j_list_lock] */ transaction_t *j_shrink_transaction; /** * @j_head: * * Journal head: identifies the first unused block in the journal. * [j_state_lock] */ unsigned long j_head; /** * @j_tail: * * Journal tail: identifies the oldest still-used block in the journal. * [j_state_lock] */ unsigned long j_tail; /** * @j_free: * * Journal free: how many free blocks are there in the journal? * [j_state_lock] */ unsigned long j_free; /** * @j_first: * * The block number of the first usable block in the journal * [j_state_lock]. */ unsigned long j_first; /** * @j_last: * * The block number one beyond the last usable block in the journal * [j_state_lock]. */ unsigned long j_last; /** * @j_fc_first: * * The block number of the first fast commit block in the journal * [j_state_lock]. */ unsigned long j_fc_first; /** * @j_fc_off: * * Number of fast commit blocks currently allocated. Accessed only * during fast commit. Currently only process can do fast commit, so * this field is not protected by any lock. */ unsigned long j_fc_off; /** * @j_fc_last: * * The block number one beyond the last fast commit block in the journal * [j_state_lock]. */ unsigned long j_fc_last; /** * @j_dev: Device where we store the journal. */ struct block_device *j_dev; /** * @j_blocksize: Block size for the location where we store the journal. */ int j_blocksize; /** * @j_blk_offset: * * Starting block offset into the device where we store the journal. */ unsigned long long j_blk_offset; /** * @j_devname: Journal device name. */ char j_devname[BDEVNAME_SIZE+24]; /** * @j_fs_dev: * * Device which holds the client fs. For internal journal this will be * equal to j_dev. */ struct block_device *j_fs_dev; /** * @j_fs_dev_wb_err: * * Records the errseq of the client fs's backing block device. */ errseq_t j_fs_dev_wb_err; /** * @j_total_len: Total maximum capacity of the journal region on disk. */ unsigned int j_total_len; /** * @j_reserved_credits: * * Number of buffers reserved from the running transaction. */ atomic_t j_reserved_credits; /** * @j_list_lock: Protects the buffer lists and internal buffer state. */ spinlock_t j_list_lock; /** * @j_inode: * * Optional inode where we store the journal. If present, all * journal block numbers are mapped into this inode via bmap(). */ struct inode *j_inode; /** * @j_tail_sequence: * * Sequence number of the oldest transaction in the log [j_state_lock] */ tid_t j_tail_sequence; /** * @j_transaction_sequence: * * Sequence number of the next transaction to grant [j_state_lock] */ tid_t j_transaction_sequence; /** * @j_commit_sequence: * * Sequence number of the most recently committed transaction * [j_state_lock, no lock for quick racy checks] */ tid_t j_commit_sequence; /** * @j_commit_request: * * Sequence number of the most recent transaction wanting commit * [j_state_lock, no lock for quick racy checks] */ tid_t j_commit_request; /** * @j_uuid: * * Journal uuid: identifies the object (filesystem, LVM volume etc) * backed by this journal. This will eventually be replaced by an array * of uuids, allowing us to index multiple devices within a single * journal and to perform atomic updates across them. */ __u8 j_uuid[16]; /** * @j_task: Pointer to the current commit thread for this journal. */ struct task_struct *j_task; /** * @j_max_transaction_buffers: * * Maximum number of metadata buffers to allow in a single compound * commit transaction. */ int j_max_transaction_buffers; /** * @j_revoke_records_per_block: * * Number of revoke records that fit in one descriptor block. */ int j_revoke_records_per_block; /** * @j_transaction_overhead_buffers: * * Number of blocks each transaction needs for its own bookkeeping */ int j_transaction_overhead_buffers; /** * @j_commit_interval: * * What is the maximum transaction lifetime before we begin a commit? */ unsigned long j_commit_interval; /** * @j_commit_timer: The timer used to wakeup the commit thread. */ struct timer_list j_commit_timer; /** * @j_revoke_lock: Protect the revoke table. */ spinlock_t j_revoke_lock; /** * @j_revoke: * * The revoke table - maintains the list of revoked blocks in the * current transaction. */ struct jbd2_revoke_table_s *j_revoke; /** * @j_revoke_table: Alternate revoke tables for j_revoke. */ struct jbd2_revoke_table_s *j_revoke_table[2]; /** * @j_wbuf: Array of bhs for jbd2_journal_commit_transaction. */ struct buffer_head **j_wbuf; /** * @j_fc_wbuf: Array of fast commit bhs for fast commit. Accessed only * during a fast commit. Currently only process can do fast commit, so * this field is not protected by any lock. */ struct buffer_head **j_fc_wbuf; /** * @j_wbufsize: * * Size of @j_wbuf array. */ int j_wbufsize; /** * @j_fc_wbufsize: * * Size of @j_fc_wbuf array. */ int j_fc_wbufsize; /** * @j_last_sync_writer: * * The pid of the last person to run a synchronous operation * through the journal. */ pid_t j_last_sync_writer; /** * @j_average_commit_time: * * The average amount of time in nanoseconds it takes to commit a * transaction to disk. [j_state_lock] */ u64 j_average_commit_time; /** * @j_min_batch_time: * * Minimum time that we should wait for additional filesystem operations * to get batched into a synchronous handle in microseconds. */ u32 j_min_batch_time; /** * @j_max_batch_time: * * Maximum time that we should wait for additional filesystem operations * to get batched into a synchronous handle in microseconds. */ u32 j_max_batch_time; /** * @j_commit_callback: * * This function is called when a transaction is closed. */ void (*j_commit_callback)(journal_t *, transaction_t *); /** * @j_submit_inode_data_buffers: * * This function is called for all inodes associated with the * committing transaction marked with JI_WRITE_DATA flag * before we start to write out the transaction to the journal. */ int (*j_submit_inode_data_buffers) (struct jbd2_inode *); /** * @j_finish_inode_data_buffers: * * This function is called for all inodes associated with the * committing transaction marked with JI_WAIT_DATA flag * after we have written the transaction to the journal * but before we write out the commit block. */ int (*j_finish_inode_data_buffers) (struct jbd2_inode *); /* * Journal statistics */ /** * @j_history_lock: Protect the transactions statistics history. */ spinlock_t j_history_lock; /** * @j_proc_entry: procfs entry for the jbd statistics directory. */ struct proc_dir_entry *j_proc_entry; /** * @j_stats: Overall statistics. */ struct transaction_stats_s j_stats; /** * @j_failed_commit: Failed journal commit ID. */ unsigned int j_failed_commit; /** * @j_private: * * An opaque pointer to fs-private information. ext3 puts its * superblock pointer here. */ void *j_private; /** * @j_csum_seed: * * Precomputed journal UUID checksum for seeding other checksums. */ __u32 j_csum_seed; #ifdef CONFIG_DEBUG_LOCK_ALLOC /** * @j_trans_commit_map: * * Lockdep entity to track transaction commit dependencies. Handles * hold this "lock" for read, when we wait for commit, we acquire the * "lock" for writing. This matches the properties of jbd2 journalling * where the running transaction has to wait for all handles to be * dropped to commit that transaction and also acquiring a handle may * require transaction commit to finish. */ struct lockdep_map j_trans_commit_map; #endif /** * @jbd2_trans_commit_key: * * "struct lock_class_key" for @j_trans_commit_map */ struct lock_class_key jbd2_trans_commit_key; /** * @j_fc_cleanup_callback: * * Clean-up after fast commit or full commit. JBD2 calls this function * after every commit operation. */ void (*j_fc_cleanup_callback)(struct journal_s *journal, int full, tid_t tid); /** * @j_fc_replay_callback: * * File-system specific function that performs replay of a fast * commit. JBD2 calls this function for each fast commit block found in * the journal. This function should return JBD2_FC_REPLAY_CONTINUE * to indicate that the block was processed correctly and more fast * commit replay should continue. Return value of JBD2_FC_REPLAY_STOP * indicates the end of replay (no more blocks remaining). A negative * return value indicates error. */ int (*j_fc_replay_callback)(struct journal_s *journal, struct buffer_head *bh, enum passtype pass, int off, tid_t expected_commit_id); /** * @j_bmap: * * Bmap function that should be used instead of the generic * VFS bmap function. */ int (*j_bmap)(struct journal_s *journal, sector_t *block); }; #define jbd2_might_wait_for_commit(j) \ do { \ rwsem_acquire(&j->j_trans_commit_map, 0, 0, _THIS_IP_); \ rwsem_release(&j->j_trans_commit_map, _THIS_IP_); \ } while (0) /* * We can support any known requested features iff the * superblock is not in version 1. Otherwise we fail to support any * extended sb features. */ static inline bool jbd2_format_support_feature(journal_t *j) { return j->j_superblock->s_header.h_blocktype != cpu_to_be32(JBD2_SUPERBLOCK_V1); } /* journal feature predicate functions */ #define JBD2_FEATURE_COMPAT_FUNCS(name, flagname) \ static inline bool jbd2_has_feature_##name(journal_t *j) \ { \ return (jbd2_format_support_feature(j) && \ ((j)->j_superblock->s_feature_compat & \ cpu_to_be32(JBD2_FEATURE_COMPAT_##flagname)) != 0); \ } \ static inline void jbd2_set_feature_##name(journal_t *j) \ { \ (j)->j_superblock->s_feature_compat |= \ cpu_to_be32(JBD2_FEATURE_COMPAT_##flagname); \ } \ static inline void jbd2_clear_feature_##name(journal_t *j) \ { \ (j)->j_superblock->s_feature_compat &= \ ~cpu_to_be32(JBD2_FEATURE_COMPAT_##flagname); \ } #define JBD2_FEATURE_RO_COMPAT_FUNCS(name, flagname) \ static inline bool jbd2_has_feature_##name(journal_t *j) \ { \ return (jbd2_format_support_feature(j) && \ ((j)->j_superblock->s_feature_ro_compat & \ cpu_to_be32(JBD2_FEATURE_RO_COMPAT_##flagname)) != 0); \ } \ static inline void jbd2_set_feature_##name(journal_t *j) \ { \ (j)->j_superblock->s_feature_ro_compat |= \ cpu_to_be32(JBD2_FEATURE_RO_COMPAT_##flagname); \ } \ static inline void jbd2_clear_feature_##name(journal_t *j) \ { \ (j)->j_superblock->s_feature_ro_compat &= \ ~cpu_to_be32(JBD2_FEATURE_RO_COMPAT_##flagname); \ } #define JBD2_FEATURE_INCOMPAT_FUNCS(name, flagname) \ static inline bool jbd2_has_feature_##name(journal_t *j) \ { \ return (jbd2_format_support_feature(j) && \ ((j)->j_superblock->s_feature_incompat & \ cpu_to_be32(JBD2_FEATURE_INCOMPAT_##flagname)) != 0); \ } \ static inline void jbd2_set_feature_##name(journal_t *j) \ { \ (j)->j_superblock->s_feature_incompat |= \ cpu_to_be32(JBD2_FEATURE_INCOMPAT_##flagname); \ } \ static inline void jbd2_clear_feature_##name(journal_t *j) \ { \ (j)->j_superblock->s_feature_incompat &= \ ~cpu_to_be32(JBD2_FEATURE_INCOMPAT_##flagname); \ } JBD2_FEATURE_COMPAT_FUNCS(checksum, CHECKSUM) JBD2_FEATURE_INCOMPAT_FUNCS(revoke, REVOKE) JBD2_FEATURE_INCOMPAT_FUNCS(64bit, 64BIT) JBD2_FEATURE_INCOMPAT_FUNCS(async_commit, ASYNC_COMMIT) JBD2_FEATURE_INCOMPAT_FUNCS(csum2, CSUM_V2) JBD2_FEATURE_INCOMPAT_FUNCS(csum3, CSUM_V3) JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit, FAST_COMMIT) /* Journal high priority write IO operation flags */ #define JBD2_JOURNAL_REQ_FLAGS (REQ_META | REQ_SYNC | REQ_IDLE) /* * Journal flag definitions */ #define JBD2_UNMOUNT 0x001 /* Journal thread is being destroyed */ #define JBD2_ABORT 0x002 /* Journaling has been aborted for errors. */ #define JBD2_ACK_ERR 0x004 /* The errno in the sb has been acked */ #define JBD2_FLUSHED 0x008 /* The journal superblock has been flushed */ #define JBD2_LOADED 0x010 /* The journal superblock has been loaded */ #define JBD2_BARRIER 0x020 /* Use IDE barriers */ #define JBD2_CYCLE_RECORD 0x080 /* Journal cycled record log on * clean and empty filesystem * logging area */ #define JBD2_FAST_COMMIT_ONGOING 0x100 /* Fast commit is ongoing */ #define JBD2_FULL_COMMIT_ONGOING 0x200 /* Full commit is ongoing */ #define JBD2_JOURNAL_FLUSH_DISCARD 0x0001 #define JBD2_JOURNAL_FLUSH_ZEROOUT 0x0002 #define JBD2_JOURNAL_FLUSH_VALID (JBD2_JOURNAL_FLUSH_DISCARD | \ JBD2_JOURNAL_FLUSH_ZEROOUT) /* * Function declarations for the journaling transaction and buffer * management */ /* Filing buffers */ extern bool __jbd2_journal_refile_buffer(struct journal_head *); extern void jbd2_journal_refile_buffer(journal_t *, struct journal_head *); extern void __jbd2_journal_file_buffer(struct journal_head *, transaction_t *, int); extern void jbd2_journal_file_buffer(struct journal_head *, transaction_t *, int); static inline void jbd2_file_log_bh(struct list_head *head, struct buffer_head *bh) { list_add_tail(&bh->b_assoc_buffers, head); } static inline void jbd2_unfile_log_bh(struct buffer_head *bh) { list_del_init(&bh->b_assoc_buffers); } /* Log buffer allocation */ struct buffer_head *jbd2_journal_get_descriptor_buffer(transaction_t *, int); void jbd2_descriptor_block_csum_set(journal_t *, struct buffer_head *); int jbd2_journal_next_log_block(journal_t *, unsigned long long *); int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid, unsigned long *block); int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block); void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block); /* Commit management */ extern void jbd2_journal_commit_transaction(journal_t *); /* Checkpoint list management */ enum jbd2_shrink_type {JBD2_SHRINK_DESTROY, JBD2_SHRINK_BUSY_STOP, JBD2_SHRINK_BUSY_SKIP}; void __jbd2_journal_clean_checkpoint_list(journal_t *journal, enum jbd2_shrink_type type); unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal, unsigned long *nr_to_scan); int __jbd2_journal_remove_checkpoint(struct journal_head *); int jbd2_journal_try_remove_checkpoint(struct journal_head *jh); void jbd2_journal_destroy_checkpoint(journal_t *journal); void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *); /* * Triggers */ struct jbd2_buffer_trigger_type { /* * Fired a the moment data to write to the journal are known to be * stable - so either at the moment b_frozen_data is created or just * before a buffer is written to the journal. mapped_data is a mapped * buffer that is the frozen data for commit. */ void (*t_frozen)(struct jbd2_buffer_trigger_type *type, struct buffer_head *bh, void *mapped_data, size_t size); /* * Fired during journal abort for dirty buffers that will not be * committed. */ void (*t_abort)(struct jbd2_buffer_trigger_type *type, struct buffer_head *bh); }; extern void jbd2_buffer_frozen_trigger(struct journal_head *jh, void *mapped_data, struct jbd2_buffer_trigger_type *triggers); extern void jbd2_buffer_abort_trigger(struct journal_head *jh, struct jbd2_buffer_trigger_type *triggers); /* Buffer IO */ extern int jbd2_journal_write_metadata_buffer(transaction_t *transaction, struct journal_head *jh_in, struct buffer_head **bh_out, sector_t blocknr); /* Transaction cache support */ extern void jbd2_journal_destroy_transaction_cache(void); extern int __init jbd2_journal_init_transaction_cache(void); extern void jbd2_journal_free_transaction(transaction_t *); /* * Journal locking. * * We need to lock the journal during transaction state changes so that nobody * ever tries to take a handle on the running transaction while we are in the * middle of moving it to the commit phase. j_state_lock does this. * * Note that the locking is completely interrupt unsafe. We never touch * journal structures from interrupts. */ static inline handle_t *journal_current_handle(void) { return current->journal_info; } /* The journaling code user interface: * * Create and destroy handles * Register buffer modifications against the current transaction. */ extern handle_t *jbd2_journal_start(journal_t *, int nblocks); extern handle_t *jbd2__journal_start(journal_t *, int blocks, int rsv_blocks, int revoke_records, gfp_t gfp_mask, unsigned int type, unsigned int line_no); extern int jbd2_journal_restart(handle_t *, int nblocks); extern int jbd2__journal_restart(handle_t *, int nblocks, int revoke_records, gfp_t gfp_mask); extern int jbd2_journal_start_reserved(handle_t *handle, unsigned int type, unsigned int line_no); extern void jbd2_journal_free_reserved(handle_t *handle); extern int jbd2_journal_extend(handle_t *handle, int nblocks, int revoke_records); extern int jbd2_journal_get_write_access(handle_t *, struct buffer_head *); extern int jbd2_journal_get_create_access (handle_t *, struct buffer_head *); extern int jbd2_journal_get_undo_access(handle_t *, struct buffer_head *); void jbd2_journal_set_triggers(struct buffer_head *, struct jbd2_buffer_trigger_type *type); extern int jbd2_journal_dirty_metadata (handle_t *, struct buffer_head *); extern int jbd2_journal_forget (handle_t *, struct buffer_head *); int jbd2_journal_invalidate_folio(journal_t *, struct folio *, size_t offset, size_t length); bool jbd2_journal_try_to_free_buffers(journal_t *journal, struct folio *folio); extern int jbd2_journal_stop(handle_t *); extern int jbd2_journal_flush(journal_t *journal, unsigned int flags); extern void jbd2_journal_lock_updates (journal_t *); extern void jbd2_journal_unlock_updates (journal_t *); void jbd2_journal_wait_updates(journal_t *); extern journal_t * jbd2_journal_init_dev(struct block_device *bdev, struct block_device *fs_dev, unsigned long long start, int len, int bsize); extern journal_t * jbd2_journal_init_inode (struct inode *); extern int jbd2_journal_update_format (journal_t *); extern int jbd2_journal_check_used_features (journal_t *, unsigned long, unsigned long, unsigned long); extern int jbd2_journal_check_available_features (journal_t *, unsigned long, unsigned long, unsigned long); extern int jbd2_journal_set_features (journal_t *, unsigned long, unsigned long, unsigned long); extern void jbd2_journal_clear_features (journal_t *, unsigned long, unsigned long, unsigned long); extern int jbd2_journal_load (journal_t *journal); extern int jbd2_journal_destroy (journal_t *); extern int jbd2_journal_recover (journal_t *journal); extern int jbd2_journal_wipe (journal_t *, int); extern int jbd2_journal_skip_recovery (journal_t *); extern void jbd2_journal_update_sb_errno(journal_t *); extern int jbd2_journal_update_sb_log_tail (journal_t *, tid_t, unsigned long, blk_opf_t); extern void jbd2_journal_abort (journal_t *, int); extern int jbd2_journal_errno (journal_t *); extern void jbd2_journal_ack_err (journal_t *); extern int jbd2_journal_clear_err (journal_t *); extern int jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *); extern int jbd2_journal_force_commit(journal_t *); extern int jbd2_journal_force_commit_nested(journal_t *); extern int jbd2_journal_inode_ranged_write(handle_t *handle, struct jbd2_inode *inode, loff_t start_byte, loff_t length); extern int jbd2_journal_inode_ranged_wait(handle_t *handle, struct jbd2_inode *inode, loff_t start_byte, loff_t length); extern int jbd2_journal_finish_inode_data_buffers( struct jbd2_inode *jinode); extern int jbd2_journal_begin_ordered_truncate(journal_t *journal, struct jbd2_inode *inode, loff_t new_size); extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode); extern void jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode); /* * journal_head management */ struct journal_head *jbd2_journal_add_journal_head(struct buffer_head *bh); struct journal_head *jbd2_journal_grab_journal_head(struct buffer_head *bh); void jbd2_journal_put_journal_head(struct journal_head *jh); /* * handle management */ extern struct kmem_cache *jbd2_handle_cache; /* * This specialized allocator has to be a macro for its allocations to be * accounted separately (to have a separate alloc_tag). The typecast is * intentional to enforce typesafety. */ #define jbd2_alloc_handle(_gfp_flags) \ ((handle_t *)kmem_cache_zalloc(jbd2_handle_cache, _gfp_flags)) static inline void jbd2_free_handle(handle_t *handle) { kmem_cache_free(jbd2_handle_cache, handle); } /* * jbd2_inode management (optional, for those file systems that want to use * dynamically allocated jbd2_inode structures) */ extern struct kmem_cache *jbd2_inode_cache; /* * This specialized allocator has to be a macro for its allocations to be * accounted separately (to have a separate alloc_tag). The typecast is * intentional to enforce typesafety. */ #define jbd2_alloc_inode(_gfp_flags) \ ((struct jbd2_inode *)kmem_cache_alloc(jbd2_inode_cache, _gfp_flags)) static inline void jbd2_free_inode(struct jbd2_inode *jinode) { kmem_cache_free(jbd2_inode_cache, jinode); } /* Primary revoke support */ #define JOURNAL_REVOKE_DEFAULT_HASH 256 extern int jbd2_journal_init_revoke(journal_t *, int); extern void jbd2_journal_destroy_revoke_record_cache(void); extern void jbd2_journal_destroy_revoke_table_cache(void); extern int __init jbd2_journal_init_revoke_record_cache(void); extern int __init jbd2_journal_init_revoke_table_cache(void); struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size); void jbd2_journal_destroy_revoke_table(struct jbd2_revoke_table_s *table); extern void jbd2_journal_destroy_revoke(journal_t *); extern int jbd2_journal_revoke (handle_t *, unsigned long long, struct buffer_head *); extern void jbd2_journal_cancel_revoke(handle_t *, struct journal_head *); extern void jbd2_journal_write_revoke_records(transaction_t *transaction, struct list_head *log_bufs); /* Recovery revoke support */ extern int jbd2_journal_set_revoke(journal_t *, unsigned long long, tid_t); extern int jbd2_journal_test_revoke(journal_t *, unsigned long long, tid_t); extern void jbd2_journal_clear_revoke(journal_t *); extern void jbd2_journal_switch_revoke_table(journal_t *journal); extern void jbd2_clear_buffer_revoked_flags(journal_t *journal); /* * The log thread user interface: * * Request space in the current transaction, and force transaction commit * transitions on demand. */ int jbd2_log_start_commit(journal_t *journal, tid_t tid); int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); int jbd2_log_wait_commit(journal_t *journal, tid_t tid); int jbd2_transaction_committed(journal_t *journal, tid_t tid); int jbd2_complete_transaction(journal_t *journal, tid_t tid); int jbd2_log_do_checkpoint(journal_t *journal); int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid); void __jbd2_log_wait_for_space(journal_t *journal); extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *); extern int jbd2_cleanup_journal_tail(journal_t *); /* Fast commit related APIs */ int jbd2_fc_begin_commit(journal_t *journal, tid_t tid); int jbd2_fc_end_commit(journal_t *journal); int jbd2_fc_end_commit_fallback(journal_t *journal); int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out); int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode); int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode); int jbd2_fc_wait_bufs(journal_t *journal, int num_blks); void jbd2_fc_release_bufs(journal_t *journal); /* * is_journal_abort * * Simple test wrapper function to test the JBD2_ABORT state flag. This * bit, when set, indicates that we have had a fatal error somewhere, * either inside the journaling layer or indicated to us by the client * (eg. ext3), and that we and should not commit any further * transactions. */ static inline int is_journal_aborted(journal_t *journal) { return journal->j_flags & JBD2_ABORT; } static inline int is_handle_aborted(handle_t *handle) { if (handle->h_aborted || !handle->h_transaction) return 1; return is_journal_aborted(handle->h_transaction->t_journal); } static inline void jbd2_journal_abort_handle(handle_t *handle) { handle->h_aborted = 1; } static inline void jbd2_init_fs_dev_write_error(journal_t *journal) { struct address_space *mapping = journal->j_fs_dev->bd_mapping; /* * Save the original wb_err value of client fs's bdev mapping which * could be used to detect the client fs's metadata async write error. */ errseq_check_and_advance(&mapping->wb_err, &journal->j_fs_dev_wb_err); } static inline int jbd2_check_fs_dev_write_error(journal_t *journal) { struct address_space *mapping = journal->j_fs_dev->bd_mapping; return errseq_check(&mapping->wb_err, READ_ONCE(journal->j_fs_dev_wb_err)); } #endif /* __KERNEL__ */ /* Comparison functions for transaction IDs: perform comparisons using * modulo arithmetic so that they work over sequence number wraps. */ static inline int tid_gt(tid_t x, tid_t y) { int difference = (x - y); return (difference > 0); } static inline int tid_geq(tid_t x, tid_t y) { int difference = (x - y); return (difference >= 0); } extern int jbd2_journal_blocks_per_folio(struct inode *inode); extern size_t journal_tag_bytes(journal_t *journal); static inline int jbd2_journal_has_csum_v2or3(journal_t *journal) { return jbd2_has_feature_csum2(journal) || jbd2_has_feature_csum3(journal); } static inline int jbd2_journal_get_num_fc_blks(journal_superblock_t *jsb) { int num_fc_blocks = be32_to_cpu(jsb->s_num_fc_blks); return num_fc_blocks ? num_fc_blocks : JBD2_DEFAULT_FAST_COMMIT_BLOCKS; } /* * Return number of free blocks in the log. Must be called under j_state_lock. */ static inline unsigned long jbd2_log_space_left(journal_t *journal) { /* Allow for rounding errors */ long free = journal->j_free - 32; if (journal->j_committing_transaction) { free -= atomic_read(&journal-> j_committing_transaction->t_outstanding_credits); } return max_t(long, free, 0); } /* * Definitions which augment the buffer_head layer */ /* journaling buffer types */ #define BJ_None 0 /* Not journaled */ #define BJ_Metadata 1 /* Normal journaled metadata */ #define BJ_Forget 2 /* Buffer superseded by this transaction */ #define BJ_Shadow 3 /* Buffer contents being shadowed to the log */ #define BJ_Reserved 4 /* Buffer is reserved for access by journal */ #define BJ_Types 5 static inline u32 jbd2_chksum(u32 crc, const void *address, unsigned int length) { return crc32c(crc, address, length); } /* Return most recent uncommitted transaction */ static inline tid_t jbd2_get_latest_transaction(journal_t *journal) { tid_t tid; read_lock(&journal->j_state_lock); tid = journal->j_commit_request; if (journal->j_running_transaction) tid = journal->j_running_transaction->t_tid; read_unlock(&journal->j_state_lock); return tid; } static inline int jbd2_handle_buffer_credits(handle_t *handle) { journal_t *journal; if (!handle->h_reserved) journal = handle->h_transaction->t_journal; else journal = handle->h_journal; return handle->h_total_credits - DIV_ROUND_UP(handle->h_revoke_credits_requested, journal->j_revoke_records_per_block); } #ifdef __KERNEL__ #define buffer_trace_init(bh) do {} while (0) #define print_buffer_fields(bh) do {} while (0) #define print_buffer_trace(bh) do {} while (0) #define BUFFER_TRACE(bh, info) do {} while (0) #define BUFFER_TRACE2(bh, bh2, info) do {} while (0) #define JBUFFER_TRACE(jh, info) do {} while (0) #endif /* __KERNEL__ */ #define EFSBADCRC EBADMSG /* Bad CRC detected */ #define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */ #endif /* _LINUX_JBD2_H */ |
| 9 4 3 1 1 5 4 1 1 3 2 2 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 | // SPDX-License-Identifier: GPL-2.0+ /* * Restartable sequences system call * * Copyright (C) 2015, Google, Inc., * Paul Turner <pjt@google.com> and Andrew Hunter <ahh@google.com> * Copyright (C) 2015-2018, EfficiOS Inc., * Mathieu Desnoyers <mathieu.desnoyers@efficios.com> */ /* * Restartable sequences are a lightweight interface that allows * user-level code to be executed atomically relative to scheduler * preemption and signal delivery. Typically used for implementing * per-cpu operations. * * It allows user-space to perform update operations on per-cpu data * without requiring heavy-weight atomic operations. * * Detailed algorithm of rseq user-space assembly sequences: * * init(rseq_cs) * cpu = TLS->rseq::cpu_id_start * [1] TLS->rseq::rseq_cs = rseq_cs * [start_ip] ---------------------------- * [2] if (cpu != TLS->rseq::cpu_id) * goto abort_ip; * [3] <last_instruction_in_cs> * [post_commit_ip] ---------------------------- * * The address of jump target abort_ip must be outside the critical * region, i.e.: * * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip] * * Steps [2]-[3] (inclusive) need to be a sequence of instructions in * userspace that can handle being interrupted between any of those * instructions, and then resumed to the abort_ip. * * 1. Userspace stores the address of the struct rseq_cs assembly * block descriptor into the rseq_cs field of the registered * struct rseq TLS area. This update is performed through a single * store within the inline assembly instruction sequence. * [start_ip] * * 2. Userspace tests to check whether the current cpu_id field match * the cpu number loaded before start_ip, branching to abort_ip * in case of a mismatch. * * If the sequence is preempted or interrupted by a signal * at or after start_ip and before post_commit_ip, then the kernel * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return * ip to abort_ip before returning to user-space, so the preempted * execution resumes at abort_ip. * * 3. Userspace critical section final instruction before * post_commit_ip is the commit. The critical section is * self-terminating. * [post_commit_ip] * * 4. <success> * * On failure at [2], or if interrupted by preempt or signal delivery * between [1] and [3]: * * [abort_ip] * F1. <failure> */ /* Required to select the proper per_cpu ops for rseq_stats_inc() */ #define RSEQ_BUILD_SLOW_PATH #include <linux/debugfs.h> #include <linux/ratelimit.h> #include <linux/rseq_entry.h> #include <linux/sched.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/types.h> #include <asm/ptrace.h> #define CREATE_TRACE_POINTS #include <trace/events/rseq.h> DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled); static inline void rseq_control_debug(bool on) { if (on) static_branch_enable(&rseq_debug_enabled); else static_branch_disable(&rseq_debug_enabled); } static int __init rseq_setup_debug(char *str) { bool on; if (kstrtobool(str, &on)) return -EINVAL; rseq_control_debug(on); return 1; } __setup("rseq_debug=", rseq_setup_debug); #ifdef CONFIG_TRACEPOINTS /* * Out of line, so the actual update functions can be in a header to be * inlined into the exit to user code. */ void __rseq_trace_update(struct task_struct *t) { trace_rseq_update(t); } void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip, unsigned long offset, unsigned long abort_ip) { trace_rseq_ip_fixup(ip, start_ip, offset, abort_ip); } #endif /* CONFIG_TRACEPOINTS */ #ifdef CONFIG_DEBUG_FS #ifdef CONFIG_RSEQ_STATS DEFINE_PER_CPU(struct rseq_stats, rseq_stats); static int rseq_stats_show(struct seq_file *m, void *p) { struct rseq_stats stats = { }; unsigned int cpu; for_each_possible_cpu(cpu) { stats.exit += data_race(per_cpu(rseq_stats.exit, cpu)); stats.signal += data_race(per_cpu(rseq_stats.signal, cpu)); stats.slowpath += data_race(per_cpu(rseq_stats.slowpath, cpu)); stats.fastpath += data_race(per_cpu(rseq_stats.fastpath, cpu)); stats.ids += data_race(per_cpu(rseq_stats.ids, cpu)); stats.cs += data_race(per_cpu(rseq_stats.cs, cpu)); stats.clear += data_race(per_cpu(rseq_stats.clear, cpu)); stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu)); } seq_printf(m, "exit: %16lu\n", stats.exit); seq_printf(m, "signal: %16lu\n", stats.signal); seq_printf(m, "slowp: %16lu\n", stats.slowpath); seq_printf(m, "fastp: %16lu\n", stats.fastpath); seq_printf(m, "ids: %16lu\n", stats.ids); seq_printf(m, "cs: %16lu\n", stats.cs); seq_printf(m, "clear: %16lu\n", stats.clear); seq_printf(m, "fixup: %16lu\n", stats.fixup); return 0; } static int rseq_stats_open(struct inode *inode, struct file *file) { return single_open(file, rseq_stats_show, inode->i_private); } static const struct file_operations stat_ops = { .open = rseq_stats_open, .read = seq_read, .llseek = seq_lseek, .release = single_release, }; static int __init rseq_stats_init(struct dentry *root_dir) { debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops); return 0; } #else static inline void rseq_stats_init(struct dentry *root_dir) { } #endif /* CONFIG_RSEQ_STATS */ static int rseq_debug_show(struct seq_file *m, void *p) { bool on = static_branch_unlikely(&rseq_debug_enabled); seq_printf(m, "%d\n", on); return 0; } static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf, size_t count, loff_t *ppos) { bool on; if (kstrtobool_from_user(ubuf, count, &on)) return -EINVAL; rseq_control_debug(on); return count; } static int rseq_debug_open(struct inode *inode, struct file *file) { return single_open(file, rseq_debug_show, inode->i_private); } static const struct file_operations debug_ops = { .open = rseq_debug_open, .read = seq_read, .write = rseq_debug_write, .llseek = seq_lseek, .release = single_release, }; static int __init rseq_debugfs_init(void) { struct dentry *root_dir = debugfs_create_dir("rseq", NULL); debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops); rseq_stats_init(root_dir); return 0; } __initcall(rseq_debugfs_init); #endif /* CONFIG_DEBUG_FS */ static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id) { return rseq_set_ids_get_csaddr(t, ids, node_id, NULL); } static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs) { struct rseq __user *urseq = t->rseq.usrptr; u64 csaddr; scoped_user_read_access(urseq, efault) unsafe_get_user(csaddr, &urseq->rseq_cs, efault); if (likely(!csaddr)) return true; return rseq_update_user_cs(t, regs, csaddr); efault: return false; } static void rseq_slowpath_update_usr(struct pt_regs *regs) { /* * Preserve rseq state and user_irq state. The generic entry code * clears user_irq on the way out, the non-generic entry * architectures are not having user_irq. */ const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, }; struct task_struct *t = current; struct rseq_ids ids; u32 node_id; bool event; if (unlikely(t->flags & PF_EXITING)) return; rseq_stat_inc(rseq_stats.slowpath); /* * Read and clear the event pending bit first. If the task * was not preempted or migrated or a signal is on the way, * there is no point in doing any of the heavy lifting here * on production kernels. In that case TIF_NOTIFY_RESUME * was raised by some other functionality. * * This is correct because the read/clear operation is * guarded against scheduler preemption, which makes it CPU * local atomic. If the task is preempted right after * re-enabling preemption then TIF_NOTIFY_RESUME is set * again and this function is invoked another time _before_ * the task is able to return to user mode. * * On a debug kernel, invoke the fixup code unconditionally * with the result handed in to allow the detection of * inconsistencies. */ scoped_guard(irq) { event = t->rseq.event.sched_switch; t->rseq.event.all &= evt_mask.all; ids.cpu_id = task_cpu(t); ids.mm_cid = task_mm_cid(t); } if (!event) return; node_id = cpu_to_node(ids.cpu_id); if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) { /* * Clear the errors just in case this might survive magically, but * leave the rest intact. */ t->rseq.event.error = 0; force_sig(SIGSEGV); } } void __rseq_handle_slowpath(struct pt_regs *regs) { /* * If invoked from hypervisors before entering the guest via * resume_user_mode_work(), then @regs is a NULL pointer. * * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises * it before returning from the ioctl() to user space when * rseq_event.sched_switch is set. * * So it's safe to ignore here instead of pointlessly updating it * in the vcpu_run() loop. */ if (!regs) return; rseq_slowpath_update_usr(regs); } void __rseq_signal_deliver(int sig, struct pt_regs *regs) { rseq_stat_inc(rseq_stats.signal); /* * Don't update IDs, they are handled on exit to user if * necessary. The important thing is to abort a critical section of * the interrupted context as after this point the instruction * pointer in @regs points to the signal handler. */ if (unlikely(!rseq_handle_cs(current, regs))) { /* * Clear the errors just in case this might survive * magically, but leave the rest intact. */ current->rseq.event.error = 0; force_sigsegv(sig); } } /* * Terminate the process if a syscall is issued within a restartable * sequence. */ void __rseq_debug_syscall_return(struct pt_regs *regs) { struct task_struct *t = current; u64 csaddr; if (!t->rseq.event.has_rseq) return; if (get_user(csaddr, &t->rseq.usrptr->rseq_cs)) goto fail; if (likely(!csaddr)) return; if (unlikely(csaddr >= TASK_SIZE)) goto fail; if (rseq_debug_update_user_cs(t, regs, csaddr)) return; fail: force_sig(SIGSEGV); } #ifdef CONFIG_DEBUG_RSEQ /* Kept around to keep GENERIC_ENTRY=n architectures supported. */ void rseq_syscall(struct pt_regs *regs) { __rseq_debug_syscall_return(regs); } #endif static bool rseq_reset_ids(void) { struct rseq_ids ids = { .cpu_id = RSEQ_CPU_ID_UNINITIALIZED, .mm_cid = 0, }; /* * If this fails, terminate it because this leaves the kernel in * stupid state as exit to user space will try to fixup the ids * again. */ if (rseq_set_ids(current, &ids, 0)) return true; force_sig(SIGSEGV); return false; } /* The original rseq structure size (including padding) is 32 bytes. */ #define ORIG_RSEQ_SIZE 32 /* * sys_rseq - setup restartable sequences for caller thread. */ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) { if (flags & RSEQ_FLAG_UNREGISTER) { if (flags & ~RSEQ_FLAG_UNREGISTER) return -EINVAL; /* Unregister rseq for current thread. */ if (current->rseq.usrptr != rseq || !current->rseq.usrptr) return -EINVAL; if (rseq_len != current->rseq.len) return -EINVAL; if (current->rseq.sig != sig) return -EPERM; if (!rseq_reset_ids()) return -EFAULT; rseq_reset(current); return 0; } if (unlikely(flags)) return -EINVAL; if (current->rseq.usrptr) { /* * If rseq is already registered, check whether * the provided address differs from the prior * one. */ if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) return -EINVAL; if (current->rseq.sig != sig) return -EPERM; /* Already registered. */ return -EBUSY; } /* * If there was no rseq previously registered, ensure the provided rseq * is properly aligned, as communcated to user-space through the ELF * auxiliary vector AT_RSEQ_ALIGN. If rseq_len is the original rseq * size, the required alignment is the original struct rseq alignment. * * In order to be valid, rseq_len is either the original rseq size, or * large enough to contain all supported fields, as communicated to * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. */ if (rseq_len < ORIG_RSEQ_SIZE || (rseq_len == ORIG_RSEQ_SIZE && !IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE)) || (rseq_len != ORIG_RSEQ_SIZE && (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)) || rseq_len < offsetof(struct rseq, end)))) return -EINVAL; if (!access_ok(rseq, rseq_len)) return -EFAULT; scoped_user_write_access(rseq, efault) { /* * If the rseq_cs pointer is non-NULL on registration, clear it to * avoid a potential segfault on return to user-space. The proper thing * to do would have been to fail the registration but this would break * older libcs that reuse the rseq area for new threads without * clearing the fields. Don't bother reading it, just reset it. */ unsafe_put_user(0UL, &rseq->rseq_cs, efault); /* Initialize IDs in user space */ unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault); unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); unsafe_put_user(0U, &rseq->node_id, efault); unsafe_put_user(0U, &rseq->mm_cid, efault); } /* * Activate the registration by setting the rseq area address, length * and signature in the task struct. */ current->rseq.usrptr = rseq; current->rseq.len = rseq_len; current->rseq.sig = sig; /* * If rseq was previously inactive, and has just been * registered, ensure the cpu_id_start and cpu_id fields * are updated before returning to user-space. */ current->rseq.event.has_rseq = true; rseq_force_update(); return 0; efault: return -EFAULT; } |
| 1759 1311 16 16 16 15 24 4 20 13 10 24 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __LINUX_GFP_H #define __LINUX_GFP_H #include <linux/gfp_types.h> #include <linux/mmzone.h> #include <linux/topology.h> #include <linux/alloc_tag.h> #include <linux/cleanup.h> #include <linux/sched.h> struct vm_area_struct; struct mempolicy; /* Convert GFP flags to their corresponding migrate type */ #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE) #define GFP_MOVABLE_SHIFT 3 static inline int gfp_migratetype(const gfp_t gfp_flags) { VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK); BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE); BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE); BUILD_BUG_ON((___GFP_RECLAIMABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_RECLAIMABLE); BUILD_BUG_ON(((___GFP_MOVABLE | ___GFP_RECLAIMABLE) >> GFP_MOVABLE_SHIFT) != MIGRATE_HIGHATOMIC); if (unlikely(page_group_by_mobility_disabled)) return MIGRATE_UNMOVABLE; /* Group based on mobility */ return (__force unsigned long)(gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT; } #undef GFP_MOVABLE_MASK #undef GFP_MOVABLE_SHIFT static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) { return !!(gfp_flags & __GFP_DIRECT_RECLAIM); } static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags) { /* * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed. * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd. * All GFP_* flags including GFP_NOWAIT use one or both flags. * alloc_pages_nolock() is the only API that doesn't specify either flag. * * This is stronger than GFP_NOWAIT or GFP_ATOMIC because * those are guaranteed to never block on a sleeping lock. * Here we are enforcing that the allocation doesn't ever spin * on any locks (i.e. only trylocks). There is no high level * GFP_$FOO flag for this use in alloc_pages_nolock() as the * regular page allocator doesn't fully support this * allocation mode. */ return !!(gfp_flags & __GFP_RECLAIM); } #ifdef CONFIG_HIGHMEM #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM #else #define OPT_ZONE_HIGHMEM ZONE_NORMAL #endif #ifdef CONFIG_ZONE_DMA #define OPT_ZONE_DMA ZONE_DMA #else #define OPT_ZONE_DMA ZONE_NORMAL #endif #ifdef CONFIG_ZONE_DMA32 #define OPT_ZONE_DMA32 ZONE_DMA32 #else #define OPT_ZONE_DMA32 ZONE_NORMAL #endif /* * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the * zone to use given the lowest 4 bits of gfp_t. Entries are GFP_ZONES_SHIFT * bits long and there are 16 of them to cover all possible combinations of * __GFP_DMA, __GFP_DMA32, __GFP_MOVABLE and __GFP_HIGHMEM. * * The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA. * But GFP_MOVABLE is not only a zone specifier but also an allocation * policy. Therefore __GFP_MOVABLE plus another zone selector is valid. * Only 1 bit of the lowest 3 bits (DMA,DMA32,HIGHMEM) can be set to "1". * * bit result * ================= * 0x0 => NORMAL * 0x1 => DMA or NORMAL * 0x2 => HIGHMEM or NORMAL * 0x3 => BAD (DMA+HIGHMEM) * 0x4 => DMA32 or NORMAL * 0x5 => BAD (DMA+DMA32) * 0x6 => BAD (HIGHMEM+DMA32) * 0x7 => BAD (HIGHMEM+DMA32+DMA) * 0x8 => NORMAL (MOVABLE+0) * 0x9 => DMA or NORMAL (MOVABLE+DMA) * 0xa => MOVABLE (Movable is valid only if HIGHMEM is set too) * 0xb => BAD (MOVABLE+HIGHMEM+DMA) * 0xc => DMA32 or NORMAL (MOVABLE+DMA32) * 0xd => BAD (MOVABLE+DMA32+DMA) * 0xe => BAD (MOVABLE+DMA32+HIGHMEM) * 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA) * * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms. */ #if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4 /* ZONE_DEVICE is not a valid GFP zone specifier */ #define GFP_ZONES_SHIFT 2 #else #define GFP_ZONES_SHIFT ZONES_SHIFT #endif #if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG #error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer #endif #define GFP_ZONE_TABLE ( \ (ZONE_NORMAL << 0 * GFP_ZONES_SHIFT) \ | (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT) \ | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT) \ | (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT) \ | (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT) \ | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT) \ | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\ | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\ ) /* * GFP_ZONE_BAD is a bitmap for all combinations of __GFP_DMA, __GFP_DMA32 * __GFP_HIGHMEM and __GFP_MOVABLE that are not permitted. One flag per * entry starting with bit 0. Bit is set if the combination is not * allowed. */ #define GFP_ZONE_BAD ( \ 1 << (___GFP_DMA | ___GFP_HIGHMEM) \ | 1 << (___GFP_DMA | ___GFP_DMA32) \ | 1 << (___GFP_DMA32 | ___GFP_HIGHMEM) \ | 1 << (___GFP_DMA | ___GFP_DMA32 | ___GFP_HIGHMEM) \ | 1 << (___GFP_MOVABLE | ___GFP_HIGHMEM | ___GFP_DMA) \ | 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA) \ | 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_HIGHMEM) \ | 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA | ___GFP_HIGHMEM) \ ) static inline enum zone_type gfp_zone(gfp_t flags) { enum zone_type z; int bit = (__force int) (flags & GFP_ZONEMASK); z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) & ((1 << GFP_ZONES_SHIFT) - 1); VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1); return z; } /* * There is only one page-allocator function, and two main namespaces to * it. The alloc_page*() variants return 'struct page *' and as such * can allocate highmem pages, the *get*page*() variants return * virtual kernel addresses to the allocated page(s). */ static inline int gfp_zonelist(gfp_t flags) { #ifdef CONFIG_NUMA if (unlikely(flags & __GFP_THISNODE)) return ZONELIST_NOFALLBACK; #endif return ZONELIST_FALLBACK; } /* * gfp flag masking for nested internal allocations. * * For code that needs to do allocations inside the public allocation API (e.g. * memory allocation tracking code) the allocations need to obey the caller * allocation context constrains to prevent allocation context mismatches (e.g. * GFP_KERNEL allocations in GFP_NOFS contexts) from potential deadlock * situations. * * It is also assumed that these nested allocations are for internal kernel * object storage purposes only and are not going to be used for DMA, etc. Hence * we strip out all the zone information and leave just the context information * intact. * * Further, internal allocations must fail before the higher level allocation * can fail, so we must make them fail faster and fail silently. We also don't * want them to deplete emergency reserves. Hence nested allocations must be * prepared for these allocations to fail. */ static inline gfp_t gfp_nested_mask(gfp_t flags) { return ((flags & (GFP_KERNEL | GFP_ATOMIC | __GFP_NOLOCKDEP)) | (__GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN)); } /* * We get the zone list from the current node and the gfp_mask. * This zone list contains a maximum of MAX_NUMNODES*MAX_NR_ZONES zones. * There are two zonelists per node, one for all zones with memory and * one containing just zones from the node the zonelist belongs to. * * For the case of non-NUMA systems the NODE_DATA() gets optimized to * &contig_page_data at compile-time. */ static inline struct zonelist *node_zonelist(int nid, gfp_t flags) { return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); } #ifndef HAVE_ARCH_FREE_PAGE static inline void arch_free_page(struct page *page, int order) { } #endif #ifndef HAVE_ARCH_ALLOC_PAGE static inline void arch_alloc_page(struct page *page, int order) { } #endif struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int preferred_nid, nodemask_t *nodemask); #define __alloc_pages(...) alloc_hooks(__alloc_pages_noprof(__VA_ARGS__)) struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid, nodemask_t *nodemask); #define __folio_alloc(...) alloc_hooks(__folio_alloc_noprof(__VA_ARGS__)) unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid, nodemask_t *nodemask, int nr_pages, struct page **page_array); #define __alloc_pages_bulk(...) alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__)) unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp, unsigned long nr_pages, struct page **page_array); #define alloc_pages_bulk_mempolicy(...) \ alloc_hooks(alloc_pages_bulk_mempolicy_noprof(__VA_ARGS__)) /* Bulk allocate order-0 pages */ #define alloc_pages_bulk(_gfp, _nr_pages, _page_array) \ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _page_array) static inline unsigned long alloc_pages_bulk_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array) { if (nid == NUMA_NO_NODE) nid = numa_mem_id(); return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, page_array); } #define alloc_pages_bulk_node(...) \ alloc_hooks(alloc_pages_bulk_node_noprof(__VA_ARGS__)) static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask) { gfp_t warn_gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN); if (warn_gfp != (__GFP_THISNODE|__GFP_NOWARN)) return; if (node_online(this_node)) return; pr_warn("%pGg allocation from offline node %d\n", &gfp_mask, this_node); dump_stack(); } /* * Allocate pages, preferring the node given as nid. The node must be valid and * online. For more general interface, see alloc_pages_node(). */ static inline struct page * __alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); warn_if_node_offline(nid, gfp_mask); return __alloc_pages_noprof(gfp_mask, order, nid, NULL); } #define __alloc_pages_node(...) alloc_hooks(__alloc_pages_node_noprof(__VA_ARGS__)) static inline struct folio *__folio_alloc_node_noprof(gfp_t gfp, unsigned int order, int nid) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); warn_if_node_offline(nid, gfp); return __folio_alloc_noprof(gfp, order, nid, NULL); } #define __folio_alloc_node(...) alloc_hooks(__folio_alloc_node_noprof(__VA_ARGS__)) /* * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE, * prefer the current CPU's closest node. Otherwise node must be valid and * online. */ static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order) { if (nid == NUMA_NO_NODE) nid = numa_mem_id(); return __alloc_pages_node_noprof(nid, gfp_mask, order); } #define alloc_pages_node(...) alloc_hooks(alloc_pages_node_noprof(__VA_ARGS__)) #ifdef CONFIG_NUMA struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order); struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order); struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order, struct mempolicy *mpol, pgoff_t ilx, int nid); struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr); #else static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order) { return alloc_pages_node_noprof(numa_node_id(), gfp_mask, order); } static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order) { return __folio_alloc_node_noprof(gfp, order, numa_node_id()); } static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order, struct mempolicy *mpol, pgoff_t ilx, int nid) { return folio_alloc_noprof(gfp, order); } #define vma_alloc_folio_noprof(gfp, order, vma, addr) \ folio_alloc_noprof(gfp, order) #endif #define alloc_pages(...) alloc_hooks(alloc_pages_noprof(__VA_ARGS__)) #define folio_alloc(...) alloc_hooks(folio_alloc_noprof(__VA_ARGS__)) #define folio_alloc_mpol(...) alloc_hooks(folio_alloc_mpol_noprof(__VA_ARGS__)) #define vma_alloc_folio(...) alloc_hooks(vma_alloc_folio_noprof(__VA_ARGS__)) #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) static inline struct page *alloc_page_vma_noprof(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr) { struct folio *folio = vma_alloc_folio_noprof(gfp, 0, vma, addr); return &folio->page; } #define alloc_page_vma(...) alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__)) struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order); #define alloc_pages_nolock(...) alloc_hooks(alloc_pages_nolock_noprof(__VA_ARGS__)) extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order); #define __get_free_pages(...) alloc_hooks(get_free_pages_noprof(__VA_ARGS__)) extern unsigned long get_zeroed_page_noprof(gfp_t gfp_mask); #define get_zeroed_page(...) alloc_hooks(get_zeroed_page_noprof(__VA_ARGS__)) void *alloc_pages_exact_noprof(size_t size, gfp_t gfp_mask) __alloc_size(1); #define alloc_pages_exact(...) alloc_hooks(alloc_pages_exact_noprof(__VA_ARGS__)) void free_pages_exact(void *virt, size_t size); __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2); #define alloc_pages_exact_nid(...) \ alloc_hooks(alloc_pages_exact_nid_noprof(__VA_ARGS__)) #define __get_free_page(gfp_mask) \ __get_free_pages((gfp_mask), 0) #define __get_dma_pages(gfp_mask, order) \ __get_free_pages((gfp_mask) | GFP_DMA, (order)) extern void __free_pages(struct page *page, unsigned int order); extern void free_pages_nolock(struct page *page, unsigned int order); extern void free_pages(unsigned long addr, unsigned int order); #define __free_page(page) __free_pages((page), 0) #define free_page(addr) free_pages((addr), 0) void page_alloc_init_cpuhp(void); bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp); void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); void drain_all_pages(struct zone *zone); void drain_local_pages(struct zone *zone); void page_alloc_init_late(void); void setup_pcp_cacheinfo(unsigned int cpu); /* * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what * GFP flags are used before interrupts are enabled. Once interrupts are * enabled, it is set to __GFP_BITS_MASK while the system is running. During * hibernation, it is used by PM to avoid I/O during memory allocation while * devices are suspended. */ extern gfp_t gfp_allowed_mask; /* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask); static inline bool gfp_has_io_fs(gfp_t gfp) { return (gfp & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS); } /* * Check if the gfp flags allow compaction - GFP_NOIO is a really * tricky context because the migration might require IO. */ static inline bool gfp_compaction_allowed(gfp_t gfp_mask) { return IS_ENABLED(CONFIG_COMPACTION) && (gfp_mask & __GFP_IO); } extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma); #ifdef CONFIG_CONTIG_ALLOC typedef unsigned int __bitwise acr_flags_t; #define ACR_FLAGS_NONE ((__force acr_flags_t)0) // ordinary allocation request #define ACR_FLAGS_CMA ((__force acr_flags_t)BIT(0)) // allocate for CMA /* The below functions must be run on a range from a single zone. */ extern int alloc_contig_range_noprof(unsigned long start, unsigned long end, acr_flags_t alloc_flags, gfp_t gfp_mask); #define alloc_contig_range(...) alloc_hooks(alloc_contig_range_noprof(__VA_ARGS__)) extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask, int nid, nodemask_t *nodemask); #define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__)) #endif void free_contig_range(unsigned long pfn, unsigned long nr_pages); #ifdef CONFIG_CONTIG_ALLOC static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp, int nid, nodemask_t *node) { struct page *page; if (WARN_ON(!order || !(gfp & __GFP_COMP))) return NULL; page = alloc_contig_pages_noprof(1 << order, gfp, nid, node); return page ? page_folio(page) : NULL; } #else static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp, int nid, nodemask_t *node) { return NULL; } #endif /* This should be paired with folio_put() rather than free_contig_range(). */ #define folio_alloc_gigantic(...) alloc_hooks(folio_alloc_gigantic_noprof(__VA_ARGS__)) DEFINE_FREE(free_page, void *, free_page((unsigned long)_T)) #endif /* __LINUX_GFP_H */ |
| 2 2 2 2 2 131 2 131 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 | // SPDX-License-Identifier: GPL-2.0 #include <linux/debugfs.h> #include "netdevsim.h" #define NSIM_DEV_HWSTATS_TRAFFIC_MS 100 static struct list_head * nsim_dev_hwstats_get_list_head(struct nsim_dev_hwstats *hwstats, enum netdev_offload_xstats_type type) { switch (type) { case NETDEV_OFFLOAD_XSTATS_TYPE_L3: return &hwstats->l3_list; } WARN_ON_ONCE(1); return NULL; } static void nsim_dev_hwstats_traffic_bump(struct nsim_dev_hwstats *hwstats, enum netdev_offload_xstats_type type) { struct nsim_dev_hwstats_netdev *hwsdev; struct list_head *hwsdev_list; hwsdev_list = nsim_dev_hwstats_get_list_head(hwstats, type); if (WARN_ON(!hwsdev_list)) return; list_for_each_entry(hwsdev, hwsdev_list, list) { if (hwsdev->enabled) { hwsdev->stats.rx_packets += 1; hwsdev->stats.tx_packets += 2; hwsdev->stats.rx_bytes += 100; hwsdev->stats.tx_bytes += 300; } } } static void nsim_dev_hwstats_traffic_work(struct work_struct *work) { struct nsim_dev_hwstats *hwstats; hwstats = container_of(work, struct nsim_dev_hwstats, traffic_dw.work); mutex_lock(&hwstats->hwsdev_list_lock); nsim_dev_hwstats_traffic_bump(hwstats, NETDEV_OFFLOAD_XSTATS_TYPE_L3); mutex_unlock(&hwstats->hwsdev_list_lock); schedule_delayed_work(&hwstats->traffic_dw, msecs_to_jiffies(NSIM_DEV_HWSTATS_TRAFFIC_MS)); } static struct nsim_dev_hwstats_netdev * nsim_dev_hwslist_find_hwsdev(struct list_head *hwsdev_list, int ifindex) { struct nsim_dev_hwstats_netdev *hwsdev; list_for_each_entry(hwsdev, hwsdev_list, list) { if (hwsdev->netdev->ifindex == ifindex) return hwsdev; } return NULL; } static int nsim_dev_hwsdev_enable(struct nsim_dev_hwstats_netdev *hwsdev, struct netlink_ext_ack *extack) { if (hwsdev->fail_enable) { hwsdev->fail_enable = false; NL_SET_ERR_MSG_MOD(extack, "Stats enablement set to fail"); return -ECANCELED; } hwsdev->enabled = true; return 0; } static void nsim_dev_hwsdev_disable(struct nsim_dev_hwstats_netdev *hwsdev) { hwsdev->enabled = false; memset(&hwsdev->stats, 0, sizeof(hwsdev->stats)); } static int nsim_dev_hwsdev_report_delta(struct nsim_dev_hwstats_netdev *hwsdev, struct netdev_notifier_offload_xstats_info *info) { netdev_offload_xstats_report_delta(info->report_delta, &hwsdev->stats); memset(&hwsdev->stats, 0, sizeof(hwsdev->stats)); return 0; } static void nsim_dev_hwsdev_report_used(struct nsim_dev_hwstats_netdev *hwsdev, struct netdev_notifier_offload_xstats_info *info) { if (hwsdev->enabled) netdev_offload_xstats_report_used(info->report_used); } static int nsim_dev_hwstats_event_off_xstats(struct nsim_dev_hwstats *hwstats, struct net_device *dev, unsigned long event, void *ptr) { struct netdev_notifier_offload_xstats_info *info; struct nsim_dev_hwstats_netdev *hwsdev; struct list_head *hwsdev_list; int err = 0; info = ptr; hwsdev_list = nsim_dev_hwstats_get_list_head(hwstats, info->type); if (!hwsdev_list) return 0; mutex_lock(&hwstats->hwsdev_list_lock); hwsdev = nsim_dev_hwslist_find_hwsdev(hwsdev_list, dev->ifindex); if (!hwsdev) goto out; switch (event) { case NETDEV_OFFLOAD_XSTATS_ENABLE: err = nsim_dev_hwsdev_enable(hwsdev, info->info.extack); break; case NETDEV_OFFLOAD_XSTATS_DISABLE: nsim_dev_hwsdev_disable(hwsdev); break; case NETDEV_OFFLOAD_XSTATS_REPORT_USED: nsim_dev_hwsdev_report_used(hwsdev, info); break; case NETDEV_OFFLOAD_XSTATS_REPORT_DELTA: err = nsim_dev_hwsdev_report_delta(hwsdev, info); break; } out: mutex_unlock(&hwstats->hwsdev_list_lock); return err; } static void nsim_dev_hwsdev_fini(struct nsim_dev_hwstats_netdev *hwsdev) { dev_put(hwsdev->netdev); kfree(hwsdev); } static void __nsim_dev_hwstats_event_unregister(struct nsim_dev_hwstats *hwstats, struct net_device *dev, enum netdev_offload_xstats_type type) { struct nsim_dev_hwstats_netdev *hwsdev; struct list_head *hwsdev_list; hwsdev_list = nsim_dev_hwstats_get_list_head(hwstats, type); if (WARN_ON(!hwsdev_list)) return; hwsdev = nsim_dev_hwslist_find_hwsdev(hwsdev_list, dev->ifindex); if (!hwsdev) return; list_del(&hwsdev->list); nsim_dev_hwsdev_fini(hwsdev); } static void nsim_dev_hwstats_event_unregister(struct nsim_dev_hwstats *hwstats, struct net_device *dev) { mutex_lock(&hwstats->hwsdev_list_lock); __nsim_dev_hwstats_event_unregister(hwstats, dev, NETDEV_OFFLOAD_XSTATS_TYPE_L3); mutex_unlock(&hwstats->hwsdev_list_lock); } static int nsim_dev_hwstats_event(struct nsim_dev_hwstats *hwstats, struct net_device *dev, unsigned long event, void *ptr) { switch (event) { case NETDEV_OFFLOAD_XSTATS_ENABLE: case NETDEV_OFFLOAD_XSTATS_DISABLE: case NETDEV_OFFLOAD_XSTATS_REPORT_USED: case NETDEV_OFFLOAD_XSTATS_REPORT_DELTA: return nsim_dev_hwstats_event_off_xstats(hwstats, dev, event, ptr); case NETDEV_UNREGISTER: nsim_dev_hwstats_event_unregister(hwstats, dev); break; } return 0; } static int nsim_dev_netdevice_event(struct notifier_block *nb, unsigned long event, void *ptr) { struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct nsim_dev_hwstats *hwstats; int err = 0; hwstats = container_of(nb, struct nsim_dev_hwstats, netdevice_nb); err = nsim_dev_hwstats_event(hwstats, dev, event, ptr); if (err) return notifier_from_errno(err); return NOTIFY_OK; } static int nsim_dev_hwstats_enable_ifindex(struct nsim_dev_hwstats *hwstats, int ifindex, enum netdev_offload_xstats_type type, struct list_head *hwsdev_list) { struct nsim_dev_hwstats_netdev *hwsdev; struct nsim_dev *nsim_dev; struct net_device *netdev; struct net *net; int err = 0; nsim_dev = container_of(hwstats, struct nsim_dev, hwstats); net = nsim_dev_net(nsim_dev); rtnl_lock(); mutex_lock(&hwstats->hwsdev_list_lock); hwsdev = nsim_dev_hwslist_find_hwsdev(hwsdev_list, ifindex); if (hwsdev) goto out_unlock_list; netdev = dev_get_by_index(net, ifindex); if (!netdev) { err = -ENODEV; goto out_unlock_list; } hwsdev = kzalloc(sizeof(*hwsdev), GFP_KERNEL); if (!hwsdev) { err = -ENOMEM; goto out_put_netdev; } hwsdev->netdev = netdev; list_add_tail(&hwsdev->list, hwsdev_list); mutex_unlock(&hwstats->hwsdev_list_lock); if (netdev_offload_xstats_enabled(netdev, type)) { nsim_dev_hwsdev_enable(hwsdev, NULL); rtnl_offload_xstats_notify(netdev); } rtnl_unlock(); return err; out_put_netdev: dev_put(netdev); out_unlock_list: mutex_unlock(&hwstats->hwsdev_list_lock); rtnl_unlock(); return err; } static int nsim_dev_hwstats_disable_ifindex(struct nsim_dev_hwstats *hwstats, int ifindex, enum netdev_offload_xstats_type type, struct list_head *hwsdev_list) { struct nsim_dev_hwstats_netdev *hwsdev; int err = 0; rtnl_lock(); mutex_lock(&hwstats->hwsdev_list_lock); hwsdev = nsim_dev_hwslist_find_hwsdev(hwsdev_list, ifindex); if (hwsdev) list_del(&hwsdev->list); mutex_unlock(&hwstats->hwsdev_list_lock); if (!hwsdev) { err = -ENOENT; goto unlock_out; } if (netdev_offload_xstats_enabled(hwsdev->netdev, type)) { netdev_offload_xstats_push_delta(hwsdev->netdev, type, &hwsdev->stats); rtnl_offload_xstats_notify(hwsdev->netdev); } nsim_dev_hwsdev_fini(hwsdev); unlock_out: rtnl_unlock(); return err; } static int nsim_dev_hwstats_fail_ifindex(struct nsim_dev_hwstats *hwstats, int ifindex, enum netdev_offload_xstats_type type, struct list_head *hwsdev_list) { struct nsim_dev_hwstats_netdev *hwsdev; int err = 0; mutex_lock(&hwstats->hwsdev_list_lock); hwsdev = nsim_dev_hwslist_find_hwsdev(hwsdev_list, ifindex); if (!hwsdev) { err = -ENOENT; goto err_hwsdev_list_unlock; } hwsdev->fail_enable = true; err_hwsdev_list_unlock: mutex_unlock(&hwstats->hwsdev_list_lock); return err; } enum nsim_dev_hwstats_do { NSIM_DEV_HWSTATS_DO_DISABLE, NSIM_DEV_HWSTATS_DO_ENABLE, NSIM_DEV_HWSTATS_DO_FAIL, }; struct nsim_dev_hwstats_fops { enum nsim_dev_hwstats_do action; enum netdev_offload_xstats_type type; }; static ssize_t nsim_dev_hwstats_do_write(struct file *file, const char __user *data, size_t count, loff_t *ppos) { struct nsim_dev_hwstats *hwstats = file->private_data; const struct nsim_dev_hwstats_fops *hwsfops; struct list_head *hwsdev_list; int ifindex; int err; hwsfops = debugfs_get_aux(file); err = kstrtoint_from_user(data, count, 0, &ifindex); if (err) return err; hwsdev_list = nsim_dev_hwstats_get_list_head(hwstats, hwsfops->type); if (WARN_ON(!hwsdev_list)) return -EINVAL; switch (hwsfops->action) { case NSIM_DEV_HWSTATS_DO_DISABLE: err = nsim_dev_hwstats_disable_ifindex(hwstats, ifindex, hwsfops->type, hwsdev_list); break; case NSIM_DEV_HWSTATS_DO_ENABLE: err = nsim_dev_hwstats_enable_ifindex(hwstats, ifindex, hwsfops->type, hwsdev_list); break; case NSIM_DEV_HWSTATS_DO_FAIL: err = nsim_dev_hwstats_fail_ifindex(hwstats, ifindex, hwsfops->type, hwsdev_list); break; } if (err) return err; return count; } static struct debugfs_short_fops debugfs_ops = { .write = nsim_dev_hwstats_do_write, .llseek = generic_file_llseek, }; #define NSIM_DEV_HWSTATS_FOPS(ACTION, TYPE) \ { \ .action = ACTION, \ .type = TYPE, \ } static const struct nsim_dev_hwstats_fops nsim_dev_hwstats_l3_disable_fops = NSIM_DEV_HWSTATS_FOPS(NSIM_DEV_HWSTATS_DO_DISABLE, NETDEV_OFFLOAD_XSTATS_TYPE_L3); static const struct nsim_dev_hwstats_fops nsim_dev_hwstats_l3_enable_fops = NSIM_DEV_HWSTATS_FOPS(NSIM_DEV_HWSTATS_DO_ENABLE, NETDEV_OFFLOAD_XSTATS_TYPE_L3); static const struct nsim_dev_hwstats_fops nsim_dev_hwstats_l3_fail_fops = NSIM_DEV_HWSTATS_FOPS(NSIM_DEV_HWSTATS_DO_FAIL, NETDEV_OFFLOAD_XSTATS_TYPE_L3); #undef NSIM_DEV_HWSTATS_FOPS int nsim_dev_hwstats_init(struct nsim_dev *nsim_dev) { struct nsim_dev_hwstats *hwstats = &nsim_dev->hwstats; struct net *net = nsim_dev_net(nsim_dev); int err; mutex_init(&hwstats->hwsdev_list_lock); INIT_LIST_HEAD(&hwstats->l3_list); hwstats->netdevice_nb.notifier_call = nsim_dev_netdevice_event; err = register_netdevice_notifier_net(net, &hwstats->netdevice_nb); if (err) goto err_mutex_destroy; hwstats->ddir = debugfs_create_dir("hwstats", nsim_dev->ddir); if (IS_ERR(hwstats->ddir)) { err = PTR_ERR(hwstats->ddir); goto err_unregister_notifier; } hwstats->l3_ddir = debugfs_create_dir("l3", hwstats->ddir); if (IS_ERR(hwstats->l3_ddir)) { err = PTR_ERR(hwstats->l3_ddir); goto err_remove_hwstats_recursive; } debugfs_create_file_aux("enable_ifindex", 0200, hwstats->l3_ddir, hwstats, &nsim_dev_hwstats_l3_enable_fops, &debugfs_ops); debugfs_create_file_aux("disable_ifindex", 0200, hwstats->l3_ddir, hwstats, &nsim_dev_hwstats_l3_disable_fops, &debugfs_ops); debugfs_create_file_aux("fail_next_enable", 0200, hwstats->l3_ddir, hwstats, &nsim_dev_hwstats_l3_fail_fops, &debugfs_ops); INIT_DELAYED_WORK(&hwstats->traffic_dw, &nsim_dev_hwstats_traffic_work); schedule_delayed_work(&hwstats->traffic_dw, msecs_to_jiffies(NSIM_DEV_HWSTATS_TRAFFIC_MS)); return 0; err_remove_hwstats_recursive: debugfs_remove_recursive(hwstats->ddir); err_unregister_notifier: unregister_netdevice_notifier_net(net, &hwstats->netdevice_nb); err_mutex_destroy: mutex_destroy(&hwstats->hwsdev_list_lock); return err; } static void nsim_dev_hwsdev_list_wipe(struct nsim_dev_hwstats *hwstats, enum netdev_offload_xstats_type type) { struct nsim_dev_hwstats_netdev *hwsdev, *tmp; struct list_head *hwsdev_list; hwsdev_list = nsim_dev_hwstats_get_list_head(hwstats, type); if (WARN_ON(!hwsdev_list)) return; mutex_lock(&hwstats->hwsdev_list_lock); list_for_each_entry_safe(hwsdev, tmp, hwsdev_list, list) { list_del(&hwsdev->list); nsim_dev_hwsdev_fini(hwsdev); } mutex_unlock(&hwstats->hwsdev_list_lock); } void nsim_dev_hwstats_exit(struct nsim_dev *nsim_dev) { struct nsim_dev_hwstats *hwstats = &nsim_dev->hwstats; struct net *net = nsim_dev_net(nsim_dev); cancel_delayed_work_sync(&hwstats->traffic_dw); debugfs_remove_recursive(hwstats->ddir); unregister_netdevice_notifier_net(net, &hwstats->netdevice_nb); nsim_dev_hwsdev_list_wipe(hwstats, NETDEV_OFFLOAD_XSTATS_TYPE_L3); mutex_destroy(&hwstats->hwsdev_list_lock); } |
| 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | /* SPDX-License-Identifier: GPL-2.0 */ #include <linux/pagemap.h> #include <linux/blkdev.h> #include "../blk.h" /* * add_gd_partition adds a partitions details to the devices partition * description. */ struct parsed_partitions { struct gendisk *disk; char name[BDEVNAME_SIZE]; struct { sector_t from; sector_t size; int flags; bool has_info; struct partition_meta_info info; } *parts; int next; int limit; bool access_beyond_eod; char *pp_buf; }; typedef struct { struct folio *v; } Sector; void *read_part_sector(struct parsed_partitions *state, sector_t n, Sector *p); static inline void put_dev_sector(Sector p) { folio_put(p.v); } static inline void put_partition(struct parsed_partitions *p, int n, sector_t from, sector_t size) { if (n < p->limit) { char tmp[1 + BDEVNAME_SIZE + 10 + 1]; p->parts[n].from = from; p->parts[n].size = size; snprintf(tmp, sizeof(tmp), " %s%d", p->name, n); strlcat(p->pp_buf, tmp, PAGE_SIZE); } } /* detection routines go here in alphabetical order: */ int adfspart_check_ADFS(struct parsed_partitions *state); int adfspart_check_CUMANA(struct parsed_partitions *state); int adfspart_check_EESOX(struct parsed_partitions *state); int adfspart_check_ICS(struct parsed_partitions *state); int adfspart_check_POWERTEC(struct parsed_partitions *state); int aix_partition(struct parsed_partitions *state); int amiga_partition(struct parsed_partitions *state); int atari_partition(struct parsed_partitions *state); int cmdline_partition(struct parsed_partitions *state); int efi_partition(struct parsed_partitions *state); int ibm_partition(struct parsed_partitions *); int karma_partition(struct parsed_partitions *state); int ldm_partition(struct parsed_partitions *state); int mac_partition(struct parsed_partitions *state); int msdos_partition(struct parsed_partitions *state); int of_partition(struct parsed_partitions *state); int osf_partition(struct parsed_partitions *state); int sgi_partition(struct parsed_partitions *state); int sun_partition(struct parsed_partitions *state); int sysv68_partition(struct parsed_partitions *state); int ultrix_partition(struct parsed_partitions *state); |
| 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (C) 2008 Red Hat, Inc., Eric Paris <eparis@redhat.com> */ /* * Basic idea behind the notification queue: An fsnotify group (like inotify) * sends the userspace notification about events asynchronously some time after * the event happened. When inotify gets an event it will need to add that * event to the group notify queue. Since a single event might need to be on * multiple group's notification queues we can't add the event directly to each * queue and instead add a small "event_holder" to each queue. This event_holder * has a pointer back to the original event. Since the majority of events are * going to end up on one, and only one, notification queue we embed one * event_holder into each event. This means we have a single allocation instead * of always needing two. If the embedded event_holder is already in use by * another group a new event_holder (from fsnotify_event_holder_cachep) will be * allocated and used. */ #include <linux/fs.h> #include <linux/init.h> #include <linux/kernel.h> #include <linux/list.h> #include <linux/module.h> #include <linux/mount.h> #include <linux/mutex.h> #include <linux/namei.h> #include <linux/path.h> #include <linux/slab.h> #include <linux/spinlock.h> #include <linux/atomic.h> #include <linux/fsnotify_backend.h> #include "fsnotify.h" static atomic_t fsnotify_sync_cookie = ATOMIC_INIT(0); /** * fsnotify_get_cookie - return a unique cookie for use in synchronizing events. * Called from fsnotify_move, which is inlined into filesystem modules. */ u32 fsnotify_get_cookie(void) { return atomic_inc_return(&fsnotify_sync_cookie); } EXPORT_SYMBOL_GPL(fsnotify_get_cookie); void fsnotify_destroy_event(struct fsnotify_group *group, struct fsnotify_event *event) { /* Overflow events are per-group and we don't want to free them */ if (!event || event == group->overflow_event) return; /* * If the event is still queued, we have a problem... Do an unreliable * lockless check first to avoid locking in the common case. The * locking may be necessary for permission events which got removed * from the list by a different CPU than the one freeing the event. */ if (!list_empty(&event->list)) { spin_lock(&group->notification_lock); WARN_ON(!list_empty(&event->list)); spin_unlock(&group->notification_lock); } group->ops->free_event(group, event); } /* * Try to add an event to the notification queue. * The group can later pull this event off the queue to deal with. * The group can use the @merge hook to merge the event with a queued event. * The group can use the @insert hook to insert the event into hash table. * The function returns: * 0 if the event was added to a queue * 1 if the event was merged with some other queued event * 2 if the event was not queued - either the queue of events has overflown * or the group is shutting down. */ int fsnotify_insert_event(struct fsnotify_group *group, struct fsnotify_event *event, int (*merge)(struct fsnotify_group *, struct fsnotify_event *), void (*insert)(struct fsnotify_group *, struct fsnotify_event *)) { int ret = 0; struct list_head *list = &group->notification_list; pr_debug("%s: group=%p event=%p\n", __func__, group, event); spin_lock(&group->notification_lock); if (group->shutdown) { spin_unlock(&group->notification_lock); return 2; } if (event == group->overflow_event || group->q_len >= group->max_events) { ret = 2; /* Queue overflow event only if it isn't already queued */ if (!list_empty(&group->overflow_event->list)) { spin_unlock(&group->notification_lock); return ret; } event = group->overflow_event; goto queue; } if (!list_empty(list) && merge) { ret = merge(group, event); if (ret) { spin_unlock(&group->notification_lock); return ret; } } queue: group->q_len++; list_add_tail(&event->list, list); if (insert) insert(group, event); spin_unlock(&group->notification_lock); wake_up(&group->notification_waitq); kill_fasync(&group->fsn_fa, SIGIO, POLL_IN); return ret; } void fsnotify_remove_queued_event(struct fsnotify_group *group, struct fsnotify_event *event) { assert_spin_locked(&group->notification_lock); /* * We need to init list head for the case of overflow event so that * check in fsnotify_add_event() works */ list_del_init(&event->list); group->q_len--; } /* * Return the first event on the notification list without removing it. * Returns NULL if the list is empty. */ struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group) { assert_spin_locked(&group->notification_lock); if (fsnotify_notify_queue_is_empty(group)) return NULL; return list_first_entry(&group->notification_list, struct fsnotify_event, list); } /* * Remove and return the first event from the notification list. It is the * responsibility of the caller to destroy the obtained event */ struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group) { struct fsnotify_event *event = fsnotify_peek_first_event(group); if (!event) return NULL; pr_debug("%s: group=%p event=%p\n", __func__, group, event); fsnotify_remove_queued_event(group, event); return event; } /* * Called when a group is being torn down to clean up any outstanding * event notifications. */ void fsnotify_flush_notify(struct fsnotify_group *group) { struct fsnotify_event *event; spin_lock(&group->notification_lock); while (!fsnotify_notify_queue_is_empty(group)) { event = fsnotify_remove_first_event(group); spin_unlock(&group->notification_lock); fsnotify_destroy_event(group, event); spin_lock(&group->notification_lock); } spin_unlock(&group->notification_lock); } |
| 229 226 227 228 243 246 246 247 247 245 229 235 237 229 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 | // SPDX-License-Identifier: GPL-2.0 /* * This file contains functions which manage clock event devices. * * Copyright(C) 2005-2006, Linutronix GmbH, Thomas Gleixner <tglx@kernel.org> * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar * Copyright(C) 2006-2007, Timesys Corp., Thomas Gleixner */ #include <linux/clockchips.h> #include <linux/hrtimer.h> #include <linux/init.h> #include <linux/module.h> #include <linux/smp.h> #include <linux/device.h> #include "tick-internal.h" /* The registered clock event devices */ static LIST_HEAD(clockevent_devices); static LIST_HEAD(clockevents_released); /* Protection for the above */ static DEFINE_RAW_SPINLOCK(clockevents_lock); /* Protection for unbind operations */ static DEFINE_MUTEX(clockevents_mutex); struct ce_unbind { struct clock_event_device *ce; int res; }; static u64 cev_delta2ns(unsigned long latch, struct clock_event_device *evt, bool ismax) { u64 clc = (u64) latch << evt->shift; u64 rnd; if (WARN_ON(!evt->mult)) evt->mult = 1; rnd = (u64) evt->mult - 1; /* * Upper bound sanity check. If the backwards conversion is * not equal latch, we know that the above shift overflowed. */ if ((clc >> evt->shift) != (u64)latch) clc = ~0ULL; /* * Scaled math oddities: * * For mult <= (1 << shift) we can safely add mult - 1 to * prevent integer rounding loss. So the backwards conversion * from nsec to device ticks will be correct. * * For mult > (1 << shift), i.e. device frequency is > 1GHz we * need to be careful. Adding mult - 1 will result in a value * which when converted back to device ticks can be larger * than latch by up to (mult - 1) >> shift. For the min_delta * calculation we still want to apply this in order to stay * above the minimum device ticks limit. For the upper limit * we would end up with a latch value larger than the upper * limit of the device, so we omit the add to stay below the * device upper boundary. * * Also omit the add if it would overflow the u64 boundary. */ if ((~0ULL - clc > rnd) && (!ismax || evt->mult <= (1ULL << evt->shift))) clc += rnd; do_div(clc, evt->mult); /* Deltas less than 1usec are pointless noise */ return clc > 1000 ? clc : 1000; } /** * clockevent_delta2ns - Convert a latch value (device ticks) to nanoseconds * @latch: value to convert * @evt: pointer to clock event device descriptor * * Math helper, returns latch value converted to nanoseconds (bound checked) */ u64 clockevent_delta2ns(unsigned long latch, struct clock_event_device *evt) { return cev_delta2ns(latch, evt, false); } EXPORT_SYMBOL_GPL(clockevent_delta2ns); static int __clockevents_switch_state(struct clock_event_device *dev, enum clock_event_state state) { if (dev->features & CLOCK_EVT_FEAT_DUMMY) return 0; /* Transition with new state-specific callbacks */ switch (state) { case CLOCK_EVT_STATE_DETACHED: /* The clockevent device is getting replaced. Shut it down. */ case CLOCK_EVT_STATE_SHUTDOWN: if (dev->set_state_shutdown) return dev->set_state_shutdown(dev); return 0; case CLOCK_EVT_STATE_PERIODIC: /* Core internal bug */ if (!(dev->features & CLOCK_EVT_FEAT_PERIODIC)) return -ENOSYS; if (dev->set_state_periodic) return dev->set_state_periodic(dev); return 0; case CLOCK_EVT_STATE_ONESHOT: /* Core internal bug */ if (!(dev->features & CLOCK_EVT_FEAT_ONESHOT)) return -ENOSYS; if (dev->set_state_oneshot) return dev->set_state_oneshot(dev); return 0; case CLOCK_EVT_STATE_ONESHOT_STOPPED: /* Core internal bug */ if (WARN_ONCE(!clockevent_state_oneshot(dev), "Current state: %d\n", clockevent_get_state(dev))) return -EINVAL; if (dev->set_state_oneshot_stopped) return dev->set_state_oneshot_stopped(dev); else return -ENOSYS; default: return -ENOSYS; } } /** * clockevents_switch_state - set the operating state of a clock event device * @dev: device to modify * @state: new state * * Must be called with interrupts disabled ! */ void clockevents_switch_state(struct clock_event_device *dev, enum clock_event_state state) { if (clockevent_get_state(dev) != state) { if (__clockevents_switch_state(dev, state)) return; clockevent_set_state(dev, state); /* * A nsec2cyc multiplicator of 0 is invalid and we'd crash * on it, so fix it up and emit a warning: */ if (clockevent_state_oneshot(dev)) { if (WARN_ON(!dev->mult)) dev->mult = 1; } } } /** * clockevents_shutdown - shutdown the device and clear next_event * @dev: device to shutdown */ void clockevents_shutdown(struct clock_event_device *dev) { clockevents_switch_state(dev, CLOCK_EVT_STATE_SHUTDOWN); dev->next_event = KTIME_MAX; } /** * clockevents_tick_resume - Resume the tick device before using it again * @dev: device to resume */ int clockevents_tick_resume(struct clock_event_device *dev) { int ret = 0; if (dev->tick_resume) ret = dev->tick_resume(dev); return ret; } #ifdef CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST /* Limit min_delta to a jiffy */ #define MIN_DELTA_LIMIT (NSEC_PER_SEC / HZ) /** * clockevents_increase_min_delta - raise minimum delta of a clock event device * @dev: device to increase the minimum delta * * Returns 0 on success, -ETIME when the minimum delta reached the limit. */ static int clockevents_increase_min_delta(struct clock_event_device *dev) { /* Nothing to do if we already reached the limit */ if (dev->min_delta_ns >= MIN_DELTA_LIMIT) { printk_deferred(KERN_WARNING "CE: Reprogramming failure. Giving up\n"); dev->next_event = KTIME_MAX; return -ETIME; } if (dev->min_delta_ns < 5000) dev->min_delta_ns = 5000; else dev->min_delta_ns += dev->min_delta_ns >> 1; if (dev->min_delta_ns > MIN_DELTA_LIMIT) dev->min_delta_ns = MIN_DELTA_LIMIT; printk_deferred(KERN_WARNING "CE: %s increased min_delta_ns to %llu nsec\n", dev->name ? dev->name : "?", (unsigned long long) dev->min_delta_ns); return 0; } /** * clockevents_program_min_delta - Set clock event device to the minimum delay. * @dev: device to program * * Returns 0 on success, -ETIME when the retry loop failed. */ static int clockevents_program_min_delta(struct clock_event_device *dev) { unsigned long long clc; int64_t delta; int i; for (i = 0;;) { delta = dev->min_delta_ns; dev->next_event = ktime_add_ns(ktime_get(), delta); if (clockevent_state_shutdown(dev)) return 0; dev->retries++; clc = ((unsigned long long) delta * dev->mult) >> dev->shift; if (dev->set_next_event((unsigned long) clc, dev) == 0) return 0; if (++i > 2) { /* * We tried 3 times to program the device with the * given min_delta_ns. Try to increase the minimum * delta, if that fails as well get out of here. */ if (clockevents_increase_min_delta(dev)) return -ETIME; i = 0; } } } #else /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */ /** * clockevents_program_min_delta - Set clock event device to the minimum delay. * @dev: device to program * * Returns 0 on success, -ETIME when the retry loop failed. */ static int clockevents_program_min_delta(struct clock_event_device *dev) { unsigned long long clc; int64_t delta = 0; int i; for (i = 0; i < 10; i++) { delta += dev->min_delta_ns; dev->next_event = ktime_add_ns(ktime_get(), delta); if (clockevent_state_shutdown(dev)) return 0; dev->retries++; clc = ((unsigned long long) delta * dev->mult) >> dev->shift; if (dev->set_next_event((unsigned long) clc, dev) == 0) return 0; } return -ETIME; } #endif /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */ /** * clockevents_program_event - Reprogram the clock event device. * @dev: device to program * @expires: absolute expiry time (monotonic clock) * @force: program minimum delay if expires can not be set * * Returns 0 on success, -ETIME when the event is in the past. */ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, bool force) { unsigned long long clc; int64_t delta; int rc; if (WARN_ON_ONCE(expires < 0)) return -ETIME; dev->next_event = expires; if (clockevent_state_shutdown(dev)) return 0; /* We must be in ONESHOT state here */ WARN_ONCE(!clockevent_state_oneshot(dev), "Current state: %d\n", clockevent_get_state(dev)); /* Shortcut for clockevent devices that can deal with ktime. */ if (dev->features & CLOCK_EVT_FEAT_KTIME) return dev->set_next_ktime(expires, dev); delta = ktime_to_ns(ktime_sub(expires, ktime_get())); if (delta <= 0) return force ? clockevents_program_min_delta(dev) : -ETIME; delta = min(delta, (int64_t) dev->max_delta_ns); delta = max(delta, (int64_t) dev->min_delta_ns); clc = ((unsigned long long) delta * dev->mult) >> dev->shift; rc = dev->set_next_event((unsigned long) clc, dev); return (rc && force) ? clockevents_program_min_delta(dev) : rc; } /* * Called after a clockevent has been added which might * have replaced a current regular or broadcast device. A * released normal device might be a suitable replacement * for the current broadcast device. Similarly a released * broadcast device might be a suitable replacement for a * normal device. */ static void clockevents_notify_released(void) { struct clock_event_device *dev; /* * Keep iterating as long as tick_check_new_device() * replaces a device. */ while (!list_empty(&clockevents_released)) { dev = list_entry(clockevents_released.next, struct clock_event_device, list); list_move(&dev->list, &clockevent_devices); tick_check_new_device(dev); } } /* * Try to install a replacement clock event device */ static int clockevents_replace(struct clock_event_device *ced) { struct clock_event_device *dev, *newdev = NULL; list_for_each_entry(dev, &clockevent_devices, list) { if (dev == ced || !clockevent_state_detached(dev)) continue; if (!tick_check_replacement(newdev, dev)) continue; if (!try_module_get(dev->owner)) continue; if (newdev) module_put(newdev->owner); newdev = dev; } if (newdev) { tick_install_replacement(newdev); list_del_init(&ced->list); } return newdev ? 0 : -EBUSY; } /* * Called with clockevents_mutex and clockevents_lock held */ static int __clockevents_try_unbind(struct clock_event_device *ced, int cpu) { /* Fast track. Device is unused */ if (clockevent_state_detached(ced)) { list_del_init(&ced->list); return 0; } return ced == per_cpu(tick_cpu_device, cpu).evtdev ? -EAGAIN : -EBUSY; } /* * SMP function call to unbind a device */ static void __clockevents_unbind(void *arg) { struct ce_unbind *cu = arg; int res; raw_spin_lock(&clockevents_lock); res = __clockevents_try_unbind(cu->ce, smp_processor_id()); if (res == -EAGAIN) res = clockevents_replace(cu->ce); cu->res = res; raw_spin_unlock(&clockevents_lock); } /* * Issues smp function call to unbind a per cpu device. Called with * clockevents_mutex held. */ static int clockevents_unbind(struct clock_event_device *ced, int cpu) { struct ce_unbind cu = { .ce = ced, .res = -ENODEV }; smp_call_function_single(cpu, __clockevents_unbind, &cu, 1); return cu.res; } /* * Unbind a clockevents device. */ int clockevents_unbind_device(struct clock_event_device *ced, int cpu) { int ret; mutex_lock(&clockevents_mutex); ret = clockevents_unbind(ced, cpu); mutex_unlock(&clockevents_mutex); return ret; } EXPORT_SYMBOL_GPL(clockevents_unbind_device); /** * clockevents_register_device - register a clock event device * @dev: device to register */ void clockevents_register_device(struct clock_event_device *dev) { unsigned long flags; /* Initialize state to DETACHED */ clockevent_set_state(dev, CLOCK_EVT_STATE_DETACHED); if (!dev->cpumask) { WARN_ON(num_possible_cpus() > 1); dev->cpumask = cpumask_of(smp_processor_id()); } if (dev->cpumask == cpu_all_mask) { WARN(1, "%s cpumask == cpu_all_mask, using cpu_possible_mask instead\n", dev->name); dev->cpumask = cpu_possible_mask; } raw_spin_lock_irqsave(&clockevents_lock, flags); list_add(&dev->list, &clockevent_devices); tick_check_new_device(dev); clockevents_notify_released(); raw_spin_unlock_irqrestore(&clockevents_lock, flags); } EXPORT_SYMBOL_GPL(clockevents_register_device); static void clockevents_config(struct clock_event_device *dev, u32 freq) { u64 sec; if (!(dev->features & CLOCK_EVT_FEAT_ONESHOT)) return; /* * Calculate the maximum number of seconds we can sleep. Limit * to 10 minutes for hardware which can program more than * 32bit ticks so we still get reasonable conversion values. */ sec = dev->max_delta_ticks; do_div(sec, freq); if (!sec) sec = 1; else if (sec > 600 && dev->max_delta_ticks > UINT_MAX) sec = 600; clockevents_calc_mult_shift(dev, freq, sec); dev->min_delta_ns = cev_delta2ns(dev->min_delta_ticks, dev, false); dev->max_delta_ns = cev_delta2ns(dev->max_delta_ticks, dev, true); } /** * clockevents_config_and_register - Configure and register a clock event device * @dev: device to register * @freq: The clock frequency * @min_delta: The minimum clock ticks to program in oneshot mode * @max_delta: The maximum clock ticks to program in oneshot mode * * min/max_delta can be 0 for devices which do not support oneshot mode. */ void clockevents_config_and_register(struct clock_event_device *dev, u32 freq, unsigned long min_delta, unsigned long max_delta) { dev->min_delta_ticks = min_delta; dev->max_delta_ticks = max_delta; clockevents_config(dev, freq); clockevents_register_device(dev); } EXPORT_SYMBOL_GPL(clockevents_config_and_register); int __clockevents_update_freq(struct clock_event_device *dev, u32 freq) { clockevents_config(dev, freq); if (clockevent_state_oneshot(dev)) return clockevents_program_event(dev, dev->next_event, false); if (clockevent_state_periodic(dev)) return __clockevents_switch_state(dev, CLOCK_EVT_STATE_PERIODIC); return 0; } /** * clockevents_update_freq - Update frequency and reprogram a clock event device. * @dev: device to modify * @freq: new device frequency * * Reconfigure and reprogram a clock event device in oneshot * mode. Must be called on the cpu for which the device delivers per * cpu timer events. If called for the broadcast device the core takes * care of serialization. * * Returns 0 on success, -ETIME when the event is in the past. */ int clockevents_update_freq(struct clock_event_device *dev, u32 freq) { unsigned long flags; int ret; local_irq_save(flags); ret = tick_broadcast_update_freq(dev, freq); if (ret == -ENODEV) ret = __clockevents_update_freq(dev, freq); local_irq_restore(flags); return ret; } /* * Noop handler when we shut down an event device */ void clockevents_handle_noop(struct clock_event_device *dev) { } /** * clockevents_exchange_device - release and request clock devices * @old: device to release (can be NULL) * @new: device to request (can be NULL) * * Called from various tick functions with clockevents_lock held and * interrupts disabled. */ void clockevents_exchange_device(struct clock_event_device *old, struct clock_event_device *new) { /* * Caller releases a clock event device. We queue it into the * released list and do a notify add later. */ if (old) { module_put(old->owner); clockevents_switch_state(old, CLOCK_EVT_STATE_DETACHED); list_move(&old->list, &clockevents_released); } if (new) { BUG_ON(!clockevent_state_detached(new)); clockevents_shutdown(new); } } /** * clockevents_suspend - suspend clock devices */ void clockevents_suspend(void) { struct clock_event_device *dev; list_for_each_entry_reverse(dev, &clockevent_devices, list) if (dev->suspend && !clockevent_state_detached(dev)) dev->suspend(dev); } /** * clockevents_resume - resume clock devices */ void clockevents_resume(void) { struct clock_event_device *dev; list_for_each_entry(dev, &clockevent_devices, list) if (dev->resume && !clockevent_state_detached(dev)) dev->resume(dev); } #ifdef CONFIG_HOTPLUG_CPU /** * tick_offline_cpu - Shutdown all clock events related * to this CPU and take it out of the * broadcast mechanism. * @cpu: The outgoing CPU * * Called by the dying CPU during teardown. */ void tick_offline_cpu(unsigned int cpu) { struct clock_event_device *dev, *tmp; raw_spin_lock(&clockevents_lock); tick_broadcast_offline(cpu); tick_shutdown(); /* * Unregister the clock event devices which were * released above. */ list_for_each_entry_safe(dev, tmp, &clockevents_released, list) list_del(&dev->list); /* * Now check whether the CPU has left unused per cpu devices */ list_for_each_entry_safe(dev, tmp, &clockevent_devices, list) { if (cpumask_test_cpu(cpu, dev->cpumask) && cpumask_weight(dev->cpumask) == 1 && !tick_is_broadcast_device(dev)) { BUG_ON(!clockevent_state_detached(dev)); list_del(&dev->list); } } raw_spin_unlock(&clockevents_lock); } #endif #ifdef CONFIG_SYSFS static const struct bus_type clockevents_subsys = { .name = "clockevents", .dev_name = "clockevent", }; static DEFINE_PER_CPU(struct device, tick_percpu_dev); static struct tick_device *tick_get_tick_dev(struct device *dev); static ssize_t current_device_show(struct device *dev, struct device_attribute *attr, char *buf) { struct tick_device *td; ssize_t count = 0; raw_spin_lock_irq(&clockevents_lock); td = tick_get_tick_dev(dev); if (td && td->evtdev) count = sysfs_emit(buf, "%s\n", td->evtdev->name); raw_spin_unlock_irq(&clockevents_lock); return count; } static DEVICE_ATTR_RO(current_device); /* We don't support the abomination of removable broadcast devices */ static ssize_t unbind_device_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { char name[CS_NAME_LEN]; ssize_t ret = sysfs_get_uname(buf, name, count); struct clock_event_device *ce = NULL, *iter; if (ret < 0) return ret; ret = -ENODEV; mutex_lock(&clockevents_mutex); raw_spin_lock_irq(&clockevents_lock); list_for_each_entry(iter, &clockevent_devices, list) { if (!strcmp(iter->name, name)) { ret = __clockevents_try_unbind(iter, dev->id); ce = iter; break; } } raw_spin_unlock_irq(&clockevents_lock); /* * We hold clockevents_mutex, so ce can't go away */ if (ret == -EAGAIN) ret = clockevents_unbind(ce, dev->id); mutex_unlock(&clockevents_mutex); return ret ? ret : count; } static DEVICE_ATTR_WO(unbind_device); #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST static struct device tick_bc_dev = { .init_name = "broadcast", .id = 0, .bus = &clockevents_subsys, }; static struct tick_device *tick_get_tick_dev(struct device *dev) { return dev == &tick_bc_dev ? tick_get_broadcast_device() : &per_cpu(tick_cpu_device, dev->id); } static __init int tick_broadcast_init_sysfs(void) { int err = device_register(&tick_bc_dev); if (!err) err = device_create_file(&tick_bc_dev, &dev_attr_current_device); return err; } #else static struct tick_device *tick_get_tick_dev(struct device *dev) { return &per_cpu(tick_cpu_device, dev->id); } static inline int tick_broadcast_init_sysfs(void) { return 0; } #endif static int __init tick_init_sysfs(void) { int cpu; for_each_possible_cpu(cpu) { struct device *dev = &per_cpu(tick_percpu_dev, cpu); int err; dev->id = cpu; dev->bus = &clockevents_subsys; err = device_register(dev); if (!err) err = device_create_file(dev, &dev_attr_current_device); if (!err) err = device_create_file(dev, &dev_attr_unbind_device); if (err) return err; } return tick_broadcast_init_sysfs(); } static int __init clockevents_init_sysfs(void) { int err = subsys_system_register(&clockevents_subsys, NULL); if (!err) err = tick_init_sysfs(); return err; } device_initcall(clockevents_init_sysfs); #endif /* SYSFS */ |
| 2974 2974 3020 2977 2984 2984 2990 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Multiplex several virtual IPIs over a single HW IPI. * * Copyright The Asahi Linux Contributors * Copyright (c) 2022 Ventana Micro Systems Inc. */ #define pr_fmt(fmt) "ipi-mux: " fmt #include <linux/cpu.h> #include <linux/init.h> #include <linux/irq.h> #include <linux/irqchip.h> #include <linux/irqchip/chained_irq.h> #include <linux/irqdomain.h> #include <linux/jump_label.h> #include <linux/percpu.h> #include <linux/smp.h> struct ipi_mux_cpu { atomic_t enable; atomic_t bits; }; static struct ipi_mux_cpu __percpu *ipi_mux_pcpu; static struct irq_domain *ipi_mux_domain; static void (*ipi_mux_send)(unsigned int cpu); static void ipi_mux_mask(struct irq_data *d) { struct ipi_mux_cpu *icpu = this_cpu_ptr(ipi_mux_pcpu); atomic_andnot(BIT(irqd_to_hwirq(d)), &icpu->enable); } static void ipi_mux_unmask(struct irq_data *d) { struct ipi_mux_cpu *icpu = this_cpu_ptr(ipi_mux_pcpu); u32 ibit = BIT(irqd_to_hwirq(d)); atomic_or(ibit, &icpu->enable); /* * The atomic_or() above must complete before the atomic_read() * below to avoid racing ipi_mux_send_mask(). */ smp_mb__after_atomic(); /* If a pending IPI was unmasked, raise a parent IPI immediately. */ if (atomic_read(&icpu->bits) & ibit) ipi_mux_send(smp_processor_id()); } static void ipi_mux_send_mask(struct irq_data *d, const struct cpumask *mask) { struct ipi_mux_cpu *icpu = this_cpu_ptr(ipi_mux_pcpu); u32 ibit = BIT(irqd_to_hwirq(d)); unsigned long pending; int cpu; for_each_cpu(cpu, mask) { icpu = per_cpu_ptr(ipi_mux_pcpu, cpu); /* * This sequence is the mirror of the one in ipi_mux_unmask(); * see the comment there. Additionally, release semantics * ensure that the vIPI flag set is ordered after any shared * memory accesses that precede it. This therefore also pairs * with the atomic_fetch_andnot in ipi_mux_process(). */ pending = atomic_fetch_or_release(ibit, &icpu->bits); /* * The atomic_fetch_or_release() above must complete * before the atomic_read() below to avoid racing with * ipi_mux_unmask(). */ smp_mb__after_atomic(); /* * The flag writes must complete before the physical IPI is * issued to another CPU. This is implied by the control * dependency on the result of atomic_read() below, which is * itself already ordered after the vIPI flag write. */ if (!(pending & ibit) && (atomic_read(&icpu->enable) & ibit)) ipi_mux_send(cpu); } } static const struct irq_chip ipi_mux_chip = { .name = "IPI Mux", .irq_mask = ipi_mux_mask, .irq_unmask = ipi_mux_unmask, .ipi_send_mask = ipi_mux_send_mask, }; static int ipi_mux_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs, void *arg) { int i; for (i = 0; i < nr_irqs; i++) { irq_set_percpu_devid(virq + i); irq_domain_set_info(d, virq + i, i, &ipi_mux_chip, NULL, handle_percpu_devid_irq, NULL, NULL); } return 0; } static const struct irq_domain_ops ipi_mux_domain_ops = { .alloc = ipi_mux_domain_alloc, .free = irq_domain_free_irqs_top, }; /** * ipi_mux_process - Process multiplexed virtual IPIs */ void ipi_mux_process(void) { struct ipi_mux_cpu *icpu = this_cpu_ptr(ipi_mux_pcpu); irq_hw_number_t hwirq; unsigned long ipis; unsigned int en; /* * Reading enable mask does not need to be ordered as long as * this function is called from interrupt handler because only * the CPU itself can change it's own enable mask. */ en = atomic_read(&icpu->enable); /* * Clear the IPIs we are about to handle. This pairs with the * atomic_fetch_or_release() in ipi_mux_send_mask(). */ ipis = atomic_fetch_andnot(en, &icpu->bits) & en; for_each_set_bit(hwirq, &ipis, BITS_PER_TYPE(int)) generic_handle_domain_irq(ipi_mux_domain, hwirq); } /** * ipi_mux_create - Create virtual IPIs multiplexed on top of a single * parent IPI. * @nr_ipi: number of virtual IPIs to create. This should * be <= BITS_PER_TYPE(int) * @mux_send: callback to trigger parent IPI for a particular CPU * * Returns first virq of the newly created virtual IPIs upon success * or <=0 upon failure */ int ipi_mux_create(unsigned int nr_ipi, void (*mux_send)(unsigned int cpu)) { struct fwnode_handle *fwnode; struct irq_domain *domain; int rc; if (ipi_mux_domain) return -EEXIST; if (BITS_PER_TYPE(int) < nr_ipi || !mux_send) return -EINVAL; ipi_mux_pcpu = alloc_percpu(typeof(*ipi_mux_pcpu)); if (!ipi_mux_pcpu) return -ENOMEM; fwnode = irq_domain_alloc_named_fwnode("IPI-Mux"); if (!fwnode) { pr_err("unable to create IPI Mux fwnode\n"); rc = -ENOMEM; goto fail_free_cpu; } domain = irq_domain_create_linear(fwnode, nr_ipi, &ipi_mux_domain_ops, NULL); if (!domain) { pr_err("unable to add IPI Mux domain\n"); rc = -ENOMEM; goto fail_free_fwnode; } domain->flags |= IRQ_DOMAIN_FLAG_IPI_SINGLE; irq_domain_update_bus_token(domain, DOMAIN_BUS_IPI); rc = irq_domain_alloc_irqs(domain, nr_ipi, NUMA_NO_NODE, NULL); if (rc <= 0) { pr_err("unable to alloc IRQs from IPI Mux domain\n"); goto fail_free_domain; } ipi_mux_domain = domain; ipi_mux_send = mux_send; return rc; fail_free_domain: irq_domain_remove(domain); fail_free_fwnode: irq_domain_free_fwnode(fwnode); fail_free_cpu: free_percpu(ipi_mux_pcpu); return rc; } |
| 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 | // SPDX-License-Identifier: GPL-2.0+ /* * Driver for USB Mass Storage compliant devices * SCSI layer glue code * * Current development and maintenance by: * (c) 1999-2002 Matthew Dharm (mdharm-usb@one-eyed-alien.net) * * Developed with the assistance of: * (c) 2000 David L. Brown, Jr. (usb-storage@davidb.org) * (c) 2000 Stephen J. Gowdy (SGowdy@lbl.gov) * * Initial work by: * (c) 1999 Michael Gee (michael@linuxspecific.com) * * This driver is based on the 'USB Mass Storage Class' document. This * describes in detail the protocol used to communicate with such * devices. Clearly, the designers had SCSI and ATAPI commands in * mind when they created this document. The commands are all very * similar to commands in the SCSI-II and ATAPI specifications. * * It is important to note that in a number of cases this class * exhibits class-specific exemptions from the USB specification. * Notably the usage of NAK, STALL and ACK differs from the norm, in * that they are used to communicate wait, failed and OK on commands. * * Also, for certain devices, the interrupt endpoint is used to convey * status of a command. */ #include <linux/blkdev.h> #include <linux/dma-mapping.h> #include <linux/module.h> #include <linux/mutex.h> #include <scsi/scsi.h> #include <scsi/scsi_cmnd.h> #include <scsi/scsi_devinfo.h> #include <scsi/scsi_device.h> #include <scsi/scsi_eh.h> #include "usb.h" #include "scsiglue.h" #include "debug.h" #include "transport.h" #include "protocol.h" /* * Vendor IDs for companies that seem to include the READ CAPACITY bug * in all their devices */ #define VENDOR_ID_NOKIA 0x0421 #define VENDOR_ID_NIKON 0x04b0 #define VENDOR_ID_PENTAX 0x0a17 #define VENDOR_ID_MOTOROLA 0x22b8 /*********************************************************************** * Host functions ***********************************************************************/ static const char* host_info(struct Scsi_Host *host) { struct us_data *us = host_to_us(host); return us->scsi_name; } static int sdev_init (struct scsi_device *sdev) { struct us_data *us = host_to_us(sdev->host); /* * Set the INQUIRY transfer length to 36. We don't use any of * the extra data and many devices choke if asked for more or * less than 36 bytes. */ sdev->inquiry_len = 36; /* Tell the SCSI layer if we know there is more than one LUN */ if (us->protocol == USB_PR_BULK && us->max_lun > 0) sdev->sdev_bflags |= BLIST_FORCELUN; /* * Some USB storage devices reset if the IO advice hints grouping mode * page is queried. Hence skip that mode page. */ sdev->sdev_bflags |= BLIST_SKIP_IO_HINTS; return 0; } static int sdev_configure(struct scsi_device *sdev, struct queue_limits *lim) { struct us_data *us = host_to_us(sdev->host); struct device *dev = us->pusb_dev->bus->sysdev; /* * Many devices have trouble transferring more than 32KB at a time, * while others have trouble with more than 64K. At this time we * are limiting both to 32K (64 sectores). */ if (us->fflags & (US_FL_MAX_SECTORS_64 | US_FL_MAX_SECTORS_MIN)) { unsigned int max_sectors = 64; if (us->fflags & US_FL_MAX_SECTORS_MIN) max_sectors = PAGE_SIZE >> 9; lim->max_hw_sectors = min(lim->max_hw_sectors, max_sectors); } else if (sdev->type == TYPE_TAPE) { /* * Tapes need much higher max_sector limits, so just * raise it to the maximum possible (4 GB / 512) and * let the queue segment size sort out the real limit. */ lim->max_hw_sectors = 0x7FFFFF; } else if (us->pusb_dev->speed >= USB_SPEED_SUPER) { /* * USB3 devices will be limited to 2048 sectors. This gives us * better throughput on most devices. */ lim->max_hw_sectors = 2048; } /* * The max_hw_sectors should be up to maximum size of a mapping for * the device. Otherwise, a DMA API might fail on swiotlb environment. */ lim->max_hw_sectors = min_t(size_t, lim->max_hw_sectors, dma_max_mapping_size(dev) >> SECTOR_SHIFT); /* * We can't put these settings in sdev_init() because that gets * called before the device type is known. Consequently these * settings can't be overridden via the scsi devinfo mechanism. */ if (sdev->type == TYPE_DISK) { /* * Some vendors seem to put the READ CAPACITY bug into * all their devices -- primarily makers of cell phones * and digital cameras. Since these devices always use * flash media and can be expected to have an even number * of sectors, we will always enable the CAPACITY_HEURISTICS * flag unless told otherwise. */ switch (le16_to_cpu(us->pusb_dev->descriptor.idVendor)) { case VENDOR_ID_NOKIA: case VENDOR_ID_NIKON: case VENDOR_ID_PENTAX: case VENDOR_ID_MOTOROLA: if (!(us->fflags & (US_FL_FIX_CAPACITY | US_FL_CAPACITY_OK))) us->fflags |= US_FL_CAPACITY_HEURISTICS; break; } /* * Disk-type devices use MODE SENSE(6) if the protocol * (SubClass) is Transparent SCSI, otherwise they use * MODE SENSE(10). */ if (us->subclass != USB_SC_SCSI && us->subclass != USB_SC_CYP_ATACB) sdev->use_10_for_ms = 1; /* *Many disks only accept MODE SENSE transfer lengths of * 192 bytes (that's what Windows uses). */ sdev->use_192_bytes_for_3f = 1; /* * Some devices report generic values until the media has been * accessed. Force a READ(10) prior to querying device * characteristics. */ sdev->read_before_ms = 1; /* * Some devices don't like MODE SENSE with page=0x3f, * which is the command used for checking if a device * is write-protected. Now that we tell the sd driver * to do a 192-byte transfer with this command the * majority of devices work fine, but a few still can't * handle it. The sd driver will simply assume those * devices are write-enabled. */ if (us->fflags & US_FL_NO_WP_DETECT) sdev->skip_ms_page_3f = 1; /* * A number of devices have problems with MODE SENSE for * page x08, so we will skip it. */ sdev->skip_ms_page_8 = 1; /* * Some devices don't handle VPD pages correctly, so skip vpd * pages if not forced by SCSI layer. */ sdev->skip_vpd_pages = !sdev->try_vpd_pages; /* Do not attempt to use REPORT SUPPORTED OPERATION CODES */ sdev->no_report_opcodes = 1; /* Do not attempt to use WRITE SAME */ sdev->no_write_same = 1; /* * Some disks return the total number of blocks in response * to READ CAPACITY rather than the highest block number. * If this device makes that mistake, tell the sd driver. */ if (us->fflags & US_FL_FIX_CAPACITY) sdev->fix_capacity = 1; /* * A few disks have two indistinguishable version, one of * which reports the correct capacity and the other does not. * The sd driver has to guess which is the case. */ if (us->fflags & US_FL_CAPACITY_HEURISTICS) sdev->guess_capacity = 1; /* Some devices cannot handle READ_CAPACITY_16 */ if (us->fflags & US_FL_NO_READ_CAPACITY_16) sdev->no_read_capacity_16 = 1; /* * Many devices do not respond properly to READ_CAPACITY_16. * Tell the SCSI layer to try READ_CAPACITY_10 first. * However some USB 3.0 drive enclosures return capacity * modulo 2TB. Those must use READ_CAPACITY_16 */ if (!(us->fflags & US_FL_NEEDS_CAP16)) sdev->try_rc_10_first = 1; /* * assume SPC3 or latter devices support sense size > 18 * unless US_FL_BAD_SENSE quirk is specified. */ if (sdev->scsi_level > SCSI_SPC_2 && !(us->fflags & US_FL_BAD_SENSE)) us->fflags |= US_FL_SANE_SENSE; /* * USB-IDE bridges tend to report SK = 0x04 (Non-recoverable * Hardware Error) when any low-level error occurs, * recoverable or not. Setting this flag tells the SCSI * midlayer to retry such commands, which frequently will * succeed and fix the error. The worst this can lead to * is an occasional series of retries that will all fail. */ sdev->retry_hwerror = 1; /* * USB disks should allow restart. Some drives spin down * automatically, requiring a START-STOP UNIT command. */ sdev->allow_restart = 1; /* * Some USB cardreaders have trouble reading an sdcard's last * sector in a larger then 1 sector read, since the performance * impact is negligible we set this flag for all USB disks */ sdev->last_sector_bug = 1; /* * Enable last-sector hacks for single-target devices using * the Bulk-only transport, unless we already know the * capacity will be decremented or is correct. */ if (!(us->fflags & (US_FL_FIX_CAPACITY | US_FL_CAPACITY_OK | US_FL_SCM_MULT_TARG)) && us->protocol == USB_PR_BULK) us->use_last_sector_hacks = 1; /* Check if write cache default on flag is set or not */ if (us->fflags & US_FL_WRITE_CACHE) sdev->wce_default_on = 1; /* A few buggy USB-ATA bridges don't understand FUA */ if (us->fflags & US_FL_BROKEN_FUA) sdev->broken_fua = 1; /* Some even totally fail to indicate a cache */ if (us->fflags & US_FL_ALWAYS_SYNC) { /* don't read caching information */ sdev->skip_ms_page_8 = 1; sdev->skip_ms_page_3f = 1; /* assume sync is needed */ sdev->wce_default_on = 1; } } else { /* * Non-disk-type devices don't need to ignore any pages * or to force 192-byte transfer lengths for MODE SENSE. * But they do need to use MODE SENSE(10). */ sdev->use_10_for_ms = 1; /* Some (fake) usb cdrom devices don't like READ_DISC_INFO */ if (us->fflags & US_FL_NO_READ_DISC_INFO) sdev->no_read_disc_info = 1; } /* * The CB and CBI transports have no way to pass LUN values * other than the bits in the second byte of a CDB. But those * bits don't get set to the LUN value if the device reports * scsi_level == 0 (UNKNOWN). Hence such devices must necessarily * be single-LUN. */ if ((us->protocol == USB_PR_CB || us->protocol == USB_PR_CBI) && sdev->scsi_level == SCSI_UNKNOWN) us->max_lun = 0; /* * Some devices choke when they receive a PREVENT-ALLOW MEDIUM * REMOVAL command, so suppress those commands. */ if (us->fflags & US_FL_NOT_LOCKABLE) sdev->lockable = 0; /* * this is to satisfy the compiler, tho I don't think the * return code is ever checked anywhere. */ return 0; } static int target_alloc(struct scsi_target *starget) { struct us_data *us = host_to_us(dev_to_shost(starget->dev.parent)); /* * Some USB drives don't support REPORT LUNS, even though they * report a SCSI revision level above 2. Tell the SCSI layer * not to issue that command; it will perform a normal sequential * scan instead. */ starget->no_report_luns = 1; /* * The UFI spec treats the Peripheral Qualifier bits in an * INQUIRY result as reserved and requires devices to set them * to 0. However the SCSI spec requires these bits to be set * to 3 to indicate when a LUN is not present. * * Let the scanning code know if this target merely sets * Peripheral Device Type to 0x1f to indicate no LUN. */ if (us->subclass == USB_SC_UFI) starget->pdt_1f_for_no_lun = 1; return 0; } /* queue a command */ /* This is always called with scsi_lock(host) held */ static int queuecommand_lck(struct scsi_cmnd *srb) { void (*done)(struct scsi_cmnd *) = scsi_done; struct us_data *us = host_to_us(srb->device->host); /* check for state-transition errors */ if (us->srb != NULL) { dev_err(&us->pusb_intf->dev, "Error in %s: us->srb = %p\n", __func__, us->srb); return SCSI_MLQUEUE_HOST_BUSY; } /* fail the command if we are disconnecting */ if (test_bit(US_FLIDX_DISCONNECTING, &us->dflags)) { usb_stor_dbg(us, "Fail command during disconnect\n"); srb->result = DID_NO_CONNECT << 16; done(srb); return 0; } if ((us->fflags & US_FL_NO_ATA_1X) && (srb->cmnd[0] == ATA_12 || srb->cmnd[0] == ATA_16)) { memcpy(srb->sense_buffer, usb_stor_sense_invalidCDB, sizeof(usb_stor_sense_invalidCDB)); srb->result = SAM_STAT_CHECK_CONDITION; done(srb); return 0; } /* enqueue the command and wake up the control thread */ us->srb = srb; complete(&us->cmnd_ready); return 0; } static DEF_SCSI_QCMD(queuecommand) /*********************************************************************** * Error handling functions ***********************************************************************/ /* Command timeout and abort */ static int command_abort_matching(struct us_data *us, struct scsi_cmnd *srb_match) { /* * us->srb together with the TIMED_OUT, RESETTING, and ABORTING * bits are protected by the host lock. */ scsi_lock(us_to_host(us)); /* is there any active pending command to abort ? */ if (!us->srb) { scsi_unlock(us_to_host(us)); usb_stor_dbg(us, "-- nothing to abort\n"); return SUCCESS; } /* Does the command match the passed srb if any ? */ if (srb_match && us->srb != srb_match) { scsi_unlock(us_to_host(us)); usb_stor_dbg(us, "-- pending command mismatch\n"); return FAILED; } /* * Set the TIMED_OUT bit. Also set the ABORTING bit, but only if * a device reset isn't already in progress (to avoid interfering * with the reset). Note that we must retain the host lock while * calling usb_stor_stop_transport(); otherwise it might interfere * with an auto-reset that begins as soon as we release the lock. */ set_bit(US_FLIDX_TIMED_OUT, &us->dflags); if (!test_bit(US_FLIDX_RESETTING, &us->dflags)) { set_bit(US_FLIDX_ABORTING, &us->dflags); usb_stor_stop_transport(us); } scsi_unlock(us_to_host(us)); /* Wait for the aborted command to finish */ wait_for_completion(&us->notify); return SUCCESS; } static int command_abort(struct scsi_cmnd *srb) { struct us_data *us = host_to_us(srb->device->host); usb_stor_dbg(us, "%s called\n", __func__); return command_abort_matching(us, srb); } /* * This invokes the transport reset mechanism to reset the state of the * device */ static int device_reset(struct scsi_cmnd *srb) { struct us_data *us = host_to_us(srb->device->host); int result; usb_stor_dbg(us, "%s called\n", __func__); /* abort any pending command before reset */ command_abort_matching(us, NULL); /* lock the device pointers and do the reset */ mutex_lock(&(us->dev_mutex)); result = us->transport_reset(us); mutex_unlock(&us->dev_mutex); return result < 0 ? FAILED : SUCCESS; } /* Simulate a SCSI bus reset by resetting the device's USB port. */ static int bus_reset(struct scsi_cmnd *srb) { struct us_data *us = host_to_us(srb->device->host); int result; usb_stor_dbg(us, "%s called\n", __func__); result = usb_stor_port_reset(us); return result < 0 ? FAILED : SUCCESS; } /* * Report a driver-initiated device reset to the SCSI layer. * Calling this for a SCSI-initiated reset is unnecessary but harmless. * The caller must own the SCSI host lock. */ void usb_stor_report_device_reset(struct us_data *us) { int i; struct Scsi_Host *host = us_to_host(us); scsi_report_device_reset(host, 0, 0); if (us->fflags & US_FL_SCM_MULT_TARG) { for (i = 1; i < host->max_id; ++i) scsi_report_device_reset(host, 0, i); } } /* * Report a driver-initiated bus reset to the SCSI layer. * Calling this for a SCSI-initiated reset is unnecessary but harmless. * The caller must not own the SCSI host lock. */ void usb_stor_report_bus_reset(struct us_data *us) { struct Scsi_Host *host = us_to_host(us); scsi_lock(host); scsi_report_bus_reset(host, 0); scsi_unlock(host); } /*********************************************************************** * /proc/scsi/ functions ***********************************************************************/ static int write_info(struct Scsi_Host *host, char *buffer, int length) { /* if someone is sending us data, just throw it away */ return length; } static int show_info (struct seq_file *m, struct Scsi_Host *host) { struct us_data *us = host_to_us(host); const char *string; /* print the controller name */ seq_printf(m, " Host scsi%d: usb-storage\n", host->host_no); /* print product, vendor, and serial number strings */ if (us->pusb_dev->manufacturer) string = us->pusb_dev->manufacturer; else if (us->unusual_dev->vendorName) string = us->unusual_dev->vendorName; else string = "Unknown"; seq_printf(m, " Vendor: %s\n", string); if (us->pusb_dev->product) string = us->pusb_dev->product; else if (us->unusual_dev->productName) string = us->unusual_dev->productName; else string = "Unknown"; seq_printf(m, " Product: %s\n", string); if (us->pusb_dev->serial) string = us->pusb_dev->serial; else string = "None"; seq_printf(m, "Serial Number: %s\n", string); /* show the protocol and transport */ seq_printf(m, " Protocol: %s\n", us->protocol_name); seq_printf(m, " Transport: %s\n", us->transport_name); /* show the device flags */ seq_printf(m, " Quirks:"); #define US_FLAG(name, value) \ if (us->fflags & value) seq_printf(m, " " #name); US_DO_ALL_FLAGS #undef US_FLAG seq_putc(m, '\n'); return 0; } /*********************************************************************** * Sysfs interface ***********************************************************************/ /* Output routine for the sysfs max_sectors file */ static ssize_t max_sectors_show(struct device *dev, struct device_attribute *attr, char *buf) { struct scsi_device *sdev = to_scsi_device(dev); return sprintf(buf, "%u\n", queue_max_hw_sectors(sdev->request_queue)); } /* Input routine for the sysfs max_sectors file */ static ssize_t max_sectors_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct scsi_device *sdev = to_scsi_device(dev); struct queue_limits lim; unsigned short ms; int ret; if (sscanf(buf, "%hu", &ms) <= 0) return -EINVAL; lim = queue_limits_start_update(sdev->request_queue); lim.max_hw_sectors = ms; ret = queue_limits_commit_update_frozen(sdev->request_queue, &lim); if (ret) return ret; return count; } static DEVICE_ATTR_RW(max_sectors); static struct attribute *usb_sdev_attrs[] = { &dev_attr_max_sectors.attr, NULL, }; ATTRIBUTE_GROUPS(usb_sdev); /* * this defines our host template, with which we'll allocate hosts */ static const struct scsi_host_template usb_stor_host_template = { /* basic userland interface stuff */ .name = "usb-storage", .proc_name = "usb-storage", .show_info = show_info, .write_info = write_info, .info = host_info, /* command interface -- queued only */ .queuecommand = queuecommand, /* error and abort handlers */ .eh_abort_handler = command_abort, .eh_device_reset_handler = device_reset, .eh_bus_reset_handler = bus_reset, /* queue commands only, only one command per LUN */ .can_queue = 1, /* unknown initiator id */ .this_id = -1, .sdev_init = sdev_init, .sdev_configure = sdev_configure, .target_alloc = target_alloc, /* lots of sg segments can be handled */ .sg_tablesize = SG_MAX_SEGMENTS, /* * Some host controllers may have alignment requirements. * We'll play it safe by requiring 512-byte alignment always. */ .dma_alignment = 511, /* * Limit the total size of a transfer to 120 KB. * * Some devices are known to choke with anything larger. It seems like * the problem stems from the fact that original IDE controllers had * only an 8-bit register to hold the number of sectors in one transfer * and even those couldn't handle a full 256 sectors. * * Because we want to make sure we interoperate with as many devices as * possible, we will maintain a 240 sector transfer size limit for USB * Mass Storage devices. * * Tests show that other operating have similar limits with Microsoft * Windows 7 limiting transfers to 128 sectors for both USB2 and USB3 * and Apple Mac OS X 10.11 limiting transfers to 256 sectors for USB2 * and 2048 for USB3 devices. */ .max_sectors = 240, /* emulated HBA */ .emulated = 1, /* we do our own delay after a device or bus reset */ .skip_settle_delay = 1, /* sysfs device attributes */ .sdev_groups = usb_sdev_groups, /* module management */ .module = THIS_MODULE }; void usb_stor_host_template_init(struct scsi_host_template *sht, const char *name, struct module *owner) { *sht = usb_stor_host_template; sht->name = name; sht->proc_name = name; sht->module = owner; } EXPORT_SYMBOL_GPL(usb_stor_host_template_init); /* To Report "Illegal Request: Invalid Field in CDB */ unsigned char usb_stor_sense_invalidCDB[18] = { [0] = 0x70, /* current error */ [2] = ILLEGAL_REQUEST, /* Illegal Request = 0x05 */ [7] = 0x0a, /* additional length */ [12] = 0x24 /* Invalid Field in CDB */ }; EXPORT_SYMBOL_GPL(usb_stor_sense_invalidCDB); |
| 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2010 Red Hat, Inc. * Copyright (C) 2016-2023 Christoph Hellwig. */ #include <linux/iomap.h> #include <linux/buffer_head.h> #include <linux/writeback.h> #include <linux/swap.h> #include <linux/migrate.h> #include "internal.h" #include "trace.h" #include "../internal.h" /* * Structure allocated for each folio to track per-block uptodate, dirty state * and I/O completions. */ struct iomap_folio_state { spinlock_t state_lock; unsigned int read_bytes_pending; atomic_t write_bytes_pending; /* * Each block has two bits in this bitmap: * Bits [0..blocks_per_folio) has the uptodate status. * Bits [b_p_f...(2*b_p_f)) has the dirty status. */ unsigned long state[]; }; static inline bool ifs_is_fully_uptodate(struct folio *folio, struct iomap_folio_state *ifs) { struct inode *inode = folio->mapping->host; return bitmap_full(ifs->state, i_blocks_per_folio(inode, folio)); } /* * Find the next uptodate block in the folio. end_blk is inclusive. * If no uptodate block is found, this will return end_blk + 1. */ static unsigned ifs_next_uptodate_block(struct folio *folio, unsigned start_blk, unsigned end_blk) { struct iomap_folio_state *ifs = folio->private; return find_next_bit(ifs->state, end_blk + 1, start_blk); } /* * Find the next non-uptodate block in the folio. end_blk is inclusive. * If no non-uptodate block is found, this will return end_blk + 1. */ static unsigned ifs_next_nonuptodate_block(struct folio *folio, unsigned start_blk, unsigned end_blk) { struct iomap_folio_state *ifs = folio->private; return find_next_zero_bit(ifs->state, end_blk + 1, start_blk); } static bool ifs_set_range_uptodate(struct folio *folio, struct iomap_folio_state *ifs, size_t off, size_t len) { struct inode *inode = folio->mapping->host; unsigned int first_blk = off >> inode->i_blkbits; unsigned int last_blk = (off + len - 1) >> inode->i_blkbits; unsigned int nr_blks = last_blk - first_blk + 1; bitmap_set(ifs->state, first_blk, nr_blks); return ifs_is_fully_uptodate(folio, ifs); } static void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len) { struct iomap_folio_state *ifs = folio->private; unsigned long flags; bool uptodate = true; if (folio_test_uptodate(folio)) return; if (ifs) { spin_lock_irqsave(&ifs->state_lock, flags); uptodate = ifs_set_range_uptodate(folio, ifs, off, len); spin_unlock_irqrestore(&ifs->state_lock, flags); } if (uptodate) folio_mark_uptodate(folio); } /* * Find the next dirty block in the folio. end_blk is inclusive. * If no dirty block is found, this will return end_blk + 1. */ static unsigned ifs_next_dirty_block(struct folio *folio, unsigned start_blk, unsigned end_blk) { struct iomap_folio_state *ifs = folio->private; struct inode *inode = folio->mapping->host; unsigned int blks = i_blocks_per_folio(inode, folio); return find_next_bit(ifs->state, blks + end_blk + 1, blks + start_blk) - blks; } /* * Find the next clean block in the folio. end_blk is inclusive. * If no clean block is found, this will return end_blk + 1. */ static unsigned ifs_next_clean_block(struct folio *folio, unsigned start_blk, unsigned end_blk) { struct iomap_folio_state *ifs = folio->private; struct inode *inode = folio->mapping->host; unsigned int blks = i_blocks_per_folio(inode, folio); return find_next_zero_bit(ifs->state, blks + end_blk + 1, blks + start_blk) - blks; } static unsigned ifs_find_dirty_range(struct folio *folio, struct iomap_folio_state *ifs, u64 *range_start, u64 range_end) { struct inode *inode = folio->mapping->host; unsigned start_blk = offset_in_folio(folio, *range_start) >> inode->i_blkbits; unsigned end_blk = min_not_zero( offset_in_folio(folio, range_end) >> inode->i_blkbits, i_blocks_per_folio(inode, folio)) - 1; unsigned nblks; start_blk = ifs_next_dirty_block(folio, start_blk, end_blk); if (start_blk > end_blk) return 0; if (start_blk == end_blk) nblks = 1; else nblks = ifs_next_clean_block(folio, start_blk + 1, end_blk) - start_blk; *range_start = folio_pos(folio) + (start_blk << inode->i_blkbits); return nblks << inode->i_blkbits; } static unsigned iomap_find_dirty_range(struct folio *folio, u64 *range_start, u64 range_end) { struct iomap_folio_state *ifs = folio->private; if (*range_start >= range_end) return 0; if (ifs) return ifs_find_dirty_range(folio, ifs, range_start, range_end); return range_end - *range_start; } static void ifs_clear_range_dirty(struct folio *folio, struct iomap_folio_state *ifs, size_t off, size_t len) { struct inode *inode = folio->mapping->host; unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); unsigned int first_blk = (off >> inode->i_blkbits); unsigned int last_blk = (off + len - 1) >> inode->i_blkbits; unsigned int nr_blks = last_blk - first_blk + 1; unsigned long flags; spin_lock_irqsave(&ifs->state_lock, flags); bitmap_clear(ifs->state, first_blk + blks_per_folio, nr_blks); spin_unlock_irqrestore(&ifs->state_lock, flags); } static void iomap_clear_range_dirty(struct folio *folio, size_t off, size_t len) { struct iomap_folio_state *ifs = folio->private; if (ifs) ifs_clear_range_dirty(folio, ifs, off, len); } static void ifs_set_range_dirty(struct folio *folio, struct iomap_folio_state *ifs, size_t off, size_t len) { struct inode *inode = folio->mapping->host; unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); unsigned int first_blk = (off >> inode->i_blkbits); unsigned int last_blk = (off + len - 1) >> inode->i_blkbits; unsigned int nr_blks = last_blk - first_blk + 1; unsigned long flags; spin_lock_irqsave(&ifs->state_lock, flags); bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks); spin_unlock_irqrestore(&ifs->state_lock, flags); } static void iomap_set_range_dirty(struct folio *folio, size_t off, size_t len) { struct iomap_folio_state *ifs = folio->private; if (ifs) ifs_set_range_dirty(folio, ifs, off, len); } static struct iomap_folio_state *ifs_alloc(struct inode *inode, struct folio *folio, unsigned int flags) { struct iomap_folio_state *ifs = folio->private; unsigned int nr_blocks = i_blocks_per_folio(inode, folio); gfp_t gfp; if (ifs || nr_blocks <= 1) return ifs; if (flags & IOMAP_NOWAIT) gfp = GFP_NOWAIT; else gfp = GFP_NOFS | __GFP_NOFAIL; /* * ifs->state tracks two sets of state flags when the * filesystem block size is smaller than the folio size. * The first state tracks per-block uptodate and the * second tracks per-block dirty state. */ ifs = kzalloc(struct_size(ifs, state, BITS_TO_LONGS(2 * nr_blocks)), gfp); if (!ifs) return ifs; spin_lock_init(&ifs->state_lock); if (folio_test_uptodate(folio)) bitmap_set(ifs->state, 0, nr_blocks); if (folio_test_dirty(folio)) bitmap_set(ifs->state, nr_blocks, nr_blocks); folio_attach_private(folio, ifs); return ifs; } static void ifs_free(struct folio *folio) { struct iomap_folio_state *ifs = folio_detach_private(folio); if (!ifs) return; WARN_ON_ONCE(ifs->read_bytes_pending != 0); WARN_ON_ONCE(atomic_read(&ifs->write_bytes_pending)); WARN_ON_ONCE(ifs_is_fully_uptodate(folio, ifs) != folio_test_uptodate(folio)); kfree(ifs); } /* * Calculate how many bytes to truncate based off the number of blocks to * truncate and the end position to start truncating from. */ static size_t iomap_bytes_to_truncate(loff_t end_pos, unsigned block_bits, unsigned blocks_truncated) { unsigned block_size = 1 << block_bits; unsigned block_offset = end_pos & (block_size - 1); if (!block_offset) return blocks_truncated << block_bits; return ((blocks_truncated - 1) << block_bits) + block_offset; } /* * Calculate the range inside the folio that we actually need to read. */ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio, loff_t *pos, loff_t length, size_t *offp, size_t *lenp) { struct iomap_folio_state *ifs = folio->private; loff_t orig_pos = *pos; loff_t isize = i_size_read(inode); unsigned block_bits = inode->i_blkbits; unsigned block_size = (1 << block_bits); size_t poff = offset_in_folio(folio, *pos); size_t plen = min_t(loff_t, folio_size(folio) - poff, length); size_t orig_plen = plen; unsigned first = poff >> block_bits; unsigned last = (poff + plen - 1) >> block_bits; /* * If the block size is smaller than the page size, we need to check the * per-block uptodate status and adjust the offset and length if needed * to avoid reading in already uptodate ranges. */ if (ifs) { unsigned int next, blocks_skipped; next = ifs_next_nonuptodate_block(folio, first, last); blocks_skipped = next - first; if (blocks_skipped) { unsigned long block_offset = *pos & (block_size - 1); unsigned bytes_skipped = (blocks_skipped << block_bits) - block_offset; *pos += bytes_skipped; poff += bytes_skipped; plen -= bytes_skipped; } first = next; /* truncate len if we find any trailing uptodate block(s) */ if (++next <= last) { next = ifs_next_uptodate_block(folio, next, last); if (next <= last) { plen -= iomap_bytes_to_truncate(*pos + plen, block_bits, last - next + 1); last = next - 1; } } } /* * If the extent spans the block that contains the i_size, we need to * handle both halves separately so that we properly zero data in the * page cache for blocks that are entirely outside of i_size. */ if (orig_pos <= isize && orig_pos + orig_plen > isize) { unsigned end = offset_in_folio(folio, isize - 1) >> block_bits; if (first <= end && last > end) plen -= iomap_bytes_to_truncate(*pos + plen, block_bits, last - end); } *offp = poff; *lenp = plen; } static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter, loff_t pos) { const struct iomap *srcmap = iomap_iter_srcmap(iter); return srcmap->type != IOMAP_MAPPED || (srcmap->flags & IOMAP_F_NEW) || pos >= i_size_read(iter->inode); } /** * iomap_read_inline_data - copy inline data into the page cache * @iter: iteration structure * @folio: folio to copy to * * Copy the inline data in @iter into @folio and zero out the rest of the folio. * Only a single IOMAP_INLINE extent is allowed at the end of each file. * Returns zero for success to complete the read, or the usual negative errno. */ static int iomap_read_inline_data(const struct iomap_iter *iter, struct folio *folio) { const struct iomap *iomap = iomap_iter_srcmap(iter); size_t size = i_size_read(iter->inode) - iomap->offset; size_t offset = offset_in_folio(folio, iomap->offset); if (WARN_ON_ONCE(!iomap->inline_data)) return -EIO; if (folio_test_uptodate(folio)) return 0; if (WARN_ON_ONCE(size > iomap->length)) return -EIO; if (offset > 0) ifs_alloc(iter->inode, folio, iter->flags); folio_fill_tail(folio, offset, iomap->inline_data, size); iomap_set_range_uptodate(folio, offset, folio_size(folio) - offset); return 0; } void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len, int error) { struct iomap_folio_state *ifs = folio->private; bool uptodate = !error; bool finished = true; if (ifs) { unsigned long flags; spin_lock_irqsave(&ifs->state_lock, flags); if (!error) uptodate = ifs_set_range_uptodate(folio, ifs, off, len); ifs->read_bytes_pending -= len; finished = !ifs->read_bytes_pending; spin_unlock_irqrestore(&ifs->state_lock, flags); } if (finished) folio_end_read(folio, uptodate); } EXPORT_SYMBOL_GPL(iomap_finish_folio_read); static void iomap_read_init(struct folio *folio) { struct iomap_folio_state *ifs = folio->private; if (ifs) { size_t len = folio_size(folio); /* * ifs->read_bytes_pending is used to track how many bytes are * read in asynchronously by the IO helper. We need to track * this so that we can know when the IO helper has finished * reading in all the necessary ranges of the folio and can end * the read. * * Increase ->read_bytes_pending by the folio size to start, and * add a +1 bias. We'll subtract the bias and any uptodate / * zeroed ranges that did not require IO in iomap_read_end() * after we're done processing the folio. * * We do this because otherwise, we would have to increment * ifs->read_bytes_pending every time a range in the folio needs * to be read in, which can get expensive since the spinlock * needs to be held whenever modifying ifs->read_bytes_pending. * * We add the bias to ensure the read has not been ended on the * folio when iomap_read_end() is called, even if the IO helper * has already finished reading in the entire folio. */ spin_lock_irq(&ifs->state_lock); WARN_ON_ONCE(ifs->read_bytes_pending != 0); ifs->read_bytes_pending = len + 1; spin_unlock_irq(&ifs->state_lock); } } /* * This ends IO if no bytes were submitted to an IO helper. * * Otherwise, this calibrates ifs->read_bytes_pending to represent only the * submitted bytes (see comment in iomap_read_init()). If all bytes submitted * have already been completed by the IO helper, then this will end the read. * Else the IO helper will end the read after all submitted ranges have been * read. */ static void iomap_read_end(struct folio *folio, size_t bytes_submitted) { struct iomap_folio_state *ifs = folio->private; if (ifs) { bool end_read, uptodate; spin_lock_irq(&ifs->state_lock); if (!ifs->read_bytes_pending) { WARN_ON_ONCE(bytes_submitted); spin_unlock_irq(&ifs->state_lock); folio_unlock(folio); return; } /* * Subtract any bytes that were initially accounted to * read_bytes_pending but skipped for IO. The +1 accounts for * the bias we added in iomap_read_init(). */ ifs->read_bytes_pending -= (folio_size(folio) + 1 - bytes_submitted); /* * If !ifs->read_bytes_pending, this means all pending reads by * the IO helper have already completed, which means we need to * end the folio read here. If ifs->read_bytes_pending != 0, * the IO helper will end the folio read. */ end_read = !ifs->read_bytes_pending; if (end_read) uptodate = ifs_is_fully_uptodate(folio, ifs); spin_unlock_irq(&ifs->state_lock); if (end_read) folio_end_read(folio, uptodate); } else if (!bytes_submitted) { /* * If there were no bytes submitted, this means we are * responsible for unlocking the folio here, since no IO helper * has taken ownership of it. If there were bytes submitted, * then the IO helper will end the read via * iomap_finish_folio_read(). */ folio_unlock(folio); } } static int iomap_read_folio_iter(struct iomap_iter *iter, struct iomap_read_folio_ctx *ctx, size_t *bytes_submitted) { const struct iomap *iomap = &iter->iomap; loff_t pos = iter->pos; loff_t length = iomap_length(iter); struct folio *folio = ctx->cur_folio; size_t poff, plen; loff_t pos_diff; int ret; if (iomap->type == IOMAP_INLINE) { ret = iomap_read_inline_data(iter, folio); if (ret) return ret; return iomap_iter_advance(iter, length); } ifs_alloc(iter->inode, folio, iter->flags); length = min_t(loff_t, length, folio_size(folio) - offset_in_folio(folio, pos)); while (length) { iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen); pos_diff = pos - iter->pos; if (WARN_ON_ONCE(pos_diff + plen > length)) return -EIO; ret = iomap_iter_advance(iter, pos_diff); if (ret) return ret; if (plen == 0) return 0; /* zero post-eof blocks as the page may be mapped */ if (iomap_block_needs_zeroing(iter, pos)) { folio_zero_range(folio, poff, plen); iomap_set_range_uptodate(folio, poff, plen); } else { if (!*bytes_submitted) iomap_read_init(folio); ret = ctx->ops->read_folio_range(iter, ctx, plen); if (ret) return ret; *bytes_submitted += plen; } ret = iomap_iter_advance(iter, plen); if (ret) return ret; length -= pos_diff + plen; pos = iter->pos; } return 0; } void iomap_read_folio(const struct iomap_ops *ops, struct iomap_read_folio_ctx *ctx) { struct folio *folio = ctx->cur_folio; struct iomap_iter iter = { .inode = folio->mapping->host, .pos = folio_pos(folio), .len = folio_size(folio), }; size_t bytes_submitted = 0; int ret; trace_iomap_readpage(iter.inode, 1); while ((ret = iomap_iter(&iter, ops)) > 0) iter.status = iomap_read_folio_iter(&iter, ctx, &bytes_submitted); if (ctx->ops->submit_read) ctx->ops->submit_read(ctx); iomap_read_end(folio, bytes_submitted); } EXPORT_SYMBOL_GPL(iomap_read_folio); static int iomap_readahead_iter(struct iomap_iter *iter, struct iomap_read_folio_ctx *ctx, size_t *cur_bytes_submitted) { int ret; while (iomap_length(iter)) { if (ctx->cur_folio && offset_in_folio(ctx->cur_folio, iter->pos) == 0) { iomap_read_end(ctx->cur_folio, *cur_bytes_submitted); ctx->cur_folio = NULL; } if (!ctx->cur_folio) { ctx->cur_folio = readahead_folio(ctx->rac); if (WARN_ON_ONCE(!ctx->cur_folio)) return -EINVAL; *cur_bytes_submitted = 0; } ret = iomap_read_folio_iter(iter, ctx, cur_bytes_submitted); if (ret) return ret; } return 0; } /** * iomap_readahead - Attempt to read pages from a file. * @ops: The operations vector for the filesystem. * @ctx: The ctx used for issuing readahead. * * This function is for filesystems to call to implement their readahead * address_space operation. * * Context: The @ops callbacks may submit I/O (eg to read the addresses of * blocks from disc), and may wait for it. The caller may be trying to * access a different page, and so sleeping excessively should be avoided. * It may allocate memory, but should avoid costly allocations. This * function is called with memalloc_nofs set, so allocations will not cause * the filesystem to be reentered. */ void iomap_readahead(const struct iomap_ops *ops, struct iomap_read_folio_ctx *ctx) { struct readahead_control *rac = ctx->rac; struct iomap_iter iter = { .inode = rac->mapping->host, .pos = readahead_pos(rac), .len = readahead_length(rac), }; size_t cur_bytes_submitted; trace_iomap_readahead(rac->mapping->host, readahead_count(rac)); while (iomap_iter(&iter, ops) > 0) iter.status = iomap_readahead_iter(&iter, ctx, &cur_bytes_submitted); if (ctx->ops->submit_read) ctx->ops->submit_read(ctx); if (ctx->cur_folio) iomap_read_end(ctx->cur_folio, cur_bytes_submitted); } EXPORT_SYMBOL_GPL(iomap_readahead); /* * iomap_is_partially_uptodate checks whether blocks within a folio are * uptodate or not. * * Returns true if all blocks which correspond to the specified part * of the folio are uptodate. */ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count) { struct iomap_folio_state *ifs = folio->private; struct inode *inode = folio->mapping->host; unsigned first, last; if (!ifs) return false; /* Caller's range may extend past the end of this folio */ count = min(folio_size(folio) - from, count); /* First and last blocks in range within folio */ first = from >> inode->i_blkbits; last = (from + count - 1) >> inode->i_blkbits; return ifs_next_nonuptodate_block(folio, first, last) > last; } EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); /** * iomap_get_folio - get a folio reference for writing * @iter: iteration structure * @pos: start offset of write * @len: Suggested size of folio to create. * * Returns a locked reference to the folio at @pos, or an error pointer if the * folio could not be obtained. */ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) { fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; if (iter->flags & IOMAP_NOWAIT) fgp |= FGP_NOWAIT; if (iter->flags & IOMAP_DONTCACHE) fgp |= FGP_DONTCACHE; fgp |= fgf_set_order(len); return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, fgp, mapping_gfp_mask(iter->inode->i_mapping)); } EXPORT_SYMBOL_GPL(iomap_get_folio); bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags) { trace_iomap_release_folio(folio->mapping->host, folio_pos(folio), folio_size(folio)); /* * If the folio is dirty, we refuse to release our metadata because * it may be partially dirty. Once we track per-block dirty state, * we can release the metadata if every block is dirty. */ if (folio_test_dirty(folio)) return false; ifs_free(folio); return true; } EXPORT_SYMBOL_GPL(iomap_release_folio); void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len) { trace_iomap_invalidate_folio(folio->mapping->host, folio_pos(folio) + offset, len); /* * If we're invalidating the entire folio, clear the dirty state * from it and release it to avoid unnecessary buildup of the LRU. */ if (offset == 0 && len == folio_size(folio)) { WARN_ON_ONCE(folio_test_writeback(folio)); folio_cancel_dirty(folio); ifs_free(folio); } } EXPORT_SYMBOL_GPL(iomap_invalidate_folio); bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) { struct inode *inode = mapping->host; size_t len = folio_size(folio); ifs_alloc(inode, folio, 0); iomap_set_range_dirty(folio, 0, len); return filemap_dirty_folio(mapping, folio); } EXPORT_SYMBOL_GPL(iomap_dirty_folio); static void iomap_write_failed(struct inode *inode, loff_t pos, unsigned len) { loff_t i_size = i_size_read(inode); /* * Only truncate newly allocated pages beyoned EOF, even if the * write started inside the existing inode size. */ if (pos + len > i_size) truncate_pagecache_range(inode, max(pos, i_size), pos + len - 1); } static int __iomap_write_begin(const struct iomap_iter *iter, const struct iomap_write_ops *write_ops, size_t len, struct folio *folio) { struct iomap_folio_state *ifs; loff_t pos = iter->pos; loff_t block_size = i_blocksize(iter->inode); loff_t block_start = round_down(pos, block_size); loff_t block_end = round_up(pos + len, block_size); unsigned int nr_blocks = i_blocks_per_folio(iter->inode, folio); size_t from = offset_in_folio(folio, pos), to = from + len; size_t poff, plen; /* * If the write or zeroing completely overlaps the current folio, then * entire folio will be dirtied so there is no need for * per-block state tracking structures to be attached to this folio. * For the unshare case, we must read in the ondisk contents because we * are not changing pagecache contents. */ if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) && pos + len >= folio_next_pos(folio)) return 0; ifs = ifs_alloc(iter->inode, folio, iter->flags); if ((iter->flags & IOMAP_NOWAIT) && !ifs && nr_blocks > 1) return -EAGAIN; if (folio_test_uptodate(folio)) return 0; do { iomap_adjust_read_range(iter->inode, folio, &block_start, block_end - block_start, &poff, &plen); if (plen == 0) break; /* * If the read range will be entirely overwritten by the write, * we can skip having to zero/read it in. */ if (!(iter->flags & IOMAP_UNSHARE) && from <= poff && to >= poff + plen) continue; if (iomap_block_needs_zeroing(iter, block_start)) { if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE)) return -EIO; folio_zero_segments(folio, poff, from, to, poff + plen); } else { int status; if (iter->flags & IOMAP_NOWAIT) return -EAGAIN; if (write_ops && write_ops->read_folio_range) status = write_ops->read_folio_range(iter, folio, block_start, plen); else status = iomap_bio_read_folio_range_sync(iter, folio, block_start, plen); if (status) return status; } iomap_set_range_uptodate(folio, poff, plen); } while ((block_start += plen) < block_end); return 0; } static struct folio *__iomap_get_folio(struct iomap_iter *iter, const struct iomap_write_ops *write_ops, size_t len) { loff_t pos = iter->pos; if (!mapping_large_folio_support(iter->inode->i_mapping)) len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos)); if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) { struct folio *folio = folio_batch_next(iter->fbatch); if (!folio) return NULL; /* * The folio mapping generally shouldn't have changed based on * fs locks, but be consistent with filemap lookup and retry * the iter if it does. */ folio_lock(folio); if (unlikely(folio->mapping != iter->inode->i_mapping)) { iter->iomap.flags |= IOMAP_F_STALE; folio_unlock(folio); return NULL; } folio_get(folio); return folio; } if (write_ops && write_ops->get_folio) return write_ops->get_folio(iter, pos, len); return iomap_get_folio(iter, pos, len); } static void __iomap_put_folio(struct iomap_iter *iter, const struct iomap_write_ops *write_ops, size_t ret, struct folio *folio) { loff_t pos = iter->pos; if (write_ops && write_ops->put_folio) { write_ops->put_folio(iter->inode, pos, ret, folio); } else { folio_unlock(folio); folio_put(folio); } } /* trim pos and bytes to within a given folio */ static loff_t iomap_trim_folio_range(struct iomap_iter *iter, struct folio *folio, size_t *offset, u64 *bytes) { loff_t pos = iter->pos; size_t fsize = folio_size(folio); WARN_ON_ONCE(pos < folio_pos(folio)); WARN_ON_ONCE(pos >= folio_pos(folio) + fsize); *offset = offset_in_folio(folio, pos); *bytes = min(*bytes, fsize - *offset); return pos; } static int iomap_write_begin_inline(const struct iomap_iter *iter, struct folio *folio) { /* needs more work for the tailpacking case; disable for now */ if (WARN_ON_ONCE(iomap_iter_srcmap(iter)->offset != 0)) return -EIO; return iomap_read_inline_data(iter, folio); } /* * Grab and prepare a folio for write based on iter state. Returns the folio, * offset, and length. Callers can optionally pass a max length *plen, * otherwise init to zero. */ static int iomap_write_begin(struct iomap_iter *iter, const struct iomap_write_ops *write_ops, struct folio **foliop, size_t *poffset, u64 *plen) { const struct iomap *srcmap = iomap_iter_srcmap(iter); loff_t pos; u64 len = min_t(u64, SIZE_MAX, iomap_length(iter)); struct folio *folio; int status = 0; len = min_not_zero(len, *plen); *foliop = NULL; *plen = 0; if (fatal_signal_pending(current)) return -EINTR; folio = __iomap_get_folio(iter, write_ops, len); if (IS_ERR(folio)) return PTR_ERR(folio); /* * No folio means we're done with a batch. We still have range to * process so return and let the caller iterate and refill the batch. */ if (!folio) { WARN_ON_ONCE(!(iter->iomap.flags & IOMAP_F_FOLIO_BATCH)); return 0; } /* * Now we have a locked folio, before we do anything with it we need to * check that the iomap we have cached is not stale. The inode extent * mapping can change due to concurrent IO in flight (e.g. * IOMAP_UNWRITTEN state can change and memory reclaim could have * reclaimed a previously partially written page at this index after IO * completion before this write reaches this file offset) and hence we * could do the wrong thing here (zero a page range incorrectly or fail * to zero) and corrupt data. */ if (write_ops && write_ops->iomap_valid) { bool iomap_valid = write_ops->iomap_valid(iter->inode, &iter->iomap); if (!iomap_valid) { iter->iomap.flags |= IOMAP_F_STALE; status = 0; goto out_unlock; } } /* * The folios in a batch may not be contiguous. If we've skipped * forward, advance the iter to the pos of the current folio. If the * folio starts beyond the end of the mapping, it may have been trimmed * since the lookup for whatever reason. Return a NULL folio to * terminate the op. */ if (folio_pos(folio) > iter->pos) { len = min_t(u64, folio_pos(folio) - iter->pos, iomap_length(iter)); status = iomap_iter_advance(iter, len); len = iomap_length(iter); if (status || !len) goto out_unlock; } pos = iomap_trim_folio_range(iter, folio, poffset, &len); if (srcmap->type == IOMAP_INLINE) status = iomap_write_begin_inline(iter, folio); else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) status = __block_write_begin_int(folio, pos, len, NULL, srcmap); else status = __iomap_write_begin(iter, write_ops, len, folio); if (unlikely(status)) goto out_unlock; *foliop = folio; *plen = len; return 0; out_unlock: __iomap_put_folio(iter, write_ops, 0, folio); return status; } static bool __iomap_write_end(struct inode *inode, loff_t pos, size_t len, size_t copied, struct folio *folio) { flush_dcache_folio(folio); /* * The blocks that were entirely written will now be uptodate, so we * don't have to worry about a read_folio reading them and overwriting a * partial write. However, if we've encountered a short write and only * partially written into a block, it will not be marked uptodate, so a * read_folio might come in and destroy our partial write. * * Do the simplest thing and just treat any short write to a * non-uptodate page as a zero-length write, and force the caller to * redo the whole thing. */ if (unlikely(copied < len && !folio_test_uptodate(folio))) return false; iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len); iomap_set_range_dirty(folio, offset_in_folio(folio, pos), copied); filemap_dirty_folio(inode->i_mapping, folio); return true; } static bool iomap_write_end_inline(const struct iomap_iter *iter, struct folio *folio, loff_t pos, size_t copied) { const struct iomap *iomap = &iter->iomap; void *addr; WARN_ON_ONCE(!folio_test_uptodate(folio)); BUG_ON(!iomap_inline_data_valid(iomap)); if (WARN_ON_ONCE(!iomap->inline_data)) return false; flush_dcache_folio(folio); addr = kmap_local_folio(folio, pos); memcpy(iomap_inline_data(iomap, pos), addr, copied); kunmap_local(addr); mark_inode_dirty(iter->inode); return true; } /* * Returns true if all copied bytes have been written to the pagecache, * otherwise return false. */ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied, struct folio *folio) { const struct iomap *srcmap = iomap_iter_srcmap(iter); loff_t pos = iter->pos; if (srcmap->type == IOMAP_INLINE) return iomap_write_end_inline(iter, folio, pos, copied); if (srcmap->flags & IOMAP_F_BUFFER_HEAD) { size_t bh_written; bh_written = block_write_end(pos, len, copied, folio); WARN_ON_ONCE(bh_written != copied && bh_written != 0); return bh_written == copied; } return __iomap_write_end(iter->inode, pos, len, copied, folio); } static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i, const struct iomap_write_ops *write_ops) { ssize_t total_written = 0; int status = 0; struct address_space *mapping = iter->inode->i_mapping; size_t chunk = mapping_max_folio_size(mapping); unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0; do { struct folio *folio; loff_t old_size; size_t offset; /* Offset into folio */ u64 bytes; /* Bytes to write to folio */ size_t copied; /* Bytes copied from user */ u64 written; /* Bytes have been written */ loff_t pos; bytes = iov_iter_count(i); retry: offset = iter->pos & (chunk - 1); bytes = min(chunk - offset, bytes); status = balance_dirty_pages_ratelimited_flags(mapping, bdp_flags); if (unlikely(status)) break; if (bytes > iomap_length(iter)) bytes = iomap_length(iter); /* * Bring in the user page that we'll copy from _first_. * Otherwise there's a nasty deadlock on copying from the * same page as we're writing to, without it being marked * up-to-date. * * For async buffered writes the assumption is that the user * page has already been faulted in. This can be optimized by * faulting the user page. */ if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) { status = -EFAULT; break; } status = iomap_write_begin(iter, write_ops, &folio, &offset, &bytes); if (unlikely(status)) { iomap_write_failed(iter->inode, iter->pos, bytes); break; } if (iter->iomap.flags & IOMAP_F_STALE) break; pos = iter->pos; if (mapping_writably_mapped(mapping)) flush_dcache_folio(folio); copied = copy_folio_from_iter_atomic(folio, offset, bytes, i); written = iomap_write_end(iter, bytes, copied, folio) ? copied : 0; /* * Update the in-memory inode size after copying the data into * the page cache. It's up to the file system to write the * updated size to disk, preferably after I/O completion so that * no stale data is exposed. Only once that's done can we * unlock and release the folio. */ old_size = iter->inode->i_size; if (pos + written > old_size) { i_size_write(iter->inode, pos + written); iter->iomap.flags |= IOMAP_F_SIZE_CHANGED; } __iomap_put_folio(iter, write_ops, written, folio); if (old_size < pos) pagecache_isize_extended(iter->inode, old_size, pos); cond_resched(); if (unlikely(written == 0)) { /* * A short copy made iomap_write_end() reject the * thing entirely. Might be memory poisoning * halfway through, might be a race with munmap, * might be severe memory pressure. */ iomap_write_failed(iter->inode, pos, bytes); iov_iter_revert(i, copied); if (chunk > PAGE_SIZE) chunk /= 2; if (copied) { bytes = copied; goto retry; } } else { total_written += written; iomap_iter_advance(iter, written); } } while (iov_iter_count(i) && iomap_length(iter)); return total_written ? 0 : status; } ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i, const struct iomap_ops *ops, const struct iomap_write_ops *write_ops, void *private) { struct iomap_iter iter = { .inode = iocb->ki_filp->f_mapping->host, .pos = iocb->ki_pos, .len = iov_iter_count(i), .flags = IOMAP_WRITE, .private = private, }; ssize_t ret; if (iocb->ki_flags & IOCB_NOWAIT) iter.flags |= IOMAP_NOWAIT; if (iocb->ki_flags & IOCB_DONTCACHE) iter.flags |= IOMAP_DONTCACHE; while ((ret = iomap_iter(&iter, ops)) > 0) iter.status = iomap_write_iter(&iter, i, write_ops); if (unlikely(iter.pos == iocb->ki_pos)) return ret; ret = iter.pos - iocb->ki_pos; iocb->ki_pos = iter.pos; return ret; } EXPORT_SYMBOL_GPL(iomap_file_buffered_write); static void iomap_write_delalloc_ifs_punch(struct inode *inode, struct folio *folio, loff_t start_byte, loff_t end_byte, struct iomap *iomap, iomap_punch_t punch) { unsigned int first_blk, last_blk; loff_t last_byte; u8 blkbits = inode->i_blkbits; struct iomap_folio_state *ifs; /* * When we have per-block dirty tracking, there can be * blocks within a folio which are marked uptodate * but not dirty. In that case it is necessary to punch * out such blocks to avoid leaking any delalloc blocks. */ ifs = folio->private; if (!ifs) return; last_byte = min_t(loff_t, end_byte - 1, folio_next_pos(folio) - 1); first_blk = offset_in_folio(folio, start_byte) >> blkbits; last_blk = offset_in_folio(folio, last_byte) >> blkbits; while ((first_blk = ifs_next_clean_block(folio, first_blk, last_blk)) <= last_blk) { punch(inode, folio_pos(folio) + (first_blk << blkbits), 1 << blkbits, iomap); first_blk++; } } static void iomap_write_delalloc_punch(struct inode *inode, struct folio *folio, loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte, struct iomap *iomap, iomap_punch_t punch) { if (!folio_test_dirty(folio)) return; /* if dirty, punch up to offset */ if (start_byte > *punch_start_byte) { punch(inode, *punch_start_byte, start_byte - *punch_start_byte, iomap); } /* Punch non-dirty blocks within folio */ iomap_write_delalloc_ifs_punch(inode, folio, start_byte, end_byte, iomap, punch); /* * Make sure the next punch start is correctly bound to * the end of this data range, not the end of the folio. */ *punch_start_byte = min_t(loff_t, end_byte, folio_next_pos(folio)); } /* * Scan the data range passed to us for dirty page cache folios. If we find a * dirty folio, punch out the preceding range and update the offset from which * the next punch will start from. * * We can punch out storage reservations under clean pages because they either * contain data that has been written back - in which case the delalloc punch * over that range is a no-op - or they have been read faults in which case they * contain zeroes and we can remove the delalloc backing range and any new * writes to those pages will do the normal hole filling operation... * * This makes the logic simple: we only need to keep the delalloc extents only * over the dirty ranges of the page cache. * * This function uses [start_byte, end_byte) intervals (i.e. open ended) to * simplify range iterations. */ static void iomap_write_delalloc_scan(struct inode *inode, loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte, struct iomap *iomap, iomap_punch_t punch) { while (start_byte < end_byte) { struct folio *folio; /* grab locked page */ folio = filemap_lock_folio(inode->i_mapping, start_byte >> PAGE_SHIFT); if (IS_ERR(folio)) { start_byte = ALIGN_DOWN(start_byte, PAGE_SIZE) + PAGE_SIZE; continue; } iomap_write_delalloc_punch(inode, folio, punch_start_byte, start_byte, end_byte, iomap, punch); /* move offset to start of next folio in range */ start_byte = folio_next_pos(folio); folio_unlock(folio); folio_put(folio); } } /* * When a short write occurs, the filesystem might need to use ->iomap_end * to remove space reservations created in ->iomap_begin. * * For filesystems that use delayed allocation, there can be dirty pages over * the delalloc extent outside the range of a short write but still within the * delalloc extent allocated for this iomap if the write raced with page * faults. * * Punch out all the delalloc blocks in the range given except for those that * have dirty data still pending in the page cache - those are going to be * written and so must still retain the delalloc backing for writeback. * * The punch() callback *must* only punch delalloc extents in the range passed * to it. It must skip over all other types of extents in the range and leave * them completely unchanged. It must do this punch atomically with respect to * other extent modifications. * * The punch() callback may be called with a folio locked to prevent writeback * extent allocation racing at the edge of the range we are currently punching. * The locked folio may or may not cover the range being punched, so it is not * safe for the punch() callback to lock folios itself. * * Lock order is: * * inode->i_rwsem (shared or exclusive) * inode->i_mapping->invalidate_lock (exclusive) * folio_lock() * ->punch * internal filesystem allocation lock * * As we are scanning the page cache for data, we don't need to reimplement the * wheel - mapping_seek_hole_data() does exactly what we need to identify the * start and end of data ranges correctly even for sub-folio block sizes. This * byte range based iteration is especially convenient because it means we * don't have to care about variable size folios, nor where the start or end of * the data range lies within a folio, if they lie within the same folio or even * if there are multiple discontiguous data ranges within the folio. * * It should be noted that mapping_seek_hole_data() is not aware of EOF, and so * can return data ranges that exist in the cache beyond EOF. e.g. a page fault * spanning EOF will initialise the post-EOF data to zeroes and mark it up to * date. A write page fault can then mark it dirty. If we then fail a write() * beyond EOF into that up to date cached range, we allocate a delalloc block * beyond EOF and then have to punch it out. Because the range is up to date, * mapping_seek_hole_data() will return it, and we will skip the punch because * the folio is dirty. THis is incorrect - we always need to punch out delalloc * beyond EOF in this case as writeback will never write back and covert that * delalloc block beyond EOF. Hence we limit the cached data scan range to EOF, * resulting in always punching out the range from the EOF to the end of the * range the iomap spans. * * Intervals are of the form [start_byte, end_byte) (i.e. open ended) because it * matches the intervals returned by mapping_seek_hole_data(). i.e. SEEK_DATA * returns the start of a data range (start_byte), and SEEK_HOLE(start_byte) * returns the end of the data range (data_end). Using closed intervals would * require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose * the code to subtle off-by-one bugs.... */ void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte, loff_t end_byte, unsigned flags, struct iomap *iomap, iomap_punch_t punch) { loff_t punch_start_byte = start_byte; loff_t scan_end_byte = min(i_size_read(inode), end_byte); /* * The caller must hold invalidate_lock to avoid races with page faults * re-instantiating folios and dirtying them via ->page_mkwrite whilst * we walk the cache and perform delalloc extent removal. Failing to do * this can leave dirty pages with no space reservation in the cache. */ lockdep_assert_held_write(&inode->i_mapping->invalidate_lock); while (start_byte < scan_end_byte) { loff_t data_end; start_byte = mapping_seek_hole_data(inode->i_mapping, start_byte, scan_end_byte, SEEK_DATA); /* * If there is no more data to scan, all that is left is to * punch out the remaining range. * * Note that mapping_seek_hole_data is only supposed to return * either an offset or -ENXIO, so WARN on any other error as * that would be an API change without updating the callers. */ if (start_byte == -ENXIO || start_byte == scan_end_byte) break; if (WARN_ON_ONCE(start_byte < 0)) return; WARN_ON_ONCE(start_byte < punch_start_byte); WARN_ON_ONCE(start_byte > scan_end_byte); /* * We find the end of this contiguous cached data range by * seeking from start_byte to the beginning of the next hole. */ data_end = mapping_seek_hole_data(inode->i_mapping, start_byte, scan_end_byte, SEEK_HOLE); if (WARN_ON_ONCE(data_end < 0)) return; /* * If we race with post-direct I/O invalidation of the page cache, * there might be no data left at start_byte. */ if (data_end == start_byte) continue; WARN_ON_ONCE(data_end < start_byte); WARN_ON_ONCE(data_end > scan_end_byte); iomap_write_delalloc_scan(inode, &punch_start_byte, start_byte, data_end, iomap, punch); /* The next data search starts at the end of this one. */ start_byte = data_end; } if (punch_start_byte < end_byte) punch(inode, punch_start_byte, end_byte - punch_start_byte, iomap); } EXPORT_SYMBOL_GPL(iomap_write_delalloc_release); static int iomap_unshare_iter(struct iomap_iter *iter, const struct iomap_write_ops *write_ops) { struct iomap *iomap = &iter->iomap; u64 bytes = iomap_length(iter); int status; if (!iomap_want_unshare_iter(iter)) return iomap_iter_advance(iter, bytes); do { struct folio *folio; size_t offset; bool ret; bytes = min_t(u64, SIZE_MAX, bytes); status = iomap_write_begin(iter, write_ops, &folio, &offset, &bytes); if (unlikely(status)) return status; if (iomap->flags & IOMAP_F_STALE) break; ret = iomap_write_end(iter, bytes, bytes, folio); __iomap_put_folio(iter, write_ops, bytes, folio); if (WARN_ON_ONCE(!ret)) return -EIO; cond_resched(); balance_dirty_pages_ratelimited(iter->inode->i_mapping); status = iomap_iter_advance(iter, bytes); if (status) break; } while ((bytes = iomap_length(iter)) > 0); return status; } int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, const struct iomap_ops *ops, const struct iomap_write_ops *write_ops) { struct iomap_iter iter = { .inode = inode, .pos = pos, .flags = IOMAP_WRITE | IOMAP_UNSHARE, }; loff_t size = i_size_read(inode); int ret; if (pos < 0 || pos >= size) return 0; iter.len = min(len, size - pos); while ((ret = iomap_iter(&iter, ops)) > 0) iter.status = iomap_unshare_iter(&iter, write_ops); return ret; } EXPORT_SYMBOL_GPL(iomap_file_unshare); /* * Flush the remaining range of the iter and mark the current mapping stale. * This is used when zero range sees an unwritten mapping that may have had * dirty pagecache over it. */ static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i) { struct address_space *mapping = i->inode->i_mapping; loff_t end = i->pos + i->len - 1; i->iomap.flags |= IOMAP_F_STALE; return filemap_write_and_wait_range(mapping, i->pos, end); } static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, const struct iomap_write_ops *write_ops) { u64 bytes = iomap_length(iter); int status; do { struct folio *folio; size_t offset; bool ret; bytes = min_t(u64, SIZE_MAX, bytes); status = iomap_write_begin(iter, write_ops, &folio, &offset, &bytes); if (status) return status; if (iter->iomap.flags & IOMAP_F_STALE) break; /* a NULL folio means we're done with a folio batch */ if (!folio) { status = iomap_iter_advance_full(iter); break; } /* warn about zeroing folios beyond eof that won't write back */ WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size); trace_iomap_zero_iter(iter->inode, folio_pos(folio) + offset, bytes); folio_zero_range(folio, offset, bytes); folio_mark_accessed(folio); ret = iomap_write_end(iter, bytes, bytes, folio); __iomap_put_folio(iter, write_ops, bytes, folio); if (WARN_ON_ONCE(!ret)) return -EIO; status = iomap_iter_advance(iter, bytes); if (status) break; } while ((bytes = iomap_length(iter)) > 0); if (did_zero) *did_zero = true; return status; } /** * iomap_fill_dirty_folios - fill a folio batch with dirty folios * @iter: Iteration structure * @start: Start offset of range. Updated based on lookup progress. * @end: End offset of range * @iomap_flags: Flags to set on the associated iomap to track the batch. * * Returns the folio count directly. Also returns the associated control flag if * the the batch lookup is performed and the expected offset of a subsequent * lookup via out params. The caller is responsible to set the flag on the * associated iomap. */ unsigned int iomap_fill_dirty_folios( struct iomap_iter *iter, loff_t *start, loff_t end, unsigned int *iomap_flags) { struct address_space *mapping = iter->inode->i_mapping; pgoff_t pstart = *start >> PAGE_SHIFT; pgoff_t pend = (end - 1) >> PAGE_SHIFT; unsigned int count; if (!iter->fbatch) { *start = end; return 0; } count = filemap_get_folios_dirty(mapping, &pstart, pend, iter->fbatch); *start = (pstart << PAGE_SHIFT); *iomap_flags |= IOMAP_F_FOLIO_BATCH; return count; } EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios); int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, const struct iomap_ops *ops, const struct iomap_write_ops *write_ops, void *private) { struct folio_batch fbatch; struct iomap_iter iter = { .inode = inode, .pos = pos, .len = len, .flags = IOMAP_ZERO, .private = private, .fbatch = &fbatch, }; struct address_space *mapping = inode->i_mapping; int ret; bool range_dirty; folio_batch_init(&fbatch); /* * To avoid an unconditional flush, check pagecache state and only flush * if dirty and the fs returns a mapping that might convert on * writeback. */ range_dirty = filemap_range_needs_writeback(mapping, iter.pos, iter.pos + iter.len - 1); while ((ret = iomap_iter(&iter, ops)) > 0) { const struct iomap *srcmap = iomap_iter_srcmap(&iter); if (WARN_ON_ONCE((iter.iomap.flags & IOMAP_F_FOLIO_BATCH) && srcmap->type != IOMAP_UNWRITTEN)) return -EIO; if (!(iter.iomap.flags & IOMAP_F_FOLIO_BATCH) && (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)) { s64 status; if (range_dirty) { range_dirty = false; status = iomap_zero_iter_flush_and_stale(&iter); } else { status = iomap_iter_advance_full(&iter); } iter.status = status; continue; } iter.status = iomap_zero_iter(&iter, did_zero, write_ops); } return ret; } EXPORT_SYMBOL_GPL(iomap_zero_range); int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, const struct iomap_ops *ops, const struct iomap_write_ops *write_ops, void *private) { unsigned int blocksize = i_blocksize(inode); unsigned int off = pos & (blocksize - 1); /* Block boundary? Nothing to do */ if (!off) return 0; return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops, write_ops, private); } EXPORT_SYMBOL_GPL(iomap_truncate_page); static int iomap_folio_mkwrite_iter(struct iomap_iter *iter, struct folio *folio) { loff_t length = iomap_length(iter); int ret; if (iter->iomap.flags & IOMAP_F_BUFFER_HEAD) { ret = __block_write_begin_int(folio, iter->pos, length, NULL, &iter->iomap); if (ret) return ret; block_commit_write(folio, 0, length); } else { WARN_ON_ONCE(!folio_test_uptodate(folio)); folio_mark_dirty(folio); } return iomap_iter_advance(iter, length); } vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops, void *private) { struct iomap_iter iter = { .inode = file_inode(vmf->vma->vm_file), .flags = IOMAP_WRITE | IOMAP_FAULT, .private = private, }; struct folio *folio = page_folio(vmf->page); ssize_t ret; folio_lock(folio); ret = folio_mkwrite_check_truncate(folio, iter.inode); if (ret < 0) goto out_unlock; iter.pos = folio_pos(folio); iter.len = ret; while ((ret = iomap_iter(&iter, ops)) > 0) iter.status = iomap_folio_mkwrite_iter(&iter, folio); if (ret < 0) goto out_unlock; folio_wait_stable(folio); return VM_FAULT_LOCKED; out_unlock: folio_unlock(folio); return vmf_fs_error(ret); } EXPORT_SYMBOL_GPL(iomap_page_mkwrite); static void iomap_writeback_init(struct inode *inode, struct folio *folio) { struct iomap_folio_state *ifs = folio->private; WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !ifs); if (ifs) { WARN_ON_ONCE(atomic_read(&ifs->write_bytes_pending) != 0); /* * Set this to the folio size. After processing the folio for * writeback in iomap_writeback_folio(), we'll subtract any * ranges not written back. * * We do this because otherwise, we would have to atomically * increment ifs->write_bytes_pending every time a range in the * folio needs to be written back. */ atomic_set(&ifs->write_bytes_pending, folio_size(folio)); } } void iomap_finish_folio_write(struct inode *inode, struct folio *folio, size_t len) { struct iomap_folio_state *ifs = folio->private; WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !ifs); WARN_ON_ONCE(ifs && atomic_read(&ifs->write_bytes_pending) <= 0); if (!ifs || atomic_sub_and_test(len, &ifs->write_bytes_pending)) folio_end_writeback(folio); } EXPORT_SYMBOL_GPL(iomap_finish_folio_write); static int iomap_writeback_range(struct iomap_writepage_ctx *wpc, struct folio *folio, u64 pos, u32 rlen, u64 end_pos, size_t *bytes_submitted) { do { ssize_t ret; ret = wpc->ops->writeback_range(wpc, folio, pos, rlen, end_pos); if (WARN_ON_ONCE(ret == 0 || ret > rlen)) return -EIO; if (ret < 0) return ret; rlen -= ret; pos += ret; /* * Holes are not written back by ->writeback_range, so track * if we did handle anything that is not a hole here. */ if (wpc->iomap.type != IOMAP_HOLE) *bytes_submitted += ret; } while (rlen); return 0; } /* * Check interaction of the folio with the file end. * * If the folio is entirely beyond i_size, return false. If it straddles * i_size, adjust end_pos and zero all data beyond i_size. */ static bool iomap_writeback_handle_eof(struct folio *folio, struct inode *inode, u64 *end_pos) { u64 isize = i_size_read(inode); if (*end_pos > isize) { size_t poff = offset_in_folio(folio, isize); pgoff_t end_index = isize >> PAGE_SHIFT; /* * If the folio is entirely ouside of i_size, skip it. * * This can happen due to a truncate operation that is in * progress and in that case truncate will finish it off once * we've dropped the folio lock. * * Note that the pgoff_t used for end_index is an unsigned long. * If the given offset is greater than 16TB on a 32-bit system, * then if we checked if the folio is fully outside i_size with * "if (folio->index >= end_index + 1)", "end_index + 1" would * overflow and evaluate to 0. Hence this folio would be * redirtied and written out repeatedly, which would result in * an infinite loop; the user program performing this operation * would hang. Instead, we can detect this situation by * checking if the folio is totally beyond i_size or if its * offset is just equal to the EOF. */ if (folio->index > end_index || (folio->index == end_index && poff == 0)) return false; /* * The folio straddles i_size. * * It must be zeroed out on each and every writepage invocation * because it may be mmapped: * * A file is mapped in multiples of the page size. For a * file that is not a multiple of the page size, the * remaining memory is zeroed when mapped, and writes to that * region are not written out to the file. * * Also adjust the end_pos to the end of file and skip writeback * for all blocks entirely beyond i_size. */ folio_zero_segment(folio, poff, folio_size(folio)); *end_pos = isize; } return true; } int iomap_writeback_folio(struct iomap_writepage_ctx *wpc, struct folio *folio) { struct iomap_folio_state *ifs = folio->private; struct inode *inode = wpc->inode; u64 pos = folio_pos(folio); u64 end_pos = pos + folio_size(folio); u64 end_aligned = 0; size_t bytes_submitted = 0; int error = 0; u32 rlen; WARN_ON_ONCE(!folio_test_locked(folio)); WARN_ON_ONCE(folio_test_dirty(folio)); WARN_ON_ONCE(folio_test_writeback(folio)); trace_iomap_writeback_folio(inode, pos, folio_size(folio)); if (!iomap_writeback_handle_eof(folio, inode, &end_pos)) return 0; WARN_ON_ONCE(end_pos <= pos); if (i_blocks_per_folio(inode, folio) > 1) { if (!ifs) { ifs = ifs_alloc(inode, folio, 0); iomap_set_range_dirty(folio, 0, end_pos - pos); } iomap_writeback_init(inode, folio); } /* * Set the writeback bit ASAP, as the I/O completion for the single * block per folio case happen hit as soon as we're submitting the bio. */ folio_start_writeback(folio); /* * Walk through the folio to find dirty areas to write back. */ end_aligned = round_up(end_pos, i_blocksize(inode)); while ((rlen = iomap_find_dirty_range(folio, &pos, end_aligned))) { error = iomap_writeback_range(wpc, folio, pos, rlen, end_pos, &bytes_submitted); if (error) break; pos += rlen; } if (bytes_submitted) wpc->nr_folios++; /* * We can have dirty bits set past end of file in page_mkwrite path * while mapping the last partial folio. Hence it's better to clear * all the dirty bits in the folio here. */ iomap_clear_range_dirty(folio, 0, folio_size(folio)); /* * Usually the writeback bit is cleared by the I/O completion handler. * But we may end up either not actually writing any blocks, or (when * there are multiple blocks in a folio) all I/O might have finished * already at this point. In that case we need to clear the writeback * bit ourselves right after unlocking the page. */ if (ifs) { /* * Subtract any bytes that were initially accounted to * write_bytes_pending but skipped for writeback. */ size_t bytes_not_submitted = folio_size(folio) - bytes_submitted; if (bytes_not_submitted) iomap_finish_folio_write(inode, folio, bytes_not_submitted); } else if (!bytes_submitted) { folio_end_writeback(folio); } mapping_set_error(inode->i_mapping, error); return error; } EXPORT_SYMBOL_GPL(iomap_writeback_folio); int iomap_writepages(struct iomap_writepage_ctx *wpc) { struct address_space *mapping = wpc->inode->i_mapping; struct folio *folio = NULL; int error; /* * Writeback from reclaim context should never happen except in the case * of a VM regression so warn about it and refuse to write the data. */ if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC | PF_KSWAPD)) == PF_MEMALLOC)) return -EIO; while ((folio = writeback_iter(mapping, wpc->wbc, folio, &error))) { error = iomap_writeback_folio(wpc, folio); folio_unlock(folio); } /* * If @error is non-zero, it means that we have a situation where some * part of the submission process has failed after we've marked pages * for writeback. * * We cannot cancel the writeback directly in that case, so always call * ->writeback_submit to run the I/O completion handler to clear the * writeback bit and let the file system proess the errors. */ if (wpc->wb_ctx) return wpc->ops->writeback_submit(wpc, error); return error; } EXPORT_SYMBOL_GPL(iomap_writepages); |
| 3 4 1 1 4 1 4 3 3 4 3 4 1 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | // SPDX-License-Identifier: GPL-2.0-or-later /* * linux/ipc/msgutil.c * Copyright (C) 1999, 2004 Manfred Spraul */ #include <linux/spinlock.h> #include <linux/init.h> #include <linux/security.h> #include <linux/slab.h> #include <linux/ipc.h> #include <linux/msg.h> #include <linux/ipc_namespace.h> #include <linux/utsname.h> #include <linux/proc_ns.h> #include <linux/uaccess.h> #include <linux/sched.h> #include <linux/nstree.h> #include "util.h" DEFINE_SPINLOCK(mq_lock); /* * The next 2 defines are here bc this is the only file * compiled when either CONFIG_SYSVIPC and CONFIG_POSIX_MQUEUE * and not CONFIG_IPC_NS. */ struct ipc_namespace init_ipc_ns = { .ns = NS_COMMON_INIT(init_ipc_ns), .user_ns = &init_user_ns, }; struct msg_msgseg { struct msg_msgseg *next; /* the next part of the message follows immediately */ }; #define DATALEN_MSG ((size_t)PAGE_SIZE-sizeof(struct msg_msg)) #define DATALEN_SEG ((size_t)PAGE_SIZE-sizeof(struct msg_msgseg)) static kmem_buckets *msg_buckets __ro_after_init; static int __init init_msg_buckets(void) { msg_buckets = kmem_buckets_create("msg_msg", SLAB_ACCOUNT, sizeof(struct msg_msg), DATALEN_MSG, NULL); return 0; } subsys_initcall(init_msg_buckets); static struct msg_msg *alloc_msg(size_t len) { struct msg_msg *msg; struct msg_msgseg **pseg; size_t alen; alen = min(len, DATALEN_MSG); msg = kmem_buckets_alloc(msg_buckets, sizeof(*msg) + alen, GFP_KERNEL); if (msg == NULL) return NULL; msg->next = NULL; msg->security = NULL; len -= alen; pseg = &msg->next; while (len > 0) { struct msg_msgseg *seg; cond_resched(); alen = min(len, DATALEN_SEG); seg = kmalloc(sizeof(*seg) + alen, GFP_KERNEL_ACCOUNT); if (seg == NULL) goto out_err; *pseg = seg; seg->next = NULL; pseg = &seg->next; len -= alen; } return msg; out_err: free_msg(msg); return NULL; } struct msg_msg *load_msg(const void __user *src, size_t len) { struct msg_msg *msg; struct msg_msgseg *seg; int err = -EFAULT; size_t alen; msg = alloc_msg(len); if (msg == NULL) return ERR_PTR(-ENOMEM); alen = min(len, DATALEN_MSG); if (copy_from_user(msg + 1, src, alen)) goto out_err; for (seg = msg->next; seg != NULL; seg = seg->next) { len -= alen; src = (char __user *)src + alen; alen = min(len, DATALEN_SEG); if (copy_from_user(seg + 1, src, alen)) goto out_err; } err = security_msg_msg_alloc(msg); if (err) goto out_err; return msg; out_err: free_msg(msg); return ERR_PTR(err); } #ifdef CONFIG_CHECKPOINT_RESTORE struct msg_msg *copy_msg(struct msg_msg *src, struct msg_msg *dst) { struct msg_msgseg *dst_pseg, *src_pseg; size_t len = src->m_ts; size_t alen; if (src->m_ts > dst->m_ts) return ERR_PTR(-EINVAL); alen = min(len, DATALEN_MSG); memcpy(dst + 1, src + 1, alen); for (dst_pseg = dst->next, src_pseg = src->next; src_pseg != NULL; dst_pseg = dst_pseg->next, src_pseg = src_pseg->next) { len -= alen; alen = min(len, DATALEN_SEG); memcpy(dst_pseg + 1, src_pseg + 1, alen); } dst->m_type = src->m_type; dst->m_ts = src->m_ts; return dst; } #else struct msg_msg *copy_msg(struct msg_msg *src, struct msg_msg *dst) { return ERR_PTR(-ENOSYS); } #endif int store_msg(void __user *dest, struct msg_msg *msg, size_t len) { size_t alen; struct msg_msgseg *seg; alen = min(len, DATALEN_MSG); if (copy_to_user(dest, msg + 1, alen)) return -1; for (seg = msg->next; seg != NULL; seg = seg->next) { len -= alen; dest = (char __user *)dest + alen; alen = min(len, DATALEN_SEG); if (copy_to_user(dest, seg + 1, alen)) return -1; } return 0; } void free_msg(struct msg_msg *msg) { struct msg_msgseg *seg; security_msg_msg_free(msg); seg = msg->next; kfree(msg); while (seg != NULL) { struct msg_msgseg *tmp = seg->next; cond_resched(); kfree(seg); seg = tmp; } } |
| 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 | // SPDX-License-Identifier: GPL-2.0-or-later /* * USB CDC EEM network interface driver * Copyright (C) 2009 Oberthur Technologies * by Omar Laazimani, Olivier Condemine */ #include <linux/module.h> #include <linux/netdevice.h> #include <linux/etherdevice.h> #include <linux/ctype.h> #include <linux/ethtool.h> #include <linux/workqueue.h> #include <linux/mii.h> #include <linux/usb.h> #include <linux/crc32.h> #include <linux/usb/cdc.h> #include <linux/usb/usbnet.h> #include <linux/gfp.h> #include <linux/if_vlan.h> /* * This driver is an implementation of the CDC "Ethernet Emulation * Model" (EEM) specification, which encapsulates Ethernet frames * for transport over USB using a simpler USB device model than the * previous CDC "Ethernet Control Model" (ECM, or "CDC Ethernet"). * * For details, see https://usb.org/sites/default/files/CDC_EEM10.pdf * * This version has been tested with GIGAntIC WuaoW SIM Smart Card on 2.6.24, * 2.6.27 and 2.6.30rc2 kernel. * It has also been validated on Openmoko Om 2008.12 (based on 2.6.24 kernel). * build on 23-April-2009 */ #define EEM_HEAD 2 /* 2 byte header */ /*-------------------------------------------------------------------------*/ static void eem_linkcmd_complete(struct urb *urb) { dev_kfree_skb(urb->context); usb_free_urb(urb); } static void eem_linkcmd(struct usbnet *dev, struct sk_buff *skb) { struct urb *urb; int status; urb = usb_alloc_urb(0, GFP_ATOMIC); if (!urb) goto fail; usb_fill_bulk_urb(urb, dev->udev, dev->out, skb->data, skb->len, eem_linkcmd_complete, skb); status = usb_submit_urb(urb, GFP_ATOMIC); if (status) { usb_free_urb(urb); fail: dev_kfree_skb(skb); netdev_warn(dev->net, "link cmd failure\n"); return; } } static int eem_bind(struct usbnet *dev, struct usb_interface *intf) { int status = 0; status = usbnet_get_endpoints(dev, intf); if (status < 0) return status; /* no jumbogram (16K) support for now */ dev->net->hard_header_len += EEM_HEAD + ETH_FCS_LEN + VLAN_HLEN; dev->hard_mtu = dev->net->mtu + dev->net->hard_header_len; return 0; } /* * EEM permits packing multiple Ethernet frames into USB transfers * (a "bundle"), but for TX we don't try to do that. */ static struct sk_buff *eem_tx_fixup(struct usbnet *dev, struct sk_buff *skb, gfp_t flags) { struct sk_buff *skb2 = NULL; u16 len = skb->len; u32 crc = 0; int padlen = 0; /* When ((len + EEM_HEAD + ETH_FCS_LEN) % dev->maxpacket) is * zero, stick two bytes of zero length EEM packet on the end. * Else the framework would add invalid single byte padding, * since it can't know whether ZLPs will be handled right by * all the relevant hardware and software. */ if (!((len + EEM_HEAD + ETH_FCS_LEN) % dev->maxpacket)) padlen += 2; if (!skb_cloned(skb)) { int headroom = skb_headroom(skb); int tailroom = skb_tailroom(skb); if ((tailroom >= ETH_FCS_LEN + padlen) && (headroom >= EEM_HEAD)) goto done; if ((headroom + tailroom) > (EEM_HEAD + ETH_FCS_LEN + padlen)) { skb->data = memmove(skb->head + EEM_HEAD, skb->data, skb->len); skb_set_tail_pointer(skb, len); goto done; } } skb2 = skb_copy_expand(skb, EEM_HEAD, ETH_FCS_LEN + padlen, flags); dev_kfree_skb_any(skb); if (!skb2) return NULL; skb = skb2; done: /* we don't use the "no Ethernet CRC" option */ crc = crc32_le(~0, skb->data, skb->len); crc = ~crc; put_unaligned_le32(crc, skb_put(skb, 4)); /* EEM packet header format: * b0..13: length of ethernet frame * b14: bmCRC (1 == valid Ethernet CRC) * b15: bmType (0 == data) */ len = skb->len; put_unaligned_le16(BIT(14) | len, skb_push(skb, 2)); /* Bundle a zero length EEM packet if needed */ if (padlen) put_unaligned_le16(0, skb_put(skb, 2)); return skb; } static int eem_rx_fixup(struct usbnet *dev, struct sk_buff *skb) { /* * Our task here is to strip off framing, leaving skb with one * data frame for the usbnet framework code to process. But we * may have received multiple EEM payloads, or command payloads. * So we must process _everything_ as if it's a header, except * maybe the last data payload * * REVISIT the framework needs updating so that when we consume * all payloads (the last or only message was a command, or a * zero length EEM packet) that is not accounted as an rx_error. */ do { struct sk_buff *skb2 = NULL; u16 header; u16 len = 0; /* incomplete EEM header? */ if (skb->len < EEM_HEAD) return 0; /* * EEM packet header format: * b0..14: EEM type dependent (Data or Command) * b15: bmType */ header = get_unaligned_le16(skb->data); skb_pull(skb, EEM_HEAD); /* * The bmType bit helps to denote when EEM * packet is data or command : * bmType = 0 : EEM data payload * bmType = 1 : EEM (link) command */ if (header & BIT(15)) { u16 bmEEMCmd; /* * EEM (link) command packet: * b0..10: bmEEMCmdParam * b11..13: bmEEMCmd * b14: bmReserved (must be 0) * b15: 1 (EEM command) */ if (header & BIT(14)) { netdev_dbg(dev->net, "reserved command %04x\n", header); continue; } bmEEMCmd = (header >> 11) & 0x7; switch (bmEEMCmd) { /* Responding to echo requests is mandatory. */ case 0: /* Echo command */ len = header & 0x7FF; /* bogus command? */ if (skb->len < len) return 0; skb2 = skb_clone(skb, GFP_ATOMIC); if (unlikely(!skb2)) goto next; skb_trim(skb2, len); put_unaligned_le16(BIT(15) | BIT(11) | len, skb_push(skb2, 2)); eem_linkcmd(dev, skb2); break; /* * Host may choose to ignore hints. * - suspend: peripheral ready to suspend * - response: suggest N millisec polling * - response complete: suggest N sec polling * * Suspend is reported and maybe heeded. */ case 2: /* Suspend hint */ usbnet_device_suggests_idle(dev); continue; case 3: /* Response hint */ case 4: /* Response complete hint */ continue; /* * Hosts should never receive host-to-peripheral * or reserved command codes; or responses to an * echo command we didn't send. */ case 1: /* Echo response */ case 5: /* Tickle */ default: /* reserved */ netdev_warn(dev->net, "unexpected link command %d\n", bmEEMCmd); continue; } } else { u32 crc, crc2; int is_last; /* zero length EEM packet? */ if (header == 0) continue; /* * EEM data packet header : * b0..13: length of ethernet frame * b14: bmCRC * b15: 0 (EEM data) */ len = header & 0x3FFF; /* bogus EEM payload? */ if (skb->len < len) return 0; /* bogus ethernet frame? */ if (len < (ETH_HLEN + ETH_FCS_LEN)) goto next; /* * Treat the last payload differently: framework * code expects our "fixup" to have stripped off * headers, so "skb" is a data packet (or error). * Else if it's not the last payload, keep "skb" * for further processing. */ is_last = (len == skb->len); if (is_last) skb2 = skb; else { skb2 = skb_clone(skb, GFP_ATOMIC); if (unlikely(!skb2)) return 0; } /* * The bmCRC helps to denote when the CRC field in * the Ethernet frame contains a calculated CRC: * bmCRC = 1 : CRC is calculated * bmCRC = 0 : CRC = 0xDEADBEEF */ if (header & BIT(14)) { crc = get_unaligned_le32(skb2->data + len - ETH_FCS_LEN); crc2 = ~crc32_le(~0, skb2->data, skb2->len - ETH_FCS_LEN); } else { crc = get_unaligned_be32(skb2->data + len - ETH_FCS_LEN); crc2 = 0xdeadbeef; } skb_trim(skb2, len - ETH_FCS_LEN); if (is_last) return crc == crc2; if (unlikely(crc != crc2)) { dev->net->stats.rx_errors++; dev_kfree_skb_any(skb2); } else usbnet_skb_return(dev, skb2); } next: skb_pull(skb, len); } while (skb->len); return 1; } static const struct driver_info eem_info = { .description = "CDC EEM Device", .flags = FLAG_ETHER | FLAG_POINTTOPOINT, .bind = eem_bind, .rx_fixup = eem_rx_fixup, .tx_fixup = eem_tx_fixup, }; /*-------------------------------------------------------------------------*/ static const struct usb_device_id products[] = { { USB_INTERFACE_INFO(USB_CLASS_COMM, USB_CDC_SUBCLASS_EEM, USB_CDC_PROTO_EEM), .driver_info = (unsigned long) &eem_info, }, { /* EMPTY == end of list */ }, }; MODULE_DEVICE_TABLE(usb, products); static struct usb_driver eem_driver = { .name = "cdc_eem", .id_table = products, .probe = usbnet_probe, .disconnect = usbnet_disconnect, .suspend = usbnet_suspend, .resume = usbnet_resume, .disable_hub_initiated_lpm = 1, }; module_usb_driver(eem_driver); MODULE_AUTHOR("Omar Laazimani <omar.oberthur@gmail.com>"); MODULE_DESCRIPTION("USB CDC EEM"); MODULE_LICENSE("GPL"); |
| 18 3 19 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 | // SPDX-License-Identifier: GPL-2.0-only /* PIPAPO: PIle PAcket POlicies: set for arbitrary concatenations of ranges * * Copyright (c) 2019-2020 Red Hat GmbH * * Author: Stefano Brivio <sbrivio@redhat.com> */ /** * DOC: Theory of Operation * * * Problem * ------- * * Match packet bytes against entries composed of ranged or non-ranged packet * field specifiers, mapping them to arbitrary references. For example: * * :: * * --- fields ---> * | [net],[port],[net]... => [reference] * entries [net],[port],[net]... => [reference] * | [net],[port],[net]... => [reference] * V ... * * where [net] fields can be IP ranges or netmasks, and [port] fields are port * ranges. Arbitrary packet fields can be matched. * * * Algorithm Overview * ------------------ * * This algorithm is loosely inspired by [Ligatti 2010], and fundamentally * relies on the consideration that every contiguous range in a space of b bits * can be converted into b * 2 netmasks, from Theorem 3 in [Rottenstreich 2010], * as also illustrated in Section 9 of [Kogan 2014]. * * Classification against a number of entries, that require matching given bits * of a packet field, is performed by grouping those bits in sets of arbitrary * size, and classifying packet bits one group at a time. * * Example: * to match the source port (16 bits) of a packet, we can divide those 16 bits * in 4 groups of 4 bits each. Given the entry: * 0000 0001 0101 1001 * and a packet with source port: * 0000 0001 1010 1001 * first and second groups match, but the third doesn't. We conclude that the * packet doesn't match the given entry. * * Translate the set to a sequence of lookup tables, one per field. Each table * has two dimensions: bit groups to be matched for a single packet field, and * all the possible values of said groups (buckets). Input entries are * represented as one or more rules, depending on the number of composing * netmasks for the given field specifier, and a group match is indicated as a * set bit, with number corresponding to the rule index, in all the buckets * whose value matches the entry for a given group. * * Rules are mapped between fields through an array of x, n pairs, with each * item mapping a matched rule to one or more rules. The position of the pair in * the array indicates the matched rule to be mapped to the next field, x * indicates the first rule index in the next field, and n the amount of * next-field rules the current rule maps to. * * The mapping array for the last field maps to the desired references. * * To match, we perform table lookups using the values of grouped packet bits, * and use a sequence of bitwise operations to progressively evaluate rule * matching. * * A stand-alone, reference implementation, also including notes about possible * future optimisations, is available at: * https://pipapo.lameexcu.se/ * * Insertion * --------- * * - For each packet field: * * - divide the b packet bits we want to classify into groups of size t, * obtaining ceil(b / t) groups * * Example: match on destination IP address, with t = 4: 32 bits, 8 groups * of 4 bits each * * - allocate a lookup table with one column ("bucket") for each possible * value of a group, and with one row for each group * * Example: 8 groups, 2^4 buckets: * * :: * * bucket * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 * 0 * 1 * 2 * 3 * 4 * 5 * 6 * 7 * * - map the bits we want to classify for the current field, for a given * entry, to a single rule for non-ranged and netmask set items, and to one * or multiple rules for ranges. Ranges are expanded to composing netmasks * by pipapo_expand(). * * Example: 2 entries, 10.0.0.5:1024 and 192.168.1.0-192.168.2.1:2048 * - rule #0: 10.0.0.5 * - rule #1: 192.168.1.0/24 * - rule #2: 192.168.2.0/31 * * - insert references to the rules in the lookup table, selecting buckets * according to bit values of a rule in the given group. This is done by * pipapo_insert(). * * Example: given: * - rule #0: 10.0.0.5 mapping to buckets * < 0 10 0 0 0 0 0 5 > * - rule #1: 192.168.1.0/24 mapping to buckets * < 12 0 10 8 0 1 < 0..15 > < 0..15 > > * - rule #2: 192.168.2.0/31 mapping to buckets * < 12 0 10 8 0 2 0 < 0..1 > > * * these bits are set in the lookup table: * * :: * * bucket * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 * 0 0 1,2 * 1 1,2 0 * 2 0 1,2 * 3 0 1,2 * 4 0,1,2 * 5 0 1 2 * 6 0,1,2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * 7 1,2 1,2 1 1 1 0,1 1 1 1 1 1 1 1 1 1 1 * * - if this is not the last field in the set, fill a mapping array that maps * rules from the lookup table to rules belonging to the same entry in * the next lookup table, done by pipapo_map(). * * Note that as rules map to contiguous ranges of rules, given how netmask * expansion and insertion is performed, &union nft_pipapo_map_bucket stores * this information as pairs of first rule index, rule count. * * Example: 2 entries, 10.0.0.5:1024 and 192.168.1.0-192.168.2.1:2048, * given lookup table #0 for field 0 (see example above): * * :: * * bucket * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 * 0 0 1,2 * 1 1,2 0 * 2 0 1,2 * 3 0 1,2 * 4 0,1,2 * 5 0 1 2 * 6 0,1,2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * 7 1,2 1,2 1 1 1 0,1 1 1 1 1 1 1 1 1 1 1 * * and lookup table #1 for field 1 with: * - rule #0: 1024 mapping to buckets * < 0 0 4 0 > * - rule #1: 2048 mapping to buckets * < 0 0 5 0 > * * :: * * bucket * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 * 0 0,1 * 1 0,1 * 2 0 1 * 3 0,1 * * we need to map rules for 10.0.0.5 in lookup table #0 (rule #0) to 1024 * in lookup table #1 (rule #0) and rules for 192.168.1.0-192.168.2.1 * (rules #1, #2) to 2048 in lookup table #2 (rule #1): * * :: * * rule indices in current field: 0 1 2 * map to rules in next field: 0 1 1 * * - if this is the last field in the set, fill a mapping array that maps * rules from the last lookup table to element pointers, also done by * pipapo_map(). * * Note that, in this implementation, we have two elements (start, end) for * each entry. The pointer to the end element is stored in this array, and * the pointer to the start element is linked from it. * * Example: entry 10.0.0.5:1024 has a corresponding &struct nft_pipapo_elem * pointer, 0x66, and element for 192.168.1.0-192.168.2.1:2048 is at 0x42. * From the rules of lookup table #1 as mapped above: * * :: * * rule indices in last field: 0 1 * map to elements: 0x66 0x42 * * * Matching * -------- * * We use a result bitmap, with the size of a single lookup table bucket, to * represent the matching state that applies at every algorithm step. This is * done by pipapo_lookup(). * * - For each packet field: * * - start with an all-ones result bitmap (res_map in pipapo_lookup()) * * - perform a lookup into the table corresponding to the current field, * for each group, and at every group, AND the current result bitmap with * the value from the lookup table bucket * * :: * * Example: 192.168.1.5 < 12 0 10 8 0 1 0 5 >, with lookup table from * insertion examples. * Lookup table buckets are at least 3 bits wide, we'll assume 8 bits for * convenience in this example. Initial result bitmap is 0xff, the steps * below show the value of the result bitmap after each group is processed: * * bucket * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 * 0 0 1,2 * result bitmap is now: 0xff & 0x6 [bucket 12] = 0x6 * * 1 1,2 0 * result bitmap is now: 0x6 & 0x6 [bucket 0] = 0x6 * * 2 0 1,2 * result bitmap is now: 0x6 & 0x6 [bucket 10] = 0x6 * * 3 0 1,2 * result bitmap is now: 0x6 & 0x6 [bucket 8] = 0x6 * * 4 0,1,2 * result bitmap is now: 0x6 & 0x7 [bucket 0] = 0x6 * * 5 0 1 2 * result bitmap is now: 0x6 & 0x2 [bucket 1] = 0x2 * * 6 0,1,2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * result bitmap is now: 0x2 & 0x7 [bucket 0] = 0x2 * * 7 1,2 1,2 1 1 1 0,1 1 1 1 1 1 1 1 1 1 1 * final result bitmap for this field is: 0x2 & 0x3 [bucket 5] = 0x2 * * - at the next field, start with a new, all-zeroes result bitmap. For each * bit set in the previous result bitmap, fill the new result bitmap * (fill_map in pipapo_lookup()) with the rule indices from the * corresponding buckets of the mapping field for this field, done by * pipapo_refill() * * Example: with mapping table from insertion examples, with the current * result bitmap from the previous example, 0x02: * * :: * * rule indices in current field: 0 1 2 * map to rules in next field: 0 1 1 * * the new result bitmap will be 0x02: rule 1 was set, and rule 1 will be * set. * * We can now extend this example to cover the second iteration of the step * above (lookup and AND bitmap): assuming the port field is * 2048 < 0 0 5 0 >, with starting result bitmap 0x2, and lookup table * for "port" field from pre-computation example: * * :: * * bucket * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 * 0 0,1 * 1 0,1 * 2 0 1 * 3 0,1 * * operations are: 0x2 & 0x3 [bucket 0] & 0x3 [bucket 0] & 0x2 [bucket 5] * & 0x3 [bucket 0], resulting bitmap is 0x2. * * - if this is the last field in the set, look up the value from the mapping * array corresponding to the final result bitmap * * Example: 0x2 resulting bitmap from 192.168.1.5:2048, mapping array for * last field from insertion example: * * :: * * rule indices in last field: 0 1 * map to elements: 0x66 0x42 * * the matching element is at 0x42. * * * References * ---------- * * [Ligatti 2010] * A Packet-classification Algorithm for Arbitrary Bitmask Rules, with * Automatic Time-space Tradeoffs * Jay Ligatti, Josh Kuhn, and Chris Gage. * Proceedings of the IEEE International Conference on Computer * Communication Networks (ICCCN), August 2010. * https://www.cse.usf.edu/~ligatti/papers/grouper-conf.pdf * * [Rottenstreich 2010] * Worst-Case TCAM Rule Expansion * Ori Rottenstreich and Isaac Keslassy. * 2010 Proceedings IEEE INFOCOM, San Diego, CA, 2010. * http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.4592&rep=rep1&type=pdf * * [Kogan 2014] * SAX-PAC (Scalable And eXpressive PAcket Classification) * Kirill Kogan, Sergey Nikolenko, Ori Rottenstreich, William Culhane, * and Patrick Eugster. * Proceedings of the 2014 ACM conference on SIGCOMM, August 2014. * https://www.sigcomm.org/sites/default/files/ccr/papers/2014/August/2619239-2626294.pdf */ #include <linux/kernel.h> #include <linux/init.h> #include <linux/module.h> #include <linux/netlink.h> #include <linux/netfilter.h> #include <linux/netfilter/nf_tables.h> #include <net/netfilter/nf_tables_core.h> #include <uapi/linux/netfilter/nf_tables.h> #include <linux/bitmap.h> #include <linux/bitops.h> #include "nft_set_pipapo_avx2.h" #include "nft_set_pipapo.h" /** * pipapo_refill() - For each set bit, set bits from selected mapping table item * @map: Bitmap to be scanned for set bits * @len: Length of bitmap in longs * @rules: Number of rules in field * @dst: Destination bitmap * @mt: Mapping table containing bit set specifiers * @match_only: Find a single bit and return, don't fill * * Iteration over set bits with __builtin_ctzl(): Daniel Lemire, public domain. * * For each bit set in map, select the bucket from mapping table with index * corresponding to the position of the bit set. Use start bit and amount of * bits specified in bucket to fill region in dst. * * Return: -1 on no match, bit position on 'match_only', 0 otherwise. */ int pipapo_refill(unsigned long *map, unsigned int len, unsigned int rules, unsigned long *dst, const union nft_pipapo_map_bucket *mt, bool match_only) { unsigned long bitset; unsigned int k; int ret = -1; for (k = 0; k < len; k++) { bitset = map[k]; while (bitset) { unsigned long t = bitset & -bitset; int r = __builtin_ctzl(bitset); int i = k * BITS_PER_LONG + r; if (unlikely(i >= rules)) { map[k] = 0; return -1; } if (match_only) { bitmap_clear(map, i, 1); return i; } ret = 0; bitmap_set(dst, mt[i].to, mt[i].n); bitset ^= t; } map[k] = 0; } return ret; } /** * pipapo_get_slow() - Get matching element reference given key data * @m: storage containing the set elements * @data: Key data to be matched against existing elements * @genmask: If set, check that element is active in given genmask * @tstamp: timestamp to check for expired elements * * For more details, see DOC: Theory of Operation. * * This is the main lookup function. It matches key data against either * the working match set or the uncommitted copy, depending on what the * caller passed to us. * nft_pipapo_get (lookup from userspace/control plane) and nft_pipapo_lookup * (datapath lookup) pass the active copy. * The insertion path will pass the uncommitted working copy. * * Return: pointer to &struct nft_pipapo_elem on match, NULL otherwise. */ static struct nft_pipapo_elem *pipapo_get_slow(const struct nft_pipapo_match *m, const u8 *data, u8 genmask, u64 tstamp) { unsigned long *res_map, *fill_map, *map; struct nft_pipapo_scratch *scratch; const struct nft_pipapo_field *f; bool map_index; int i; local_bh_disable(); scratch = *raw_cpu_ptr(m->scratch); if (unlikely(!scratch)) goto out; __local_lock_nested_bh(&scratch->bh_lock); map_index = scratch->map_index; map = NFT_PIPAPO_LT_ALIGN(&scratch->__map[0]); res_map = map + (map_index ? m->bsize_max : 0); fill_map = map + (map_index ? 0 : m->bsize_max); pipapo_resmap_init(m, res_map); nft_pipapo_for_each_field(f, i, m) { bool last = i == m->field_count - 1; int b; /* For each bit group: select lookup table bucket depending on * packet bytes value, then AND bucket value */ if (likely(f->bb == 8)) pipapo_and_field_buckets_8bit(f, res_map, data); else pipapo_and_field_buckets_4bit(f, res_map, data); NFT_PIPAPO_GROUP_BITS_ARE_8_OR_4; data += f->groups / NFT_PIPAPO_GROUPS_PER_BYTE(f); /* Now populate the bitmap for the next field, unless this is * the last field, in which case return the matched 'ext' * pointer if any. * * Now res_map contains the matching bitmap, and fill_map is the * bitmap for the next field. */ next_match: b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt, last); if (b < 0) { scratch->map_index = map_index; __local_unlock_nested_bh(&scratch->bh_lock); local_bh_enable(); return NULL; } if (last) { struct nft_pipapo_elem *e; e = f->mt[b].e; if (unlikely(__nft_set_elem_expired(&e->ext, tstamp) || !nft_set_elem_active(&e->ext, genmask))) goto next_match; /* Last field: we're just returning the key without * filling the initial bitmap for the next field, so the * current inactive bitmap is clean and can be reused as * *next* bitmap (not initial) for the next packet. */ scratch->map_index = map_index; __local_unlock_nested_bh(&scratch->bh_lock); local_bh_enable(); return e; } /* Swap bitmap indices: res_map is the initial bitmap for the * next field, and fill_map is guaranteed to be all-zeroes at * this point. */ map_index = !map_index; swap(res_map, fill_map); data += NFT_PIPAPO_GROUPS_PADDING(f); } __local_unlock_nested_bh(&scratch->bh_lock); out: local_bh_enable(); return NULL; } /** * pipapo_get() - Get matching element reference given key data * @m: Storage containing the set elements * @data: Key data to be matched against existing elements * @genmask: If set, check that element is active in given genmask * @tstamp: Timestamp to check for expired elements * * This is a dispatcher function, either calling out the generic C * implementation or, if available, the AVX2 one. * This helper is only called from the control plane, with either RCU * read lock or transaction mutex held. * * Return: pointer to &struct nft_pipapo_elem on match, NULL otherwise. */ static struct nft_pipapo_elem *pipapo_get(const struct nft_pipapo_match *m, const u8 *data, u8 genmask, u64 tstamp) { struct nft_pipapo_elem *e; local_bh_disable(); #if defined(CONFIG_X86_64) && !defined(CONFIG_UML) if (boot_cpu_has(X86_FEATURE_AVX2) && irq_fpu_usable()) { e = pipapo_get_avx2(m, data, genmask, tstamp); local_bh_enable(); return e; } #endif e = pipapo_get_slow(m, data, genmask, tstamp); local_bh_enable(); return e; } /** * nft_pipapo_lookup() - Dataplane fronted for main lookup function * @net: Network namespace * @set: nftables API set representation * @key: pointer to nft registers containing key data * * This function is called from the data path. It will search for * an element matching the given key in the current active copy. * Unlike other set types, this uses 0 instead of nft_genmask_cur(). * * This is because new (future) elements are not reachable from * priv->match, they get added to priv->clone instead. * When the commit phase flips the generation bitmask, the * 'now old' entries are skipped but without the 'now current' * elements becoming visible. Using nft_genmask_cur() thus creates * inconsistent state: matching old entries get skipped but thew * newly matching entries are unreachable. * * GENMASK_ANY doesn't work for the same reason: old-gen entries get * skipped, new-gen entries are only reachable from priv->clone. * * nft_pipapo_commit swaps ->clone and ->match shortly after the * genbit flip. As ->clone doesn't contain the old entries in the first * place, lookup will only find the now-current ones. * * Return: ntables API extension pointer or NULL if no match. */ const struct nft_set_ext * nft_pipapo_lookup(const struct net *net, const struct nft_set *set, const u32 *key) { struct nft_pipapo *priv = nft_set_priv(set); const struct nft_pipapo_match *m; const struct nft_pipapo_elem *e; m = rcu_dereference(priv->match); e = pipapo_get_slow(m, (const u8 *)key, 0, get_jiffies_64()); return e ? &e->ext : NULL; } /** * nft_pipapo_get() - Get matching element reference given key data * @net: Network namespace * @set: nftables API set representation * @elem: nftables API element representation containing key data * @flags: Unused * * This function is called from the control plane path under * RCU read lock. * * Return: set element private pointer or ERR_PTR(-ENOENT). */ static struct nft_elem_priv * nft_pipapo_get(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem, unsigned int flags) { struct nft_pipapo *priv = nft_set_priv(set); struct nft_pipapo_match *m = rcu_dereference(priv->match); struct nft_pipapo_elem *e; e = pipapo_get(m, (const u8 *)elem->key.val.data, nft_genmask_cur(net), get_jiffies_64()); if (!e) return ERR_PTR(-ENOENT); return &e->priv; } /** * pipapo_realloc_mt() - Reallocate mapping table if needed upon resize * @f: Field containing mapping table * @old_rules: Amount of existing mapped rules * @rules: Amount of new rules to map * * Return: 0 on success, negative error code on failure. */ static int pipapo_realloc_mt(struct nft_pipapo_field *f, unsigned int old_rules, unsigned int rules) { union nft_pipapo_map_bucket *new_mt = NULL, *old_mt = f->mt; const unsigned int extra = PAGE_SIZE / sizeof(*new_mt); unsigned int rules_alloc = rules; might_sleep(); if (unlikely(rules == 0)) goto out_free; /* growing and enough space left, no action needed */ if (rules > old_rules && f->rules_alloc > rules) return 0; /* downsize and extra slack has not grown too large */ if (rules < old_rules) { unsigned int remove = f->rules_alloc - rules; if (remove < (2u * extra)) return 0; } /* If set needs more than one page of memory for rules then * allocate another extra page to avoid frequent reallocation. */ if (rules > extra && check_add_overflow(rules, extra, &rules_alloc)) return -EOVERFLOW; if (rules_alloc > (INT_MAX / sizeof(*new_mt))) return -ENOMEM; new_mt = kvmalloc_array(rules_alloc, sizeof(*new_mt), GFP_KERNEL_ACCOUNT); if (!new_mt) return -ENOMEM; if (old_mt) memcpy(new_mt, old_mt, min(old_rules, rules) * sizeof(*new_mt)); if (rules > old_rules) { memset(new_mt + old_rules, 0, (rules - old_rules) * sizeof(*new_mt)); } out_free: f->rules_alloc = rules_alloc; f->mt = new_mt; kvfree(old_mt); return 0; } /** * lt_calculate_size() - Get storage size for lookup table with overflow check * @groups: Amount of bit groups * @bb: Number of bits grouped together in lookup table buckets * @bsize: Size of each bucket in lookup table, in longs * * Return: allocation size including alignment overhead, negative on overflow */ static ssize_t lt_calculate_size(unsigned int groups, unsigned int bb, unsigned int bsize) { ssize_t ret = groups * NFT_PIPAPO_BUCKETS(bb) * sizeof(long); if (check_mul_overflow(ret, bsize, &ret)) return -1; if (check_add_overflow(ret, NFT_PIPAPO_ALIGN_HEADROOM, &ret)) return -1; if (ret > INT_MAX) return -1; return ret; } /** * pipapo_resize() - Resize lookup or mapping table, or both * @f: Field containing lookup and mapping tables * @old_rules: Previous amount of rules in field * @rules: New amount of rules * * Increase, decrease or maintain tables size depending on new amount of rules, * and copy data over. In case the new size is smaller, throw away data for * highest-numbered rules. * * Return: 0 on success, -ENOMEM on allocation failure. */ static int pipapo_resize(struct nft_pipapo_field *f, unsigned int old_rules, unsigned int rules) { long *new_lt = NULL, *new_p, *old_lt = f->lt, *old_p; unsigned int new_bucket_size, copy; int group, bucket, err; ssize_t lt_size; if (rules >= NFT_PIPAPO_RULE0_MAX) return -ENOSPC; new_bucket_size = DIV_ROUND_UP(rules, BITS_PER_LONG); #ifdef NFT_PIPAPO_ALIGN new_bucket_size = roundup(new_bucket_size, NFT_PIPAPO_ALIGN / sizeof(*new_lt)); #endif if (new_bucket_size == f->bsize) goto mt; if (new_bucket_size > f->bsize) copy = f->bsize; else copy = new_bucket_size; lt_size = lt_calculate_size(f->groups, f->bb, new_bucket_size); if (lt_size < 0) return -ENOMEM; new_lt = kvzalloc(lt_size, GFP_KERNEL_ACCOUNT); if (!new_lt) return -ENOMEM; new_p = NFT_PIPAPO_LT_ALIGN(new_lt); old_p = NFT_PIPAPO_LT_ALIGN(old_lt); for (group = 0; group < f->groups; group++) { for (bucket = 0; bucket < NFT_PIPAPO_BUCKETS(f->bb); bucket++) { memcpy(new_p, old_p, copy * sizeof(*new_p)); new_p += copy; old_p += copy; if (new_bucket_size > f->bsize) new_p += new_bucket_size - f->bsize; else old_p += f->bsize - new_bucket_size; } } mt: err = pipapo_realloc_mt(f, old_rules, rules); if (err) { kvfree(new_lt); return err; } if (new_lt) { f->bsize = new_bucket_size; f->lt = new_lt; kvfree(old_lt); } return 0; } /** * pipapo_bucket_set() - Set rule bit in bucket given group and group value * @f: Field containing lookup table * @rule: Rule index * @group: Group index * @v: Value of bit group */ static void pipapo_bucket_set(struct nft_pipapo_field *f, int rule, int group, int v) { unsigned long *pos; pos = NFT_PIPAPO_LT_ALIGN(f->lt); pos += f->bsize * NFT_PIPAPO_BUCKETS(f->bb) * group; pos += f->bsize * v; __set_bit(rule, pos); } /** * pipapo_lt_4b_to_8b() - Switch lookup table group width from 4 bits to 8 bits * @old_groups: Number of current groups * @bsize: Size of one bucket, in longs * @old_lt: Pointer to the current lookup table * @new_lt: Pointer to the new, pre-allocated lookup table * * Each bucket with index b in the new lookup table, belonging to group g, is * filled with the bit intersection between: * - bucket with index given by the upper 4 bits of b, from group g, and * - bucket with index given by the lower 4 bits of b, from group g + 1 * * That is, given buckets from the new lookup table N(x, y) and the old lookup * table O(x, y), with x bucket index, and y group index: * * N(b, g) := O(b / 16, g) & O(b % 16, g + 1) * * This ensures equivalence of the matching results on lookup. Two examples in * pictures: * * bucket * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ... 254 255 * 0 ^ * 1 | ^ * ... ( & ) | * / \ | * / \ .-( & )-. * / bucket \ | | * group 0 / 1 2 3 \ 4 5 6 7 8 9 10 11 12 13 |14 15 | * 0 / \ | | * 1 \ | | * 2 | --' * 3 '- * ... */ static void pipapo_lt_4b_to_8b(int old_groups, int bsize, unsigned long *old_lt, unsigned long *new_lt) { int g, b, i; for (g = 0; g < old_groups / 2; g++) { int src_g0 = g * 2, src_g1 = g * 2 + 1; for (b = 0; b < NFT_PIPAPO_BUCKETS(8); b++) { int src_b0 = b / NFT_PIPAPO_BUCKETS(4); int src_b1 = b % NFT_PIPAPO_BUCKETS(4); int src_i0 = src_g0 * NFT_PIPAPO_BUCKETS(4) + src_b0; int src_i1 = src_g1 * NFT_PIPAPO_BUCKETS(4) + src_b1; for (i = 0; i < bsize; i++) { *new_lt = old_lt[src_i0 * bsize + i] & old_lt[src_i1 * bsize + i]; new_lt++; } } } } /** * pipapo_lt_8b_to_4b() - Switch lookup table group width from 8 bits to 4 bits * @old_groups: Number of current groups * @bsize: Size of one bucket, in longs * @old_lt: Pointer to the current lookup table * @new_lt: Pointer to the new, pre-allocated lookup table * * Each bucket with index b in the new lookup table, belonging to group g, is * filled with the bit union of: * - all the buckets with index such that the upper four bits of the lower byte * equal b, from group g, with g odd * - all the buckets with index such that the lower four bits equal b, from * group g, with g even * * That is, given buckets from the new lookup table N(x, y) and the old lookup * table O(x, y), with x bucket index, and y group index: * * - with g odd: N(b, g) := U(O(x, g) for each x : x = (b & 0xf0) >> 4) * - with g even: N(b, g) := U(O(x, g) for each x : x = b & 0x0f) * * where U() denotes the arbitrary union operation (binary OR of n terms). This * ensures equivalence of the matching results on lookup. */ static void pipapo_lt_8b_to_4b(int old_groups, int bsize, unsigned long *old_lt, unsigned long *new_lt) { int g, b, bsrc, i; memset(new_lt, 0, old_groups * 2 * NFT_PIPAPO_BUCKETS(4) * bsize * sizeof(unsigned long)); for (g = 0; g < old_groups * 2; g += 2) { int src_g = g / 2; for (b = 0; b < NFT_PIPAPO_BUCKETS(4); b++) { for (bsrc = NFT_PIPAPO_BUCKETS(8) * src_g; bsrc < NFT_PIPAPO_BUCKETS(8) * (src_g + 1); bsrc++) { if (((bsrc & 0xf0) >> 4) != b) continue; for (i = 0; i < bsize; i++) new_lt[i] |= old_lt[bsrc * bsize + i]; } new_lt += bsize; } for (b = 0; b < NFT_PIPAPO_BUCKETS(4); b++) { for (bsrc = NFT_PIPAPO_BUCKETS(8) * src_g; bsrc < NFT_PIPAPO_BUCKETS(8) * (src_g + 1); bsrc++) { if ((bsrc & 0x0f) != b) continue; for (i = 0; i < bsize; i++) new_lt[i] |= old_lt[bsrc * bsize + i]; } new_lt += bsize; } } } /** * pipapo_lt_bits_adjust() - Adjust group size for lookup table if needed * @f: Field containing lookup table */ static void pipapo_lt_bits_adjust(struct nft_pipapo_field *f) { unsigned int groups, bb; unsigned long *new_lt; ssize_t lt_size; lt_size = f->groups * NFT_PIPAPO_BUCKETS(f->bb) * f->bsize * sizeof(*f->lt); if (f->bb == NFT_PIPAPO_GROUP_BITS_SMALL_SET && lt_size > NFT_PIPAPO_LT_SIZE_HIGH) { groups = f->groups * 2; bb = NFT_PIPAPO_GROUP_BITS_LARGE_SET; lt_size = lt_calculate_size(groups, bb, f->bsize); if (lt_size < 0) return; } else if (f->bb == NFT_PIPAPO_GROUP_BITS_LARGE_SET && lt_size < NFT_PIPAPO_LT_SIZE_LOW) { groups = f->groups / 2; bb = NFT_PIPAPO_GROUP_BITS_SMALL_SET; lt_size = lt_calculate_size(groups, bb, f->bsize); if (lt_size < 0) return; /* Don't increase group width if the resulting lookup table size * would exceed the upper size threshold for a "small" set. */ if (lt_size > NFT_PIPAPO_LT_SIZE_HIGH) return; } else { return; } new_lt = kvzalloc(lt_size, GFP_KERNEL_ACCOUNT); if (!new_lt) return; NFT_PIPAPO_GROUP_BITS_ARE_8_OR_4; if (f->bb == 4 && bb == 8) { pipapo_lt_4b_to_8b(f->groups, f->bsize, NFT_PIPAPO_LT_ALIGN(f->lt), NFT_PIPAPO_LT_ALIGN(new_lt)); } else if (f->bb == 8 && bb == 4) { pipapo_lt_8b_to_4b(f->groups, f->bsize, NFT_PIPAPO_LT_ALIGN(f->lt), NFT_PIPAPO_LT_ALIGN(new_lt)); } else { BUG(); } f->groups = groups; f->bb = bb; kvfree(f->lt); f->lt = new_lt; } /** * pipapo_insert() - Insert new rule in field given input key and mask length * @f: Field containing lookup table * @k: Input key for classification, without nftables padding * @mask_bits: Length of mask; matches field length for non-ranged entry * * Insert a new rule reference in lookup buckets corresponding to k and * mask_bits. * * Return: 1 on success (one rule inserted), negative error code on failure. */ static int pipapo_insert(struct nft_pipapo_field *f, const uint8_t *k, int mask_bits) { unsigned int rule = f->rules, group, ret, bit_offset = 0; ret = pipapo_resize(f, f->rules, f->rules + 1); if (ret) return ret; f->rules++; for (group = 0; group < f->groups; group++) { int i, v; u8 mask; v = k[group / (BITS_PER_BYTE / f->bb)]; v &= GENMASK(BITS_PER_BYTE - bit_offset - 1, 0); v >>= (BITS_PER_BYTE - bit_offset) - f->bb; bit_offset += f->bb; bit_offset %= BITS_PER_BYTE; if (mask_bits >= (group + 1) * f->bb) { /* Not masked */ pipapo_bucket_set(f, rule, group, v); } else if (mask_bits <= group * f->bb) { /* Completely masked */ for (i = 0; i < NFT_PIPAPO_BUCKETS(f->bb); i++) pipapo_bucket_set(f, rule, group, i); } else { /* The mask limit falls on this group */ mask = GENMASK(f->bb - 1, 0); mask >>= mask_bits - group * f->bb; for (i = 0; i < NFT_PIPAPO_BUCKETS(f->bb); i++) { if ((i & ~mask) == (v & ~mask)) pipapo_bucket_set(f, rule, group, i); } } } pipapo_lt_bits_adjust(f); return 1; } /** * pipapo_step_diff() - Check if setting @step bit in netmask would change it * @base: Mask we are expanding * @step: Step bit for given expansion step * @len: Total length of mask space (set and unset bits), bytes * * Convenience function for mask expansion. * * Return: true if step bit changes mask (i.e. isn't set), false otherwise. */ static bool pipapo_step_diff(u8 *base, int step, int len) { /* Network order, byte-addressed */ #ifdef __BIG_ENDIAN__ return !(BIT(step % BITS_PER_BYTE) & base[step / BITS_PER_BYTE]); #else return !(BIT(step % BITS_PER_BYTE) & base[len - 1 - step / BITS_PER_BYTE]); #endif } /** * pipapo_step_after_end() - Check if mask exceeds range end with given step * @base: Mask we are expanding * @end: End of range * @step: Step bit for given expansion step, highest bit to be set * @len: Total length of mask space (set and unset bits), bytes * * Convenience function for mask expansion. * * Return: true if mask exceeds range setting step bits, false otherwise. */ static bool pipapo_step_after_end(const u8 *base, const u8 *end, int step, int len) { u8 tmp[NFT_PIPAPO_MAX_BYTES]; int i; memcpy(tmp, base, len); /* Network order, byte-addressed */ for (i = 0; i <= step; i++) #ifdef __BIG_ENDIAN__ tmp[i / BITS_PER_BYTE] |= BIT(i % BITS_PER_BYTE); #else tmp[len - 1 - i / BITS_PER_BYTE] |= BIT(i % BITS_PER_BYTE); #endif return memcmp(tmp, end, len) > 0; } /** * pipapo_base_sum() - Sum step bit to given len-sized netmask base with carry * @base: Netmask base * @step: Step bit to sum * @len: Netmask length, bytes */ static void pipapo_base_sum(u8 *base, int step, int len) { bool carry = false; int i; /* Network order, byte-addressed */ #ifdef __BIG_ENDIAN__ for (i = step / BITS_PER_BYTE; i < len; i++) { #else for (i = len - 1 - step / BITS_PER_BYTE; i >= 0; i--) { #endif if (carry) base[i]++; else base[i] += 1 << (step % BITS_PER_BYTE); if (base[i]) break; carry = true; } } /** * pipapo_expand() - Expand to composing netmasks, insert into lookup table * @f: Field containing lookup table * @start: Start of range * @end: End of range * @len: Length of value in bits * * Expand range to composing netmasks and insert corresponding rule references * in lookup buckets. * * Return: number of inserted rules on success, negative error code on failure. */ static int pipapo_expand(struct nft_pipapo_field *f, const u8 *start, const u8 *end, int len) { int step, masks = 0, bytes = DIV_ROUND_UP(len, BITS_PER_BYTE); u8 base[NFT_PIPAPO_MAX_BYTES]; memcpy(base, start, bytes); while (memcmp(base, end, bytes) <= 0) { int err; step = 0; while (pipapo_step_diff(base, step, bytes)) { if (pipapo_step_after_end(base, end, step, bytes)) break; step++; if (step >= len) { if (!masks) { err = pipapo_insert(f, base, 0); if (err < 0) return err; masks = 1; } goto out; } } err = pipapo_insert(f, base, len - step); if (err < 0) return err; masks++; pipapo_base_sum(base, step, bytes); } out: return masks; } /** * pipapo_map() - Insert rules in mapping tables, mapping them between fields * @m: Matching data, including mapping table * @map: Table of rule maps: array of first rule and amount of rules * in next field a given rule maps to, for each field * @e: For last field, nft_set_ext pointer matching rules map to */ static void pipapo_map(struct nft_pipapo_match *m, union nft_pipapo_map_bucket map[NFT_PIPAPO_MAX_FIELDS], struct nft_pipapo_elem *e) { struct nft_pipapo_field *f; int i, j; for (i = 0, f = m->f; i < m->field_count - 1; i++, f++) { for (j = 0; j < map[i].n; j++) { f->mt[map[i].to + j].to = map[i + 1].to; f->mt[map[i].to + j].n = map[i + 1].n; } } /* Last field: map to ext instead of mapping to next field */ for (j = 0; j < map[i].n; j++) f->mt[map[i].to + j].e = e; } /** * pipapo_free_scratch() - Free per-CPU map at original address * @m: Matching data * @cpu: CPU number */ static void pipapo_free_scratch(const struct nft_pipapo_match *m, unsigned int cpu) { struct nft_pipapo_scratch *s; s = *per_cpu_ptr(m->scratch, cpu); kvfree(s); } /** * pipapo_realloc_scratch() - Reallocate scratch maps for partial match results * @clone: Copy of matching data with pending insertions and deletions * @bsize_max: Maximum bucket size, scratch maps cover two buckets * * Return: 0 on success, -ENOMEM on failure. */ static int pipapo_realloc_scratch(struct nft_pipapo_match *clone, unsigned long bsize_max) { int i; for_each_possible_cpu(i) { struct nft_pipapo_scratch *scratch; scratch = kvzalloc_node(struct_size(scratch, __map, bsize_max * 2) + NFT_PIPAPO_ALIGN_HEADROOM, GFP_KERNEL_ACCOUNT, cpu_to_node(i)); if (!scratch) { /* On failure, there's no need to undo previous * allocations: this means that some scratch maps have * a bigger allocated size now (this is only called on * insertion), but the extra space won't be used by any * CPU as new elements are not inserted and m->bsize_max * is not updated. */ return -ENOMEM; } pipapo_free_scratch(clone, i); local_lock_init(&scratch->bh_lock); *per_cpu_ptr(clone->scratch, i) = scratch; } return 0; } static bool nft_pipapo_transaction_mutex_held(const struct nft_set *set) { #ifdef CONFIG_PROVE_LOCKING const struct net *net = read_pnet(&set->net); return lockdep_is_held(&nft_pernet(net)->commit_mutex); #else return true; #endif } static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old); /** * pipapo_maybe_clone() - Build clone for pending data changes, if not existing * @set: nftables API set representation * * Return: newly created or existing clone, if any. NULL on allocation failure */ static struct nft_pipapo_match *pipapo_maybe_clone(const struct nft_set *set) { struct nft_pipapo *priv = nft_set_priv(set); struct nft_pipapo_match *m; if (priv->clone) return priv->clone; m = rcu_dereference_protected(priv->match, nft_pipapo_transaction_mutex_held(set)); priv->clone = pipapo_clone(m); return priv->clone; } /** * nft_pipapo_insert() - Validate and insert ranged elements * @net: Network namespace * @set: nftables API set representation * @elem: nftables API element representation containing key data * @elem_priv: Filled with pointer to &struct nft_set_ext in inserted element * * Return: 0 on success, error pointer on failure. */ static int nft_pipapo_insert(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem, struct nft_elem_priv **elem_priv) { const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv); union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS]; const u8 *start = (const u8 *)elem->key.val.data, *end; struct nft_pipapo_match *m = pipapo_maybe_clone(set); u8 genmask = nft_genmask_next(net); struct nft_pipapo_elem *e, *dup; u64 tstamp = nft_net_tstamp(net); struct nft_pipapo_field *f; const u8 *start_p, *end_p; int i, bsize_max, err = 0; if (!m) return -ENOMEM; if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END)) end = (const u8 *)nft_set_ext_key_end(ext)->data; else end = start; dup = pipapo_get(m, start, genmask, tstamp); if (dup) { /* Check if we already have the same exact entry */ const struct nft_data *dup_key, *dup_end; dup_key = nft_set_ext_key(&dup->ext); if (nft_set_ext_exists(&dup->ext, NFT_SET_EXT_KEY_END)) dup_end = nft_set_ext_key_end(&dup->ext); else dup_end = dup_key; if (!memcmp(start, dup_key->data, set->klen) && !memcmp(end, dup_end->data, set->klen)) { *elem_priv = &dup->priv; return -EEXIST; } return -ENOTEMPTY; } /* Look for partially overlapping entries */ dup = pipapo_get(m, end, nft_genmask_next(net), tstamp); if (dup) { *elem_priv = &dup->priv; return -ENOTEMPTY; } /* Validate */ start_p = start; end_p = end; /* some helpers return -1, or 0 >= for valid rule pos, * so we cannot support more than INT_MAX rules at this time. */ BUILD_BUG_ON(NFT_PIPAPO_RULE0_MAX > INT_MAX); nft_pipapo_for_each_field(f, i, m) { if (f->rules >= NFT_PIPAPO_RULE0_MAX) return -ENOSPC; if (memcmp(start_p, end_p, f->groups / NFT_PIPAPO_GROUPS_PER_BYTE(f)) > 0) return -EINVAL; start_p += NFT_PIPAPO_GROUPS_PADDED_SIZE(f); end_p += NFT_PIPAPO_GROUPS_PADDED_SIZE(f); } /* Insert */ bsize_max = m->bsize_max; nft_pipapo_for_each_field(f, i, m) { int ret; rulemap[i].to = f->rules; ret = memcmp(start, end, f->groups / NFT_PIPAPO_GROUPS_PER_BYTE(f)); if (!ret) ret = pipapo_insert(f, start, f->groups * f->bb); else ret = pipapo_expand(f, start, end, f->groups * f->bb); if (ret < 0) return ret; if (f->bsize > bsize_max) bsize_max = f->bsize; rulemap[i].n = ret; start += NFT_PIPAPO_GROUPS_PADDED_SIZE(f); end += NFT_PIPAPO_GROUPS_PADDED_SIZE(f); } if (!*get_cpu_ptr(m->scratch) || bsize_max > m->bsize_max) { put_cpu_ptr(m->scratch); err = pipapo_realloc_scratch(m, bsize_max); if (err) return err; m->bsize_max = bsize_max; } else { put_cpu_ptr(m->scratch); } e = nft_elem_priv_cast(elem->priv); *elem_priv = &e->priv; pipapo_map(m, rulemap, e); return 0; } /** * pipapo_clone() - Clone matching data to create new working copy * @old: Existing matching data * * Return: copy of matching data passed as 'old' or NULL. */ static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old) { struct nft_pipapo_field *dst, *src; struct nft_pipapo_match *new; int i; new = kmalloc(struct_size(new, f, old->field_count), GFP_KERNEL_ACCOUNT); if (!new) return NULL; new->field_count = old->field_count; new->bsize_max = old->bsize_max; new->scratch = alloc_percpu(*new->scratch); if (!new->scratch) goto out_scratch; for_each_possible_cpu(i) *per_cpu_ptr(new->scratch, i) = NULL; if (pipapo_realloc_scratch(new, old->bsize_max)) goto out_scratch_realloc; rcu_head_init(&new->rcu); src = old->f; dst = new->f; for (i = 0; i < old->field_count; i++) { unsigned long *new_lt; ssize_t lt_size; memcpy(dst, src, offsetof(struct nft_pipapo_field, lt)); lt_size = lt_calculate_size(src->groups, src->bb, src->bsize); if (lt_size < 0) goto out_lt; new_lt = kvzalloc(lt_size, GFP_KERNEL_ACCOUNT); if (!new_lt) goto out_lt; dst->lt = new_lt; memcpy(NFT_PIPAPO_LT_ALIGN(new_lt), NFT_PIPAPO_LT_ALIGN(src->lt), src->bsize * sizeof(*dst->lt) * src->groups * NFT_PIPAPO_BUCKETS(src->bb)); if (src->rules > 0) { if (src->rules_alloc > (INT_MAX / sizeof(*src->mt))) goto out_mt; dst->mt = kvmalloc_array(src->rules_alloc, sizeof(*src->mt), GFP_KERNEL_ACCOUNT); if (!dst->mt) goto out_mt; memcpy(dst->mt, src->mt, src->rules * sizeof(*src->mt)); } else { dst->mt = NULL; dst->rules_alloc = 0; } src++; dst++; } return new; out_mt: kvfree(dst->lt); out_lt: for (dst--; i > 0; i--) { kvfree(dst->mt); kvfree(dst->lt); dst--; } out_scratch_realloc: for_each_possible_cpu(i) pipapo_free_scratch(new, i); out_scratch: free_percpu(new->scratch); kfree(new); return NULL; } /** * pipapo_rules_same_key() - Get number of rules originated from the same entry * @f: Field containing mapping table * @first: Index of first rule in set of rules mapping to same entry * * Using the fact that all rules in a field that originated from the same entry * will map to the same set of rules in the next field, or to the same element * reference, return the cardinality of the set of rules that originated from * the same entry as the rule with index @first, @first rule included. * * In pictures: * rules * field #0 0 1 2 3 4 * map to: 0 1 2-4 2-4 5-9 * . . ....... . ... * | | | | \ \ * | | | | \ \ * | | | | \ \ * ' ' ' ' ' \ * in field #1 0 1 2 3 4 5 ... * * if this is called for rule 2 on field #0, it will return 3, as also rules 2 * and 3 in field 0 map to the same set of rules (2, 3, 4) in the next field. * * For the last field in a set, we can rely on associated entries to map to the * same element references. * * Return: Number of rules that originated from the same entry as @first. */ static unsigned int pipapo_rules_same_key(struct nft_pipapo_field *f, unsigned int first) { struct nft_pipapo_elem *e = NULL; /* Keep gcc happy */ unsigned int r; for (r = first; r < f->rules; r++) { if (r != first && e != f->mt[r].e) return r - first; e = f->mt[r].e; } if (r != first) return r - first; return 0; } /** * pipapo_unmap() - Remove rules from mapping tables, renumber remaining ones * @mt: Mapping array * @rules: Original amount of rules in mapping table * @start: First rule index to be removed * @n: Amount of rules to be removed * @to_offset: First rule index, in next field, this group of rules maps to * @is_last: If this is the last field, delete reference from mapping array * * This is used to unmap rules from the mapping table for a single field, * maintaining consistency and compactness for the existing ones. * * In pictures: let's assume that we want to delete rules 2 and 3 from the * following mapping array: * * rules * 0 1 2 3 4 * map to: 4-10 4-10 11-15 11-15 16-18 * * the result will be: * * rules * 0 1 2 * map to: 4-10 4-10 11-13 * * for fields before the last one. In case this is the mapping table for the * last field in a set, and rules map to pointers to &struct nft_pipapo_elem: * * rules * 0 1 2 3 4 * element pointers: 0x42 0x42 0x33 0x33 0x44 * * the result will be: * * rules * 0 1 2 * element pointers: 0x42 0x42 0x44 */ static void pipapo_unmap(union nft_pipapo_map_bucket *mt, unsigned int rules, unsigned int start, unsigned int n, unsigned int to_offset, bool is_last) { int i; memmove(mt + start, mt + start + n, (rules - start - n) * sizeof(*mt)); memset(mt + rules - n, 0, n * sizeof(*mt)); if (is_last) return; for (i = start; i < rules - n; i++) mt[i].to -= to_offset; } /** * pipapo_drop() - Delete entry from lookup and mapping tables, given rule map * @m: Matching data * @rulemap: Table of rule maps, arrays of first rule and amount of rules * in next field a given entry maps to, for each field * * For each rule in lookup table buckets mapping to this set of rules, drop * all bits set in lookup table mapping. In pictures, assuming we want to drop * rules 0 and 1 from this lookup table: * * bucket * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 * 0 0 1,2 * 1 1,2 0 * 2 0 1,2 * 3 0 1,2 * 4 0,1,2 * 5 0 1 2 * 6 0,1,2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * 7 1,2 1,2 1 1 1 0,1 1 1 1 1 1 1 1 1 1 1 * * rule 2 becomes rule 0, and the result will be: * * bucket * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 * 0 0 * 1 0 * 2 0 * 3 0 * 4 0 * 5 0 * 6 0 * 7 0 0 * * once this is done, call unmap() to drop all the corresponding rule references * from mapping tables. */ static void pipapo_drop(struct nft_pipapo_match *m, union nft_pipapo_map_bucket rulemap[]) { struct nft_pipapo_field *f; int i; nft_pipapo_for_each_field(f, i, m) { int g; for (g = 0; g < f->groups; g++) { unsigned long *pos; int b; pos = NFT_PIPAPO_LT_ALIGN(f->lt) + g * NFT_PIPAPO_BUCKETS(f->bb) * f->bsize; for (b = 0; b < NFT_PIPAPO_BUCKETS(f->bb); b++) { bitmap_cut(pos, pos, rulemap[i].to, rulemap[i].n, f->bsize * BITS_PER_LONG); pos += f->bsize; } } pipapo_unmap(f->mt, f->rules, rulemap[i].to, rulemap[i].n, rulemap[i + 1].n, i == m->field_count - 1); if (pipapo_resize(f, f->rules, f->rules - rulemap[i].n)) { /* We can ignore this, a failure to shrink tables down * doesn't make tables invalid. */ ; } f->rules -= rulemap[i].n; pipapo_lt_bits_adjust(f); } } static void nft_pipapo_gc_deactivate(struct net *net, struct nft_set *set, struct nft_pipapo_elem *e) { nft_setelem_data_deactivate(net, set, &e->priv); } /** * pipapo_gc() - Drop expired entries from set, destroy start and end elements * @set: nftables API set representation * @m: Matching data */ static void pipapo_gc(struct nft_set *set, struct nft_pipapo_match *m) { struct nft_pipapo *priv = nft_set_priv(set); struct net *net = read_pnet(&set->net); unsigned int rules_f0, first_rule = 0; u64 tstamp = nft_net_tstamp(net); struct nft_pipapo_elem *e; struct nft_trans_gc *gc; gc = nft_trans_gc_alloc(set, 0, GFP_KERNEL); if (!gc) return; while ((rules_f0 = pipapo_rules_same_key(m->f, first_rule))) { union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS]; const struct nft_pipapo_field *f; unsigned int i, start, rules_fx; start = first_rule; rules_fx = rules_f0; nft_pipapo_for_each_field(f, i, m) { rulemap[i].to = start; rulemap[i].n = rules_fx; if (i < m->field_count - 1) { rules_fx = f->mt[start].n; start = f->mt[start].to; } } /* Pick the last field, and its last index */ f--; i--; e = f->mt[rulemap[i].to].e; /* synchronous gc never fails, there is no need to set on * NFT_SET_ELEM_DEAD_BIT. */ if (__nft_set_elem_expired(&e->ext, tstamp)) { gc = nft_trans_gc_queue_sync(gc, GFP_KERNEL); if (!gc) return; nft_pipapo_gc_deactivate(net, set, e); pipapo_drop(m, rulemap); nft_trans_gc_elem_add(gc, e); /* And check again current first rule, which is now the * first we haven't checked. */ } else { first_rule += rules_f0; } } gc = nft_trans_gc_catchall_sync(gc); if (gc) { nft_trans_gc_queue_sync_done(gc); priv->last_gc = jiffies; } } /** * pipapo_free_fields() - Free per-field tables contained in matching data * @m: Matching data */ static void pipapo_free_fields(struct nft_pipapo_match *m) { struct nft_pipapo_field *f; int i; nft_pipapo_for_each_field(f, i, m) { kvfree(f->lt); kvfree(f->mt); } } static void pipapo_free_match(struct nft_pipapo_match *m) { int i; for_each_possible_cpu(i) pipapo_free_scratch(m, i); free_percpu(m->scratch); pipapo_free_fields(m); kfree(m); } /** * pipapo_reclaim_match - RCU callback to free fields from old matching data * @rcu: RCU head */ static void pipapo_reclaim_match(struct rcu_head *rcu) { struct nft_pipapo_match *m; m = container_of(rcu, struct nft_pipapo_match, rcu); pipapo_free_match(m); } /** * nft_pipapo_commit() - Replace lookup data with current working copy * @set: nftables API set representation * * While at it, check if we should perform garbage collection on the working * copy before committing it for lookup, and don't replace the table if the * working copy doesn't have pending changes. * * We also need to create a new working copy for subsequent insertions and * deletions. */ static void nft_pipapo_commit(struct nft_set *set) { struct nft_pipapo *priv = nft_set_priv(set); struct nft_pipapo_match *old; if (!priv->clone) return; if (time_after_eq(jiffies, priv->last_gc + nft_set_gc_interval(set))) pipapo_gc(set, priv->clone); old = rcu_replace_pointer(priv->match, priv->clone, nft_pipapo_transaction_mutex_held(set)); priv->clone = NULL; if (old) call_rcu(&old->rcu, pipapo_reclaim_match); } static void nft_pipapo_abort(const struct nft_set *set) { struct nft_pipapo *priv = nft_set_priv(set); if (!priv->clone) return; pipapo_free_match(priv->clone); priv->clone = NULL; } /** * nft_pipapo_activate() - Mark element reference as active given key, commit * @net: Network namespace * @set: nftables API set representation * @elem_priv: nftables API element representation containing key data * * On insertion, elements are added to a copy of the matching data currently * in use for lookups, and not directly inserted into current lookup data. Both * nft_pipapo_insert() and nft_pipapo_activate() are called once for each * element, hence we can't purpose either one as a real commit operation. */ static void nft_pipapo_activate(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv) { struct nft_pipapo_elem *e = nft_elem_priv_cast(elem_priv); nft_clear(net, &e->ext); } /** * nft_pipapo_deactivate() - Search for element and make it inactive * @net: Network namespace * @set: nftables API set representation * @elem: nftables API element representation containing key data * * Return: deactivated element if found, NULL otherwise. */ static struct nft_elem_priv * nft_pipapo_deactivate(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem) { struct nft_pipapo_match *m = pipapo_maybe_clone(set); struct nft_pipapo_elem *e; /* removal must occur on priv->clone, if we are low on memory * we have no choice and must fail the removal request. */ if (!m) return NULL; e = pipapo_get(m, (const u8 *)elem->key.val.data, nft_genmask_next(net), nft_net_tstamp(net)); if (!e) return NULL; nft_set_elem_change_active(net, set, &e->ext); return &e->priv; } /** * nft_pipapo_flush() - make element inactive * @net: Network namespace * @set: nftables API set representation * @elem_priv: nftables API element representation containing key data * * This is functionally the same as nft_pipapo_deactivate(), with a slightly * different interface, and it's also called once for each element in a set * being flushed, so we can't implement, strictly speaking, a flush operation, * which would otherwise be as simple as allocating an empty copy of the * matching data. * * Note that we could in theory do that, mark the set as flushed, and ignore * subsequent calls, but we would leak all the elements after the first one, * because they wouldn't then be freed as result of API calls. * * Return: true if element was found and deactivated. */ static void nft_pipapo_flush(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv) { struct nft_pipapo_elem *e = nft_elem_priv_cast(elem_priv); nft_set_elem_change_active(net, set, &e->ext); } /** * pipapo_get_boundaries() - Get byte interval for associated rules * @f: Field including lookup table * @first_rule: First rule (lowest index) * @rule_count: Number of associated rules * @left: Byte expression for left boundary (start of range) * @right: Byte expression for right boundary (end of range) * * Given the first rule and amount of rules that originated from the same entry, * build the original range associated with the entry, and calculate the length * of the originating netmask. * * In pictures: * * bucket * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 * 0 1,2 * 1 1,2 * 2 1,2 * 3 1,2 * 4 1,2 * 5 1 2 * 6 1,2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * 7 1,2 1,2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * * this is the lookup table corresponding to the IPv4 range * 192.168.1.0-192.168.2.1, which was expanded to the two composing netmasks, * rule #1: 192.168.1.0/24, and rule #2: 192.168.2.0/31. * * This function fills @left and @right with the byte values of the leftmost * and rightmost bucket indices for the lowest and highest rule indices, * respectively. If @first_rule is 1 and @rule_count is 2, we obtain, in * nibbles: * left: < 12, 0, 10, 8, 0, 1, 0, 0 > * right: < 12, 0, 10, 8, 0, 2, 2, 1 > * corresponding to bytes: * left: < 192, 168, 1, 0 > * right: < 192, 168, 2, 1 > * with mask length irrelevant here, unused on return, as the range is already * defined by its start and end points. The mask length is relevant for a single * ranged entry instead: if @first_rule is 1 and @rule_count is 1, we ignore * rule 2 above: @left becomes < 192, 168, 1, 0 >, @right becomes * < 192, 168, 1, 255 >, and the mask length, calculated from the distances * between leftmost and rightmost bucket indices for each group, would be 24. * * Return: mask length, in bits. */ static int pipapo_get_boundaries(struct nft_pipapo_field *f, int first_rule, int rule_count, u8 *left, u8 *right) { int g, mask_len = 0, bit_offset = 0; u8 *l = left, *r = right; for (g = 0; g < f->groups; g++) { int b, x0, x1; x0 = -1; x1 = -1; for (b = 0; b < NFT_PIPAPO_BUCKETS(f->bb); b++) { unsigned long *pos; pos = NFT_PIPAPO_LT_ALIGN(f->lt) + (g * NFT_PIPAPO_BUCKETS(f->bb) + b) * f->bsize; if (test_bit(first_rule, pos) && x0 == -1) x0 = b; if (test_bit(first_rule + rule_count - 1, pos)) x1 = b; } *l |= x0 << (BITS_PER_BYTE - f->bb - bit_offset); *r |= x1 << (BITS_PER_BYTE - f->bb - bit_offset); bit_offset += f->bb; if (bit_offset >= BITS_PER_BYTE) { bit_offset %= BITS_PER_BYTE; l++; r++; } if (x1 - x0 == 0) mask_len += 4; else if (x1 - x0 == 1) mask_len += 3; else if (x1 - x0 == 3) mask_len += 2; else if (x1 - x0 == 7) mask_len += 1; } return mask_len; } /** * pipapo_match_field() - Match rules against byte ranges * @f: Field including the lookup table * @first_rule: First of associated rules originating from same entry * @rule_count: Amount of associated rules * @start: Start of range to be matched * @end: End of range to be matched * * Return: true on match, false otherwise. */ static bool pipapo_match_field(struct nft_pipapo_field *f, int first_rule, int rule_count, const u8 *start, const u8 *end) { u8 right[NFT_PIPAPO_MAX_BYTES] = { 0 }; u8 left[NFT_PIPAPO_MAX_BYTES] = { 0 }; pipapo_get_boundaries(f, first_rule, rule_count, left, right); return !memcmp(start, left, f->groups / NFT_PIPAPO_GROUPS_PER_BYTE(f)) && !memcmp(end, right, f->groups / NFT_PIPAPO_GROUPS_PER_BYTE(f)); } /** * nft_pipapo_remove() - Remove element given key, commit * @net: Network namespace * @set: nftables API set representation * @elem_priv: nftables API element representation containing key data * * Similarly to nft_pipapo_activate(), this is used as commit operation by the * API, but it's called once per element in the pending transaction, so we can't * implement this as a single commit operation. Closest we can get is to remove * the matched element here, if any, and commit the updated matching data. */ static void nft_pipapo_remove(const struct net *net, const struct nft_set *set, struct nft_elem_priv *elem_priv) { struct nft_pipapo *priv = nft_set_priv(set); struct nft_pipapo_match *m = priv->clone; unsigned int rules_f0, first_rule = 0; struct nft_pipapo_elem *e; const u8 *data; e = nft_elem_priv_cast(elem_priv); data = (const u8 *)nft_set_ext_key(&e->ext); while ((rules_f0 = pipapo_rules_same_key(m->f, first_rule))) { union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS]; const u8 *match_start, *match_end; struct nft_pipapo_field *f; int i, start, rules_fx; match_start = data; if (nft_set_ext_exists(&e->ext, NFT_SET_EXT_KEY_END)) match_end = (const u8 *)nft_set_ext_key_end(&e->ext)->data; else match_end = data; start = first_rule; rules_fx = rules_f0; nft_pipapo_for_each_field(f, i, m) { bool last = i == m->field_count - 1; if (!pipapo_match_field(f, start, rules_fx, match_start, match_end)) break; rulemap[i].to = start; rulemap[i].n = rules_fx; rules_fx = f->mt[start].n; start = f->mt[start].to; match_start += NFT_PIPAPO_GROUPS_PADDED_SIZE(f); match_end += NFT_PIPAPO_GROUPS_PADDED_SIZE(f); if (last && f->mt[rulemap[i].to].e == e) { pipapo_drop(m, rulemap); return; } } first_rule += rules_f0; } WARN_ON_ONCE(1); /* elem_priv not found */ } /** * nft_pipapo_do_walk() - Walk over elements in m * @ctx: nftables API context * @set: nftables API set representation * @m: matching data pointing to key mapping array * @iter: Iterator * * As elements are referenced in the mapping array for the last field, directly * scan that array: there's no need to follow rule mappings from the first * field. @m is protected either by RCU read lock or by transaction mutex. */ static void nft_pipapo_do_walk(const struct nft_ctx *ctx, struct nft_set *set, const struct nft_pipapo_match *m, struct nft_set_iter *iter) { const struct nft_pipapo_field *f; unsigned int i, r; for (i = 0, f = m->f; i < m->field_count - 1; i++, f++) ; for (r = 0; r < f->rules; r++) { struct nft_pipapo_elem *e; if (r < f->rules - 1 && f->mt[r + 1].e == f->mt[r].e) continue; if (iter->count < iter->skip) goto cont; e = f->mt[r].e; iter->err = iter->fn(ctx, set, iter, &e->priv); if (iter->err < 0) return; cont: iter->count++; } } /** * nft_pipapo_walk() - Walk over elements * @ctx: nftables API context * @set: nftables API set representation * @iter: Iterator * * Test if destructive action is needed or not, clone active backend if needed * and call the real function to work on the data. */ static void nft_pipapo_walk(const struct nft_ctx *ctx, struct nft_set *set, struct nft_set_iter *iter) { struct nft_pipapo *priv = nft_set_priv(set); const struct nft_pipapo_match *m; switch (iter->type) { case NFT_ITER_UPDATE: m = pipapo_maybe_clone(set); if (!m) { iter->err = -ENOMEM; return; } nft_pipapo_do_walk(ctx, set, m, iter); break; case NFT_ITER_READ: rcu_read_lock(); m = rcu_dereference(priv->match); nft_pipapo_do_walk(ctx, set, m, iter); rcu_read_unlock(); break; default: iter->err = -EINVAL; WARN_ON_ONCE(1); break; } } /** * nft_pipapo_privsize() - Return the size of private data for the set * @nla: netlink attributes, ignored as size doesn't depend on them * @desc: Set description, ignored as size doesn't depend on it * * Return: size of private data for this set implementation, in bytes */ static u64 nft_pipapo_privsize(const struct nlattr * const nla[], const struct nft_set_desc *desc) { return sizeof(struct nft_pipapo); } /** * nft_pipapo_estimate() - Set size, space and lookup complexity * @desc: Set description, element count and field description used * @features: Flags: NFT_SET_INTERVAL needs to be there * @est: Storage for estimation data * * Return: true if set description is compatible, false otherwise */ static bool nft_pipapo_estimate(const struct nft_set_desc *desc, u32 features, struct nft_set_estimate *est) { if (!(features & NFT_SET_INTERVAL) || desc->field_count < NFT_PIPAPO_MIN_FIELDS) return false; est->size = pipapo_estimate_size(desc); if (!est->size) return false; est->lookup = NFT_SET_CLASS_O_LOG_N; est->space = NFT_SET_CLASS_O_N; return true; } /** * nft_pipapo_init() - Initialise data for a set instance * @set: nftables API set representation * @desc: Set description * @nla: netlink attributes * * Validate number and size of fields passed as NFTA_SET_DESC_CONCAT netlink * attributes, initialise internal set parameters, current instance of matching * data and a copy for subsequent insertions. * * Return: 0 on success, negative error code on failure. */ static int nft_pipapo_init(const struct nft_set *set, const struct nft_set_desc *desc, const struct nlattr * const nla[]) { struct nft_pipapo *priv = nft_set_priv(set); struct nft_pipapo_match *m; struct nft_pipapo_field *f; int err, i, field_count; BUILD_BUG_ON(offsetof(struct nft_pipapo_elem, priv) != 0); field_count = desc->field_count ? : 1; BUILD_BUG_ON(NFT_PIPAPO_MAX_FIELDS > 255); BUILD_BUG_ON(NFT_PIPAPO_MAX_FIELDS != NFT_REG32_COUNT); if (field_count > NFT_PIPAPO_MAX_FIELDS) return -EINVAL; m = kmalloc(struct_size(m, f, field_count), GFP_KERNEL); if (!m) return -ENOMEM; m->field_count = field_count; m->bsize_max = 0; m->scratch = alloc_percpu(struct nft_pipapo_scratch *); if (!m->scratch) { err = -ENOMEM; goto out_scratch; } for_each_possible_cpu(i) *per_cpu_ptr(m->scratch, i) = NULL; rcu_head_init(&m->rcu); nft_pipapo_for_each_field(f, i, m) { unsigned int len = desc->field_len[i] ? : set->klen; /* f->groups is u8 */ BUILD_BUG_ON((NFT_PIPAPO_MAX_BYTES * BITS_PER_BYTE / NFT_PIPAPO_GROUP_BITS_LARGE_SET) >= 256); f->bb = NFT_PIPAPO_GROUP_BITS_INIT; f->groups = len * NFT_PIPAPO_GROUPS_PER_BYTE(f); priv->width += round_up(len, sizeof(u32)); f->bsize = 0; f->rules = 0; f->rules_alloc = 0; f->lt = NULL; f->mt = NULL; } rcu_assign_pointer(priv->match, m); return 0; out_scratch: kfree(m); return err; } /** * nft_set_pipapo_match_destroy() - Destroy elements from key mapping array * @ctx: context * @set: nftables API set representation * @m: matching data pointing to key mapping array */ static void nft_set_pipapo_match_destroy(const struct nft_ctx *ctx, const struct nft_set *set, struct nft_pipapo_match *m) { struct nft_pipapo_field *f; unsigned int i, r; for (i = 0, f = m->f; i < m->field_count - 1; i++, f++) ; for (r = 0; r < f->rules; r++) { struct nft_pipapo_elem *e; if (r < f->rules - 1 && f->mt[r + 1].e == f->mt[r].e) continue; e = f->mt[r].e; nf_tables_set_elem_destroy(ctx, set, &e->priv); } } /** * nft_pipapo_destroy() - Free private data for set and all committed elements * @ctx: context * @set: nftables API set representation */ static void nft_pipapo_destroy(const struct nft_ctx *ctx, const struct nft_set *set) { struct nft_pipapo *priv = nft_set_priv(set); struct nft_pipapo_match *m; m = rcu_dereference_protected(priv->match, true); if (priv->clone) { nft_set_pipapo_match_destroy(ctx, set, priv->clone); pipapo_free_match(priv->clone); priv->clone = NULL; } else { nft_set_pipapo_match_destroy(ctx, set, m); } pipapo_free_match(m); } /** * nft_pipapo_gc_init() - Initialise garbage collection * @set: nftables API set representation * * Instead of actually setting up a periodic work for garbage collection, as * this operation requires a swap of matching data with the working copy, we'll * do that opportunistically with other commit operations if the interval is * elapsed, so we just need to set the current jiffies timestamp here. */ static void nft_pipapo_gc_init(const struct nft_set *set) { struct nft_pipapo *priv = nft_set_priv(set); priv->last_gc = jiffies; } const struct nft_set_type nft_set_pipapo_type = { .features = NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_OBJECT | NFT_SET_TIMEOUT, .ops = { .lookup = nft_pipapo_lookup, .insert = nft_pipapo_insert, .activate = nft_pipapo_activate, .deactivate = nft_pipapo_deactivate, .flush = nft_pipapo_flush, .remove = nft_pipapo_remove, .walk = nft_pipapo_walk, .get = nft_pipapo_get, .privsize = nft_pipapo_privsize, .estimate = nft_pipapo_estimate, .init = nft_pipapo_init, .destroy = nft_pipapo_destroy, .gc_init = nft_pipapo_gc_init, .commit = nft_pipapo_commit, .abort = nft_pipapo_abort, .elemsize = offsetof(struct nft_pipapo_elem, ext), }, }; #if defined(CONFIG_X86_64) && !defined(CONFIG_UML) const struct nft_set_type nft_set_pipapo_avx2_type = { .features = NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_OBJECT | NFT_SET_TIMEOUT, .ops = { .lookup = nft_pipapo_avx2_lookup, .insert = nft_pipapo_insert, .activate = nft_pipapo_activate, .deactivate = nft_pipapo_deactivate, .flush = nft_pipapo_flush, .remove = nft_pipapo_remove, .walk = nft_pipapo_walk, .get = nft_pipapo_get, .privsize = nft_pipapo_privsize, .estimate = nft_pipapo_avx2_estimate, .init = nft_pipapo_init, .destroy = nft_pipapo_destroy, .gc_init = nft_pipapo_gc_init, .commit = nft_pipapo_commit, .abort = nft_pipapo_abort, .elemsize = offsetof(struct nft_pipapo_elem, ext), }, }; #endif |
| 2 2 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | // SPDX-License-Identifier: GPL-2.0 #include <linux/sysctl.h> #include <net/lwtunnel.h> #include <net/netfilter/nf_hooks_lwtunnel.h> #include <linux/netfilter.h> #include "nf_internals.h" static inline int nf_hooks_lwtunnel_get(void) { if (static_branch_unlikely(&nf_hooks_lwtunnel_enabled)) return 1; else return 0; } static inline int nf_hooks_lwtunnel_set(int enable) { if (static_branch_unlikely(&nf_hooks_lwtunnel_enabled)) { if (!enable) return -EBUSY; } else if (enable) { static_branch_enable(&nf_hooks_lwtunnel_enabled); } return 0; } #ifdef CONFIG_SYSCTL int nf_hooks_lwtunnel_sysctl_handler(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { int proc_nf_hooks_lwtunnel_enabled = 0; struct ctl_table tmp = { .procname = table->procname, .data = &proc_nf_hooks_lwtunnel_enabled, .maxlen = sizeof(int), .mode = table->mode, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }; int ret; if (!write) proc_nf_hooks_lwtunnel_enabled = nf_hooks_lwtunnel_get(); ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); if (write && ret == 0) ret = nf_hooks_lwtunnel_set(proc_nf_hooks_lwtunnel_enabled); return ret; } EXPORT_SYMBOL_GPL(nf_hooks_lwtunnel_sysctl_handler); static struct ctl_table nf_lwtunnel_sysctl_table[] = { { .procname = "nf_hooks_lwtunnel", .data = NULL, .maxlen = sizeof(int), .mode = 0644, .proc_handler = nf_hooks_lwtunnel_sysctl_handler, }, }; static int __net_init nf_lwtunnel_net_init(struct net *net) { struct ctl_table_header *hdr; struct ctl_table *table; table = nf_lwtunnel_sysctl_table; if (!net_eq(net, &init_net)) { table = kmemdup(nf_lwtunnel_sysctl_table, sizeof(nf_lwtunnel_sysctl_table), GFP_KERNEL); if (!table) goto err_alloc; } hdr = register_net_sysctl_sz(net, "net/netfilter", table, ARRAY_SIZE(nf_lwtunnel_sysctl_table)); if (!hdr) goto err_reg; net->nf.nf_lwtnl_dir_header = hdr; return 0; err_reg: if (!net_eq(net, &init_net)) kfree(table); err_alloc: return -ENOMEM; } static void __net_exit nf_lwtunnel_net_exit(struct net *net) { const struct ctl_table *table; table = net->nf.nf_lwtnl_dir_header->ctl_table_arg; unregister_net_sysctl_table(net->nf.nf_lwtnl_dir_header); if (!net_eq(net, &init_net)) kfree(table); } static struct pernet_operations nf_lwtunnel_net_ops = { .init = nf_lwtunnel_net_init, .exit = nf_lwtunnel_net_exit, }; int __init netfilter_lwtunnel_init(void) { return register_pernet_subsys(&nf_lwtunnel_net_ops); } void netfilter_lwtunnel_fini(void) { unregister_pernet_subsys(&nf_lwtunnel_net_ops); } #else int __init netfilter_lwtunnel_init(void) { return 0; } void netfilter_lwtunnel_fini(void) {} #endif /* CONFIG_SYSCTL */ |
| 5 143 8 529 143 16 13 960 1335 9 472 4 19 5 8 2 2 134 481 525 129 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_CGROUP_H #define _LINUX_CGROUP_H /* * cgroup interface * * Copyright (C) 2003 BULL SA * Copyright (C) 2004-2006 Silicon Graphics, Inc. * */ #include <linux/sched.h> #include <linux/nodemask.h> #include <linux/list.h> #include <linux/rculist.h> #include <linux/cgroupstats.h> #include <linux/fs.h> #include <linux/seq_file.h> #include <linux/kernfs.h> #include <linux/jump_label.h> #include <linux/types.h> #include <linux/notifier.h> #include <linux/ns_common.h> #include <linux/nsproxy.h> #include <linux/user_namespace.h> #include <linux/refcount.h> #include <linux/kernel_stat.h> #include <linux/cgroup-defs.h> #include <linux/cgroup_namespace.h> struct kernel_clone_args; /* * All weight knobs on the default hierarchy should use the following min, * default and max values. The default value is the logarithmic center of * MIN and MAX and allows 100x to be expressed in both directions. */ #define CGROUP_WEIGHT_MIN 1 #define CGROUP_WEIGHT_DFL 100 #define CGROUP_WEIGHT_MAX 10000 #ifdef CONFIG_CGROUPS enum css_task_iter_flags { CSS_TASK_ITER_PROCS = (1U << 0), /* walk only threadgroup leaders */ CSS_TASK_ITER_THREADED = (1U << 1), /* walk all threaded css_sets in the domain */ CSS_TASK_ITER_SKIPPED = (1U << 16), /* internal flags */ }; /* a css_task_iter should be treated as an opaque object */ struct css_task_iter { struct cgroup_subsys *ss; unsigned int flags; struct list_head *cset_pos; struct list_head *cset_head; struct list_head *tcset_pos; struct list_head *tcset_head; struct list_head *task_pos; struct list_head *cur_tasks_head; struct css_set *cur_cset; struct css_set *cur_dcset; struct task_struct *cur_task; struct list_head iters_node; /* css_set->task_iters */ }; enum cgroup_lifetime_events { CGROUP_LIFETIME_ONLINE, CGROUP_LIFETIME_OFFLINE, }; extern struct file_system_type cgroup_fs_type; extern struct cgroup_root cgrp_dfl_root; extern struct css_set init_css_set; extern spinlock_t css_set_lock; extern struct blocking_notifier_head cgroup_lifetime_notifier; #define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys; #include <linux/cgroup_subsys.h> #undef SUBSYS #define SUBSYS(_x) \ extern struct static_key_true _x ## _cgrp_subsys_enabled_key; \ extern struct static_key_true _x ## _cgrp_subsys_on_dfl_key; #include <linux/cgroup_subsys.h> #undef SUBSYS /** * cgroup_subsys_enabled - fast test on whether a subsys is enabled * @ss: subsystem in question */ #define cgroup_subsys_enabled(ss) \ static_branch_likely(&ss ## _enabled_key) /** * cgroup_subsys_on_dfl - fast test on whether a subsys is on default hierarchy * @ss: subsystem in question */ #define cgroup_subsys_on_dfl(ss) \ static_branch_likely(&ss ## _on_dfl_key) bool css_has_online_children(struct cgroup_subsys_state *css); struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss); struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgroup, struct cgroup_subsys *ss); struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup, struct cgroup_subsys *ss); struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry, struct cgroup_subsys *ss); struct cgroup *cgroup_get_from_path(const char *path); struct cgroup *cgroup_get_from_fd(int fd); struct cgroup *cgroup_v1v2_get_from_fd(int fd); int cgroup_attach_task_all(struct task_struct *from, struct task_struct *); int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from); int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts); int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts); int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts); int cgroup_rm_cftypes(struct cftype *cfts); void cgroup_file_notify(struct cgroup_file *cfile); void cgroup_file_show(struct cgroup_file *cfile, bool show); int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry); int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *tsk); void cgroup_fork(struct task_struct *p); extern int cgroup_can_fork(struct task_struct *p, struct kernel_clone_args *kargs); extern void cgroup_cancel_fork(struct task_struct *p, struct kernel_clone_args *kargs); extern void cgroup_post_fork(struct task_struct *p, struct kernel_clone_args *kargs); void cgroup_task_exit(struct task_struct *p); void cgroup_task_dead(struct task_struct *p); void cgroup_task_release(struct task_struct *p); void cgroup_task_free(struct task_struct *p); int cgroup_init_early(void); int cgroup_init(void); int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v); /* * Iteration helpers and macros. */ struct cgroup_subsys_state *css_next_child(struct cgroup_subsys_state *pos, struct cgroup_subsys_state *parent); struct cgroup_subsys_state *css_next_descendant_pre(struct cgroup_subsys_state *pos, struct cgroup_subsys_state *css); struct cgroup_subsys_state *css_rightmost_descendant(struct cgroup_subsys_state *pos); struct cgroup_subsys_state *css_next_descendant_post(struct cgroup_subsys_state *pos, struct cgroup_subsys_state *css); struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset, struct cgroup_subsys_state **dst_cssp); struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset, struct cgroup_subsys_state **dst_cssp); void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags, struct css_task_iter *it); struct task_struct *css_task_iter_next(struct css_task_iter *it); void css_task_iter_end(struct css_task_iter *it); /** * css_for_each_child - iterate through children of a css * @pos: the css * to use as the loop cursor * @parent: css whose children to walk * * Walk @parent's children. Must be called under rcu_read_lock(). * * If a subsystem synchronizes ->css_online() and the start of iteration, a * css which finished ->css_online() is guaranteed to be visible in the * future iterations and will stay visible until the last reference is put. * A css which hasn't finished ->css_online() or already finished * ->css_offline() may show up during traversal. It's each subsystem's * responsibility to synchronize against on/offlining. * * It is allowed to temporarily drop RCU read lock during iteration. The * caller is responsible for ensuring that @pos remains accessible until * the start of the next iteration by, for example, bumping the css refcnt. */ #define css_for_each_child(pos, parent) \ for ((pos) = css_next_child(NULL, (parent)); (pos); \ (pos) = css_next_child((pos), (parent))) /** * css_for_each_descendant_pre - pre-order walk of a css's descendants * @pos: the css * to use as the loop cursor * @root: css whose descendants to walk * * Walk @root's descendants. @root is included in the iteration and the * first node to be visited. Must be called under rcu_read_lock(). * * If a subsystem synchronizes ->css_online() and the start of iteration, a * css which finished ->css_online() is guaranteed to be visible in the * future iterations and will stay visible until the last reference is put. * A css which hasn't finished ->css_online() or already finished * ->css_offline() may show up during traversal. It's each subsystem's * responsibility to synchronize against on/offlining. * * For example, the following guarantees that a descendant can't escape * state updates of its ancestors. * * my_online(@css) * { * Lock @css's parent and @css; * Inherit state from the parent; * Unlock both. * } * * my_update_state(@css) * { * css_for_each_descendant_pre(@pos, @css) { * Lock @pos; * if (@pos == @css) * Update @css's state; * else * Verify @pos is alive and inherit state from its parent; * Unlock @pos; * } * } * * As long as the inheriting step, including checking the parent state, is * enclosed inside @pos locking, double-locking the parent isn't necessary * while inheriting. The state update to the parent is guaranteed to be * visible by walking order and, as long as inheriting operations to the * same @pos are atomic to each other, multiple updates racing each other * still result in the correct state. It's guaranateed that at least one * inheritance happens for any css after the latest update to its parent. * * If checking parent's state requires locking the parent, each inheriting * iteration should lock and unlock both @pos->parent and @pos. * * Alternatively, a subsystem may choose to use a single global lock to * synchronize ->css_online() and ->css_offline() against tree-walking * operations. * * It is allowed to temporarily drop RCU read lock during iteration. The * caller is responsible for ensuring that @pos remains accessible until * the start of the next iteration by, for example, bumping the css refcnt. */ #define css_for_each_descendant_pre(pos, css) \ for ((pos) = css_next_descendant_pre(NULL, (css)); (pos); \ (pos) = css_next_descendant_pre((pos), (css))) /** * css_for_each_descendant_post - post-order walk of a css's descendants * @pos: the css * to use as the loop cursor * @css: css whose descendants to walk * * Similar to css_for_each_descendant_pre() but performs post-order * traversal instead. @root is included in the iteration and the last * node to be visited. * * If a subsystem synchronizes ->css_online() and the start of iteration, a * css which finished ->css_online() is guaranteed to be visible in the * future iterations and will stay visible until the last reference is put. * A css which hasn't finished ->css_online() or already finished * ->css_offline() may show up during traversal. It's each subsystem's * responsibility to synchronize against on/offlining. * * Note that the walk visibility guarantee example described in pre-order * walk doesn't apply the same to post-order walks. */ #define css_for_each_descendant_post(pos, css) \ for ((pos) = css_next_descendant_post(NULL, (css)); (pos); \ (pos) = css_next_descendant_post((pos), (css))) /** * cgroup_taskset_for_each - iterate cgroup_taskset * @task: the loop cursor * @dst_css: the destination css * @tset: taskset to iterate * * @tset may contain multiple tasks and they may belong to multiple * processes. * * On the v2 hierarchy, there may be tasks from multiple processes and they * may not share the source or destination csses. * * On traditional hierarchies, when there are multiple tasks in @tset, if a * task of a process is in @tset, all tasks of the process are in @tset. * Also, all are guaranteed to share the same source and destination csses. * * Iteration is not in any specific order. */ #define cgroup_taskset_for_each(task, dst_css, tset) \ for ((task) = cgroup_taskset_first((tset), &(dst_css)); \ (task); \ (task) = cgroup_taskset_next((tset), &(dst_css))) /** * cgroup_taskset_for_each_leader - iterate group leaders in a cgroup_taskset * @leader: the loop cursor * @dst_css: the destination css * @tset: taskset to iterate * * Iterate threadgroup leaders of @tset. For single-task migrations, @tset * may not contain any. */ #define cgroup_taskset_for_each_leader(leader, dst_css, tset) \ for ((leader) = cgroup_taskset_first((tset), &(dst_css)); \ (leader); \ (leader) = cgroup_taskset_next((tset), &(dst_css))) \ if ((leader) != (leader)->group_leader) \ ; \ else /* * Inline functions. */ #ifdef CONFIG_DEBUG_CGROUP_REF void css_get(struct cgroup_subsys_state *css); void css_get_many(struct cgroup_subsys_state *css, unsigned int n); bool css_tryget(struct cgroup_subsys_state *css); bool css_tryget_online(struct cgroup_subsys_state *css); void css_put(struct cgroup_subsys_state *css); void css_put_many(struct cgroup_subsys_state *css, unsigned int n); #else #define CGROUP_REF_FN_ATTRS static inline #define CGROUP_REF_EXPORT(fn) #include <linux/cgroup_refcnt.h> #endif static inline u64 cgroup_id(const struct cgroup *cgrp) { return cgrp->kn->id; } /** * css_is_dying - test whether the specified css is dying * @css: target css * * Test whether @css is in the process of offlining or already offline. In * most cases, ->css_online() and ->css_offline() callbacks should be * enough; however, the actual offline operations are RCU delayed and this * test returns %true also when @css is scheduled to be offlined. * * This is useful, for example, when the use case requires synchronous * behavior with respect to cgroup removal. cgroup removal schedules css * offlining but the css can seem alive while the operation is being * delayed. If the delay affects user visible semantics, this test can be * used to resolve the situation. */ static inline bool css_is_dying(struct cgroup_subsys_state *css) { return css->flags & CSS_DYING; } static inline bool css_is_online(struct cgroup_subsys_state *css) { return css->flags & CSS_ONLINE; } static inline bool css_is_self(struct cgroup_subsys_state *css) { if (css == &css->cgroup->self) { /* cgroup::self should not have subsystem association */ WARN_ON(css->ss != NULL); return true; } return false; } static inline void cgroup_get(struct cgroup *cgrp) { css_get(&cgrp->self); } static inline bool cgroup_tryget(struct cgroup *cgrp) { return css_tryget(&cgrp->self); } static inline void cgroup_put(struct cgroup *cgrp) { css_put(&cgrp->self); } extern struct mutex cgroup_mutex; static inline void cgroup_lock(void) { mutex_lock(&cgroup_mutex); } static inline void cgroup_unlock(void) { mutex_unlock(&cgroup_mutex); } /** * task_css_set_check - obtain a task's css_set with extra access conditions * @task: the task to obtain css_set for * @__c: extra condition expression to be passed to rcu_dereference_check() * * A task's css_set is RCU protected, initialized and exited while holding * task_lock(), and can only be modified while holding both cgroup_mutex * and task_lock() while the task is alive. This macro verifies that the * caller is inside proper critical section and returns @task's css_set. * * The caller can also specify additional allowed conditions via @__c, such * as locks used during the cgroup_subsys::attach() methods. */ #ifdef CONFIG_PROVE_RCU #define task_css_set_check(task, __c) \ rcu_dereference_check((task)->cgroups, \ rcu_read_lock_sched_held() || \ lockdep_is_held(&cgroup_mutex) || \ lockdep_is_held(&css_set_lock) || \ ((task)->flags & PF_EXITING) || (__c)) #else #define task_css_set_check(task, __c) \ rcu_dereference((task)->cgroups) #endif /** * task_css_check - obtain css for (task, subsys) w/ extra access conds * @task: the target task * @subsys_id: the target subsystem ID * @__c: extra condition expression to be passed to rcu_dereference_check() * * Return the cgroup_subsys_state for the (@task, @subsys_id) pair. The * synchronization rules are the same as task_css_set_check(). */ #define task_css_check(task, subsys_id, __c) \ task_css_set_check((task), (__c))->subsys[(subsys_id)] /** * task_css_set - obtain a task's css_set * @task: the task to obtain css_set for * * See task_css_set_check(). */ static inline struct css_set *task_css_set(struct task_struct *task) { return task_css_set_check(task, false); } /** * task_css - obtain css for (task, subsys) * @task: the target task * @subsys_id: the target subsystem ID * * See task_css_check(). */ static inline struct cgroup_subsys_state *task_css(struct task_struct *task, int subsys_id) { return task_css_check(task, subsys_id, false); } /** * task_get_css - find and get the css for (task, subsys) * @task: the target task * @subsys_id: the target subsystem ID * * Find the css for the (@task, @subsys_id) combination, increment a * reference on and return it. This function is guaranteed to return a * valid css. The returned css may already have been offlined. */ static inline struct cgroup_subsys_state * task_get_css(struct task_struct *task, int subsys_id) { struct cgroup_subsys_state *css; rcu_read_lock(); while (true) { css = task_css(task, subsys_id); /* * Can't use css_tryget_online() here. A task which has * PF_EXITING set may stay associated with an offline css. * If such task calls this function, css_tryget_online() * will keep failing. */ if (likely(css_tryget(css))) break; cpu_relax(); } rcu_read_unlock(); return css; } /** * task_css_is_root - test whether a task belongs to the root css * @task: the target task * @subsys_id: the target subsystem ID * * Test whether @task belongs to the root css on the specified subsystem. * May be invoked in any context. */ static inline bool task_css_is_root(struct task_struct *task, int subsys_id) { return task_css_check(task, subsys_id, true) == init_css_set.subsys[subsys_id]; } static inline struct cgroup *task_cgroup(struct task_struct *task, int subsys_id) { return task_css(task, subsys_id)->cgroup; } static inline struct cgroup *task_dfl_cgroup(struct task_struct *task) { return task_css_set(task)->dfl_cgrp; } static inline struct cgroup *cgroup_parent(struct cgroup *cgrp) { struct cgroup_subsys_state *parent_css = cgrp->self.parent; if (parent_css) return container_of(parent_css, struct cgroup, self); return NULL; } /** * cgroup_is_descendant - test ancestry * @cgrp: the cgroup to be tested * @ancestor: possible ancestor of @cgrp * * Test whether @cgrp is a descendant of @ancestor. It also returns %true * if @cgrp == @ancestor. This function is safe to call as long as @cgrp * and @ancestor are accessible. */ static inline bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor) { if (cgrp->root != ancestor->root || cgrp->level < ancestor->level) return false; return cgrp->ancestors[ancestor->level] == ancestor; } /** * cgroup_ancestor - find ancestor of cgroup * @cgrp: cgroup to find ancestor of * @ancestor_level: level of ancestor to find starting from root * * Find ancestor of cgroup at specified level starting from root if it exists * and return pointer to it. Return NULL if @cgrp doesn't have ancestor at * @ancestor_level. * * This function is safe to call as long as @cgrp is accessible. */ static inline struct cgroup *cgroup_ancestor(struct cgroup *cgrp, int ancestor_level) { if (ancestor_level < 0 || ancestor_level > cgrp->level) return NULL; return cgrp->ancestors[ancestor_level]; } /** * task_under_cgroup_hierarchy - test task's membership of cgroup ancestry * @task: the task to be tested * @ancestor: possible ancestor of @task's cgroup * * Tests whether @task's default cgroup hierarchy is a descendant of @ancestor. * It follows all the same rules as cgroup_is_descendant, and only applies * to the default hierarchy. */ static inline bool task_under_cgroup_hierarchy(struct task_struct *task, struct cgroup *ancestor) { struct css_set *cset = task_css_set(task); return cgroup_is_descendant(cset->dfl_cgrp, ancestor); } /* no synchronization, the result can only be used as a hint */ static inline bool cgroup_is_populated(struct cgroup *cgrp) { return cgrp->nr_populated_csets + cgrp->nr_populated_domain_children + cgrp->nr_populated_threaded_children; } /* returns ino associated with a cgroup */ static inline ino_t cgroup_ino(struct cgroup *cgrp) { return kernfs_ino(cgrp->kn); } /* cft/css accessors for cftype->write() operation */ static inline struct cftype *of_cft(struct kernfs_open_file *of) { return of->kn->priv; } struct cgroup_subsys_state *of_css(struct kernfs_open_file *of); /* cft/css accessors for cftype->seq_*() operations */ static inline struct cftype *seq_cft(struct seq_file *seq) { return of_cft(seq->private); } static inline struct cgroup_subsys_state *seq_css(struct seq_file *seq) { return of_css(seq->private); } /* * Name / path handling functions. All are thin wrappers around the kernfs * counterparts and can be called under any context. */ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen) { return kernfs_name(cgrp->kn, buf, buflen); } static inline int cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen) { return kernfs_path(cgrp->kn, buf, buflen); } static inline void pr_cont_cgroup_name(struct cgroup *cgrp) { pr_cont_kernfs_name(cgrp->kn); } static inline void pr_cont_cgroup_path(struct cgroup *cgrp) { pr_cont_kernfs_path(cgrp->kn); } bool cgroup_psi_enabled(void); static inline void cgroup_init_kthreadd(void) { /* * kthreadd is inherited by all kthreads, keep it in the root so * that the new kthreads are guaranteed to stay in the root until * initialization is finished. */ current->no_cgroup_migration = 1; } static inline void cgroup_kthread_ready(void) { /* * This kthread finished initialization. The creator should have * set PF_NO_SETAFFINITY if this kthread should stay in the root. */ current->no_cgroup_migration = 0; } void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen); struct cgroup *__cgroup_get_from_id(u64 id); struct cgroup *cgroup_get_from_id(u64 id); #else /* !CONFIG_CGROUPS */ struct cgroup_subsys_state; struct cgroup; static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; } static inline void css_get(struct cgroup_subsys_state *css) {} static inline void css_put(struct cgroup_subsys_state *css) {} static inline void cgroup_lock(void) {} static inline void cgroup_unlock(void) {} static inline int cgroup_attach_task_all(struct task_struct *from, struct task_struct *t) { return 0; } static inline int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) { return -EINVAL; } static inline void cgroup_fork(struct task_struct *p) {} static inline int cgroup_can_fork(struct task_struct *p, struct kernel_clone_args *kargs) { return 0; } static inline void cgroup_cancel_fork(struct task_struct *p, struct kernel_clone_args *kargs) {} static inline void cgroup_post_fork(struct task_struct *p, struct kernel_clone_args *kargs) {} static inline void cgroup_task_exit(struct task_struct *p) {} static inline void cgroup_task_dead(struct task_struct *p) {} static inline void cgroup_task_release(struct task_struct *p) {} static inline void cgroup_task_free(struct task_struct *p) {} static inline int cgroup_init_early(void) { return 0; } static inline int cgroup_init(void) { return 0; } static inline void cgroup_init_kthreadd(void) {} static inline void cgroup_kthread_ready(void) {} static inline struct cgroup *cgroup_parent(struct cgroup *cgrp) { return NULL; } static inline bool cgroup_psi_enabled(void) { return false; } static inline bool task_under_cgroup_hierarchy(struct task_struct *task, struct cgroup *ancestor) { return true; } static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen) {} #endif /* !CONFIG_CGROUPS */ #ifdef CONFIG_CGROUPS /* * cgroup scalable recursive statistics. */ void css_rstat_updated(struct cgroup_subsys_state *css, int cpu); void css_rstat_flush(struct cgroup_subsys_state *css); /* * Basic resource stats. */ #ifdef CONFIG_CGROUP_CPUACCT void cpuacct_charge(struct task_struct *tsk, u64 cputime); void cpuacct_account_field(struct task_struct *tsk, int index, u64 val); #else static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {} static inline void cpuacct_account_field(struct task_struct *tsk, int index, u64 val) {} #endif void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec); void __cgroup_account_cputime_field(struct cgroup *cgrp, enum cpu_usage_stat index, u64 delta_exec); static inline void cgroup_account_cputime(struct task_struct *task, u64 delta_exec) { struct cgroup *cgrp; cpuacct_charge(task, delta_exec); cgrp = task_dfl_cgroup(task); if (cgroup_parent(cgrp)) __cgroup_account_cputime(cgrp, delta_exec); } static inline void cgroup_account_cputime_field(struct task_struct *task, enum cpu_usage_stat index, u64 delta_exec) { struct cgroup *cgrp; cpuacct_account_field(task, index, delta_exec); cgrp = task_dfl_cgroup(task); if (cgroup_parent(cgrp)) __cgroup_account_cputime_field(cgrp, index, delta_exec); } #else /* CONFIG_CGROUPS */ static inline void cgroup_account_cputime(struct task_struct *task, u64 delta_exec) {} static inline void cgroup_account_cputime_field(struct task_struct *task, enum cpu_usage_stat index, u64 delta_exec) {} #endif /* CONFIG_CGROUPS */ /* * sock->sk_cgrp_data handling. For more info, see sock_cgroup_data * definition in cgroup-defs.h. */ #ifdef CONFIG_SOCK_CGROUP_DATA void cgroup_sk_alloc(struct sock_cgroup_data *skcd); void cgroup_sk_clone(struct sock_cgroup_data *skcd); void cgroup_sk_free(struct sock_cgroup_data *skcd); static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd) { return skcd->cgroup; } #else /* CONFIG_CGROUP_DATA */ static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {} static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {} static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {} #endif /* CONFIG_CGROUP_DATA */ #ifdef CONFIG_CGROUPS void cgroup_enter_frozen(void); void cgroup_leave_frozen(bool always_leave); void cgroup_update_frozen(struct cgroup *cgrp); void cgroup_freeze(struct cgroup *cgrp, bool freeze); void cgroup_freezer_migrate_task(struct task_struct *task, struct cgroup *src, struct cgroup *dst); static inline bool cgroup_task_frozen(struct task_struct *task) { return task->frozen; } #else /* !CONFIG_CGROUPS */ static inline void cgroup_enter_frozen(void) { } static inline void cgroup_leave_frozen(bool always_leave) { } static inline bool cgroup_task_frozen(struct task_struct *task) { return false; } #endif /* !CONFIG_CGROUPS */ #ifdef CONFIG_CGROUP_BPF static inline void cgroup_bpf_get(struct cgroup *cgrp) { percpu_ref_get(&cgrp->bpf.refcnt); } static inline void cgroup_bpf_put(struct cgroup *cgrp) { percpu_ref_put(&cgrp->bpf.refcnt); } #else /* CONFIG_CGROUP_BPF */ static inline void cgroup_bpf_get(struct cgroup *cgrp) {} static inline void cgroup_bpf_put(struct cgroup *cgrp) {} #endif /* CONFIG_CGROUP_BPF */ struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id); struct cgroup_of_peak *of_peak(struct kernfs_open_file *of); #endif /* _LINUX_CGROUP_H */ |
| 190 187 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright (C) 2020 Google LLC. */ #ifndef _LINUX_BPF_LSM_H #define _LINUX_BPF_LSM_H #include <linux/sched.h> #include <linux/bpf.h> #include <linux/bpf_verifier.h> #include <linux/lsm_hooks.h> #ifdef CONFIG_BPF_LSM #define LSM_HOOK(RET, DEFAULT, NAME, ...) \ RET bpf_lsm_##NAME(__VA_ARGS__); #include <linux/lsm_hook_defs.h> #undef LSM_HOOK struct bpf_storage_blob { struct bpf_local_storage __rcu *storage; }; extern struct lsm_blob_sizes bpf_lsm_blob_sizes; int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, const struct bpf_prog *prog); bool bpf_lsm_is_sleepable_hook(u32 btf_id); bool bpf_lsm_is_trusted(const struct bpf_prog *prog); static inline struct bpf_storage_blob *bpf_inode( const struct inode *inode) { if (unlikely(!inode->i_security)) return NULL; return inode->i_security + bpf_lsm_blob_sizes.lbs_inode; } extern const struct bpf_func_proto bpf_inode_storage_get_proto; extern const struct bpf_func_proto bpf_inode_storage_delete_proto; void bpf_inode_storage_free(struct inode *inode); void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog, bpf_func_t *bpf_func); int bpf_lsm_get_retval_range(const struct bpf_prog *prog, struct bpf_retval_range *range); int bpf_set_dentry_xattr_locked(struct dentry *dentry, const char *name__str, const struct bpf_dynptr *value_p, int flags); int bpf_remove_dentry_xattr_locked(struct dentry *dentry, const char *name__str); bool bpf_lsm_has_d_inode_locked(const struct bpf_prog *prog); #else /* !CONFIG_BPF_LSM */ static inline bool bpf_lsm_is_sleepable_hook(u32 btf_id) { return false; } static inline bool bpf_lsm_is_trusted(const struct bpf_prog *prog) { return false; } static inline int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, const struct bpf_prog *prog) { return -EOPNOTSUPP; } static inline struct bpf_storage_blob *bpf_inode( const struct inode *inode) { return NULL; } static inline void bpf_inode_storage_free(struct inode *inode) { } static inline void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog, bpf_func_t *bpf_func) { } static inline int bpf_lsm_get_retval_range(const struct bpf_prog *prog, struct bpf_retval_range *range) { return -EOPNOTSUPP; } static inline int bpf_set_dentry_xattr_locked(struct dentry *dentry, const char *name__str, const struct bpf_dynptr *value_p, int flags) { return -EOPNOTSUPP; } static inline int bpf_remove_dentry_xattr_locked(struct dentry *dentry, const char *name__str) { return -EOPNOTSUPP; } static inline bool bpf_lsm_has_d_inode_locked(const struct bpf_prog *prog) { return false; } #endif /* CONFIG_BPF_LSM */ #endif /* _LINUX_BPF_LSM_H */ |
| 74 75 77 77 1 1 1 1 77 75 75 1 11 1 12 74 74 74 74 74 75 75 75 74 74 78 76 5 2 73 74 72 74 73 56 74 74 8 74 75 8 75 75 74 1 74 75 75 75 75 77 54 54 54 54 54 51 54 54 53 13 74 73 73 74 32 31 72 74 73 74 72 71 71 72 71 71 72 8 7 8 8 8 8 8 8 8 8 12 148 115 12 12 146 40 40 40 39 40 38 39 40 40 27 84 98 3 27 1 1 96 2 2 2 2 2 75 76 74 75 73 74 73 73 62 33 3 43 30 43 43 43 43 43 29 43 43 43 34 38 43 43 11 32 43 74 60 75 75 2 2 2 2 73 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 8 8 8 8 21 21 21 22 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 | // SPDX-License-Identifier: GPL-2.0-or-later /* Basic authentication token and access key management * * Copyright (C) 2004-2008 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) */ #include <linux/export.h> #include <linux/init.h> #include <linux/poison.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/security.h> #include <linux/workqueue.h> #include <linux/random.h> #include <linux/err.h> #include "internal.h" struct kmem_cache *key_jar; struct rb_root key_serial_tree; /* tree of keys indexed by serial */ DEFINE_SPINLOCK(key_serial_lock); struct rb_root key_user_tree; /* tree of quota records indexed by UID */ DEFINE_SPINLOCK(key_user_lock); unsigned int key_quota_root_maxkeys = 1000000; /* root's key count quota */ unsigned int key_quota_root_maxbytes = 25000000; /* root's key space quota */ unsigned int key_quota_maxkeys = 200; /* general key count quota */ unsigned int key_quota_maxbytes = 20000; /* general key space quota */ static LIST_HEAD(key_types_list); static DECLARE_RWSEM(key_types_sem); /* We serialise key instantiation and link */ DEFINE_MUTEX(key_construction_mutex); #ifdef KEY_DEBUGGING void __key_check(const struct key *key) { printk("__key_check: key %p {%08x} should be {%08x}\n", key, key->magic, KEY_DEBUG_MAGIC); BUG(); } #endif /* * Get the key quota record for a user, allocating a new record if one doesn't * already exist. */ struct key_user *key_user_lookup(kuid_t uid) { struct key_user *candidate = NULL, *user; struct rb_node *parent, **p; try_again: parent = NULL; p = &key_user_tree.rb_node; spin_lock(&key_user_lock); /* search the tree for a user record with a matching UID */ while (*p) { parent = *p; user = rb_entry(parent, struct key_user, node); if (uid_lt(uid, user->uid)) p = &(*p)->rb_left; else if (uid_gt(uid, user->uid)) p = &(*p)->rb_right; else goto found; } /* if we get here, we failed to find a match in the tree */ if (!candidate) { /* allocate a candidate user record if we don't already have * one */ spin_unlock(&key_user_lock); user = NULL; candidate = kmalloc(sizeof(struct key_user), GFP_KERNEL); if (unlikely(!candidate)) goto out; /* the allocation may have scheduled, so we need to repeat the * search lest someone else added the record whilst we were * asleep */ goto try_again; } /* if we get here, then the user record still hadn't appeared on the * second pass - so we use the candidate record */ refcount_set(&candidate->usage, 1); atomic_set(&candidate->nkeys, 0); atomic_set(&candidate->nikeys, 0); candidate->uid = uid; candidate->qnkeys = 0; candidate->qnbytes = 0; spin_lock_init(&candidate->lock); mutex_init(&candidate->cons_lock); rb_link_node(&candidate->node, parent, p); rb_insert_color(&candidate->node, &key_user_tree); spin_unlock(&key_user_lock); user = candidate; goto out; /* okay - we found a user record for this UID */ found: refcount_inc(&user->usage); spin_unlock(&key_user_lock); kfree(candidate); out: return user; } /* * Dispose of a user structure */ void key_user_put(struct key_user *user) { if (refcount_dec_and_lock(&user->usage, &key_user_lock)) { rb_erase(&user->node, &key_user_tree); spin_unlock(&key_user_lock); kfree(user); } } /* * Allocate a serial number for a key. These are assigned randomly to avoid * security issues through covert channel problems. */ static inline void key_alloc_serial(struct key *key) { struct rb_node *parent, **p; struct key *xkey; /* propose a random serial number and look for a hole for it in the * serial number tree */ do { get_random_bytes(&key->serial, sizeof(key->serial)); key->serial >>= 1; /* negative numbers are not permitted */ } while (key->serial < 3); spin_lock(&key_serial_lock); attempt_insertion: parent = NULL; p = &key_serial_tree.rb_node; while (*p) { parent = *p; xkey = rb_entry(parent, struct key, serial_node); if (key->serial < xkey->serial) p = &(*p)->rb_left; else if (key->serial > xkey->serial) p = &(*p)->rb_right; else goto serial_exists; } /* we've found a suitable hole - arrange for this key to occupy it */ rb_link_node(&key->serial_node, parent, p); rb_insert_color(&key->serial_node, &key_serial_tree); spin_unlock(&key_serial_lock); return; /* we found a key with the proposed serial number - walk the tree from * that point looking for the next unused serial number */ serial_exists: for (;;) { key->serial++; if (key->serial < 3) { key->serial = 3; goto attempt_insertion; } parent = rb_next(parent); if (!parent) goto attempt_insertion; xkey = rb_entry(parent, struct key, serial_node); if (key->serial < xkey->serial) goto attempt_insertion; } } /** * key_alloc - Allocate a key of the specified type. * @type: The type of key to allocate. * @desc: The key description to allow the key to be searched out. * @uid: The owner of the new key. * @gid: The group ID for the new key's group permissions. * @cred: The credentials specifying UID namespace. * @perm: The permissions mask of the new key. * @flags: Flags specifying quota properties. * @restrict_link: Optional link restriction for new keyrings. * * Allocate a key of the specified type with the attributes given. The key is * returned in an uninstantiated state and the caller needs to instantiate the * key before returning. * * The restrict_link structure (if not NULL) will be freed when the * keyring is destroyed, so it must be dynamically allocated. * * The user's key count quota is updated to reflect the creation of the key and * the user's key data quota has the default for the key type reserved. The * instantiation function should amend this as necessary. If insufficient * quota is available, -EDQUOT will be returned. * * The LSM security modules can prevent a key being created, in which case * -EACCES will be returned. * * Returns a pointer to the new key if successful and an error code otherwise. * * Note that the caller needs to ensure the key type isn't uninstantiated. * Internally this can be done by locking key_types_sem. Externally, this can * be done by either never unregistering the key type, or making sure * key_alloc() calls don't race with module unloading. */ struct key *key_alloc(struct key_type *type, const char *desc, kuid_t uid, kgid_t gid, const struct cred *cred, key_perm_t perm, unsigned long flags, struct key_restriction *restrict_link) { struct key_user *user = NULL; struct key *key; size_t desclen, quotalen; int ret; unsigned long irqflags; key = ERR_PTR(-EINVAL); if (!desc || !*desc) goto error; if (type->vet_description) { ret = type->vet_description(desc); if (ret < 0) { key = ERR_PTR(ret); goto error; } } desclen = strlen(desc); quotalen = desclen + 1 + type->def_datalen; /* get hold of the key tracking for this user */ user = key_user_lookup(uid); if (!user) goto no_memory_1; /* check that the user's quota permits allocation of another key and * its description */ if (!(flags & KEY_ALLOC_NOT_IN_QUOTA)) { unsigned maxkeys = uid_eq(uid, GLOBAL_ROOT_UID) ? key_quota_root_maxkeys : key_quota_maxkeys; unsigned maxbytes = uid_eq(uid, GLOBAL_ROOT_UID) ? key_quota_root_maxbytes : key_quota_maxbytes; spin_lock_irqsave(&user->lock, irqflags); if (!(flags & KEY_ALLOC_QUOTA_OVERRUN)) { if (user->qnkeys + 1 > maxkeys || user->qnbytes + quotalen > maxbytes || user->qnbytes + quotalen < user->qnbytes) goto no_quota; } user->qnkeys++; user->qnbytes += quotalen; spin_unlock_irqrestore(&user->lock, irqflags); } /* allocate and initialise the key and its description */ key = kmem_cache_zalloc(key_jar, GFP_KERNEL); if (!key) goto no_memory_2; key->index_key.desc_len = desclen; key->index_key.description = kmemdup(desc, desclen + 1, GFP_KERNEL); if (!key->index_key.description) goto no_memory_3; key->index_key.type = type; key_set_index_key(&key->index_key); refcount_set(&key->usage, 1); init_rwsem(&key->sem); lockdep_set_class(&key->sem, &type->lock_class); key->user = user; key->quotalen = quotalen; key->datalen = type->def_datalen; key->uid = uid; key->gid = gid; key->perm = perm; key->expiry = TIME64_MAX; key->restrict_link = restrict_link; key->last_used_at = ktime_get_real_seconds(); key->flags |= 1 << KEY_FLAG_USER_ALIVE; if (!(flags & KEY_ALLOC_NOT_IN_QUOTA)) key->flags |= 1 << KEY_FLAG_IN_QUOTA; if (flags & KEY_ALLOC_BUILT_IN) key->flags |= 1 << KEY_FLAG_BUILTIN; if (flags & KEY_ALLOC_UID_KEYRING) key->flags |= 1 << KEY_FLAG_UID_KEYRING; if (flags & KEY_ALLOC_SET_KEEP) key->flags |= 1 << KEY_FLAG_KEEP; #ifdef KEY_DEBUGGING key->magic = KEY_DEBUG_MAGIC; #endif /* let the security module know about the key */ ret = security_key_alloc(key, cred, flags); if (ret < 0) goto security_error; /* publish the key by giving it a serial number */ refcount_inc(&key->domain_tag->usage); atomic_inc(&user->nkeys); key_alloc_serial(key); error: return key; security_error: kfree(key->description); kmem_cache_free(key_jar, key); if (!(flags & KEY_ALLOC_NOT_IN_QUOTA)) { spin_lock_irqsave(&user->lock, irqflags); user->qnkeys--; user->qnbytes -= quotalen; spin_unlock_irqrestore(&user->lock, irqflags); } key_user_put(user); key = ERR_PTR(ret); goto error; no_memory_3: kmem_cache_free(key_jar, key); no_memory_2: if (!(flags & KEY_ALLOC_NOT_IN_QUOTA)) { spin_lock_irqsave(&user->lock, irqflags); user->qnkeys--; user->qnbytes -= quotalen; spin_unlock_irqrestore(&user->lock, irqflags); } key_user_put(user); no_memory_1: key = ERR_PTR(-ENOMEM); goto error; no_quota: spin_unlock_irqrestore(&user->lock, irqflags); key_user_put(user); key = ERR_PTR(-EDQUOT); goto error; } EXPORT_SYMBOL(key_alloc); /** * key_payload_reserve - Adjust data quota reservation for the key's payload * @key: The key to make the reservation for. * @datalen: The amount of data payload the caller now wants. * * Adjust the amount of the owning user's key data quota that a key reserves. * If the amount is increased, then -EDQUOT may be returned if there isn't * enough free quota available. * * If successful, 0 is returned. */ int key_payload_reserve(struct key *key, size_t datalen) { int delta = (int)datalen - key->datalen; int ret = 0; key_check(key); /* contemplate the quota adjustment */ if (delta != 0 && test_bit(KEY_FLAG_IN_QUOTA, &key->flags)) { unsigned maxbytes = uid_eq(key->user->uid, GLOBAL_ROOT_UID) ? key_quota_root_maxbytes : key_quota_maxbytes; unsigned long flags; spin_lock_irqsave(&key->user->lock, flags); if (delta > 0 && |