57 57 57 10 127 125 127 127 77 79 78 93 4 4 2 2 2 2 5 2 2 7 7 7 6 5 4 4 4 2 2 2 2 202 195 163 154 160 163 162 101 101 101 283 201 163 101 101 284 283 284 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 | /* SPDX-License-Identifier: GPL-2.0 */ #include <linux/kernel.h> #include <linux/slab.h> #include <net/act_api.h> #include <net/flow_offload.h> #include <linux/rtnetlink.h> #include <linux/mutex.h> #include <linux/rhashtable.h> struct flow_rule *flow_rule_alloc(unsigned int num_actions) { struct flow_rule *rule; int i; rule = kzalloc(struct_size(rule, action.entries, num_actions), GFP_KERNEL); if (!rule) return NULL; rule->action.num_entries = num_actions; /* Pre-fill each action hw_stats with DONT_CARE. * Caller can override this if it wants stats for a given action. */ for (i = 0; i < num_actions; i++) rule->action.entries[i].hw_stats = FLOW_ACTION_HW_STATS_DONT_CARE; return rule; } EXPORT_SYMBOL(flow_rule_alloc); struct flow_offload_action *offload_action_alloc(unsigned int num_actions) { struct flow_offload_action *fl_action; int i; fl_action = kzalloc(struct_size(fl_action, action.entries, num_actions), GFP_KERNEL); if (!fl_action) return NULL; fl_action->action.num_entries = num_actions; /* Pre-fill each action hw_stats with DONT_CARE. * Caller can override this if it wants stats for a given action. */ for (i = 0; i < num_actions; i++) fl_action->action.entries[i].hw_stats = FLOW_ACTION_HW_STATS_DONT_CARE; return fl_action; } #define FLOW_DISSECTOR_MATCH(__rule, __type, __out) \ const struct flow_match *__m = &(__rule)->match; \ struct flow_dissector *__d = (__m)->dissector; \ \ (__out)->key = skb_flow_dissector_target(__d, __type, (__m)->key); \ (__out)->mask = skb_flow_dissector_target(__d, __type, (__m)->mask); \ void flow_rule_match_meta(const struct flow_rule *rule, struct flow_match_meta *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_META, out); } EXPORT_SYMBOL(flow_rule_match_meta); void flow_rule_match_basic(const struct flow_rule *rule, struct flow_match_basic *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_BASIC, out); } EXPORT_SYMBOL(flow_rule_match_basic); void flow_rule_match_control(const struct flow_rule *rule, struct flow_match_control *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_CONTROL, out); } EXPORT_SYMBOL(flow_rule_match_control); void flow_rule_match_eth_addrs(const struct flow_rule *rule, struct flow_match_eth_addrs *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_ETH_ADDRS, out); } EXPORT_SYMBOL(flow_rule_match_eth_addrs); void flow_rule_match_vlan(const struct flow_rule *rule, struct flow_match_vlan *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_VLAN, out); } EXPORT_SYMBOL(flow_rule_match_vlan); void flow_rule_match_cvlan(const struct flow_rule *rule, struct flow_match_vlan *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_CVLAN, out); } EXPORT_SYMBOL(flow_rule_match_cvlan); void flow_rule_match_arp(const struct flow_rule *rule, struct flow_match_arp *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_ARP, out); } EXPORT_SYMBOL(flow_rule_match_arp); void flow_rule_match_ipv4_addrs(const struct flow_rule *rule, struct flow_match_ipv4_addrs *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_IPV4_ADDRS, out); } EXPORT_SYMBOL(flow_rule_match_ipv4_addrs); void flow_rule_match_ipv6_addrs(const struct flow_rule *rule, struct flow_match_ipv6_addrs *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_IPV6_ADDRS, out); } EXPORT_SYMBOL(flow_rule_match_ipv6_addrs); void flow_rule_match_ip(const struct flow_rule *rule, struct flow_match_ip *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_IP, out); } EXPORT_SYMBOL(flow_rule_match_ip); void flow_rule_match_ports(const struct flow_rule *rule, struct flow_match_ports *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_PORTS, out); } EXPORT_SYMBOL(flow_rule_match_ports); void flow_rule_match_ports_range(const struct flow_rule *rule, struct flow_match_ports_range *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_PORTS_RANGE, out); } EXPORT_SYMBOL(flow_rule_match_ports_range); void flow_rule_match_tcp(const struct flow_rule *rule, struct flow_match_tcp *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_TCP, out); } EXPORT_SYMBOL(flow_rule_match_tcp); void flow_rule_match_ipsec(const struct flow_rule *rule, struct flow_match_ipsec *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_IPSEC, out); } EXPORT_SYMBOL(flow_rule_match_ipsec); void flow_rule_match_icmp(const struct flow_rule *rule, struct flow_match_icmp *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_ICMP, out); } EXPORT_SYMBOL(flow_rule_match_icmp); void flow_rule_match_mpls(const struct flow_rule *rule, struct flow_match_mpls *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_MPLS, out); } EXPORT_SYMBOL(flow_rule_match_mpls); void flow_rule_match_enc_control(const struct flow_rule *rule, struct flow_match_control *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_ENC_CONTROL, out); } EXPORT_SYMBOL(flow_rule_match_enc_control); void flow_rule_match_enc_ipv4_addrs(const struct flow_rule *rule, struct flow_match_ipv4_addrs *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS, out); } EXPORT_SYMBOL(flow_rule_match_enc_ipv4_addrs); void flow_rule_match_enc_ipv6_addrs(const struct flow_rule *rule, struct flow_match_ipv6_addrs *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, out); } EXPORT_SYMBOL(flow_rule_match_enc_ipv6_addrs); void flow_rule_match_enc_ip(const struct flow_rule *rule, struct flow_match_ip *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_ENC_IP, out); } EXPORT_SYMBOL(flow_rule_match_enc_ip); void flow_rule_match_enc_ports(const struct flow_rule *rule, struct flow_match_ports *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_ENC_PORTS, out); } EXPORT_SYMBOL(flow_rule_match_enc_ports); void flow_rule_match_enc_keyid(const struct flow_rule *rule, struct flow_match_enc_keyid *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_ENC_KEYID, out); } EXPORT_SYMBOL(flow_rule_match_enc_keyid); void flow_rule_match_enc_opts(const struct flow_rule *rule, struct flow_match_enc_opts *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_ENC_OPTS, out); } EXPORT_SYMBOL(flow_rule_match_enc_opts); struct flow_action_cookie *flow_action_cookie_create(void *data, unsigned int len, gfp_t gfp) { struct flow_action_cookie *cookie; cookie = kmalloc(sizeof(*cookie) + len, gfp); if (!cookie) return NULL; cookie->cookie_len = len; memcpy(cookie->cookie, data, len); return cookie; } EXPORT_SYMBOL(flow_action_cookie_create); void flow_action_cookie_destroy(struct flow_action_cookie *cookie) { kfree(cookie); } EXPORT_SYMBOL(flow_action_cookie_destroy); void flow_rule_match_ct(const struct flow_rule *rule, struct flow_match_ct *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_CT, out); } EXPORT_SYMBOL(flow_rule_match_ct); void flow_rule_match_pppoe(const struct flow_rule *rule, struct flow_match_pppoe *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_PPPOE, out); } EXPORT_SYMBOL(flow_rule_match_pppoe); void flow_rule_match_l2tpv3(const struct flow_rule *rule, struct flow_match_l2tpv3 *out) { FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_L2TPV3, out); } EXPORT_SYMBOL(flow_rule_match_l2tpv3); struct flow_block_cb *flow_block_cb_alloc(flow_setup_cb_t *cb, void *cb_ident, void *cb_priv, void (*release)(void *cb_priv)) { struct flow_block_cb *block_cb; block_cb = kzalloc(sizeof(*block_cb), GFP_KERNEL); if (!block_cb) return ERR_PTR(-ENOMEM); block_cb->cb = cb; block_cb->cb_ident = cb_ident; block_cb->cb_priv = cb_priv; block_cb->release = release; return block_cb; } EXPORT_SYMBOL(flow_block_cb_alloc); void flow_block_cb_free(struct flow_block_cb *block_cb) { if (block_cb->release) block_cb->release(block_cb->cb_priv); kfree(block_cb); } EXPORT_SYMBOL(flow_block_cb_free); struct flow_block_cb *flow_block_cb_lookup(struct flow_block *block, flow_setup_cb_t *cb, void *cb_ident) { struct flow_block_cb *block_cb; list_for_each_entry(block_cb, &block->cb_list, list) { if (block_cb->cb == cb && block_cb->cb_ident == cb_ident) return block_cb; } return NULL; } EXPORT_SYMBOL(flow_block_cb_lookup); void *flow_block_cb_priv(struct flow_block_cb *block_cb) { return block_cb->cb_priv; } EXPORT_SYMBOL(flow_block_cb_priv); void flow_block_cb_incref(struct flow_block_cb *block_cb) { block_cb->refcnt++; } EXPORT_SYMBOL(flow_block_cb_incref); unsigned int flow_block_cb_decref(struct flow_block_cb *block_cb) { return --block_cb->refcnt; } EXPORT_SYMBOL(flow_block_cb_decref); bool flow_block_cb_is_busy(flow_setup_cb_t *cb, void *cb_ident, struct list_head *driver_block_list) { struct flow_block_cb *block_cb; list_for_each_entry(block_cb, driver_block_list, driver_list) { if (block_cb->cb == cb && block_cb->cb_ident == cb_ident) return true; } return false; } EXPORT_SYMBOL(flow_block_cb_is_busy); int flow_block_cb_setup_simple(struct flow_block_offload *f, struct list_head *driver_block_list, flow_setup_cb_t *cb, void *cb_ident, void *cb_priv, bool ingress_only) { struct flow_block_cb *block_cb; if (ingress_only && f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS) return -EOPNOTSUPP; f->driver_block_list = driver_block_list; switch (f->command) { case FLOW_BLOCK_BIND: if (flow_block_cb_is_busy(cb, cb_ident, driver_block_list)) return -EBUSY; block_cb = flow_block_cb_alloc(cb, cb_ident, cb_priv, NULL); if (IS_ERR(block_cb)) return PTR_ERR(block_cb); flow_block_cb_add(block_cb, f); list_add_tail(&block_cb->driver_list, driver_block_list); return 0; case FLOW_BLOCK_UNBIND: block_cb = flow_block_cb_lookup(f->block, cb, cb_ident); if (!block_cb) return -ENOENT; flow_block_cb_remove(block_cb, f); list_del(&block_cb->driver_list); return 0; default: return -EOPNOTSUPP; } } EXPORT_SYMBOL(flow_block_cb_setup_simple); static DEFINE_MUTEX(flow_indr_block_lock); static LIST_HEAD(flow_block_indr_list); static LIST_HEAD(flow_block_indr_dev_list); static LIST_HEAD(flow_indir_dev_list); struct flow_indr_dev { struct list_head list; flow_indr_block_bind_cb_t *cb; void *cb_priv; refcount_t refcnt; }; static struct flow_indr_dev *flow_indr_dev_alloc(flow_indr_block_bind_cb_t *cb, void *cb_priv) { struct flow_indr_dev *indr_dev; indr_dev = kmalloc(sizeof(*indr_dev), GFP_KERNEL); if (!indr_dev) return NULL; indr_dev->cb = cb; indr_dev->cb_priv = cb_priv; refcount_set(&indr_dev->refcnt, 1); return indr_dev; } struct flow_indir_dev_info { void *data; struct net_device *dev; struct Qdisc *sch; enum tc_setup_type type; void (*cleanup)(struct flow_block_cb *block_cb); struct list_head list; enum flow_block_command command; enum flow_block_binder_type binder_type; struct list_head *cb_list; }; static void existing_qdiscs_register(flow_indr_block_bind_cb_t *cb, void *cb_priv) { struct flow_block_offload bo; struct flow_indir_dev_info *cur; list_for_each_entry(cur, &flow_indir_dev_list, list) { memset(&bo, 0, sizeof(bo)); bo.command = cur->command; bo.binder_type = cur->binder_type; INIT_LIST_HEAD(&bo.cb_list); cb(cur->dev, cur->sch, cb_priv, cur->type, &bo, cur->data, cur->cleanup); list_splice(&bo.cb_list, cur->cb_list); } } int flow_indr_dev_register(flow_indr_block_bind_cb_t *cb, void *cb_priv) { struct flow_indr_dev *indr_dev; mutex_lock(&flow_indr_block_lock); list_for_each_entry(indr_dev, &flow_block_indr_dev_list, list) { if (indr_dev->cb == cb && indr_dev->cb_priv == cb_priv) { refcount_inc(&indr_dev->refcnt); mutex_unlock(&flow_indr_block_lock); return 0; } } indr_dev = flow_indr_dev_alloc(cb, cb_priv); if (!indr_dev) { mutex_unlock(&flow_indr_block_lock); return -ENOMEM; } list_add(&indr_dev->list, &flow_block_indr_dev_list); existing_qdiscs_register(cb, cb_priv); mutex_unlock(&flow_indr_block_lock); tcf_action_reoffload_cb(cb, cb_priv, true); return 0; } EXPORT_SYMBOL(flow_indr_dev_register); static void __flow_block_indr_cleanup(void (*release)(void *cb_priv), void *cb_priv, struct list_head *cleanup_list) { struct flow_block_cb *this, *next; list_for_each_entry_safe(this, next, &flow_block_indr_list, indr.list) { if (this->release == release && this->indr.cb_priv == cb_priv) list_move(&this->indr.list, cleanup_list); } } static void flow_block_indr_notify(struct list_head *cleanup_list) { struct flow_block_cb *this, *next; list_for_each_entry_safe(this, next, cleanup_list, indr.list) { list_del(&this->indr.list); this->indr.cleanup(this); } } void flow_indr_dev_unregister(flow_indr_block_bind_cb_t *cb, void *cb_priv, void (*release)(void *cb_priv)) { struct flow_indr_dev *this, *next, *indr_dev = NULL; LIST_HEAD(cleanup_list); mutex_lock(&flow_indr_block_lock); list_for_each_entry_safe(this, next, &flow_block_indr_dev_list, list) { if (this->cb == cb && this->cb_priv == cb_priv && refcount_dec_and_test(&this->refcnt)) { indr_dev = this; list_del(&indr_dev->list); break; } } if (!indr_dev) { mutex_unlock(&flow_indr_block_lock); return; } __flow_block_indr_cleanup(release, cb_priv, &cleanup_list); mutex_unlock(&flow_indr_block_lock); tcf_action_reoffload_cb(cb, cb_priv, false); flow_block_indr_notify(&cleanup_list); kfree(indr_dev); } EXPORT_SYMBOL(flow_indr_dev_unregister); static void flow_block_indr_init(struct flow_block_cb *flow_block, struct flow_block_offload *bo, struct net_device *dev, struct Qdisc *sch, void *data, void *cb_priv, void (*cleanup)(struct flow_block_cb *block_cb)) { flow_block->indr.binder_type = bo->binder_type; flow_block->indr.data = data; flow_block->indr.cb_priv = cb_priv; flow_block->indr.dev = dev; flow_block->indr.sch = sch; flow_block->indr.cleanup = cleanup; } struct flow_block_cb *flow_indr_block_cb_alloc(flow_setup_cb_t *cb, void *cb_ident, void *cb_priv, void (*release)(void *cb_priv), struct flow_block_offload *bo, struct net_device *dev, struct Qdisc *sch, void *data, void *indr_cb_priv, void (*cleanup)(struct flow_block_cb *block_cb)) { struct flow_block_cb *block_cb; block_cb = flow_block_cb_alloc(cb, cb_ident, cb_priv, release); if (IS_ERR(block_cb)) goto out; flow_block_indr_init(block_cb, bo, dev, sch, data, indr_cb_priv, cleanup); list_add(&block_cb->indr.list, &flow_block_indr_list); out: return block_cb; } EXPORT_SYMBOL(flow_indr_block_cb_alloc); static struct flow_indir_dev_info *find_indir_dev(void *data) { struct flow_indir_dev_info *cur; list_for_each_entry(cur, &flow_indir_dev_list, list) { if (cur->data == data) return cur; } return NULL; } static int indir_dev_add(void *data, struct net_device *dev, struct Qdisc *sch, enum tc_setup_type type, void (*cleanup)(struct flow_block_cb *block_cb), struct flow_block_offload *bo) { struct flow_indir_dev_info *info; info = find_indir_dev(data); if (info) return -EEXIST; info = kzalloc(sizeof(*info), GFP_KERNEL); if (!info) return -ENOMEM; info->data = data; info->dev = dev; info->sch = sch; info->type = type; info->cleanup = cleanup; info->command = bo->command; info->binder_type = bo->binder_type; info->cb_list = bo->cb_list_head; list_add(&info->list, &flow_indir_dev_list); return 0; } static int indir_dev_remove(void *data) { struct flow_indir_dev_info *info; info = find_indir_dev(data); if (!info) return -ENOENT; list_del(&info->list); kfree(info); return 0; } int flow_indr_dev_setup_offload(struct net_device *dev, struct Qdisc *sch, enum tc_setup_type type, void *data, struct flow_block_offload *bo, void (*cleanup)(struct flow_block_cb *block_cb)) { struct flow_indr_dev *this; u32 count = 0; int err; mutex_lock(&flow_indr_block_lock); if (bo) { if (bo->command == FLOW_BLOCK_BIND) indir_dev_add(data, dev, sch, type, cleanup, bo); else if (bo->command == FLOW_BLOCK_UNBIND) indir_dev_remove(data); } list_for_each_entry(this, &flow_block_indr_dev_list, list) { err = this->cb(dev, sch, this->cb_priv, type, bo, data, cleanup); if (!err) count++; } mutex_unlock(&flow_indr_block_lock); return (bo && list_empty(&bo->cb_list)) ? -EOPNOTSUPP : count; } EXPORT_SYMBOL(flow_indr_dev_setup_offload); bool flow_indr_dev_exists(void) { return !list_empty(&flow_block_indr_dev_list); } EXPORT_SYMBOL(flow_indr_dev_exists); |
3 3 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | // SPDX-License-Identifier: GPL-2.0-or-later /* * X.25 Packet Layer release 002 * * This is ALPHA test software. This code may break your machine, randomly fail to work with new * releases, misbehave and/or generally screw up. It might even work. * * This code REQUIRES 2.1.15 or higher * * History * X.25 001 Jonathan Naylor Started coding. * 2000-09-04 Henner Eisen Prevent freeing a dangling skb. */ #define pr_fmt(fmt) "X25: " fmt #include <linux/kernel.h> #include <linux/netdevice.h> #include <linux/skbuff.h> #include <linux/slab.h> #include <net/sock.h> #include <linux/if_arp.h> #include <net/x25.h> #include <net/x25device.h> static int x25_receive_data(struct sk_buff *skb, struct x25_neigh *nb) { struct sock *sk; unsigned short frametype; unsigned int lci; if (!pskb_may_pull(skb, X25_STD_MIN_LEN)) return 0; frametype = skb->data[2]; lci = ((skb->data[0] << 8) & 0xF00) + ((skb->data[1] << 0) & 0x0FF); /* * LCI of zero is always for us, and its always a link control * frame. */ if (lci == 0) { x25_link_control(skb, nb, frametype); return 0; } /* * Find an existing socket. */ if ((sk = x25_find_socket(lci, nb)) != NULL) { int queued = 1; skb_reset_transport_header(skb); bh_lock_sock(sk); if (!sock_owned_by_user(sk)) { queued = x25_process_rx_frame(sk, skb); } else { queued = !sk_add_backlog(sk, skb, READ_ONCE(sk->sk_rcvbuf)); } bh_unlock_sock(sk); sock_put(sk); return queued; } /* * Is is a Call Request ? if so process it. */ if (frametype == X25_CALL_REQUEST) return x25_rx_call_request(skb, nb, lci); /* * Its not a Call Request, nor is it a control frame. * Can we forward it? */ if (x25_forward_data(lci, nb, skb)) { if (frametype == X25_CLEAR_CONFIRMATION) { x25_clear_forward_by_lci(lci); } kfree_skb(skb); return 1; } /* x25_transmit_clear_request(nb, lci, 0x0D); */ if (frametype != X25_CLEAR_CONFIRMATION) pr_debug("x25_receive_data(): unknown frame type %2x\n",frametype); return 0; } int x25_lapb_receive_frame(struct sk_buff *skb, struct net_device *dev, struct packet_type *ptype, struct net_device *orig_dev) { struct sk_buff *nskb; struct x25_neigh *nb; if (!net_eq(dev_net(dev), &init_net)) goto drop; nskb = skb_copy(skb, GFP_ATOMIC); if (!nskb) goto drop; kfree_skb(skb); skb = nskb; /* * Packet received from unrecognised device, throw it away. */ nb = x25_get_neigh(dev); if (!nb) { pr_debug("unknown neighbour - %s\n", dev->name); goto drop; } if (!pskb_may_pull(skb, 1)) { x25_neigh_put(nb); goto drop; } switch (skb->data[0]) { case X25_IFACE_DATA: skb_pull(skb, 1); if (x25_receive_data(skb, nb)) { x25_neigh_put(nb); goto out; } break; case X25_IFACE_CONNECT: x25_link_established(nb); break; case X25_IFACE_DISCONNECT: x25_link_terminated(nb); break; } x25_neigh_put(nb); drop: kfree_skb(skb); out: return 0; } void x25_establish_link(struct x25_neigh *nb) { struct sk_buff *skb; unsigned char *ptr; switch (nb->dev->type) { case ARPHRD_X25: if ((skb = alloc_skb(1, GFP_ATOMIC)) == NULL) { pr_err("x25_dev: out of memory\n"); return; } ptr = skb_put(skb, 1); *ptr = X25_IFACE_CONNECT; break; default: return; } skb->protocol = htons(ETH_P_X25); skb->dev = nb->dev; dev_queue_xmit(skb); } void x25_terminate_link(struct x25_neigh *nb) { struct sk_buff *skb; unsigned char *ptr; if (nb->dev->type != ARPHRD_X25) return; skb = alloc_skb(1, GFP_ATOMIC); if (!skb) { pr_err("x25_dev: out of memory\n"); return; } ptr = skb_put(skb, 1); *ptr = X25_IFACE_DISCONNECT; skb->protocol = htons(ETH_P_X25); skb->dev = nb->dev; dev_queue_xmit(skb); } void x25_send_frame(struct sk_buff *skb, struct x25_neigh *nb) { unsigned char *dptr; skb_reset_network_header(skb); switch (nb->dev->type) { case ARPHRD_X25: dptr = skb_push(skb, 1); *dptr = X25_IFACE_DATA; break; default: kfree_skb(skb); return; } skb->protocol = htons(ETH_P_X25); skb->dev = nb->dev; dev_queue_xmit(skb); } |
6 6 27 12 2 160 146 12 145 128 147 147 142 140 138 140 140 137 140 137 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (c) 2016 Mellanox Technologies. All rights reserved. * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com> */ #include <net/genetlink.h> #define CREATE_TRACE_POINTS #include <trace/events/devlink.h> #include "devl_internal.h" EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_hwmsg); EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_hwerr); EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_trap_report); DEFINE_XARRAY_FLAGS(devlinks, XA_FLAGS_ALLOC); static struct devlink *devlinks_xa_get(unsigned long index) { struct devlink *devlink; rcu_read_lock(); devlink = xa_find(&devlinks, &index, index, DEVLINK_REGISTERED); if (!devlink || !devlink_try_get(devlink)) devlink = NULL; rcu_read_unlock(); return devlink; } /* devlink_rels xarray contains 1:1 relationships between * devlink object and related nested devlink instance. * The xarray index is used to get the nested object from * the nested-in object code. */ static DEFINE_XARRAY_FLAGS(devlink_rels, XA_FLAGS_ALLOC1); #define DEVLINK_REL_IN_USE XA_MARK_0 struct devlink_rel { u32 index; refcount_t refcount; u32 devlink_index; struct { u32 devlink_index; u32 obj_index; devlink_rel_notify_cb_t *notify_cb; devlink_rel_cleanup_cb_t *cleanup_cb; struct delayed_work notify_work; } nested_in; }; static void devlink_rel_free(struct devlink_rel *rel) { xa_erase(&devlink_rels, rel->index); kfree(rel); } static void __devlink_rel_get(struct devlink_rel *rel) { refcount_inc(&rel->refcount); } static void __devlink_rel_put(struct devlink_rel *rel) { if (refcount_dec_and_test(&rel->refcount)) devlink_rel_free(rel); } static void devlink_rel_nested_in_notify_work(struct work_struct *work) { struct devlink_rel *rel = container_of(work, struct devlink_rel, nested_in.notify_work.work); struct devlink *devlink; devlink = devlinks_xa_get(rel->nested_in.devlink_index); if (!devlink) goto rel_put; if (!devl_trylock(devlink)) { devlink_put(devlink); goto reschedule_work; } if (!devl_is_registered(devlink)) { devl_unlock(devlink); devlink_put(devlink); goto rel_put; } if (!xa_get_mark(&devlink_rels, rel->index, DEVLINK_REL_IN_USE)) rel->nested_in.cleanup_cb(devlink, rel->nested_in.obj_index, rel->index); rel->nested_in.notify_cb(devlink, rel->nested_in.obj_index); devl_unlock(devlink); devlink_put(devlink); rel_put: __devlink_rel_put(rel); return; reschedule_work: schedule_delayed_work(&rel->nested_in.notify_work, 1); } static void devlink_rel_nested_in_notify_work_schedule(struct devlink_rel *rel) { __devlink_rel_get(rel); schedule_delayed_work(&rel->nested_in.notify_work, 0); } static struct devlink_rel *devlink_rel_alloc(void) { struct devlink_rel *rel; static u32 next; int err; rel = kzalloc(sizeof(*rel), GFP_KERNEL); if (!rel) return ERR_PTR(-ENOMEM); err = xa_alloc_cyclic(&devlink_rels, &rel->index, rel, xa_limit_32b, &next, GFP_KERNEL); if (err < 0) { kfree(rel); return ERR_PTR(err); } refcount_set(&rel->refcount, 1); INIT_DELAYED_WORK(&rel->nested_in.notify_work, &devlink_rel_nested_in_notify_work); return rel; } static void devlink_rel_put(struct devlink *devlink) { struct devlink_rel *rel = devlink->rel; if (!rel) return; xa_clear_mark(&devlink_rels, rel->index, DEVLINK_REL_IN_USE); devlink_rel_nested_in_notify_work_schedule(rel); __devlink_rel_put(rel); devlink->rel = NULL; } void devlink_rel_nested_in_clear(u32 rel_index) { xa_clear_mark(&devlink_rels, rel_index, DEVLINK_REL_IN_USE); } int devlink_rel_nested_in_add(u32 *rel_index, u32 devlink_index, u32 obj_index, devlink_rel_notify_cb_t *notify_cb, devlink_rel_cleanup_cb_t *cleanup_cb, struct devlink *devlink) { struct devlink_rel *rel = devlink_rel_alloc(); ASSERT_DEVLINK_NOT_REGISTERED(devlink); if (IS_ERR(rel)) return PTR_ERR(rel); rel->devlink_index = devlink->index; rel->nested_in.devlink_index = devlink_index; rel->nested_in.obj_index = obj_index; rel->nested_in.notify_cb = notify_cb; rel->nested_in.cleanup_cb = cleanup_cb; *rel_index = rel->index; xa_set_mark(&devlink_rels, rel->index, DEVLINK_REL_IN_USE); devlink->rel = rel; return 0; } /** * devlink_rel_nested_in_notify - Notify the object this devlink * instance is nested in. * @devlink: devlink * * This is called upon network namespace change of devlink instance. * In case this devlink instance is nested in another devlink object, * a notification of a change of this object should be sent * over netlink. The parent devlink instance lock needs to be * taken during the notification preparation. * However, since the devlink lock of nested instance is held here, * we would end with wrong devlink instance lock ordering and * deadlock. Therefore the work is utilized to avoid that. */ void devlink_rel_nested_in_notify(struct devlink *devlink) { struct devlink_rel *rel = devlink->rel; if (!rel) return; devlink_rel_nested_in_notify_work_schedule(rel); } static struct devlink_rel *devlink_rel_find(unsigned long rel_index) { return xa_find(&devlink_rels, &rel_index, rel_index, DEVLINK_REL_IN_USE); } static struct devlink *devlink_rel_devlink_get(u32 rel_index) { struct devlink_rel *rel; u32 devlink_index; if (!rel_index) return NULL; xa_lock(&devlink_rels); rel = devlink_rel_find(rel_index); if (rel) devlink_index = rel->devlink_index; xa_unlock(&devlink_rels); if (!rel) return NULL; return devlinks_xa_get(devlink_index); } int devlink_rel_devlink_handle_put(struct sk_buff *msg, struct devlink *devlink, u32 rel_index, int attrtype, bool *msg_updated) { struct net *net = devlink_net(devlink); struct devlink *rel_devlink; int err; rel_devlink = devlink_rel_devlink_get(rel_index); if (!rel_devlink) return 0; err = devlink_nl_put_nested_handle(msg, net, rel_devlink, attrtype); devlink_put(rel_devlink); if (!err && msg_updated) *msg_updated = true; return err; } void *devlink_priv(struct devlink *devlink) { return &devlink->priv; } EXPORT_SYMBOL_GPL(devlink_priv); struct devlink *priv_to_devlink(void *priv) { return container_of(priv, struct devlink, priv); } EXPORT_SYMBOL_GPL(priv_to_devlink); struct device *devlink_to_dev(const struct devlink *devlink) { return devlink->dev; } EXPORT_SYMBOL_GPL(devlink_to_dev); struct net *devlink_net(const struct devlink *devlink) { return read_pnet(&devlink->_net); } EXPORT_SYMBOL_GPL(devlink_net); void devl_assert_locked(struct devlink *devlink) { lockdep_assert_held(&devlink->lock); } EXPORT_SYMBOL_GPL(devl_assert_locked); #ifdef CONFIG_LOCKDEP /* For use in conjunction with LOCKDEP only e.g. rcu_dereference_protected() */ bool devl_lock_is_held(struct devlink *devlink) { return lockdep_is_held(&devlink->lock); } EXPORT_SYMBOL_GPL(devl_lock_is_held); #endif void devl_lock(struct devlink *devlink) { mutex_lock(&devlink->lock); } EXPORT_SYMBOL_GPL(devl_lock); int devl_trylock(struct devlink *devlink) { return mutex_trylock(&devlink->lock); } EXPORT_SYMBOL_GPL(devl_trylock); void devl_unlock(struct devlink *devlink) { mutex_unlock(&devlink->lock); } EXPORT_SYMBOL_GPL(devl_unlock); /** * devlink_try_get() - try to obtain a reference on a devlink instance * @devlink: instance to reference * * Obtain a reference on a devlink instance. A reference on a devlink instance * only implies that it's safe to take the instance lock. It does not imply * that the instance is registered, use devl_is_registered() after taking * the instance lock to check registration status. */ struct devlink *__must_check devlink_try_get(struct devlink *devlink) { if (refcount_inc_not_zero(&devlink->refcount)) return devlink; return NULL; } static void devlink_release(struct work_struct *work) { struct devlink *devlink; devlink = container_of(to_rcu_work(work), struct devlink, rwork); mutex_destroy(&devlink->lock); lockdep_unregister_key(&devlink->lock_key); put_device(devlink->dev); kvfree(devlink); } void devlink_put(struct devlink *devlink) { if (refcount_dec_and_test(&devlink->refcount)) queue_rcu_work(system_wq, &devlink->rwork); } struct devlink *devlinks_xa_find_get(struct net *net, unsigned long *indexp) { struct devlink *devlink = NULL; rcu_read_lock(); retry: devlink = xa_find(&devlinks, indexp, ULONG_MAX, DEVLINK_REGISTERED); if (!devlink) goto unlock; if (!devlink_try_get(devlink)) goto next; if (!net_eq(devlink_net(devlink), net)) { devlink_put(devlink); goto next; } unlock: rcu_read_unlock(); return devlink; next: (*indexp)++; goto retry; } /** * devl_register - Register devlink instance * @devlink: devlink */ int devl_register(struct devlink *devlink) { ASSERT_DEVLINK_NOT_REGISTERED(devlink); devl_assert_locked(devlink); xa_set_mark(&devlinks, devlink->index, DEVLINK_REGISTERED); devlink_notify_register(devlink); devlink_rel_nested_in_notify(devlink); return 0; } EXPORT_SYMBOL_GPL(devl_register); void devlink_register(struct devlink *devlink) { devl_lock(devlink); devl_register(devlink); devl_unlock(devlink); } EXPORT_SYMBOL_GPL(devlink_register); /** * devl_unregister - Unregister devlink instance * @devlink: devlink */ void devl_unregister(struct devlink *devlink) { ASSERT_DEVLINK_REGISTERED(devlink); devl_assert_locked(devlink); devlink_notify_unregister(devlink); xa_clear_mark(&devlinks, devlink->index, DEVLINK_REGISTERED); devlink_rel_put(devlink); } EXPORT_SYMBOL_GPL(devl_unregister); void devlink_unregister(struct devlink *devlink) { devl_lock(devlink); devl_unregister(devlink); devl_unlock(devlink); } EXPORT_SYMBOL_GPL(devlink_unregister); /** * devlink_alloc_ns - Allocate new devlink instance resources * in specific namespace * * @ops: ops * @priv_size: size of user private data * @net: net namespace * @dev: parent device * * Allocate new devlink instance resources, including devlink index * and name. */ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops, size_t priv_size, struct net *net, struct device *dev) { struct devlink *devlink; static u32 last_id; int ret; WARN_ON(!ops || !dev); if (!devlink_reload_actions_valid(ops)) return NULL; devlink = kvzalloc(struct_size(devlink, priv, priv_size), GFP_KERNEL); if (!devlink) return NULL; ret = xa_alloc_cyclic(&devlinks, &devlink->index, devlink, xa_limit_31b, &last_id, GFP_KERNEL); if (ret < 0) goto err_xa_alloc; devlink->dev = get_device(dev); devlink->ops = ops; xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC); xa_init_flags(&devlink->params, XA_FLAGS_ALLOC); xa_init_flags(&devlink->snapshot_ids, XA_FLAGS_ALLOC); xa_init_flags(&devlink->nested_rels, XA_FLAGS_ALLOC); write_pnet(&devlink->_net, net); INIT_LIST_HEAD(&devlink->rate_list); INIT_LIST_HEAD(&devlink->linecard_list); INIT_LIST_HEAD(&devlink->sb_list); INIT_LIST_HEAD_RCU(&devlink->dpipe_table_list); INIT_LIST_HEAD(&devlink->resource_list); INIT_LIST_HEAD(&devlink->region_list); INIT_LIST_HEAD(&devlink->reporter_list); INIT_LIST_HEAD(&devlink->trap_list); INIT_LIST_HEAD(&devlink->trap_group_list); INIT_LIST_HEAD(&devlink->trap_policer_list); INIT_RCU_WORK(&devlink->rwork, devlink_release); lockdep_register_key(&devlink->lock_key); mutex_init(&devlink->lock); lockdep_set_class(&devlink->lock, &devlink->lock_key); refcount_set(&devlink->refcount, 1); return devlink; err_xa_alloc: kvfree(devlink); return NULL; } EXPORT_SYMBOL_GPL(devlink_alloc_ns); /** * devlink_free - Free devlink instance resources * * @devlink: devlink */ void devlink_free(struct devlink *devlink) { ASSERT_DEVLINK_NOT_REGISTERED(devlink); WARN_ON(!list_empty(&devlink->trap_policer_list)); WARN_ON(!list_empty(&devlink->trap_group_list)); WARN_ON(!list_empty(&devlink->trap_list)); WARN_ON(!list_empty(&devlink->reporter_list)); WARN_ON(!list_empty(&devlink->region_list)); WARN_ON(!list_empty(&devlink->resource_list)); WARN_ON(!list_empty(&devlink->dpipe_table_list)); WARN_ON(!list_empty(&devlink->sb_list)); WARN_ON(!list_empty(&devlink->rate_list)); WARN_ON(!list_empty(&devlink->linecard_list)); WARN_ON(!xa_empty(&devlink->ports)); xa_destroy(&devlink->nested_rels); xa_destroy(&devlink->snapshot_ids); xa_destroy(&devlink->params); xa_destroy(&devlink->ports); xa_erase(&devlinks, devlink->index); devlink_put(devlink); } EXPORT_SYMBOL_GPL(devlink_free); static void __net_exit devlink_pernet_pre_exit(struct net *net) { struct devlink *devlink; u32 actions_performed; unsigned long index; int err; /* In case network namespace is getting destroyed, reload * all devlink instances from this namespace into init_net. */ devlinks_xa_for_each_registered_get(net, index, devlink) { devl_dev_lock(devlink, true); err = 0; if (devl_is_registered(devlink)) err = devlink_reload(devlink, &init_net, DEVLINK_RELOAD_ACTION_DRIVER_REINIT, DEVLINK_RELOAD_LIMIT_UNSPEC, &actions_performed, NULL); devl_dev_unlock(devlink, true); devlink_put(devlink); if (err && err != -EOPNOTSUPP) pr_warn("Failed to reload devlink instance into init_net\n"); } } static struct pernet_operations devlink_pernet_ops __net_initdata = { .pre_exit = devlink_pernet_pre_exit, }; static struct notifier_block devlink_port_netdevice_nb = { .notifier_call = devlink_port_netdevice_event, }; static int __init devlink_init(void) { int err; err = register_pernet_subsys(&devlink_pernet_ops); if (err) goto out; err = genl_register_family(&devlink_nl_family); if (err) goto out_unreg_pernet_subsys; err = register_netdevice_notifier(&devlink_port_netdevice_nb); if (!err) return 0; genl_unregister_family(&devlink_nl_family); out_unreg_pernet_subsys: unregister_pernet_subsys(&devlink_pernet_ops); out: WARN_ON(err); return err; } subsys_initcall(devlink_init); |
34 34 1 34 1 33 26 1 37 1 36 36 1 35 34 37 12 12 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/net.h> #include <linux/netdevice.h> #include <linux/netlink.h> #include <linux/types.h> #include <net/pkt_sched.h> #include "sch_mqprio_lib.h" /* Returns true if the intervals [a, b) and [c, d) overlap. */ static bool intervals_overlap(int a, int b, int c, int d) { int left = max(a, c), right = min(b, d); return left < right; } static int mqprio_validate_queue_counts(struct net_device *dev, const struct tc_mqprio_qopt *qopt, bool allow_overlapping_txqs, struct netlink_ext_ack *extack) { int i, j; for (i = 0; i < qopt->num_tc; i++) { unsigned int last = qopt->offset[i] + qopt->count[i]; if (!qopt->count[i]) { NL_SET_ERR_MSG_FMT_MOD(extack, "No queues for TC %d", i); return -EINVAL; } /* Verify the queue count is in tx range being equal to the * real_num_tx_queues indicates the last queue is in use. */ if (qopt->offset[i] >= dev->real_num_tx_queues || last > dev->real_num_tx_queues) { NL_SET_ERR_MSG_FMT_MOD(extack, "Queues %d:%d for TC %d exceed the %d TX queues available", qopt->count[i], qopt->offset[i], i, dev->real_num_tx_queues); return -EINVAL; } if (allow_overlapping_txqs) continue; /* Verify that the offset and counts do not overlap */ for (j = i + 1; j < qopt->num_tc; j++) { if (intervals_overlap(qopt->offset[i], last, qopt->offset[j], qopt->offset[j] + qopt->count[j])) { NL_SET_ERR_MSG_FMT_MOD(extack, "TC %d queues %d@%d overlap with TC %d queues %d@%d", i, qopt->count[i], qopt->offset[i], j, qopt->count[j], qopt->offset[j]); return -EINVAL; } } } return 0; } int mqprio_validate_qopt(struct net_device *dev, struct tc_mqprio_qopt *qopt, bool validate_queue_counts, bool allow_overlapping_txqs, struct netlink_ext_ack *extack) { int i, err; /* Verify num_tc is not out of max range */ if (qopt->num_tc > TC_MAX_QUEUE) { NL_SET_ERR_MSG(extack, "Number of traffic classes is outside valid range"); return -EINVAL; } /* Verify priority mapping uses valid tcs */ for (i = 0; i <= TC_BITMASK; i++) { if (qopt->prio_tc_map[i] >= qopt->num_tc) { NL_SET_ERR_MSG(extack, "Invalid traffic class in priority to traffic class mapping"); return -EINVAL; } } if (validate_queue_counts) { err = mqprio_validate_queue_counts(dev, qopt, allow_overlapping_txqs, extack); if (err) return err; } return 0; } EXPORT_SYMBOL_GPL(mqprio_validate_qopt); void mqprio_qopt_reconstruct(struct net_device *dev, struct tc_mqprio_qopt *qopt) { int tc, num_tc = netdev_get_num_tc(dev); qopt->num_tc = num_tc; memcpy(qopt->prio_tc_map, dev->prio_tc_map, sizeof(qopt->prio_tc_map)); for (tc = 0; tc < num_tc; tc++) { qopt->count[tc] = dev->tc_to_txq[tc].count; qopt->offset[tc] = dev->tc_to_txq[tc].offset; } } EXPORT_SYMBOL_GPL(mqprio_qopt_reconstruct); void mqprio_fp_to_offload(u32 fp[TC_QOPT_MAX_QUEUE], struct tc_mqprio_qopt_offload *mqprio) { unsigned long preemptible_tcs = 0; int tc; for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) if (fp[tc] == TC_FP_PREEMPTIBLE) preemptible_tcs |= BIT(tc); mqprio->preemptible_tcs = preemptible_tcs; } EXPORT_SYMBOL_GPL(mqprio_fp_to_offload); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Shared mqprio qdisc code currently between taprio and mqprio"); |
1 72 72 72 72 72 72 72 72 72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 | // SPDX-License-Identifier: GPL-2.0 #include <linux/types.h> #include <linux/netfilter.h> #include <linux/slab.h> #include <linux/module.h> #include <linux/skbuff.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/percpu.h> #include <linux/netdevice.h> #include <linux/security.h> #include <net/net_namespace.h> #ifdef CONFIG_SYSCTL #include <linux/sysctl.h> #endif #include <net/netfilter/nf_conntrack.h> #include <net/netfilter/nf_conntrack_core.h> #include <net/netfilter/nf_conntrack_l4proto.h> #include <net/netfilter/nf_conntrack_expect.h> #include <net/netfilter/nf_conntrack_helper.h> #include <net/netfilter/nf_conntrack_acct.h> #include <net/netfilter/nf_conntrack_zones.h> #include <net/netfilter/nf_conntrack_timestamp.h> #include <linux/rculist_nulls.h> static bool enable_hooks __read_mostly; MODULE_PARM_DESC(enable_hooks, "Always enable conntrack hooks"); module_param(enable_hooks, bool, 0000); unsigned int nf_conntrack_net_id __read_mostly; #ifdef CONFIG_NF_CONNTRACK_PROCFS void print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple, const struct nf_conntrack_l4proto *l4proto) { switch (tuple->src.l3num) { case NFPROTO_IPV4: seq_printf(s, "src=%pI4 dst=%pI4 ", &tuple->src.u3.ip, &tuple->dst.u3.ip); break; case NFPROTO_IPV6: seq_printf(s, "src=%pI6 dst=%pI6 ", tuple->src.u3.ip6, tuple->dst.u3.ip6); break; default: break; } switch (l4proto->l4proto) { case IPPROTO_ICMP: seq_printf(s, "type=%u code=%u id=%u ", tuple->dst.u.icmp.type, tuple->dst.u.icmp.code, ntohs(tuple->src.u.icmp.id)); break; case IPPROTO_TCP: seq_printf(s, "sport=%hu dport=%hu ", ntohs(tuple->src.u.tcp.port), ntohs(tuple->dst.u.tcp.port)); break; case IPPROTO_UDPLITE: case IPPROTO_UDP: seq_printf(s, "sport=%hu dport=%hu ", ntohs(tuple->src.u.udp.port), ntohs(tuple->dst.u.udp.port)); break; case IPPROTO_DCCP: seq_printf(s, "sport=%hu dport=%hu ", ntohs(tuple->src.u.dccp.port), ntohs(tuple->dst.u.dccp.port)); break; case IPPROTO_SCTP: seq_printf(s, "sport=%hu dport=%hu ", ntohs(tuple->src.u.sctp.port), ntohs(tuple->dst.u.sctp.port)); break; case IPPROTO_ICMPV6: seq_printf(s, "type=%u code=%u id=%u ", tuple->dst.u.icmp.type, tuple->dst.u.icmp.code, ntohs(tuple->src.u.icmp.id)); break; case IPPROTO_GRE: seq_printf(s, "srckey=0x%x dstkey=0x%x ", ntohs(tuple->src.u.gre.key), ntohs(tuple->dst.u.gre.key)); break; default: break; } } EXPORT_SYMBOL_GPL(print_tuple); struct ct_iter_state { struct seq_net_private p; struct hlist_nulls_head *hash; unsigned int htable_size; unsigned int bucket; u_int64_t time_now; }; static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) { struct ct_iter_state *st = seq->private; struct hlist_nulls_node *n; for (st->bucket = 0; st->bucket < st->htable_size; st->bucket++) { n = rcu_dereference( hlist_nulls_first_rcu(&st->hash[st->bucket])); if (!is_a_nulls(n)) return n; } return NULL; } static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, struct hlist_nulls_node *head) { struct ct_iter_state *st = seq->private; head = rcu_dereference(hlist_nulls_next_rcu(head)); while (is_a_nulls(head)) { if (likely(get_nulls_value(head) == st->bucket)) { if (++st->bucket >= st->htable_size) return NULL; } head = rcu_dereference( hlist_nulls_first_rcu(&st->hash[st->bucket])); } return head; } static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos) { struct hlist_nulls_node *head = ct_get_first(seq); if (head) while (pos && (head = ct_get_next(seq, head))) pos--; return pos ? NULL : head; } static void *ct_seq_start(struct seq_file *seq, loff_t *pos) __acquires(RCU) { struct ct_iter_state *st = seq->private; st->time_now = ktime_get_real_ns(); rcu_read_lock(); nf_conntrack_get_ht(&st->hash, &st->htable_size); return ct_get_idx(seq, *pos); } static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) { (*pos)++; return ct_get_next(s, v); } static void ct_seq_stop(struct seq_file *s, void *v) __releases(RCU) { rcu_read_unlock(); } #ifdef CONFIG_NF_CONNTRACK_SECMARK static void ct_show_secctx(struct seq_file *s, const struct nf_conn *ct) { struct lsm_context ctx; int ret; ret = security_secid_to_secctx(ct->secmark, &ctx); if (ret < 0) return; seq_printf(s, "secctx=%s ", ctx.context); security_release_secctx(&ctx); } #else static inline void ct_show_secctx(struct seq_file *s, const struct nf_conn *ct) { } #endif #ifdef CONFIG_NF_CONNTRACK_ZONES static void ct_show_zone(struct seq_file *s, const struct nf_conn *ct, int dir) { const struct nf_conntrack_zone *zone = nf_ct_zone(ct); if (zone->dir != dir) return; switch (zone->dir) { case NF_CT_DEFAULT_ZONE_DIR: seq_printf(s, "zone=%u ", zone->id); break; case NF_CT_ZONE_DIR_ORIG: seq_printf(s, "zone-orig=%u ", zone->id); break; case NF_CT_ZONE_DIR_REPL: seq_printf(s, "zone-reply=%u ", zone->id); break; default: break; } } #else static inline void ct_show_zone(struct seq_file *s, const struct nf_conn *ct, int dir) { } #endif #ifdef CONFIG_NF_CONNTRACK_TIMESTAMP static void ct_show_delta_time(struct seq_file *s, const struct nf_conn *ct) { struct ct_iter_state *st = s->private; struct nf_conn_tstamp *tstamp; s64 delta_time; tstamp = nf_conn_tstamp_find(ct); if (tstamp) { delta_time = st->time_now - tstamp->start; if (delta_time > 0) delta_time = div_s64(delta_time, NSEC_PER_SEC); else delta_time = 0; seq_printf(s, "delta-time=%llu ", (unsigned long long)delta_time); } return; } #else static inline void ct_show_delta_time(struct seq_file *s, const struct nf_conn *ct) { } #endif static const char* l3proto_name(u16 proto) { switch (proto) { case AF_INET: return "ipv4"; case AF_INET6: return "ipv6"; } return "unknown"; } static const char* l4proto_name(u16 proto) { switch (proto) { case IPPROTO_ICMP: return "icmp"; case IPPROTO_TCP: return "tcp"; case IPPROTO_UDP: return "udp"; case IPPROTO_DCCP: return "dccp"; case IPPROTO_GRE: return "gre"; case IPPROTO_SCTP: return "sctp"; case IPPROTO_UDPLITE: return "udplite"; case IPPROTO_ICMPV6: return "icmpv6"; } return "unknown"; } static void seq_print_acct(struct seq_file *s, const struct nf_conn *ct, int dir) { struct nf_conn_acct *acct; struct nf_conn_counter *counter; acct = nf_conn_acct_find(ct); if (!acct) return; counter = acct->counter; seq_printf(s, "packets=%llu bytes=%llu ", (unsigned long long)atomic64_read(&counter[dir].packets), (unsigned long long)atomic64_read(&counter[dir].bytes)); } /* return 0 on success, 1 in case of error */ static int ct_seq_show(struct seq_file *s, void *v) { struct nf_conntrack_tuple_hash *hash = v; struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); const struct nf_conntrack_l4proto *l4proto; struct net *net = seq_file_net(s); int ret = 0; WARN_ON(!ct); if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use))) return 0; /* load ->status after refcount increase */ smp_acquire__after_ctrl_dep(); if (nf_ct_should_gc(ct)) { nf_ct_kill(ct); goto release; } /* we only want to print DIR_ORIGINAL */ if (NF_CT_DIRECTION(hash)) goto release; if (!net_eq(nf_ct_net(ct), net)) goto release; l4proto = nf_ct_l4proto_find(nf_ct_protonum(ct)); ret = -ENOSPC; seq_printf(s, "%-8s %u %-8s %u ", l3proto_name(nf_ct_l3num(ct)), nf_ct_l3num(ct), l4proto_name(l4proto->l4proto), nf_ct_protonum(ct)); if (!test_bit(IPS_OFFLOAD_BIT, &ct->status)) seq_printf(s, "%ld ", nf_ct_expires(ct) / HZ); if (l4proto->print_conntrack) l4proto->print_conntrack(s, ct); print_tuple(s, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, l4proto); ct_show_zone(s, ct, NF_CT_ZONE_DIR_ORIG); if (seq_has_overflowed(s)) goto release; seq_print_acct(s, ct, IP_CT_DIR_ORIGINAL); if (!(test_bit(IPS_SEEN_REPLY_BIT, &ct->status))) seq_puts(s, "[UNREPLIED] "); print_tuple(s, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, l4proto); ct_show_zone(s, ct, NF_CT_ZONE_DIR_REPL); seq_print_acct(s, ct, IP_CT_DIR_REPLY); if (test_bit(IPS_HW_OFFLOAD_BIT, &ct->status)) seq_puts(s, "[HW_OFFLOAD] "); else if (test_bit(IPS_OFFLOAD_BIT, &ct->status)) seq_puts(s, "[OFFLOAD] "); else if (test_bit(IPS_ASSURED_BIT, &ct->status)) seq_puts(s, "[ASSURED] "); if (seq_has_overflowed(s)) goto release; #if defined(CONFIG_NF_CONNTRACK_MARK) seq_printf(s, "mark=%u ", READ_ONCE(ct->mark)); #endif ct_show_secctx(s, ct); ct_show_zone(s, ct, NF_CT_DEFAULT_ZONE_DIR); ct_show_delta_time(s, ct); seq_printf(s, "use=%u\n", refcount_read(&ct->ct_general.use)); if (seq_has_overflowed(s)) goto release; ret = 0; release: nf_ct_put(ct); return ret; } static const struct seq_operations ct_seq_ops = { .start = ct_seq_start, .next = ct_seq_next, .stop = ct_seq_stop, .show = ct_seq_show }; static void *ct_cpu_seq_start(struct seq_file *seq, loff_t *pos) { struct net *net = seq_file_net(seq); int cpu; if (*pos == 0) return SEQ_START_TOKEN; for (cpu = *pos-1; cpu < nr_cpu_ids; ++cpu) { if (!cpu_possible(cpu)) continue; *pos = cpu + 1; return per_cpu_ptr(net->ct.stat, cpu); } return NULL; } static void *ct_cpu_seq_next(struct seq_file *seq, void *v, loff_t *pos) { struct net *net = seq_file_net(seq); int cpu; for (cpu = *pos; cpu < nr_cpu_ids; ++cpu) { if (!cpu_possible(cpu)) continue; *pos = cpu + 1; return per_cpu_ptr(net->ct.stat, cpu); } (*pos)++; return NULL; } static void ct_cpu_seq_stop(struct seq_file *seq, void *v) { } static int ct_cpu_seq_show(struct seq_file *seq, void *v) { struct net *net = seq_file_net(seq); const struct ip_conntrack_stat *st = v; unsigned int nr_conntracks; if (v == SEQ_START_TOKEN) { seq_puts(seq, "entries clashres found new invalid ignore delete chainlength insert insert_failed drop early_drop icmp_error expect_new expect_create expect_delete search_restart\n"); return 0; } nr_conntracks = nf_conntrack_count(net); seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x " "%08x %08x %08x %08x %08x %08x %08x %08x %08x\n", nr_conntracks, st->clash_resolve, st->found, 0, st->invalid, 0, 0, st->chaintoolong, st->insert, st->insert_failed, st->drop, st->early_drop, st->error, st->expect_new, st->expect_create, st->expect_delete, st->search_restart ); return 0; } static const struct seq_operations ct_cpu_seq_ops = { .start = ct_cpu_seq_start, .next = ct_cpu_seq_next, .stop = ct_cpu_seq_stop, .show = ct_cpu_seq_show, }; static int nf_conntrack_standalone_init_proc(struct net *net) { struct proc_dir_entry *pde; kuid_t root_uid; kgid_t root_gid; pde = proc_create_net("nf_conntrack", 0440, net->proc_net, &ct_seq_ops, sizeof(struct ct_iter_state)); if (!pde) goto out_nf_conntrack; root_uid = make_kuid(net->user_ns, 0); root_gid = make_kgid(net->user_ns, 0); if (uid_valid(root_uid) && gid_valid(root_gid)) proc_set_user(pde, root_uid, root_gid); pde = proc_create_net("nf_conntrack", 0444, net->proc_net_stat, &ct_cpu_seq_ops, sizeof(struct seq_net_private)); if (!pde) goto out_stat_nf_conntrack; return 0; out_stat_nf_conntrack: remove_proc_entry("nf_conntrack", net->proc_net); out_nf_conntrack: return -ENOMEM; } static void nf_conntrack_standalone_fini_proc(struct net *net) { remove_proc_entry("nf_conntrack", net->proc_net_stat); remove_proc_entry("nf_conntrack", net->proc_net); } #else static int nf_conntrack_standalone_init_proc(struct net *net) { return 0; } static void nf_conntrack_standalone_fini_proc(struct net *net) { } #endif /* CONFIG_NF_CONNTRACK_PROCFS */ u32 nf_conntrack_count(const struct net *net) { const struct nf_conntrack_net *cnet = nf_ct_pernet(net); return atomic_read(&cnet->count); } EXPORT_SYMBOL_GPL(nf_conntrack_count); /* Sysctl support */ #ifdef CONFIG_SYSCTL /* size the user *wants to set */ static unsigned int nf_conntrack_htable_size_user __read_mostly; static int nf_conntrack_hash_sysctl(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { int ret; /* module_param hashsize could have changed value */ nf_conntrack_htable_size_user = nf_conntrack_htable_size; ret = proc_dointvec(table, write, buffer, lenp, ppos); if (ret < 0 || !write) return ret; /* update ret, we might not be able to satisfy request */ ret = nf_conntrack_hash_resize(nf_conntrack_htable_size_user); /* update it to the actual value used by conntrack */ nf_conntrack_htable_size_user = nf_conntrack_htable_size; return ret; } static struct ctl_table_header *nf_ct_netfilter_header; enum nf_ct_sysctl_index { NF_SYSCTL_CT_MAX, NF_SYSCTL_CT_COUNT, NF_SYSCTL_CT_BUCKETS, NF_SYSCTL_CT_CHECKSUM, NF_SYSCTL_CT_LOG_INVALID, NF_SYSCTL_CT_EXPECT_MAX, NF_SYSCTL_CT_ACCT, #ifdef CONFIG_NF_CONNTRACK_EVENTS NF_SYSCTL_CT_EVENTS, #endif #ifdef CONFIG_NF_CONNTRACK_TIMESTAMP NF_SYSCTL_CT_TIMESTAMP, #endif NF_SYSCTL_CT_PROTO_TIMEOUT_GENERIC, NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_SYN_SENT, NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_SYN_RECV, NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_ESTABLISHED, NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_FIN_WAIT, NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_CLOSE_WAIT, NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_LAST_ACK, NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_TIME_WAIT, NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_CLOSE, NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_RETRANS, NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_UNACK, #if IS_ENABLED(CONFIG_NF_FLOW_TABLE) NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_OFFLOAD, #endif NF_SYSCTL_CT_PROTO_TCP_LOOSE, NF_SYSCTL_CT_PROTO_TCP_LIBERAL, NF_SYSCTL_CT_PROTO_TCP_IGNORE_INVALID_RST, NF_SYSCTL_CT_PROTO_TCP_MAX_RETRANS, NF_SYSCTL_CT_PROTO_TIMEOUT_UDP, NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_STREAM, #if IS_ENABLED(CONFIG_NF_FLOW_TABLE) NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD, #endif NF_SYSCTL_CT_PROTO_TIMEOUT_ICMP, NF_SYSCTL_CT_PROTO_TIMEOUT_ICMPV6, #ifdef CONFIG_NF_CT_PROTO_SCTP NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_CLOSED, NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_COOKIE_WAIT, NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_COOKIE_ECHOED, NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_ESTABLISHED, NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_SHUTDOWN_SENT, NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_SHUTDOWN_RECD, NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT, NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_HEARTBEAT_SENT, #endif #ifdef CONFIG_NF_CT_PROTO_DCCP NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_REQUEST, NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_RESPOND, NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_PARTOPEN, NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_OPEN, NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_CLOSEREQ, NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_CLOSING, NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_TIMEWAIT, NF_SYSCTL_CT_PROTO_DCCP_LOOSE, #endif #ifdef CONFIG_NF_CT_PROTO_GRE NF_SYSCTL_CT_PROTO_TIMEOUT_GRE, NF_SYSCTL_CT_PROTO_TIMEOUT_GRE_STREAM, #endif NF_SYSCTL_CT_LAST_SYSCTL, }; static struct ctl_table nf_ct_sysctl_table[] = { [NF_SYSCTL_CT_MAX] = { .procname = "nf_conntrack_max", .data = &nf_conntrack_max, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_INT_MAX, }, [NF_SYSCTL_CT_COUNT] = { .procname = "nf_conntrack_count", .maxlen = sizeof(int), .mode = 0444, .proc_handler = proc_dointvec, }, [NF_SYSCTL_CT_BUCKETS] = { .procname = "nf_conntrack_buckets", .data = &nf_conntrack_htable_size_user, .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = nf_conntrack_hash_sysctl, }, [NF_SYSCTL_CT_CHECKSUM] = { .procname = "nf_conntrack_checksum", .data = &init_net.ct.sysctl_checksum, .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, [NF_SYSCTL_CT_LOG_INVALID] = { .procname = "nf_conntrack_log_invalid", .data = &init_net.ct.sysctl_log_invalid, .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, }, [NF_SYSCTL_CT_EXPECT_MAX] = { .procname = "nf_conntrack_expect_max", .data = &nf_ct_expect_max, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ONE, .extra2 = SYSCTL_INT_MAX, }, [NF_SYSCTL_CT_ACCT] = { .procname = "nf_conntrack_acct", .data = &init_net.ct.sysctl_acct, .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, #ifdef CONFIG_NF_CONNTRACK_EVENTS [NF_SYSCTL_CT_EVENTS] = { .procname = "nf_conntrack_events", .data = &init_net.ct.sysctl_events, .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, }, #endif #ifdef CONFIG_NF_CONNTRACK_TIMESTAMP [NF_SYSCTL_CT_TIMESTAMP] = { .procname = "nf_conntrack_timestamp", .data = &init_net.ct.sysctl_tstamp, .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, #endif [NF_SYSCTL_CT_PROTO_TIMEOUT_GENERIC] = { .procname = "nf_conntrack_generic_timeout", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_SYN_SENT] = { .procname = "nf_conntrack_tcp_timeout_syn_sent", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_SYN_RECV] = { .procname = "nf_conntrack_tcp_timeout_syn_recv", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_ESTABLISHED] = { .procname = "nf_conntrack_tcp_timeout_established", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_FIN_WAIT] = { .procname = "nf_conntrack_tcp_timeout_fin_wait", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_CLOSE_WAIT] = { .procname = "nf_conntrack_tcp_timeout_close_wait", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_LAST_ACK] = { .procname = "nf_conntrack_tcp_timeout_last_ack", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_TIME_WAIT] = { .procname = "nf_conntrack_tcp_timeout_time_wait", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_CLOSE] = { .procname = "nf_conntrack_tcp_timeout_close", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_RETRANS] = { .procname = "nf_conntrack_tcp_timeout_max_retrans", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_UNACK] = { .procname = "nf_conntrack_tcp_timeout_unacknowledged", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, #if IS_ENABLED(CONFIG_NF_FLOW_TABLE) [NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_OFFLOAD] = { .procname = "nf_flowtable_tcp_timeout", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, #endif [NF_SYSCTL_CT_PROTO_TCP_LOOSE] = { .procname = "nf_conntrack_tcp_loose", .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, [NF_SYSCTL_CT_PROTO_TCP_LIBERAL] = { .procname = "nf_conntrack_tcp_be_liberal", .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, [NF_SYSCTL_CT_PROTO_TCP_IGNORE_INVALID_RST] = { .procname = "nf_conntrack_tcp_ignore_invalid_rst", .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, [NF_SYSCTL_CT_PROTO_TCP_MAX_RETRANS] = { .procname = "nf_conntrack_tcp_max_retrans", .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_UDP] = { .procname = "nf_conntrack_udp_timeout", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_STREAM] = { .procname = "nf_conntrack_udp_timeout_stream", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, #if IS_ENABLED(CONFIG_NF_FLOW_TABLE) [NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD] = { .procname = "nf_flowtable_udp_timeout", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, #endif [NF_SYSCTL_CT_PROTO_TIMEOUT_ICMP] = { .procname = "nf_conntrack_icmp_timeout", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_ICMPV6] = { .procname = "nf_conntrack_icmpv6_timeout", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, #ifdef CONFIG_NF_CT_PROTO_SCTP [NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_CLOSED] = { .procname = "nf_conntrack_sctp_timeout_closed", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_COOKIE_WAIT] = { .procname = "nf_conntrack_sctp_timeout_cookie_wait", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_COOKIE_ECHOED] = { .procname = "nf_conntrack_sctp_timeout_cookie_echoed", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_ESTABLISHED] = { .procname = "nf_conntrack_sctp_timeout_established", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_SHUTDOWN_SENT] = { .procname = "nf_conntrack_sctp_timeout_shutdown_sent", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_SHUTDOWN_RECD] = { .procname = "nf_conntrack_sctp_timeout_shutdown_recd", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT] = { .procname = "nf_conntrack_sctp_timeout_shutdown_ack_sent", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_HEARTBEAT_SENT] = { .procname = "nf_conntrack_sctp_timeout_heartbeat_sent", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, #endif #ifdef CONFIG_NF_CT_PROTO_DCCP [NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_REQUEST] = { .procname = "nf_conntrack_dccp_timeout_request", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_RESPOND] = { .procname = "nf_conntrack_dccp_timeout_respond", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_PARTOPEN] = { .procname = "nf_conntrack_dccp_timeout_partopen", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_OPEN] = { .procname = "nf_conntrack_dccp_timeout_open", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_CLOSEREQ] = { .procname = "nf_conntrack_dccp_timeout_closereq", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_CLOSING] = { .procname = "nf_conntrack_dccp_timeout_closing", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_TIMEWAIT] = { .procname = "nf_conntrack_dccp_timeout_timewait", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_DCCP_LOOSE] = { .procname = "nf_conntrack_dccp_loose", .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, #endif #ifdef CONFIG_NF_CT_PROTO_GRE [NF_SYSCTL_CT_PROTO_TIMEOUT_GRE] = { .procname = "nf_conntrack_gre_timeout", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, [NF_SYSCTL_CT_PROTO_TIMEOUT_GRE_STREAM] = { .procname = "nf_conntrack_gre_timeout_stream", .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, #endif }; static struct ctl_table nf_ct_netfilter_table[] = { { .procname = "nf_conntrack_max", .data = &nf_conntrack_max, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_INT_MAX, }, }; static void nf_conntrack_standalone_init_tcp_sysctl(struct net *net, struct ctl_table *table) { struct nf_tcp_net *tn = nf_tcp_pernet(net); #define XASSIGN(XNAME, tn) \ table[NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_ ## XNAME].data = \ &(tn)->timeouts[TCP_CONNTRACK_ ## XNAME] XASSIGN(SYN_SENT, tn); XASSIGN(SYN_RECV, tn); XASSIGN(ESTABLISHED, tn); XASSIGN(FIN_WAIT, tn); XASSIGN(CLOSE_WAIT, tn); XASSIGN(LAST_ACK, tn); XASSIGN(TIME_WAIT, tn); XASSIGN(CLOSE, tn); XASSIGN(RETRANS, tn); XASSIGN(UNACK, tn); #undef XASSIGN #define XASSIGN(XNAME, rval) \ table[NF_SYSCTL_CT_PROTO_TCP_ ## XNAME].data = (rval) XASSIGN(LOOSE, &tn->tcp_loose); XASSIGN(LIBERAL, &tn->tcp_be_liberal); XASSIGN(MAX_RETRANS, &tn->tcp_max_retrans); XASSIGN(IGNORE_INVALID_RST, &tn->tcp_ignore_invalid_rst); #undef XASSIGN #if IS_ENABLED(CONFIG_NF_FLOW_TABLE) table[NF_SYSCTL_CT_PROTO_TIMEOUT_TCP_OFFLOAD].data = &tn->offload_timeout; #endif } static void nf_conntrack_standalone_init_sctp_sysctl(struct net *net, struct ctl_table *table) { #ifdef CONFIG_NF_CT_PROTO_SCTP struct nf_sctp_net *sn = nf_sctp_pernet(net); #define XASSIGN(XNAME, sn) \ table[NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_ ## XNAME].data = \ &(sn)->timeouts[SCTP_CONNTRACK_ ## XNAME] XASSIGN(CLOSED, sn); XASSIGN(COOKIE_WAIT, sn); XASSIGN(COOKIE_ECHOED, sn); XASSIGN(ESTABLISHED, sn); XASSIGN(SHUTDOWN_SENT, sn); XASSIGN(SHUTDOWN_RECD, sn); XASSIGN(SHUTDOWN_ACK_SENT, sn); XASSIGN(HEARTBEAT_SENT, sn); #undef XASSIGN #endif } static void nf_conntrack_standalone_init_dccp_sysctl(struct net *net, struct ctl_table *table) { #ifdef CONFIG_NF_CT_PROTO_DCCP struct nf_dccp_net *dn = nf_dccp_pernet(net); #define XASSIGN(XNAME, dn) \ table[NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_ ## XNAME].data = \ &(dn)->dccp_timeout[CT_DCCP_ ## XNAME] XASSIGN(REQUEST, dn); XASSIGN(RESPOND, dn); XASSIGN(PARTOPEN, dn); XASSIGN(OPEN, dn); XASSIGN(CLOSEREQ, dn); XASSIGN(CLOSING, dn); XASSIGN(TIMEWAIT, dn); #undef XASSIGN table[NF_SYSCTL_CT_PROTO_DCCP_LOOSE].data = &dn->dccp_loose; #endif } static void nf_conntrack_standalone_init_gre_sysctl(struct net *net, struct ctl_table *table) { #ifdef CONFIG_NF_CT_PROTO_GRE struct nf_gre_net *gn = nf_gre_pernet(net); table[NF_SYSCTL_CT_PROTO_TIMEOUT_GRE].data = &gn->timeouts[GRE_CT_UNREPLIED]; table[NF_SYSCTL_CT_PROTO_TIMEOUT_GRE_STREAM].data = &gn->timeouts[GRE_CT_REPLIED]; #endif } static int nf_conntrack_standalone_init_sysctl(struct net *net) { struct nf_conntrack_net *cnet = nf_ct_pernet(net); struct nf_udp_net *un = nf_udp_pernet(net); struct ctl_table *table; BUILD_BUG_ON(ARRAY_SIZE(nf_ct_sysctl_table) != NF_SYSCTL_CT_LAST_SYSCTL); table = kmemdup(nf_ct_sysctl_table, sizeof(nf_ct_sysctl_table), GFP_KERNEL); if (!table) return -ENOMEM; table[NF_SYSCTL_CT_COUNT].data = &cnet->count; table[NF_SYSCTL_CT_CHECKSUM].data = &net->ct.sysctl_checksum; table[NF_SYSCTL_CT_LOG_INVALID].data = &net->ct.sysctl_log_invalid; table[NF_SYSCTL_CT_ACCT].data = &net->ct.sysctl_acct; #ifdef CONFIG_NF_CONNTRACK_EVENTS table[NF_SYSCTL_CT_EVENTS].data = &net->ct.sysctl_events; #endif #ifdef CONFIG_NF_CONNTRACK_TIMESTAMP table[NF_SYSCTL_CT_TIMESTAMP].data = &net->ct.sysctl_tstamp; #endif table[NF_SYSCTL_CT_PROTO_TIMEOUT_GENERIC].data = &nf_generic_pernet(net)->timeout; table[NF_SYSCTL_CT_PROTO_TIMEOUT_ICMP].data = &nf_icmp_pernet(net)->timeout; table[NF_SYSCTL_CT_PROTO_TIMEOUT_ICMPV6].data = &nf_icmpv6_pernet(net)->timeout; table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP].data = &un->timeouts[UDP_CT_UNREPLIED]; table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_STREAM].data = &un->timeouts[UDP_CT_REPLIED]; #if IS_ENABLED(CONFIG_NF_FLOW_TABLE) table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD].data = &un->offload_timeout; #endif nf_conntrack_standalone_init_tcp_sysctl(net, table); nf_conntrack_standalone_init_sctp_sysctl(net, table); nf_conntrack_standalone_init_dccp_sysctl(net, table); nf_conntrack_standalone_init_gre_sysctl(net, table); /* Don't allow non-init_net ns to alter global sysctls */ if (!net_eq(&init_net, net)) { table[NF_SYSCTL_CT_MAX].mode = 0444; table[NF_SYSCTL_CT_EXPECT_MAX].mode = 0444; table[NF_SYSCTL_CT_BUCKETS].mode = 0444; } cnet->sysctl_header = register_net_sysctl_sz(net, "net/netfilter", table, ARRAY_SIZE(nf_ct_sysctl_table)); if (!cnet->sysctl_header) goto out_unregister_netfilter; return 0; out_unregister_netfilter: kfree(table); return -ENOMEM; } static void nf_conntrack_standalone_fini_sysctl(struct net *net) { struct nf_conntrack_net *cnet = nf_ct_pernet(net); const struct ctl_table *table; table = cnet->sysctl_header->ctl_table_arg; unregister_net_sysctl_table(cnet->sysctl_header); kfree(table); } #else static int nf_conntrack_standalone_init_sysctl(struct net *net) { return 0; } static void nf_conntrack_standalone_fini_sysctl(struct net *net) { } #endif /* CONFIG_SYSCTL */ static void nf_conntrack_fini_net(struct net *net) { if (enable_hooks) nf_ct_netns_put(net, NFPROTO_INET); nf_conntrack_standalone_fini_proc(net); nf_conntrack_standalone_fini_sysctl(net); } static int nf_conntrack_pernet_init(struct net *net) { int ret; net->ct.sysctl_checksum = 1; ret = nf_conntrack_standalone_init_sysctl(net); if (ret < 0) return ret; ret = nf_conntrack_standalone_init_proc(net); if (ret < 0) goto out_proc; ret = nf_conntrack_init_net(net); if (ret < 0) goto out_init_net; if (enable_hooks) { ret = nf_ct_netns_get(net, NFPROTO_INET); if (ret < 0) goto out_hooks; } return 0; out_hooks: nf_conntrack_cleanup_net(net); out_init_net: nf_conntrack_standalone_fini_proc(net); out_proc: nf_conntrack_standalone_fini_sysctl(net); return ret; } static void nf_conntrack_pernet_exit(struct list_head *net_exit_list) { struct net *net; list_for_each_entry(net, net_exit_list, exit_list) nf_conntrack_fini_net(net); nf_conntrack_cleanup_net_list(net_exit_list); } static struct pernet_operations nf_conntrack_net_ops = { .init = nf_conntrack_pernet_init, .exit_batch = nf_conntrack_pernet_exit, .id = &nf_conntrack_net_id, .size = sizeof(struct nf_conntrack_net), }; static int __init nf_conntrack_standalone_init(void) { int ret = nf_conntrack_init_start(); if (ret < 0) goto out_start; BUILD_BUG_ON(NFCT_INFOMASK <= IP_CT_NUMBER); #ifdef CONFIG_SYSCTL nf_ct_netfilter_header = register_net_sysctl(&init_net, "net", nf_ct_netfilter_table); if (!nf_ct_netfilter_header) { pr_err("nf_conntrack: can't register to sysctl.\n"); ret = -ENOMEM; goto out_sysctl; } nf_conntrack_htable_size_user = nf_conntrack_htable_size; #endif nf_conntrack_init_end(); ret = register_pernet_subsys(&nf_conntrack_net_ops); if (ret < 0) goto out_pernet; return 0; out_pernet: #ifdef CONFIG_SYSCTL unregister_net_sysctl_table(nf_ct_netfilter_header); out_sysctl: #endif nf_conntrack_cleanup_end(); out_start: return ret; } static void __exit nf_conntrack_standalone_fini(void) { nf_conntrack_cleanup_start(); unregister_pernet_subsys(&nf_conntrack_net_ops); #ifdef CONFIG_SYSCTL unregister_net_sysctl_table(nf_ct_netfilter_header); #endif nf_conntrack_cleanup_end(); } module_init(nf_conntrack_standalone_init); module_exit(nf_conntrack_standalone_fini); |
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 3 3 4 3 2 1 1 1 1 2 1 4 4 1 1 1 1 13 13 13 7 7 3 2 4 4 10 13 2 1 1 1 2 22 10 22 6 9 21 4 21 1 21 5 21 1 21 15 4 15 2 6 3 6 2 4 3 20 5 4 14 14 3 3 2 5 4 5 3 3 3 1 1 3 3 1 1 5 4 4 2 3 3 2 2 1 2 2 2 5 5 5 3 3 3 3 2 2 2 2 16 16 16 24 24 24 23 23 3 1 1 19 18 18 17 17 15 14 17 55 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/module.h> #include <linux/errno.h> #include <linux/socket.h> #include <linux/skbuff.h> #include <linux/ip.h> #include <linux/icmp.h> #include <linux/udp.h> #include <linux/types.h> #include <linux/kernel.h> #include <net/genetlink.h> #include <net/gro.h> #include <net/gue.h> #include <net/fou.h> #include <net/ip.h> #include <net/protocol.h> #include <net/udp.h> #include <net/udp_tunnel.h> #include <uapi/linux/fou.h> #include <uapi/linux/genetlink.h> #include "fou_nl.h" struct fou { struct socket *sock; u8 protocol; u8 flags; __be16 port; u8 family; u16 type; struct list_head list; struct rcu_head rcu; }; #define FOU_F_REMCSUM_NOPARTIAL BIT(0) struct fou_cfg { u16 type; u8 protocol; u8 flags; struct udp_port_cfg udp_config; }; static unsigned int fou_net_id; struct fou_net { struct list_head fou_list; struct mutex fou_lock; }; static inline struct fou *fou_from_sock(struct sock *sk) { return rcu_dereference_sk_user_data(sk); } static int fou_recv_pull(struct sk_buff *skb, struct fou *fou, size_t len) { /* Remove 'len' bytes from the packet (UDP header and * FOU header if present). */ if (fou->family == AF_INET) ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len); else ipv6_hdr(skb)->payload_len = htons(ntohs(ipv6_hdr(skb)->payload_len) - len); __skb_pull(skb, len); skb_postpull_rcsum(skb, udp_hdr(skb), len); skb_reset_transport_header(skb); return iptunnel_pull_offloads(skb); } static int fou_udp_recv(struct sock *sk, struct sk_buff *skb) { struct fou *fou = fou_from_sock(sk); if (!fou) return 1; if (fou_recv_pull(skb, fou, sizeof(struct udphdr))) goto drop; return -fou->protocol; drop: kfree_skb(skb); return 0; } static struct guehdr *gue_remcsum(struct sk_buff *skb, struct guehdr *guehdr, void *data, size_t hdrlen, u8 ipproto, bool nopartial) { __be16 *pd = data; size_t start = ntohs(pd[0]); size_t offset = ntohs(pd[1]); size_t plen = sizeof(struct udphdr) + hdrlen + max_t(size_t, offset + sizeof(u16), start); if (skb->remcsum_offload) return guehdr; if (!pskb_may_pull(skb, plen)) return NULL; guehdr = (struct guehdr *)&udp_hdr(skb)[1]; skb_remcsum_process(skb, (void *)guehdr + hdrlen, start, offset, nopartial); return guehdr; } static int gue_control_message(struct sk_buff *skb, struct guehdr *guehdr) { /* No support yet */ kfree_skb(skb); return 0; } static int gue_udp_recv(struct sock *sk, struct sk_buff *skb) { struct fou *fou = fou_from_sock(sk); size_t len, optlen, hdrlen; struct guehdr *guehdr; void *data; u16 doffset = 0; u8 proto_ctype; if (!fou) return 1; len = sizeof(struct udphdr) + sizeof(struct guehdr); if (!pskb_may_pull(skb, len)) goto drop; guehdr = (struct guehdr *)&udp_hdr(skb)[1]; switch (guehdr->version) { case 0: /* Full GUE header present */ break; case 1: { /* Direct encapsulation of IPv4 or IPv6 */ int prot; switch (((struct iphdr *)guehdr)->version) { case 4: prot = IPPROTO_IPIP; break; case 6: prot = IPPROTO_IPV6; break; default: goto drop; } if (fou_recv_pull(skb, fou, sizeof(struct udphdr))) goto drop; return -prot; } default: /* Undefined version */ goto drop; } optlen = guehdr->hlen << 2; len += optlen; if (!pskb_may_pull(skb, len)) goto drop; /* guehdr may change after pull */ guehdr = (struct guehdr *)&udp_hdr(skb)[1]; if (validate_gue_flags(guehdr, optlen)) goto drop; hdrlen = sizeof(struct guehdr) + optlen; if (fou->family == AF_INET) ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len); else ipv6_hdr(skb)->payload_len = htons(ntohs(ipv6_hdr(skb)->payload_len) - len); /* Pull csum through the guehdr now . This can be used if * there is a remote checksum offload. */ skb_postpull_rcsum(skb, udp_hdr(skb), len); data = &guehdr[1]; if (guehdr->flags & GUE_FLAG_PRIV) { __be32 flags = *(__be32 *)(data + doffset); doffset += GUE_LEN_PRIV; if (flags & GUE_PFLAG_REMCSUM) { guehdr = gue_remcsum(skb, guehdr, data + doffset, hdrlen, guehdr->proto_ctype, !!(fou->flags & FOU_F_REMCSUM_NOPARTIAL)); if (!guehdr) goto drop; data = &guehdr[1]; doffset += GUE_PLEN_REMCSUM; } } if (unlikely(guehdr->control)) return gue_control_message(skb, guehdr); proto_ctype = guehdr->proto_ctype; __skb_pull(skb, sizeof(struct udphdr) + hdrlen); skb_reset_transport_header(skb); if (iptunnel_pull_offloads(skb)) goto drop; return -proto_ctype; drop: kfree_skb(skb); return 0; } static struct sk_buff *fou_gro_receive(struct sock *sk, struct list_head *head, struct sk_buff *skb) { const struct net_offload __rcu **offloads; struct fou *fou = fou_from_sock(sk); const struct net_offload *ops; struct sk_buff *pp = NULL; u8 proto; if (!fou) goto out; proto = fou->protocol; /* We can clear the encap_mark for FOU as we are essentially doing * one of two possible things. We are either adding an L4 tunnel * header to the outer L3 tunnel header, or we are simply * treating the GRE tunnel header as though it is a UDP protocol * specific header such as VXLAN or GENEVE. */ NAPI_GRO_CB(skb)->encap_mark = 0; /* Flag this frame as already having an outer encap header */ NAPI_GRO_CB(skb)->is_fou = 1; offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[proto]); if (!ops || !ops->callbacks.gro_receive) goto out; pp = call_gro_receive(ops->callbacks.gro_receive, head, skb); out: return pp; } static int fou_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff) { const struct net_offload __rcu **offloads; struct fou *fou = fou_from_sock(sk); const struct net_offload *ops; u8 proto; int err; if (!fou) { err = -ENOENT; goto out; } proto = fou->protocol; offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[proto]); if (WARN_ON(!ops || !ops->callbacks.gro_complete)) { err = -ENOSYS; goto out; } err = ops->callbacks.gro_complete(skb, nhoff); skb_set_inner_mac_header(skb, nhoff); out: return err; } static struct guehdr *gue_gro_remcsum(struct sk_buff *skb, unsigned int off, struct guehdr *guehdr, void *data, size_t hdrlen, struct gro_remcsum *grc, bool nopartial) { __be16 *pd = data; size_t start = ntohs(pd[0]); size_t offset = ntohs(pd[1]); if (skb->remcsum_offload) return guehdr; if (!NAPI_GRO_CB(skb)->csum_valid) return NULL; guehdr = skb_gro_remcsum_process(skb, (void *)guehdr, off, hdrlen, start, offset, grc, nopartial); skb->remcsum_offload = 1; return guehdr; } static struct sk_buff *gue_gro_receive(struct sock *sk, struct list_head *head, struct sk_buff *skb) { const struct net_offload __rcu **offloads; const struct net_offload *ops; struct sk_buff *pp = NULL; struct sk_buff *p; struct guehdr *guehdr; size_t len, optlen, hdrlen, off; void *data; u16 doffset = 0; int flush = 1; struct fou *fou = fou_from_sock(sk); struct gro_remcsum grc; u8 proto; skb_gro_remcsum_init(&grc); if (!fou) goto out; off = skb_gro_offset(skb); len = off + sizeof(*guehdr); guehdr = skb_gro_header(skb, len, off); if (unlikely(!guehdr)) goto out; switch (guehdr->version) { case 0: break; case 1: switch (((struct iphdr *)guehdr)->version) { case 4: proto = IPPROTO_IPIP; break; case 6: proto = IPPROTO_IPV6; break; default: goto out; } goto next_proto; default: goto out; } optlen = guehdr->hlen << 2; len += optlen; if (!skb_gro_may_pull(skb, len)) { guehdr = skb_gro_header_slow(skb, len, off); if (unlikely(!guehdr)) goto out; } if (unlikely(guehdr->control) || guehdr->version != 0 || validate_gue_flags(guehdr, optlen)) goto out; hdrlen = sizeof(*guehdr) + optlen; /* Adjust NAPI_GRO_CB(skb)->csum to account for guehdr, * this is needed if there is a remote checkcsum offload. */ skb_gro_postpull_rcsum(skb, guehdr, hdrlen); data = &guehdr[1]; if (guehdr->flags & GUE_FLAG_PRIV) { __be32 flags = *(__be32 *)(data + doffset); doffset += GUE_LEN_PRIV; if (flags & GUE_PFLAG_REMCSUM) { guehdr = gue_gro_remcsum(skb, off, guehdr, data + doffset, hdrlen, &grc, !!(fou->flags & FOU_F_REMCSUM_NOPARTIAL)); if (!guehdr) goto out; data = &guehdr[1]; doffset += GUE_PLEN_REMCSUM; } } skb_gro_pull(skb, hdrlen); list_for_each_entry(p, head, list) { const struct guehdr *guehdr2; if (!NAPI_GRO_CB(p)->same_flow) continue; guehdr2 = (struct guehdr *)(p->data + off); /* Compare base GUE header to be equal (covers * hlen, version, proto_ctype, and flags. */ if (guehdr->word != guehdr2->word) { NAPI_GRO_CB(p)->same_flow = 0; continue; } /* Compare optional fields are the same. */ if (guehdr->hlen && memcmp(&guehdr[1], &guehdr2[1], guehdr->hlen << 2)) { NAPI_GRO_CB(p)->same_flow = 0; continue; } } proto = guehdr->proto_ctype; next_proto: /* We can clear the encap_mark for GUE as we are essentially doing * one of two possible things. We are either adding an L4 tunnel * header to the outer L3 tunnel header, or we are simply * treating the GRE tunnel header as though it is a UDP protocol * specific header such as VXLAN or GENEVE. */ NAPI_GRO_CB(skb)->encap_mark = 0; /* Flag this frame as already having an outer encap header */ NAPI_GRO_CB(skb)->is_fou = 1; offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[proto]); if (!ops || !ops->callbacks.gro_receive) goto out; pp = call_gro_receive(ops->callbacks.gro_receive, head, skb); flush = 0; out: skb_gro_flush_final_remcsum(skb, pp, flush, &grc); return pp; } static int gue_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff) { struct guehdr *guehdr = (struct guehdr *)(skb->data + nhoff); const struct net_offload __rcu **offloads; const struct net_offload *ops; unsigned int guehlen = 0; u8 proto; int err = -ENOENT; switch (guehdr->version) { case 0: proto = guehdr->proto_ctype; guehlen = sizeof(*guehdr) + (guehdr->hlen << 2); break; case 1: switch (((struct iphdr *)guehdr)->version) { case 4: proto = IPPROTO_IPIP; break; case 6: proto = IPPROTO_IPV6; break; default: return err; } break; default: return err; } offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[proto]); if (WARN_ON(!ops || !ops->callbacks.gro_complete)) goto out; err = ops->callbacks.gro_complete(skb, nhoff + guehlen); skb_set_inner_mac_header(skb, nhoff + guehlen); out: return err; } static bool fou_cfg_cmp(struct fou *fou, struct fou_cfg *cfg) { struct sock *sk = fou->sock->sk; struct udp_port_cfg *udp_cfg = &cfg->udp_config; if (fou->family != udp_cfg->family || fou->port != udp_cfg->local_udp_port || sk->sk_dport != udp_cfg->peer_udp_port || sk->sk_bound_dev_if != udp_cfg->bind_ifindex) return false; if (fou->family == AF_INET) { if (sk->sk_rcv_saddr != udp_cfg->local_ip.s_addr || sk->sk_daddr != udp_cfg->peer_ip.s_addr) return false; else return true; #if IS_ENABLED(CONFIG_IPV6) } else { if (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, &udp_cfg->local_ip6) || ipv6_addr_cmp(&sk->sk_v6_daddr, &udp_cfg->peer_ip6)) return false; else return true; #endif } return false; } static int fou_add_to_port_list(struct net *net, struct fou *fou, struct fou_cfg *cfg) { struct fou_net *fn = net_generic(net, fou_net_id); struct fou *fout; mutex_lock(&fn->fou_lock); list_for_each_entry(fout, &fn->fou_list, list) { if (fou_cfg_cmp(fout, cfg)) { mutex_unlock(&fn->fou_lock); return -EALREADY; } } list_add(&fou->list, &fn->fou_list); mutex_unlock(&fn->fou_lock); return 0; } static void fou_release(struct fou *fou) { struct socket *sock = fou->sock; list_del(&fou->list); udp_tunnel_sock_release(sock); kfree_rcu(fou, rcu); } static int fou_create(struct net *net, struct fou_cfg *cfg, struct socket **sockp) { struct socket *sock = NULL; struct fou *fou = NULL; struct sock *sk; struct udp_tunnel_sock_cfg tunnel_cfg; int err; /* Open UDP socket */ err = udp_sock_create(net, &cfg->udp_config, &sock); if (err < 0) goto error; /* Allocate FOU port structure */ fou = kzalloc(sizeof(*fou), GFP_KERNEL); if (!fou) { err = -ENOMEM; goto error; } sk = sock->sk; fou->port = cfg->udp_config.local_udp_port; fou->family = cfg->udp_config.family; fou->flags = cfg->flags; fou->type = cfg->type; fou->sock = sock; memset(&tunnel_cfg, 0, sizeof(tunnel_cfg)); tunnel_cfg.encap_type = 1; tunnel_cfg.sk_user_data = fou; tunnel_cfg.encap_destroy = NULL; /* Initial for fou type */ switch (cfg->type) { case FOU_ENCAP_DIRECT: tunnel_cfg.encap_rcv = fou_udp_recv; tunnel_cfg.gro_receive = fou_gro_receive; tunnel_cfg.gro_complete = fou_gro_complete; fou->protocol = cfg->protocol; break; case FOU_ENCAP_GUE: tunnel_cfg.encap_rcv = gue_udp_recv; tunnel_cfg.gro_receive = gue_gro_receive; tunnel_cfg.gro_complete = gue_gro_complete; break; default: err = -EINVAL; goto error; } setup_udp_tunnel_sock(net, sock, &tunnel_cfg); sk->sk_allocation = GFP_ATOMIC; err = fou_add_to_port_list(net, fou, cfg); if (err) goto error; if (sockp) *sockp = sock; return 0; error: kfree(fou); if (sock) udp_tunnel_sock_release(sock); return err; } static int fou_destroy(struct net *net, struct fou_cfg *cfg) { struct fou_net *fn = net_generic(net, fou_net_id); int err = -EINVAL; struct fou *fou; mutex_lock(&fn->fou_lock); list_for_each_entry(fou, &fn->fou_list, list) { if (fou_cfg_cmp(fou, cfg)) { fou_release(fou); err = 0; break; } } mutex_unlock(&fn->fou_lock); return err; } static struct genl_family fou_nl_family; static int parse_nl_config(struct genl_info *info, struct fou_cfg *cfg) { bool has_local = false, has_peer = false; struct nlattr *attr; int ifindex; __be16 port; memset(cfg, 0, sizeof(*cfg)); cfg->udp_config.family = AF_INET; if (info->attrs[FOU_ATTR_AF]) { u8 family = nla_get_u8(info->attrs[FOU_ATTR_AF]); switch (family) { case AF_INET: break; case AF_INET6: cfg->udp_config.ipv6_v6only = 1; break; default: return -EAFNOSUPPORT; } cfg->udp_config.family = family; } if (info->attrs[FOU_ATTR_PORT]) { port = nla_get_be16(info->attrs[FOU_ATTR_PORT]); cfg->udp_config.local_udp_port = port; } if (info->attrs[FOU_ATTR_IPPROTO]) cfg->protocol = nla_get_u8(info->attrs[FOU_ATTR_IPPROTO]); if (info->attrs[FOU_ATTR_TYPE]) cfg->type = nla_get_u8(info->attrs[FOU_ATTR_TYPE]); if (info->attrs[FOU_ATTR_REMCSUM_NOPARTIAL]) cfg->flags |= FOU_F_REMCSUM_NOPARTIAL; if (cfg->udp_config.family == AF_INET) { if (info->attrs[FOU_ATTR_LOCAL_V4]) { attr = info->attrs[FOU_ATTR_LOCAL_V4]; cfg->udp_config.local_ip.s_addr = nla_get_in_addr(attr); has_local = true; } if (info->attrs[FOU_ATTR_PEER_V4]) { attr = info->attrs[FOU_ATTR_PEER_V4]; cfg->udp_config.peer_ip.s_addr = nla_get_in_addr(attr); has_peer = true; } #if IS_ENABLED(CONFIG_IPV6) } else { if (info->attrs[FOU_ATTR_LOCAL_V6]) { attr = info->attrs[FOU_ATTR_LOCAL_V6]; cfg->udp_config.local_ip6 = nla_get_in6_addr(attr); has_local = true; } if (info->attrs[FOU_ATTR_PEER_V6]) { attr = info->attrs[FOU_ATTR_PEER_V6]; cfg->udp_config.peer_ip6 = nla_get_in6_addr(attr); has_peer = true; } #endif } if (has_peer) { if (info->attrs[FOU_ATTR_PEER_PORT]) { port = nla_get_be16(info->attrs[FOU_ATTR_PEER_PORT]); cfg->udp_config.peer_udp_port = port; } else { return -EINVAL; } } if (info->attrs[FOU_ATTR_IFINDEX]) { if (!has_local) return -EINVAL; ifindex = nla_get_s32(info->attrs[FOU_ATTR_IFINDEX]); cfg->udp_config.bind_ifindex = ifindex; } return 0; } int fou_nl_add_doit(struct sk_buff *skb, struct genl_info *info) { struct net *net = genl_info_net(info); struct fou_cfg cfg; int err; err = parse_nl_config(info, &cfg); if (err) return err; return fou_create(net, &cfg, NULL); } int fou_nl_del_doit(struct sk_buff *skb, struct genl_info *info) { struct net *net = genl_info_net(info); struct fou_cfg cfg; int err; err = parse_nl_config(info, &cfg); if (err) return err; return fou_destroy(net, &cfg); } static int fou_fill_info(struct fou *fou, struct sk_buff *msg) { struct sock *sk = fou->sock->sk; if (nla_put_u8(msg, FOU_ATTR_AF, fou->sock->sk->sk_family) || nla_put_be16(msg, FOU_ATTR_PORT, fou->port) || nla_put_be16(msg, FOU_ATTR_PEER_PORT, sk->sk_dport) || nla_put_u8(msg, FOU_ATTR_IPPROTO, fou->protocol) || nla_put_u8(msg, FOU_ATTR_TYPE, fou->type) || nla_put_s32(msg, FOU_ATTR_IFINDEX, sk->sk_bound_dev_if)) return -1; if (fou->flags & FOU_F_REMCSUM_NOPARTIAL) if (nla_put_flag(msg, FOU_ATTR_REMCSUM_NOPARTIAL)) return -1; if (fou->sock->sk->sk_family == AF_INET) { if (nla_put_in_addr(msg, FOU_ATTR_LOCAL_V4, sk->sk_rcv_saddr)) return -1; if (nla_put_in_addr(msg, FOU_ATTR_PEER_V4, sk->sk_daddr)) return -1; #if IS_ENABLED(CONFIG_IPV6) } else { if (nla_put_in6_addr(msg, FOU_ATTR_LOCAL_V6, &sk->sk_v6_rcv_saddr)) return -1; if (nla_put_in6_addr(msg, FOU_ATTR_PEER_V6, &sk->sk_v6_daddr)) return -1; #endif } return 0; } static int fou_dump_info(struct fou *fou, u32 portid, u32 seq, u32 flags, struct sk_buff *skb, u8 cmd) { void *hdr; hdr = genlmsg_put(skb, portid, seq, &fou_nl_family, flags, cmd); if (!hdr) return -ENOMEM; if (fou_fill_info(fou, skb) < 0) goto nla_put_failure; genlmsg_end(skb, hdr); return 0; nla_put_failure: genlmsg_cancel(skb, hdr); return -EMSGSIZE; } int fou_nl_get_doit(struct sk_buff *skb, struct genl_info *info) { struct net *net = genl_info_net(info); struct fou_net *fn = net_generic(net, fou_net_id); struct sk_buff *msg; struct fou_cfg cfg; struct fou *fout; __be16 port; u8 family; int ret; ret = parse_nl_config(info, &cfg); if (ret) return ret; port = cfg.udp_config.local_udp_port; if (port == 0) return -EINVAL; family = cfg.udp_config.family; if (family != AF_INET && family != AF_INET6) return -EINVAL; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; ret = -ESRCH; mutex_lock(&fn->fou_lock); list_for_each_entry(fout, &fn->fou_list, list) { if (fou_cfg_cmp(fout, &cfg)) { ret = fou_dump_info(fout, info->snd_portid, info->snd_seq, 0, msg, info->genlhdr->cmd); break; } } mutex_unlock(&fn->fou_lock); if (ret < 0) goto out_free; return genlmsg_reply(msg, info); out_free: nlmsg_free(msg); return ret; } int fou_nl_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { struct net *net = sock_net(skb->sk); struct fou_net *fn = net_generic(net, fou_net_id); struct fou *fout; int idx = 0, ret; mutex_lock(&fn->fou_lock); list_for_each_entry(fout, &fn->fou_list, list) { if (idx++ < cb->args[0]) continue; ret = fou_dump_info(fout, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NLM_F_MULTI, skb, FOU_CMD_GET); if (ret) break; } mutex_unlock(&fn->fou_lock); cb->args[0] = idx; return skb->len; } static struct genl_family fou_nl_family __ro_after_init = { .hdrsize = 0, .name = FOU_GENL_NAME, .version = FOU_GENL_VERSION, .maxattr = FOU_ATTR_MAX, .policy = fou_nl_policy, .netnsok = true, .module = THIS_MODULE, .small_ops = fou_nl_ops, .n_small_ops = ARRAY_SIZE(fou_nl_ops), .resv_start_op = FOU_CMD_GET + 1, }; size_t fou_encap_hlen(struct ip_tunnel_encap *e) { return sizeof(struct udphdr); } EXPORT_SYMBOL(fou_encap_hlen); size_t gue_encap_hlen(struct ip_tunnel_encap *e) { size_t len; bool need_priv = false; len = sizeof(struct udphdr) + sizeof(struct guehdr); if (e->flags & TUNNEL_ENCAP_FLAG_REMCSUM) { len += GUE_PLEN_REMCSUM; need_priv = true; } len += need_priv ? GUE_LEN_PRIV : 0; return len; } EXPORT_SYMBOL(gue_encap_hlen); int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, u8 *protocol, __be16 *sport, int type) { int err; err = iptunnel_handle_offloads(skb, type); if (err) return err; *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev), skb, 0, 0, false); return 0; } EXPORT_SYMBOL(__fou_build_header); int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, u8 *protocol, __be16 *sport, int type) { struct guehdr *guehdr; size_t hdrlen, optlen = 0; void *data; bool need_priv = false; int err; if ((e->flags & TUNNEL_ENCAP_FLAG_REMCSUM) && skb->ip_summed == CHECKSUM_PARTIAL) { optlen += GUE_PLEN_REMCSUM; type |= SKB_GSO_TUNNEL_REMCSUM; need_priv = true; } optlen += need_priv ? GUE_LEN_PRIV : 0; err = iptunnel_handle_offloads(skb, type); if (err) return err; /* Get source port (based on flow hash) before skb_push */ *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev), skb, 0, 0, false); hdrlen = sizeof(struct guehdr) + optlen; skb_push(skb, hdrlen); guehdr = (struct guehdr *)skb->data; guehdr->control = 0; guehdr->version = 0; guehdr->hlen = optlen >> 2; guehdr->flags = 0; guehdr->proto_ctype = *protocol; data = &guehdr[1]; if (need_priv) { __be32 *flags = data; guehdr->flags |= GUE_FLAG_PRIV; *flags = 0; data += GUE_LEN_PRIV; if (type & SKB_GSO_TUNNEL_REMCSUM) { u16 csum_start = skb_checksum_start_offset(skb); __be16 *pd = data; if (csum_start < hdrlen) return -EINVAL; csum_start -= hdrlen; pd[0] = htons(csum_start); pd[1] = htons(csum_start + skb->csum_offset); if (!skb_is_gso(skb)) { skb->ip_summed = CHECKSUM_NONE; skb->encapsulation = 0; } *flags |= GUE_PFLAG_REMCSUM; data += GUE_PLEN_REMCSUM; } } return 0; } EXPORT_SYMBOL(__gue_build_header); #ifdef CONFIG_NET_FOU_IP_TUNNELS static void fou_build_udp(struct sk_buff *skb, struct ip_tunnel_encap *e, struct flowi4 *fl4, u8 *protocol, __be16 sport) { struct udphdr *uh; skb_push(skb, sizeof(struct udphdr)); skb_reset_transport_header(skb); uh = udp_hdr(skb); uh->dest = e->dport; uh->source = sport; uh->len = htons(skb->len); udp_set_csum(!(e->flags & TUNNEL_ENCAP_FLAG_CSUM), skb, fl4->saddr, fl4->daddr, skb->len); *protocol = IPPROTO_UDP; } static int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, u8 *protocol, struct flowi4 *fl4) { int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL; __be16 sport; int err; err = __fou_build_header(skb, e, protocol, &sport, type); if (err) return err; fou_build_udp(skb, e, fl4, protocol, sport); return 0; } static int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, u8 *protocol, struct flowi4 *fl4) { int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL; __be16 sport; int err; err = __gue_build_header(skb, e, protocol, &sport, type); if (err) return err; fou_build_udp(skb, e, fl4, protocol, sport); return 0; } static int gue_err_proto_handler(int proto, struct sk_buff *skb, u32 info) { const struct net_protocol *ipprot = rcu_dereference(inet_protos[proto]); if (ipprot && ipprot->err_handler) { if (!ipprot->err_handler(skb, info)) return 0; } return -ENOENT; } static int gue_err(struct sk_buff *skb, u32 info) { int transport_offset = skb_transport_offset(skb); struct guehdr *guehdr; size_t len, optlen; int ret; len = sizeof(struct udphdr) + sizeof(struct guehdr); if (!pskb_may_pull(skb, transport_offset + len)) return -EINVAL; guehdr = (struct guehdr *)&udp_hdr(skb)[1]; switch (guehdr->version) { case 0: /* Full GUE header present */ break; case 1: { /* Direct encapsulation of IPv4 or IPv6 */ skb_set_transport_header(skb, -(int)sizeof(struct icmphdr)); switch (((struct iphdr *)guehdr)->version) { case 4: ret = gue_err_proto_handler(IPPROTO_IPIP, skb, info); goto out; #if IS_ENABLED(CONFIG_IPV6) case 6: ret = gue_err_proto_handler(IPPROTO_IPV6, skb, info); goto out; #endif default: ret = -EOPNOTSUPP; goto out; } } default: /* Undefined version */ return -EOPNOTSUPP; } if (guehdr->control) return -ENOENT; optlen = guehdr->hlen << 2; if (!pskb_may_pull(skb, transport_offset + len + optlen)) return -EINVAL; guehdr = (struct guehdr *)&udp_hdr(skb)[1]; if (validate_gue_flags(guehdr, optlen)) return -EINVAL; /* Handling exceptions for direct UDP encapsulation in GUE would lead to * recursion. Besides, this kind of encapsulation can't even be * configured currently. Discard this. */ if (guehdr->proto_ctype == IPPROTO_UDP || guehdr->proto_ctype == IPPROTO_UDPLITE) return -EOPNOTSUPP; skb_set_transport_header(skb, -(int)sizeof(struct icmphdr)); ret = gue_err_proto_handler(guehdr->proto_ctype, skb, info); out: skb_set_transport_header(skb, transport_offset); return ret; } static const struct ip_tunnel_encap_ops fou_iptun_ops = { .encap_hlen = fou_encap_hlen, .build_header = fou_build_header, .err_handler = gue_err, }; static const struct ip_tunnel_encap_ops gue_iptun_ops = { .encap_hlen = gue_encap_hlen, .build_header = gue_build_header, .err_handler = gue_err, }; static int ip_tunnel_encap_add_fou_ops(void) { int ret; ret = ip_tunnel_encap_add_ops(&fou_iptun_ops, TUNNEL_ENCAP_FOU); if (ret < 0) { pr_err("can't add fou ops\n"); return ret; } ret = ip_tunnel_encap_add_ops(&gue_iptun_ops, TUNNEL_ENCAP_GUE); if (ret < 0) { pr_err("can't add gue ops\n"); ip_tunnel_encap_del_ops(&fou_iptun_ops, TUNNEL_ENCAP_FOU); return ret; } return 0; } static void ip_tunnel_encap_del_fou_ops(void) { ip_tunnel_encap_del_ops(&fou_iptun_ops, TUNNEL_ENCAP_FOU); ip_tunnel_encap_del_ops(&gue_iptun_ops, TUNNEL_ENCAP_GUE); } #else static int ip_tunnel_encap_add_fou_ops(void) { return 0; } static void ip_tunnel_encap_del_fou_ops(void) { } #endif static __net_init int fou_init_net(struct net *net) { struct fou_net *fn = net_generic(net, fou_net_id); INIT_LIST_HEAD(&fn->fou_list); mutex_init(&fn->fou_lock); return 0; } static __net_exit void fou_exit_net(struct net *net) { struct fou_net *fn = net_generic(net, fou_net_id); struct fou *fou, *next; /* Close all the FOU sockets */ mutex_lock(&fn->fou_lock); list_for_each_entry_safe(fou, next, &fn->fou_list, list) fou_release(fou); mutex_unlock(&fn->fou_lock); } static struct pernet_operations fou_net_ops = { .init = fou_init_net, .exit = fou_exit_net, .id = &fou_net_id, .size = sizeof(struct fou_net), }; static int __init fou_init(void) { int ret; ret = register_pernet_device(&fou_net_ops); if (ret) goto exit; ret = genl_register_family(&fou_nl_family); if (ret < 0) goto unregister; ret = register_fou_bpf(); if (ret < 0) goto kfunc_failed; ret = ip_tunnel_encap_add_fou_ops(); if (ret == 0) return 0; kfunc_failed: genl_unregister_family(&fou_nl_family); unregister: unregister_pernet_device(&fou_net_ops); exit: return ret; } static void __exit fou_fini(void) { ip_tunnel_encap_del_fou_ops(); genl_unregister_family(&fou_nl_family); unregister_pernet_device(&fou_net_ops); } module_init(fou_init); module_exit(fou_fini); MODULE_AUTHOR("Tom Herbert <therbert@google.com>"); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Foo over UDP"); |
7 7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | /* SPDX-License-Identifier: GPL-2.0-or-later */ #ifndef __SOUND_INFO_H #define __SOUND_INFO_H /* * Header file for info interface * Copyright (c) by Jaroslav Kysela <perex@perex.cz> */ #include <linux/poll.h> #include <linux/seq_file.h> #include <sound/core.h> /* buffer for information */ struct snd_info_buffer { char *buffer; /* pointer to begin of buffer */ unsigned int curr; /* current position in buffer */ unsigned int size; /* current size */ unsigned int len; /* total length of buffer */ int stop; /* stop flag */ int error; /* error code */ }; #define SNDRV_INFO_CONTENT_TEXT 0 #define SNDRV_INFO_CONTENT_DATA 1 struct snd_info_entry; struct snd_info_entry_text { void (*read)(struct snd_info_entry *entry, struct snd_info_buffer *buffer); void (*write)(struct snd_info_entry *entry, struct snd_info_buffer *buffer); }; struct snd_info_entry_ops { int (*open)(struct snd_info_entry *entry, unsigned short mode, void **file_private_data); int (*release)(struct snd_info_entry *entry, unsigned short mode, void *file_private_data); ssize_t (*read)(struct snd_info_entry *entry, void *file_private_data, struct file *file, char __user *buf, size_t count, loff_t pos); ssize_t (*write)(struct snd_info_entry *entry, void *file_private_data, struct file *file, const char __user *buf, size_t count, loff_t pos); loff_t (*llseek)(struct snd_info_entry *entry, void *file_private_data, struct file *file, loff_t offset, int orig); __poll_t (*poll)(struct snd_info_entry *entry, void *file_private_data, struct file *file, poll_table *wait); int (*ioctl)(struct snd_info_entry *entry, void *file_private_data, struct file *file, unsigned int cmd, unsigned long arg); int (*mmap)(struct snd_info_entry *entry, void *file_private_data, struct inode *inode, struct file *file, struct vm_area_struct *vma); }; struct snd_info_entry { const char *name; umode_t mode; long size; unsigned short content; union { struct snd_info_entry_text text; const struct snd_info_entry_ops *ops; } c; struct snd_info_entry *parent; struct module *module; void *private_data; void (*private_free)(struct snd_info_entry *entry); struct proc_dir_entry *p; struct mutex access; struct list_head children; struct list_head list; }; #if defined(CONFIG_SND_OSSEMUL) && defined(CONFIG_SND_PROC_FS) int snd_info_minor_register(void); #else #define snd_info_minor_register() 0 #endif #ifdef CONFIG_SND_PROC_FS extern struct snd_info_entry *snd_seq_root; #ifdef CONFIG_SND_OSSEMUL extern struct snd_info_entry *snd_oss_root; void snd_card_info_read_oss(struct snd_info_buffer *buffer); #else #define snd_oss_root NULL static inline void snd_card_info_read_oss(struct snd_info_buffer *buffer) {} #endif /** * snd_iprintf - printf on the procfs buffer * @buf: the procfs buffer * @fmt: the printf format * * Outputs the string on the procfs buffer just like printf(). * * Return: zero for success, or a negative error code. */ #define snd_iprintf(buf, fmt, args...) \ seq_printf((struct seq_file *)(buf)->buffer, fmt, ##args) int snd_info_init(void); int snd_info_done(void); int snd_info_get_line(struct snd_info_buffer *buffer, char *line, int len); const char *snd_info_get_str(char *dest, const char *src, int len); struct snd_info_entry *snd_info_create_module_entry(struct module *module, const char *name, struct snd_info_entry *parent); struct snd_info_entry *snd_info_create_card_entry(struct snd_card *card, const char *name, struct snd_info_entry *parent); void snd_info_free_entry(struct snd_info_entry *entry); int snd_info_card_create(struct snd_card *card); int snd_info_card_register(struct snd_card *card); int snd_info_card_free(struct snd_card *card); void snd_info_card_disconnect(struct snd_card *card); void snd_info_card_id_change(struct snd_card *card); int snd_info_register(struct snd_info_entry *entry); /* for card drivers */ static inline int snd_card_proc_new(struct snd_card *card, const char *name, struct snd_info_entry **entryp) { *entryp = snd_info_create_card_entry(card, name, card->proc_root); return *entryp ? 0 : -ENOMEM; } static inline void snd_info_set_text_ops(struct snd_info_entry *entry, void *private_data, void (*read)(struct snd_info_entry *, struct snd_info_buffer *)) { entry->private_data = private_data; entry->c.text.read = read; } int snd_card_rw_proc_new(struct snd_card *card, const char *name, void *private_data, void (*read)(struct snd_info_entry *, struct snd_info_buffer *), void (*write)(struct snd_info_entry *entry, struct snd_info_buffer *buffer)); int snd_info_check_reserved_words(const char *str); #else #define snd_seq_root NULL #define snd_oss_root NULL static inline int snd_iprintf(struct snd_info_buffer *buffer, char *fmt, ...) { return 0; } static inline int snd_info_init(void) { return 0; } static inline int snd_info_done(void) { return 0; } static inline int snd_info_get_line(struct snd_info_buffer *buffer, char *line, int len) { return 0; } static inline char *snd_info_get_str(char *dest, char *src, int len) { return NULL; } static inline struct snd_info_entry *snd_info_create_module_entry(struct module *module, const char *name, struct snd_info_entry *parent) { return NULL; } static inline struct snd_info_entry *snd_info_create_card_entry(struct snd_card *card, const char *name, struct snd_info_entry *parent) { return NULL; } static inline void snd_info_free_entry(struct snd_info_entry *entry) { ; } static inline int snd_info_card_create(struct snd_card *card) { return 0; } static inline int snd_info_card_register(struct snd_card *card) { return 0; } static inline int snd_info_card_free(struct snd_card *card) { return 0; } static inline void snd_info_card_disconnect(struct snd_card *card) { } static inline void snd_info_card_id_change(struct snd_card *card) { } static inline int snd_info_register(struct snd_info_entry *entry) { return 0; } static inline int snd_card_proc_new(struct snd_card *card, const char *name, struct snd_info_entry **entryp) { return -EINVAL; } static inline void snd_info_set_text_ops(struct snd_info_entry *entry __attribute__((unused)), void *private_data, void (*read)(struct snd_info_entry *, struct snd_info_buffer *)) {} static inline int snd_card_rw_proc_new(struct snd_card *card, const char *name, void *private_data, void (*read)(struct snd_info_entry *, struct snd_info_buffer *), void (*write)(struct snd_info_entry *entry, struct snd_info_buffer *buffer)) { return 0; } static inline int snd_info_check_reserved_words(const char *str) { return 1; } #endif /** * snd_card_ro_proc_new - Create a read-only text proc file entry for the card * @card: the card instance * @name: the file name * @private_data: the arbitrary private data * @read: the read callback * * This proc file entry will be registered via snd_card_register() call, and * it will be removed automatically at the card removal, too. */ static inline int snd_card_ro_proc_new(struct snd_card *card, const char *name, void *private_data, void (*read)(struct snd_info_entry *, struct snd_info_buffer *)) { return snd_card_rw_proc_new(card, name, private_data, read, NULL); } /* * OSS info part */ #if defined(CONFIG_SND_OSSEMUL) && defined(CONFIG_SND_PROC_FS) #define SNDRV_OSS_INFO_DEV_AUDIO 0 #define SNDRV_OSS_INFO_DEV_SYNTH 1 #define SNDRV_OSS_INFO_DEV_MIDI 2 #define SNDRV_OSS_INFO_DEV_TIMERS 4 #define SNDRV_OSS_INFO_DEV_MIXERS 5 #define SNDRV_OSS_INFO_DEV_COUNT 6 int snd_oss_info_register(int dev, int num, char *string); #define snd_oss_info_unregister(dev, num) snd_oss_info_register(dev, num, NULL) #endif /* CONFIG_SND_OSSEMUL && CONFIG_SND_PROC_FS */ #endif /* __SOUND_INFO_H */ |
15 1 1 1 4 1 4 4 4 5 5 4 1 5 5 5 6 5 4 4 2 2 5 1 4 1 4 4 4 13 13 19 15 4 18 17 13 2 3 3 18 3 15 14 6 11 6 5 6 3 6 2 6 6 3 3 7 9 10 10 72 72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (c) 2015 Jiri Pirko <jiri@resnulli.us> */ #include <linux/module.h> #include <linux/init.h> #include <linux/kernel.h> #include <linux/skbuff.h> #include <linux/rtnetlink.h> #include <linux/filter.h> #include <linux/bpf.h> #include <net/netlink.h> #include <net/sock.h> #include <net/pkt_sched.h> #include <net/pkt_cls.h> #include <linux/tc_act/tc_bpf.h> #include <net/tc_act/tc_bpf.h> #include <net/tc_wrapper.h> #define ACT_BPF_NAME_LEN 256 struct tcf_bpf_cfg { struct bpf_prog *filter; struct sock_filter *bpf_ops; const char *bpf_name; u16 bpf_num_ops; bool is_ebpf; }; static struct tc_action_ops act_bpf_ops; TC_INDIRECT_SCOPE int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, struct tcf_result *res) { bool at_ingress = skb_at_tc_ingress(skb); struct tcf_bpf *prog = to_bpf(act); struct bpf_prog *filter; int action, filter_res; tcf_lastuse_update(&prog->tcf_tm); bstats_update(this_cpu_ptr(prog->common.cpu_bstats), skb); filter = rcu_dereference(prog->filter); if (at_ingress) { __skb_push(skb, skb->mac_len); bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(filter, skb); __skb_pull(skb, skb->mac_len); } else { bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(filter, skb); } if (unlikely(!skb->tstamp && skb->tstamp_type)) skb->tstamp_type = SKB_CLOCK_REALTIME; if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK) skb_orphan(skb); /* A BPF program may overwrite the default action opcode. * Similarly as in cls_bpf, if filter_res == -1 we use the * default action specified from tc. * * In case a different well-known TC_ACT opcode has been * returned, it will overwrite the default one. * * For everything else that is unknown, TC_ACT_UNSPEC is * returned. */ switch (filter_res) { case TC_ACT_PIPE: case TC_ACT_RECLASSIFY: case TC_ACT_OK: case TC_ACT_REDIRECT: action = filter_res; break; case TC_ACT_SHOT: action = filter_res; qstats_drop_inc(this_cpu_ptr(prog->common.cpu_qstats)); break; case TC_ACT_UNSPEC: action = prog->tcf_action; break; default: action = TC_ACT_UNSPEC; break; } return action; } static bool tcf_bpf_is_ebpf(const struct tcf_bpf *prog) { return !prog->bpf_ops; } static int tcf_bpf_dump_bpf_info(const struct tcf_bpf *prog, struct sk_buff *skb) { struct nlattr *nla; if (nla_put_u16(skb, TCA_ACT_BPF_OPS_LEN, prog->bpf_num_ops)) return -EMSGSIZE; nla = nla_reserve(skb, TCA_ACT_BPF_OPS, prog->bpf_num_ops * sizeof(struct sock_filter)); if (nla == NULL) return -EMSGSIZE; memcpy(nla_data(nla), prog->bpf_ops, nla_len(nla)); return 0; } static int tcf_bpf_dump_ebpf_info(const struct tcf_bpf *prog, struct sk_buff *skb) { struct nlattr *nla; if (prog->bpf_name && nla_put_string(skb, TCA_ACT_BPF_NAME, prog->bpf_name)) return -EMSGSIZE; if (nla_put_u32(skb, TCA_ACT_BPF_ID, prog->filter->aux->id)) return -EMSGSIZE; nla = nla_reserve(skb, TCA_ACT_BPF_TAG, sizeof(prog->filter->tag)); if (nla == NULL) return -EMSGSIZE; memcpy(nla_data(nla), prog->filter->tag, nla_len(nla)); return 0; } static int tcf_bpf_dump(struct sk_buff *skb, struct tc_action *act, int bind, int ref) { unsigned char *tp = skb_tail_pointer(skb); struct tcf_bpf *prog = to_bpf(act); struct tc_act_bpf opt = { .index = prog->tcf_index, .refcnt = refcount_read(&prog->tcf_refcnt) - ref, .bindcnt = atomic_read(&prog->tcf_bindcnt) - bind, }; struct tcf_t tm; int ret; spin_lock_bh(&prog->tcf_lock); opt.action = prog->tcf_action; if (nla_put(skb, TCA_ACT_BPF_PARMS, sizeof(opt), &opt)) goto nla_put_failure; if (tcf_bpf_is_ebpf(prog)) ret = tcf_bpf_dump_ebpf_info(prog, skb); else ret = tcf_bpf_dump_bpf_info(prog, skb); if (ret) goto nla_put_failure; tcf_tm_dump(&tm, &prog->tcf_tm); if (nla_put_64bit(skb, TCA_ACT_BPF_TM, sizeof(tm), &tm, TCA_ACT_BPF_PAD)) goto nla_put_failure; spin_unlock_bh(&prog->tcf_lock); return skb->len; nla_put_failure: spin_unlock_bh(&prog->tcf_lock); nlmsg_trim(skb, tp); return -1; } static const struct nla_policy act_bpf_policy[TCA_ACT_BPF_MAX + 1] = { [TCA_ACT_BPF_PARMS] = { .len = sizeof(struct tc_act_bpf) }, [TCA_ACT_BPF_FD] = { .type = NLA_U32 }, [TCA_ACT_BPF_NAME] = { .type = NLA_NUL_STRING, .len = ACT_BPF_NAME_LEN }, [TCA_ACT_BPF_OPS_LEN] = { .type = NLA_U16 }, [TCA_ACT_BPF_OPS] = { .type = NLA_BINARY, .len = sizeof(struct sock_filter) * BPF_MAXINSNS }, }; static int tcf_bpf_init_from_ops(struct nlattr **tb, struct tcf_bpf_cfg *cfg) { struct sock_filter *bpf_ops; struct sock_fprog_kern fprog_tmp; struct bpf_prog *fp; u16 bpf_size, bpf_num_ops; int ret; bpf_num_ops = nla_get_u16(tb[TCA_ACT_BPF_OPS_LEN]); if (bpf_num_ops > BPF_MAXINSNS || bpf_num_ops == 0) return -EINVAL; bpf_size = bpf_num_ops * sizeof(*bpf_ops); if (bpf_size != nla_len(tb[TCA_ACT_BPF_OPS])) return -EINVAL; bpf_ops = kmemdup(nla_data(tb[TCA_ACT_BPF_OPS]), bpf_size, GFP_KERNEL); if (bpf_ops == NULL) return -ENOMEM; fprog_tmp.len = bpf_num_ops; fprog_tmp.filter = bpf_ops; ret = bpf_prog_create(&fp, &fprog_tmp); if (ret < 0) { kfree(bpf_ops); return ret; } cfg->bpf_ops = bpf_ops; cfg->bpf_num_ops = bpf_num_ops; cfg->filter = fp; cfg->is_ebpf = false; return 0; } static int tcf_bpf_init_from_efd(struct nlattr **tb, struct tcf_bpf_cfg *cfg) { struct bpf_prog *fp; char *name = NULL; u32 bpf_fd; bpf_fd = nla_get_u32(tb[TCA_ACT_BPF_FD]); fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_SCHED_ACT); if (IS_ERR(fp)) return PTR_ERR(fp); if (tb[TCA_ACT_BPF_NAME]) { name = nla_memdup(tb[TCA_ACT_BPF_NAME], GFP_KERNEL); if (!name) { bpf_prog_put(fp); return -ENOMEM; } } cfg->bpf_name = name; cfg->filter = fp; cfg->is_ebpf = true; return 0; } static void tcf_bpf_cfg_cleanup(const struct tcf_bpf_cfg *cfg) { struct bpf_prog *filter = cfg->filter; if (filter) { if (cfg->is_ebpf) bpf_prog_put(filter); else bpf_prog_destroy(filter); } kfree(cfg->bpf_ops); kfree(cfg->bpf_name); } static void tcf_bpf_prog_fill_cfg(const struct tcf_bpf *prog, struct tcf_bpf_cfg *cfg) { cfg->is_ebpf = tcf_bpf_is_ebpf(prog); /* updates to prog->filter are prevented, since it's called either * with tcf lock or during final cleanup in rcu callback */ cfg->filter = rcu_dereference_protected(prog->filter, 1); cfg->bpf_ops = prog->bpf_ops; cfg->bpf_name = prog->bpf_name; } static int tcf_bpf_init(struct net *net, struct nlattr *nla, struct nlattr *est, struct tc_action **act, struct tcf_proto *tp, u32 flags, struct netlink_ext_ack *extack) { struct tc_action_net *tn = net_generic(net, act_bpf_ops.net_id); bool bind = flags & TCA_ACT_FLAGS_BIND; struct nlattr *tb[TCA_ACT_BPF_MAX + 1]; struct tcf_chain *goto_ch = NULL; struct tcf_bpf_cfg cfg, old; struct tc_act_bpf *parm; struct tcf_bpf *prog; bool is_bpf, is_ebpf; int ret, res = 0; u32 index; if (!nla) return -EINVAL; ret = nla_parse_nested_deprecated(tb, TCA_ACT_BPF_MAX, nla, act_bpf_policy, NULL); if (ret < 0) return ret; if (!tb[TCA_ACT_BPF_PARMS]) return -EINVAL; parm = nla_data(tb[TCA_ACT_BPF_PARMS]); index = parm->index; ret = tcf_idr_check_alloc(tn, &index, act, bind); if (!ret) { ret = tcf_idr_create(tn, index, est, act, &act_bpf_ops, bind, true, flags); if (ret < 0) { tcf_idr_cleanup(tn, index); return ret; } res = ACT_P_CREATED; } else if (ret > 0) { /* Don't override defaults. */ if (bind) return ACT_P_BOUND; if (!(flags & TCA_ACT_FLAGS_REPLACE)) { tcf_idr_release(*act, bind); return -EEXIST; } } else { return ret; } ret = tcf_action_check_ctrlact(parm->action, tp, &goto_ch, extack); if (ret < 0) goto release_idr; is_bpf = tb[TCA_ACT_BPF_OPS_LEN] && tb[TCA_ACT_BPF_OPS]; is_ebpf = tb[TCA_ACT_BPF_FD]; if (is_bpf == is_ebpf) { ret = -EINVAL; goto put_chain; } memset(&cfg, 0, sizeof(cfg)); ret = is_bpf ? tcf_bpf_init_from_ops(tb, &cfg) : tcf_bpf_init_from_efd(tb, &cfg); if (ret < 0) goto put_chain; prog = to_bpf(*act); spin_lock_bh(&prog->tcf_lock); if (res != ACT_P_CREATED) tcf_bpf_prog_fill_cfg(prog, &old); prog->bpf_ops = cfg.bpf_ops; prog->bpf_name = cfg.bpf_name; if (cfg.bpf_num_ops) prog->bpf_num_ops = cfg.bpf_num_ops; goto_ch = tcf_action_set_ctrlact(*act, parm->action, goto_ch); rcu_assign_pointer(prog->filter, cfg.filter); spin_unlock_bh(&prog->tcf_lock); if (goto_ch) tcf_chain_put_by_act(goto_ch); if (res != ACT_P_CREATED) { /* make sure the program being replaced is no longer executing */ synchronize_rcu(); tcf_bpf_cfg_cleanup(&old); } return res; put_chain: if (goto_ch) tcf_chain_put_by_act(goto_ch); release_idr: tcf_idr_release(*act, bind); return ret; } static void tcf_bpf_cleanup(struct tc_action *act) { struct tcf_bpf_cfg tmp; tcf_bpf_prog_fill_cfg(to_bpf(act), &tmp); tcf_bpf_cfg_cleanup(&tmp); } static struct tc_action_ops act_bpf_ops __read_mostly = { .kind = "bpf", .id = TCA_ID_BPF, .owner = THIS_MODULE, .act = tcf_bpf_act, .dump = tcf_bpf_dump, .cleanup = tcf_bpf_cleanup, .init = tcf_bpf_init, .size = sizeof(struct tcf_bpf), }; MODULE_ALIAS_NET_ACT("bpf"); static __net_init int bpf_init_net(struct net *net) { struct tc_action_net *tn = net_generic(net, act_bpf_ops.net_id); return tc_action_net_init(net, tn, &act_bpf_ops); } static void __net_exit bpf_exit_net(struct list_head *net_list) { tc_action_net_exit(net_list, act_bpf_ops.net_id); } static struct pernet_operations bpf_net_ops = { .init = bpf_init_net, .exit_batch = bpf_exit_net, .id = &act_bpf_ops.net_id, .size = sizeof(struct tc_action_net), }; static int __init bpf_init_module(void) { return tcf_register_action(&act_bpf_ops, &bpf_net_ops); } static void __exit bpf_cleanup_module(void) { tcf_unregister_action(&act_bpf_ops, &bpf_net_ops); } module_init(bpf_init_module); module_exit(bpf_cleanup_module); MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>"); MODULE_DESCRIPTION("TC BPF based action"); MODULE_LICENSE("GPL v2"); |
3607 3617 3604 3601 3603 3600 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | // SPDX-License-Identifier: GPL-2.0-or-later /* * The "hash function" used as the core of the ChaCha stream cipher (RFC7539) * * Copyright (C) 2015 Martin Willi */ #include <linux/bug.h> #include <linux/kernel.h> #include <linux/export.h> #include <linux/bitops.h> #include <linux/string.h> #include <linux/unaligned.h> #include <crypto/chacha.h> static void chacha_permute(u32 *x, int nrounds) { int i; /* whitelist the allowed round counts */ WARN_ON_ONCE(nrounds != 20 && nrounds != 12); for (i = 0; i < nrounds; i += 2) { x[0] += x[4]; x[12] = rol32(x[12] ^ x[0], 16); x[1] += x[5]; x[13] = rol32(x[13] ^ x[1], 16); x[2] += x[6]; x[14] = rol32(x[14] ^ x[2], 16); x[3] += x[7]; x[15] = rol32(x[15] ^ x[3], 16); x[8] += x[12]; x[4] = rol32(x[4] ^ x[8], 12); x[9] += x[13]; x[5] = rol32(x[5] ^ x[9], 12); x[10] += x[14]; x[6] = rol32(x[6] ^ x[10], 12); x[11] += x[15]; x[7] = rol32(x[7] ^ x[11], 12); x[0] += x[4]; x[12] = rol32(x[12] ^ x[0], 8); x[1] += x[5]; x[13] = rol32(x[13] ^ x[1], 8); x[2] += x[6]; x[14] = rol32(x[14] ^ x[2], 8); x[3] += x[7]; x[15] = rol32(x[15] ^ x[3], 8); x[8] += x[12]; x[4] = rol32(x[4] ^ x[8], 7); x[9] += x[13]; x[5] = rol32(x[5] ^ x[9], 7); x[10] += x[14]; x[6] = rol32(x[6] ^ x[10], 7); x[11] += x[15]; x[7] = rol32(x[7] ^ x[11], 7); x[0] += x[5]; x[15] = rol32(x[15] ^ x[0], 16); x[1] += x[6]; x[12] = rol32(x[12] ^ x[1], 16); x[2] += x[7]; x[13] = rol32(x[13] ^ x[2], 16); x[3] += x[4]; x[14] = rol32(x[14] ^ x[3], 16); x[10] += x[15]; x[5] = rol32(x[5] ^ x[10], 12); x[11] += x[12]; x[6] = rol32(x[6] ^ x[11], 12); x[8] += x[13]; x[7] = rol32(x[7] ^ x[8], 12); x[9] += x[14]; x[4] = rol32(x[4] ^ x[9], 12); x[0] += x[5]; x[15] = rol32(x[15] ^ x[0], 8); x[1] += x[6]; x[12] = rol32(x[12] ^ x[1], 8); x[2] += x[7]; x[13] = rol32(x[13] ^ x[2], 8); x[3] += x[4]; x[14] = rol32(x[14] ^ x[3], 8); x[10] += x[15]; x[5] = rol32(x[5] ^ x[10], 7); x[11] += x[12]; x[6] = rol32(x[6] ^ x[11], 7); x[8] += x[13]; x[7] = rol32(x[7] ^ x[8], 7); x[9] += x[14]; x[4] = rol32(x[4] ^ x[9], 7); } } /** * chacha_block_generic - generate one keystream block and increment block counter * @state: input state matrix (16 32-bit words) * @stream: output keystream block (64 bytes) * @nrounds: number of rounds (20 or 12; 20 is recommended) * * This is the ChaCha core, a function from 64-byte strings to 64-byte strings. * The caller has already converted the endianness of the input. This function * also handles incrementing the block counter in the input matrix. */ void chacha_block_generic(u32 *state, u8 *stream, int nrounds) { u32 x[16]; int i; memcpy(x, state, 64); chacha_permute(x, nrounds); for (i = 0; i < ARRAY_SIZE(x); i++) put_unaligned_le32(x[i] + state[i], &stream[i * sizeof(u32)]); state[12]++; } EXPORT_SYMBOL(chacha_block_generic); /** * hchacha_block_generic - abbreviated ChaCha core, for XChaCha * @state: input state matrix (16 32-bit words) * @stream: output (8 32-bit words) * @nrounds: number of rounds (20 or 12; 20 is recommended) * * HChaCha is the ChaCha equivalent of HSalsa and is an intermediate step * towards XChaCha (see https://cr.yp.to/snuffle/xsalsa-20081128.pdf). HChaCha * skips the final addition of the initial state, and outputs only certain words * of the state. It should not be used for streaming directly. */ void hchacha_block_generic(const u32 *state, u32 *stream, int nrounds) { u32 x[16]; memcpy(x, state, 64); chacha_permute(x, nrounds); memcpy(&stream[0], &x[0], 16); memcpy(&stream[4], &x[12], 16); } EXPORT_SYMBOL(hchacha_block_generic); |
147 147 147 93 91 54 147 147 97 50 147 7 391 391 153 147 147 147 147 147 147 28 126 126 125 125 153 28 28 1 78 3 9 1 8 8 6 2 8 9 30 88 64 30 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 | // SPDX-License-Identifier: GPL-2.0-only /* * * Copyright (C) 2004 by Eric Van Hensbergen <ericvh@gmail.com> * Copyright (C) 2002 by Ron Minnich <rminnich@lanl.gov> */ #include <linux/kernel.h> #include <linux/module.h> #include <linux/errno.h> #include <linux/fs.h> #include <linux/file.h> #include <linux/stat.h> #include <linux/string.h> #include <linux/pagemap.h> #include <linux/mount.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/statfs.h> #include <linux/magic.h> #include <linux/fscache.h> #include <net/9p/9p.h> #include <net/9p/client.h> #include "v9fs.h" #include "v9fs_vfs.h" #include "fid.h" #include "xattr.h" #include "acl.h" static const struct super_operations v9fs_super_ops, v9fs_super_ops_dotl; /** * v9fs_set_super - set the superblock * @s: super block * @data: file system specific data * */ static int v9fs_set_super(struct super_block *s, void *data) { s->s_fs_info = data; return set_anon_super(s, data); } /** * v9fs_fill_super - populate superblock with info * @sb: superblock * @v9ses: session information * @flags: flags propagated from v9fs_mount() * */ static int v9fs_fill_super(struct super_block *sb, struct v9fs_session_info *v9ses, int flags) { int ret; sb->s_maxbytes = MAX_LFS_FILESIZE; sb->s_blocksize_bits = fls(v9ses->maxdata - 1); sb->s_blocksize = 1 << sb->s_blocksize_bits; sb->s_magic = V9FS_MAGIC; if (v9fs_proto_dotl(v9ses)) { sb->s_op = &v9fs_super_ops_dotl; if (!(v9ses->flags & V9FS_NO_XATTR)) sb->s_xattr = v9fs_xattr_handlers; } else { sb->s_op = &v9fs_super_ops; sb->s_time_max = U32_MAX; } sb->s_time_min = 0; ret = super_setup_bdi(sb); if (ret) return ret; if (!v9ses->cache) { sb->s_bdi->ra_pages = 0; sb->s_bdi->io_pages = 0; } else { sb->s_bdi->ra_pages = v9ses->maxdata >> PAGE_SHIFT; sb->s_bdi->io_pages = v9ses->maxdata >> PAGE_SHIFT; } sb->s_flags |= SB_ACTIVE; #ifdef CONFIG_9P_FS_POSIX_ACL if ((v9ses->flags & V9FS_ACL_MASK) == V9FS_POSIX_ACL) sb->s_flags |= SB_POSIXACL; #endif return 0; } /** * v9fs_mount - mount a superblock * @fs_type: file system type * @flags: mount flags * @dev_name: device name that was mounted * @data: mount options * */ static struct dentry *v9fs_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { struct super_block *sb = NULL; struct inode *inode = NULL; struct dentry *root = NULL; struct v9fs_session_info *v9ses = NULL; struct p9_fid *fid; int retval = 0; p9_debug(P9_DEBUG_VFS, "\n"); v9ses = kzalloc(sizeof(struct v9fs_session_info), GFP_KERNEL); if (!v9ses) return ERR_PTR(-ENOMEM); fid = v9fs_session_init(v9ses, dev_name, data); if (IS_ERR(fid)) { retval = PTR_ERR(fid); goto free_session; } sb = sget(fs_type, NULL, v9fs_set_super, flags, v9ses); if (IS_ERR(sb)) { retval = PTR_ERR(sb); goto clunk_fid; } retval = v9fs_fill_super(sb, v9ses, flags); if (retval) goto release_sb; if (v9ses->cache & (CACHE_META|CACHE_LOOSE)) sb->s_d_op = &v9fs_cached_dentry_operations; else sb->s_d_op = &v9fs_dentry_operations; inode = v9fs_get_new_inode_from_fid(v9ses, fid, sb); if (IS_ERR(inode)) { retval = PTR_ERR(inode); goto release_sb; } root = d_make_root(inode); if (!root) { retval = -ENOMEM; goto release_sb; } sb->s_root = root; retval = v9fs_get_acl(inode, fid); if (retval) goto release_sb; v9fs_fid_add(root, &fid); p9_debug(P9_DEBUG_VFS, " simple set mount, return 0\n"); return dget(sb->s_root); clunk_fid: p9_fid_put(fid); v9fs_session_close(v9ses); free_session: kfree(v9ses); return ERR_PTR(retval); release_sb: /* * we will do the session_close and root dentry release * in the below call. But we need to clunk fid, because we haven't * attached the fid to dentry so it won't get clunked * automatically. */ p9_fid_put(fid); deactivate_locked_super(sb); return ERR_PTR(retval); } /** * v9fs_kill_super - Kill Superblock * @s: superblock * */ static void v9fs_kill_super(struct super_block *s) { struct v9fs_session_info *v9ses = s->s_fs_info; p9_debug(P9_DEBUG_VFS, " %p\n", s); kill_anon_super(s); v9fs_session_cancel(v9ses); v9fs_session_close(v9ses); kfree(v9ses); s->s_fs_info = NULL; p9_debug(P9_DEBUG_VFS, "exiting kill_super\n"); } static void v9fs_umount_begin(struct super_block *sb) { struct v9fs_session_info *v9ses; v9ses = sb->s_fs_info; v9fs_session_begin_cancel(v9ses); } static int v9fs_statfs(struct dentry *dentry, struct kstatfs *buf) { struct v9fs_session_info *v9ses; struct p9_fid *fid; struct p9_rstatfs rs; int res; fid = v9fs_fid_lookup(dentry); if (IS_ERR(fid)) { res = PTR_ERR(fid); goto done; } v9ses = v9fs_dentry2v9ses(dentry); if (v9fs_proto_dotl(v9ses)) { res = p9_client_statfs(fid, &rs); if (res == 0) { buf->f_type = rs.type; buf->f_bsize = rs.bsize; buf->f_blocks = rs.blocks; buf->f_bfree = rs.bfree; buf->f_bavail = rs.bavail; buf->f_files = rs.files; buf->f_ffree = rs.ffree; buf->f_fsid = u64_to_fsid(rs.fsid); buf->f_namelen = rs.namelen; } if (res != -ENOSYS) goto done; } res = simple_statfs(dentry, buf); done: p9_fid_put(fid); return res; } static int v9fs_drop_inode(struct inode *inode) { struct v9fs_session_info *v9ses; v9ses = v9fs_inode2v9ses(inode); if (v9ses->cache & (CACHE_META|CACHE_LOOSE)) return generic_drop_inode(inode); /* * in case of non cached mode always drop the * inode because we want the inode attribute * to always match that on the server. */ return 1; } static int v9fs_write_inode(struct inode *inode, struct writeback_control *wbc) { /* * send an fsync request to server irrespective of * wbc->sync_mode. */ p9_debug(P9_DEBUG_VFS, "%s: inode %p\n", __func__, inode); return netfs_unpin_writeback(inode, wbc); } static int v9fs_write_inode_dotl(struct inode *inode, struct writeback_control *wbc) { p9_debug(P9_DEBUG_VFS, "%s: inode %p\n", __func__, inode); return netfs_unpin_writeback(inode, wbc); } static const struct super_operations v9fs_super_ops = { .alloc_inode = v9fs_alloc_inode, .free_inode = v9fs_free_inode, .statfs = simple_statfs, .drop_inode = v9fs_drop_inode, .evict_inode = v9fs_evict_inode, .show_options = v9fs_show_options, .umount_begin = v9fs_umount_begin, .write_inode = v9fs_write_inode, }; static const struct super_operations v9fs_super_ops_dotl = { .alloc_inode = v9fs_alloc_inode, .free_inode = v9fs_free_inode, .statfs = v9fs_statfs, .drop_inode = v9fs_drop_inode, .evict_inode = v9fs_evict_inode, .show_options = v9fs_show_options, .umount_begin = v9fs_umount_begin, .write_inode = v9fs_write_inode_dotl, }; struct file_system_type v9fs_fs_type = { .name = "9p", .mount = v9fs_mount, .kill_sb = v9fs_kill_super, .owner = THIS_MODULE, .fs_flags = FS_RENAME_DOES_D_MOVE, }; MODULE_ALIAS_FS("9p"); |
1 1 1 2 2 2 30 24 28 6 4 4 2 2 4 12 12 12 12 12 12 12 12 13 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 | /* * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * */ #include <linux/kernel.h> #include <linux/slab.h> #include <linux/in.h> #include <linux/module.h> #include <net/tcp.h> #include <net/net_namespace.h> #include <net/netns/generic.h> #include <net/addrconf.h> #include "rds.h" #include "tcp.h" /* only for info exporting */ static DEFINE_SPINLOCK(rds_tcp_tc_list_lock); static LIST_HEAD(rds_tcp_tc_list); /* rds_tcp_tc_count counts only IPv4 connections. * rds6_tcp_tc_count counts both IPv4 and IPv6 connections. */ static unsigned int rds_tcp_tc_count; #if IS_ENABLED(CONFIG_IPV6) static unsigned int rds6_tcp_tc_count; #endif /* Track rds_tcp_connection structs so they can be cleaned up */ static DEFINE_SPINLOCK(rds_tcp_conn_lock); static LIST_HEAD(rds_tcp_conn_list); static atomic_t rds_tcp_unloading = ATOMIC_INIT(0); static struct kmem_cache *rds_tcp_conn_slab; static int rds_tcp_sndbuf_handler(const struct ctl_table *ctl, int write, void *buffer, size_t *lenp, loff_t *fpos); static int rds_tcp_rcvbuf_handler(const struct ctl_table *ctl, int write, void *buffer, size_t *lenp, loff_t *fpos); static int rds_tcp_min_sndbuf = SOCK_MIN_SNDBUF; static int rds_tcp_min_rcvbuf = SOCK_MIN_RCVBUF; static struct ctl_table rds_tcp_sysctl_table[] = { #define RDS_TCP_SNDBUF 0 { .procname = "rds_tcp_sndbuf", /* data is per-net pointer */ .maxlen = sizeof(int), .mode = 0644, .proc_handler = rds_tcp_sndbuf_handler, .extra1 = &rds_tcp_min_sndbuf, }, #define RDS_TCP_RCVBUF 1 { .procname = "rds_tcp_rcvbuf", /* data is per-net pointer */ .maxlen = sizeof(int), .mode = 0644, .proc_handler = rds_tcp_rcvbuf_handler, .extra1 = &rds_tcp_min_rcvbuf, }, }; u32 rds_tcp_write_seq(struct rds_tcp_connection *tc) { /* seq# of the last byte of data in tcp send buffer */ return tcp_sk(tc->t_sock->sk)->write_seq; } u32 rds_tcp_snd_una(struct rds_tcp_connection *tc) { return tcp_sk(tc->t_sock->sk)->snd_una; } void rds_tcp_restore_callbacks(struct socket *sock, struct rds_tcp_connection *tc) { rdsdebug("restoring sock %p callbacks from tc %p\n", sock, tc); write_lock_bh(&sock->sk->sk_callback_lock); /* done under the callback_lock to serialize with write_space */ spin_lock(&rds_tcp_tc_list_lock); list_del_init(&tc->t_list_item); #if IS_ENABLED(CONFIG_IPV6) rds6_tcp_tc_count--; #endif if (!tc->t_cpath->cp_conn->c_isv6) rds_tcp_tc_count--; spin_unlock(&rds_tcp_tc_list_lock); tc->t_sock = NULL; sock->sk->sk_write_space = tc->t_orig_write_space; sock->sk->sk_data_ready = tc->t_orig_data_ready; sock->sk->sk_state_change = tc->t_orig_state_change; sock->sk->sk_user_data = NULL; write_unlock_bh(&sock->sk->sk_callback_lock); } /* * rds_tcp_reset_callbacks() switches the to the new sock and * returns the existing tc->t_sock. * * The only functions that set tc->t_sock are rds_tcp_set_callbacks * and rds_tcp_reset_callbacks. Send and receive trust that * it is set. The absence of RDS_CONN_UP bit protects those paths * from being called while it isn't set. */ void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp) { struct rds_tcp_connection *tc = cp->cp_transport_data; struct socket *osock = tc->t_sock; if (!osock) goto newsock; /* Need to resolve a duelling SYN between peers. * We have an outstanding SYN to this peer, which may * potentially have transitioned to the RDS_CONN_UP state, * so we must quiesce any send threads before resetting * cp_transport_data. We quiesce these threads by setting * cp_state to something other than RDS_CONN_UP, and then * waiting for any existing threads in rds_send_xmit to * complete release_in_xmit(). (Subsequent threads entering * rds_send_xmit() will bail on !rds_conn_up(). * * However an incoming syn-ack at this point would end up * marking the conn as RDS_CONN_UP, and would again permit * rds_send_xmi() threads through, so ideally we would * synchronize on RDS_CONN_UP after lock_sock(), but cannot * do that: waiting on !RDS_IN_XMIT after lock_sock() may * end up deadlocking with tcp_sendmsg(), and the RDS_IN_XMIT * would not get set. As a result, we set c_state to * RDS_CONN_RESETTTING, to ensure that rds_tcp_state_change * cannot mark rds_conn_path_up() in the window before lock_sock() */ atomic_set(&cp->cp_state, RDS_CONN_RESETTING); wait_event(cp->cp_waitq, !test_bit(RDS_IN_XMIT, &cp->cp_flags)); /* reset receive side state for rds_tcp_data_recv() for osock */ cancel_delayed_work_sync(&cp->cp_send_w); cancel_delayed_work_sync(&cp->cp_recv_w); lock_sock(osock->sk); if (tc->t_tinc) { rds_inc_put(&tc->t_tinc->ti_inc); tc->t_tinc = NULL; } tc->t_tinc_hdr_rem = sizeof(struct rds_header); tc->t_tinc_data_rem = 0; rds_tcp_restore_callbacks(osock, tc); release_sock(osock->sk); sock_release(osock); newsock: rds_send_path_reset(cp); lock_sock(sock->sk); rds_tcp_set_callbacks(sock, cp); release_sock(sock->sk); } /* Add tc to rds_tcp_tc_list and set tc->t_sock. See comments * above rds_tcp_reset_callbacks for notes about synchronization * with data path */ void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp) { struct rds_tcp_connection *tc = cp->cp_transport_data; rdsdebug("setting sock %p callbacks to tc %p\n", sock, tc); write_lock_bh(&sock->sk->sk_callback_lock); /* done under the callback_lock to serialize with write_space */ spin_lock(&rds_tcp_tc_list_lock); list_add_tail(&tc->t_list_item, &rds_tcp_tc_list); #if IS_ENABLED(CONFIG_IPV6) rds6_tcp_tc_count++; #endif if (!tc->t_cpath->cp_conn->c_isv6) rds_tcp_tc_count++; spin_unlock(&rds_tcp_tc_list_lock); /* accepted sockets need our listen data ready undone */ if (sock->sk->sk_data_ready == rds_tcp_listen_data_ready) sock->sk->sk_data_ready = sock->sk->sk_user_data; tc->t_sock = sock; tc->t_cpath = cp; tc->t_orig_data_ready = sock->sk->sk_data_ready; tc->t_orig_write_space = sock->sk->sk_write_space; tc->t_orig_state_change = sock->sk->sk_state_change; sock->sk->sk_user_data = cp; sock->sk->sk_data_ready = rds_tcp_data_ready; sock->sk->sk_write_space = rds_tcp_write_space; sock->sk->sk_state_change = rds_tcp_state_change; write_unlock_bh(&sock->sk->sk_callback_lock); } /* Handle RDS_INFO_TCP_SOCKETS socket option. It only returns IPv4 * connections for backward compatibility. */ static void rds_tcp_tc_info(struct socket *rds_sock, unsigned int len, struct rds_info_iterator *iter, struct rds_info_lengths *lens) { struct rds_info_tcp_socket tsinfo; struct rds_tcp_connection *tc; unsigned long flags; spin_lock_irqsave(&rds_tcp_tc_list_lock, flags); if (len / sizeof(tsinfo) < rds_tcp_tc_count) goto out; list_for_each_entry(tc, &rds_tcp_tc_list, t_list_item) { struct inet_sock *inet = inet_sk(tc->t_sock->sk); if (tc->t_cpath->cp_conn->c_isv6) continue; tsinfo.local_addr = inet->inet_saddr; tsinfo.local_port = inet->inet_sport; tsinfo.peer_addr = inet->inet_daddr; tsinfo.peer_port = inet->inet_dport; tsinfo.hdr_rem = tc->t_tinc_hdr_rem; tsinfo.data_rem = tc->t_tinc_data_rem; tsinfo.last_sent_nxt = tc->t_last_sent_nxt; tsinfo.last_expected_una = tc->t_last_expected_una; tsinfo.last_seen_una = tc->t_last_seen_una; tsinfo.tos = tc->t_cpath->cp_conn->c_tos; rds_info_copy(iter, &tsinfo, sizeof(tsinfo)); } out: lens->nr = rds_tcp_tc_count; lens->each = sizeof(tsinfo); spin_unlock_irqrestore(&rds_tcp_tc_list_lock, flags); } #if IS_ENABLED(CONFIG_IPV6) /* Handle RDS6_INFO_TCP_SOCKETS socket option. It returns both IPv4 and * IPv6 connections. IPv4 connection address is returned in an IPv4 mapped * address. */ static void rds6_tcp_tc_info(struct socket *sock, unsigned int len, struct rds_info_iterator *iter, struct rds_info_lengths *lens) { struct rds6_info_tcp_socket tsinfo6; struct rds_tcp_connection *tc; unsigned long flags; spin_lock_irqsave(&rds_tcp_tc_list_lock, flags); if (len / sizeof(tsinfo6) < rds6_tcp_tc_count) goto out; list_for_each_entry(tc, &rds_tcp_tc_list, t_list_item) { struct sock *sk = tc->t_sock->sk; struct inet_sock *inet = inet_sk(sk); tsinfo6.local_addr = sk->sk_v6_rcv_saddr; tsinfo6.local_port = inet->inet_sport; tsinfo6.peer_addr = sk->sk_v6_daddr; tsinfo6.peer_port = inet->inet_dport; tsinfo6.hdr_rem = tc->t_tinc_hdr_rem; tsinfo6.data_rem = tc->t_tinc_data_rem; tsinfo6.last_sent_nxt = tc->t_last_sent_nxt; tsinfo6.last_expected_una = tc->t_last_expected_una; tsinfo6.last_seen_una = tc->t_last_seen_una; rds_info_copy(iter, &tsinfo6, sizeof(tsinfo6)); } out: lens->nr = rds6_tcp_tc_count; lens->each = sizeof(tsinfo6); spin_unlock_irqrestore(&rds_tcp_tc_list_lock, flags); } #endif int rds_tcp_laddr_check(struct net *net, const struct in6_addr *addr, __u32 scope_id) { struct net_device *dev = NULL; #if IS_ENABLED(CONFIG_IPV6) int ret; #endif if (ipv6_addr_v4mapped(addr)) { if (inet_addr_type(net, addr->s6_addr32[3]) == RTN_LOCAL) return 0; return -EADDRNOTAVAIL; } /* If the scope_id is specified, check only those addresses * hosted on the specified interface. */ if (scope_id != 0) { rcu_read_lock(); dev = dev_get_by_index_rcu(net, scope_id); /* scope_id is not valid... */ if (!dev) { rcu_read_unlock(); return -EADDRNOTAVAIL; } rcu_read_unlock(); } #if IS_ENABLED(CONFIG_IPV6) ret = ipv6_chk_addr(net, addr, dev, 0); if (ret) return 0; #endif return -EADDRNOTAVAIL; } static void rds_tcp_conn_free(void *arg) { struct rds_tcp_connection *tc = arg; unsigned long flags; rdsdebug("freeing tc %p\n", tc); spin_lock_irqsave(&rds_tcp_conn_lock, flags); if (!tc->t_tcp_node_detached) list_del(&tc->t_tcp_node); spin_unlock_irqrestore(&rds_tcp_conn_lock, flags); kmem_cache_free(rds_tcp_conn_slab, tc); } static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp) { struct rds_tcp_connection *tc; int i, j; int ret = 0; for (i = 0; i < RDS_MPATH_WORKERS; i++) { tc = kmem_cache_alloc(rds_tcp_conn_slab, gfp); if (!tc) { ret = -ENOMEM; goto fail; } mutex_init(&tc->t_conn_path_lock); tc->t_sock = NULL; tc->t_tinc = NULL; tc->t_tinc_hdr_rem = sizeof(struct rds_header); tc->t_tinc_data_rem = 0; conn->c_path[i].cp_transport_data = tc; tc->t_cpath = &conn->c_path[i]; tc->t_tcp_node_detached = true; rdsdebug("rds_conn_path [%d] tc %p\n", i, conn->c_path[i].cp_transport_data); } spin_lock_irq(&rds_tcp_conn_lock); for (i = 0; i < RDS_MPATH_WORKERS; i++) { tc = conn->c_path[i].cp_transport_data; tc->t_tcp_node_detached = false; list_add_tail(&tc->t_tcp_node, &rds_tcp_conn_list); } spin_unlock_irq(&rds_tcp_conn_lock); fail: if (ret) { for (j = 0; j < i; j++) rds_tcp_conn_free(conn->c_path[j].cp_transport_data); } return ret; } static bool list_has_conn(struct list_head *list, struct rds_connection *conn) { struct rds_tcp_connection *tc, *_tc; list_for_each_entry_safe(tc, _tc, list, t_tcp_node) { if (tc->t_cpath->cp_conn == conn) return true; } return false; } static void rds_tcp_set_unloading(void) { atomic_set(&rds_tcp_unloading, 1); } static bool rds_tcp_is_unloading(struct rds_connection *conn) { return atomic_read(&rds_tcp_unloading) != 0; } static void rds_tcp_destroy_conns(void) { struct rds_tcp_connection *tc, *_tc; LIST_HEAD(tmp_list); /* avoid calling conn_destroy with irqs off */ spin_lock_irq(&rds_tcp_conn_lock); list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) { if (!list_has_conn(&tmp_list, tc->t_cpath->cp_conn)) list_move_tail(&tc->t_tcp_node, &tmp_list); } spin_unlock_irq(&rds_tcp_conn_lock); list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node) rds_conn_destroy(tc->t_cpath->cp_conn); } static void rds_tcp_exit(void); static u8 rds_tcp_get_tos_map(u8 tos) { /* all user tos mapped to default 0 for TCP transport */ return 0; } struct rds_transport rds_tcp_transport = { .laddr_check = rds_tcp_laddr_check, .xmit_path_prepare = rds_tcp_xmit_path_prepare, .xmit_path_complete = rds_tcp_xmit_path_complete, .xmit = rds_tcp_xmit, .recv_path = rds_tcp_recv_path, .conn_alloc = rds_tcp_conn_alloc, .conn_free = rds_tcp_conn_free, .conn_path_connect = rds_tcp_conn_path_connect, .conn_path_shutdown = rds_tcp_conn_path_shutdown, .inc_copy_to_user = rds_tcp_inc_copy_to_user, .inc_free = rds_tcp_inc_free, .stats_info_copy = rds_tcp_stats_info_copy, .exit = rds_tcp_exit, .get_tos_map = rds_tcp_get_tos_map, .t_owner = THIS_MODULE, .t_name = "tcp", .t_type = RDS_TRANS_TCP, .t_prefer_loopback = 1, .t_mp_capable = 1, .t_unloading = rds_tcp_is_unloading, }; static unsigned int rds_tcp_netid; /* per-network namespace private data for this module */ struct rds_tcp_net { struct socket *rds_tcp_listen_sock; struct work_struct rds_tcp_accept_w; struct ctl_table_header *rds_tcp_sysctl; struct ctl_table *ctl_table; int sndbuf_size; int rcvbuf_size; }; /* All module specific customizations to the RDS-TCP socket should be done in * rds_tcp_tune() and applied after socket creation. */ bool rds_tcp_tune(struct socket *sock) { struct sock *sk = sock->sk; struct net *net = sock_net(sk); struct rds_tcp_net *rtn; tcp_sock_set_nodelay(sock->sk); lock_sock(sk); /* TCP timer functions might access net namespace even after * a process which created this net namespace terminated. */ if (!sk->sk_net_refcnt) { if (!maybe_get_net(net)) { release_sock(sk); return false; } sk_net_refcnt_upgrade(sk); put_net(net); } rtn = net_generic(net, rds_tcp_netid); if (rtn->sndbuf_size > 0) { sk->sk_sndbuf = rtn->sndbuf_size; sk->sk_userlocks |= SOCK_SNDBUF_LOCK; } if (rtn->rcvbuf_size > 0) { sk->sk_rcvbuf = rtn->rcvbuf_size; sk->sk_userlocks |= SOCK_RCVBUF_LOCK; } release_sock(sk); return true; } static void rds_tcp_accept_worker(struct work_struct *work) { struct rds_tcp_net *rtn = container_of(work, struct rds_tcp_net, rds_tcp_accept_w); while (rds_tcp_accept_one(rtn->rds_tcp_listen_sock) == 0) cond_resched(); } void rds_tcp_accept_work(struct sock *sk) { struct net *net = sock_net(sk); struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); queue_work(rds_wq, &rtn->rds_tcp_accept_w); } static __net_init int rds_tcp_init_net(struct net *net) { struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); struct ctl_table *tbl; int err = 0; memset(rtn, 0, sizeof(*rtn)); /* {snd, rcv}buf_size default to 0, which implies we let the * stack pick the value, and permit auto-tuning of buffer size. */ if (net == &init_net) { tbl = rds_tcp_sysctl_table; } else { tbl = kmemdup(rds_tcp_sysctl_table, sizeof(rds_tcp_sysctl_table), GFP_KERNEL); if (!tbl) { pr_warn("could not set allocate sysctl table\n"); return -ENOMEM; } rtn->ctl_table = tbl; } tbl[RDS_TCP_SNDBUF].data = &rtn->sndbuf_size; tbl[RDS_TCP_RCVBUF].data = &rtn->rcvbuf_size; rtn->rds_tcp_sysctl = register_net_sysctl_sz(net, "net/rds/tcp", tbl, ARRAY_SIZE(rds_tcp_sysctl_table)); if (!rtn->rds_tcp_sysctl) { pr_warn("could not register sysctl\n"); err = -ENOMEM; goto fail; } #if IS_ENABLED(CONFIG_IPV6) rtn->rds_tcp_listen_sock = rds_tcp_listen_init(net, true); #else rtn->rds_tcp_listen_sock = rds_tcp_listen_init(net, false); #endif if (!rtn->rds_tcp_listen_sock) { pr_warn("could not set up IPv6 listen sock\n"); #if IS_ENABLED(CONFIG_IPV6) /* Try IPv4 as some systems disable IPv6 */ rtn->rds_tcp_listen_sock = rds_tcp_listen_init(net, false); if (!rtn->rds_tcp_listen_sock) { #endif unregister_net_sysctl_table(rtn->rds_tcp_sysctl); rtn->rds_tcp_sysctl = NULL; err = -EAFNOSUPPORT; goto fail; #if IS_ENABLED(CONFIG_IPV6) } #endif } INIT_WORK(&rtn->rds_tcp_accept_w, rds_tcp_accept_worker); return 0; fail: if (net != &init_net) kfree(tbl); return err; } static void rds_tcp_kill_sock(struct net *net) { struct rds_tcp_connection *tc, *_tc; LIST_HEAD(tmp_list); struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); struct socket *lsock = rtn->rds_tcp_listen_sock; rtn->rds_tcp_listen_sock = NULL; rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w); spin_lock_irq(&rds_tcp_conn_lock); list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) { struct net *c_net = read_pnet(&tc->t_cpath->cp_conn->c_net); if (net != c_net) continue; if (!list_has_conn(&tmp_list, tc->t_cpath->cp_conn)) { list_move_tail(&tc->t_tcp_node, &tmp_list); } else { list_del(&tc->t_tcp_node); tc->t_tcp_node_detached = true; } } spin_unlock_irq(&rds_tcp_conn_lock); list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node) rds_conn_destroy(tc->t_cpath->cp_conn); } static void __net_exit rds_tcp_exit_net(struct net *net) { struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); rds_tcp_kill_sock(net); if (rtn->rds_tcp_sysctl) unregister_net_sysctl_table(rtn->rds_tcp_sysctl); if (net != &init_net) kfree(rtn->ctl_table); } static struct pernet_operations rds_tcp_net_ops = { .init = rds_tcp_init_net, .exit = rds_tcp_exit_net, .id = &rds_tcp_netid, .size = sizeof(struct rds_tcp_net), }; void *rds_tcp_listen_sock_def_readable(struct net *net) { struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); struct socket *lsock = rtn->rds_tcp_listen_sock; if (!lsock) return NULL; return lsock->sk->sk_user_data; } /* when sysctl is used to modify some kernel socket parameters,this * function resets the RDS connections in that netns so that we can * restart with new parameters. The assumption is that such reset * events are few and far-between. */ static void rds_tcp_sysctl_reset(struct net *net) { struct rds_tcp_connection *tc, *_tc; spin_lock_irq(&rds_tcp_conn_lock); list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) { struct net *c_net = read_pnet(&tc->t_cpath->cp_conn->c_net); if (net != c_net || !tc->t_sock) continue; /* reconnect with new parameters */ rds_conn_path_drop(tc->t_cpath, false); } spin_unlock_irq(&rds_tcp_conn_lock); } static int rds_tcp_skbuf_handler(struct rds_tcp_net *rtn, const struct ctl_table *ctl, int write, void *buffer, size_t *lenp, loff_t *fpos) { int err; err = proc_dointvec_minmax(ctl, write, buffer, lenp, fpos); if (err < 0) { pr_warn("Invalid input. Must be >= %d\n", *(int *)(ctl->extra1)); return err; } if (write && rtn->rds_tcp_listen_sock && rtn->rds_tcp_listen_sock->sk) { struct net *net = sock_net(rtn->rds_tcp_listen_sock->sk); rds_tcp_sysctl_reset(net); } return 0; } static int rds_tcp_sndbuf_handler(const struct ctl_table *ctl, int write, void *buffer, size_t *lenp, loff_t *fpos) { struct rds_tcp_net *rtn = container_of(ctl->data, struct rds_tcp_net, sndbuf_size); return rds_tcp_skbuf_handler(rtn, ctl, write, buffer, lenp, fpos); } static int rds_tcp_rcvbuf_handler(const struct ctl_table *ctl, int write, void *buffer, size_t *lenp, loff_t *fpos) { struct rds_tcp_net *rtn = container_of(ctl->data, struct rds_tcp_net, rcvbuf_size); return rds_tcp_skbuf_handler(rtn, ctl, write, buffer, lenp, fpos); } static void rds_tcp_exit(void) { rds_tcp_set_unloading(); synchronize_rcu(); rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info); #if IS_ENABLED(CONFIG_IPV6) rds_info_deregister_func(RDS6_INFO_TCP_SOCKETS, rds6_tcp_tc_info); #endif unregister_pernet_device(&rds_tcp_net_ops); rds_tcp_destroy_conns(); rds_trans_unregister(&rds_tcp_transport); rds_tcp_recv_exit(); kmem_cache_destroy(rds_tcp_conn_slab); } module_exit(rds_tcp_exit); static int __init rds_tcp_init(void) { int ret; rds_tcp_conn_slab = KMEM_CACHE(rds_tcp_connection, 0); if (!rds_tcp_conn_slab) { ret = -ENOMEM; goto out; } ret = rds_tcp_recv_init(); if (ret) goto out_slab; ret = register_pernet_device(&rds_tcp_net_ops); if (ret) goto out_recv; rds_trans_register(&rds_tcp_transport); rds_info_register_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info); #if IS_ENABLED(CONFIG_IPV6) rds_info_register_func(RDS6_INFO_TCP_SOCKETS, rds6_tcp_tc_info); #endif goto out; out_recv: rds_tcp_recv_exit(); out_slab: kmem_cache_destroy(rds_tcp_conn_slab); out: return ret; } module_init(rds_tcp_init); MODULE_AUTHOR("Oracle Corporation <rds-devel@oss.oracle.com>"); MODULE_DESCRIPTION("RDS: TCP transport"); MODULE_LICENSE("Dual BSD/GPL"); |
2 2 2 2 2 2 2 3 3 3 3 3 1 1 2 2 2 2 2 2 2 3 3 2 2 2 2 2 3 3 3 2 2 3 5 4 1 3 5 1 5 3 3 3 3 3 3 3 3 3 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 | // SPDX-License-Identifier: GPL-2.0 /* * NVMe over Fabrics common host code. * Copyright (c) 2015-2016 HGST, a Western Digital Company. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/init.h> #include <linux/miscdevice.h> #include <linux/module.h> #include <linux/mutex.h> #include <linux/parser.h> #include <linux/seq_file.h> #include "nvme.h" #include "fabrics.h" #include <linux/nvme-keyring.h> static LIST_HEAD(nvmf_transports); static DECLARE_RWSEM(nvmf_transports_rwsem); static LIST_HEAD(nvmf_hosts); static DEFINE_MUTEX(nvmf_hosts_mutex); static struct nvmf_host *nvmf_default_host; static struct nvmf_host *nvmf_host_alloc(const char *hostnqn, uuid_t *id) { struct nvmf_host *host; host = kmalloc(sizeof(*host), GFP_KERNEL); if (!host) return NULL; kref_init(&host->ref); uuid_copy(&host->id, id); strscpy(host->nqn, hostnqn, NVMF_NQN_SIZE); return host; } static struct nvmf_host *nvmf_host_add(const char *hostnqn, uuid_t *id) { struct nvmf_host *host; mutex_lock(&nvmf_hosts_mutex); /* * We have defined a host as how it is perceived by the target. * Therefore, we don't allow different Host NQNs with the same Host ID. * Similarly, we do not allow the usage of the same Host NQN with * different Host IDs. This'll maintain unambiguous host identification. */ list_for_each_entry(host, &nvmf_hosts, list) { bool same_hostnqn = !strcmp(host->nqn, hostnqn); bool same_hostid = uuid_equal(&host->id, id); if (same_hostnqn && same_hostid) { kref_get(&host->ref); goto out_unlock; } if (same_hostnqn) { pr_err("found same hostnqn %s but different hostid %pUb\n", hostnqn, id); host = ERR_PTR(-EINVAL); goto out_unlock; } if (same_hostid) { pr_err("found same hostid %pUb but different hostnqn %s\n", id, hostnqn); host = ERR_PTR(-EINVAL); goto out_unlock; } } host = nvmf_host_alloc(hostnqn, id); if (!host) { host = ERR_PTR(-ENOMEM); goto out_unlock; } list_add_tail(&host->list, &nvmf_hosts); out_unlock: mutex_unlock(&nvmf_hosts_mutex); return host; } static struct nvmf_host *nvmf_host_default(void) { struct nvmf_host *host; char nqn[NVMF_NQN_SIZE]; uuid_t id; uuid_gen(&id); snprintf(nqn, NVMF_NQN_SIZE, "nqn.2014-08.org.nvmexpress:uuid:%pUb", &id); host = nvmf_host_alloc(nqn, &id); if (!host) return NULL; mutex_lock(&nvmf_hosts_mutex); list_add_tail(&host->list, &nvmf_hosts); mutex_unlock(&nvmf_hosts_mutex); return host; } static void nvmf_host_destroy(struct kref *ref) { struct nvmf_host *host = container_of(ref, struct nvmf_host, ref); mutex_lock(&nvmf_hosts_mutex); list_del(&host->list); mutex_unlock(&nvmf_hosts_mutex); kfree(host); } static void nvmf_host_put(struct nvmf_host *host) { if (host) kref_put(&host->ref, nvmf_host_destroy); } /** * nvmf_get_address() - Get address/port * @ctrl: Host NVMe controller instance which we got the address * @buf: OUTPUT parameter that will contain the address/port * @size: buffer size */ int nvmf_get_address(struct nvme_ctrl *ctrl, char *buf, int size) { int len = 0; if (ctrl->opts->mask & NVMF_OPT_TRADDR) len += scnprintf(buf, size, "traddr=%s", ctrl->opts->traddr); if (ctrl->opts->mask & NVMF_OPT_TRSVCID) len += scnprintf(buf + len, size - len, "%strsvcid=%s", (len) ? "," : "", ctrl->opts->trsvcid); if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR) len += scnprintf(buf + len, size - len, "%shost_traddr=%s", (len) ? "," : "", ctrl->opts->host_traddr); if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE) len += scnprintf(buf + len, size - len, "%shost_iface=%s", (len) ? "," : "", ctrl->opts->host_iface); len += scnprintf(buf + len, size - len, "\n"); return len; } EXPORT_SYMBOL_GPL(nvmf_get_address); /** * nvmf_reg_read32() - NVMe Fabrics "Property Get" API function. * @ctrl: Host NVMe controller instance maintaining the admin * queue used to submit the property read command to * the allocated NVMe controller resource on the target system. * @off: Starting offset value of the targeted property * register (see the fabrics section of the NVMe standard). * @val: OUTPUT parameter that will contain the value of * the property after a successful read. * * Used by the host system to retrieve a 32-bit capsule property value * from an NVMe controller on the target system. * * ("Capsule property" is an "PCIe register concept" applied to the * NVMe fabrics space.) * * Return: * 0: successful read * > 0: NVMe error status code * < 0: Linux errno error code */ int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val) { struct nvme_command cmd = { }; union nvme_result res; int ret; cmd.prop_get.opcode = nvme_fabrics_command; cmd.prop_get.fctype = nvme_fabrics_type_property_get; cmd.prop_get.offset = cpu_to_le32(off); ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, NULL, 0, NVME_QID_ANY, NVME_SUBMIT_RESERVED); if (ret >= 0) *val = le64_to_cpu(res.u64); if (unlikely(ret != 0)) dev_err(ctrl->device, "Property Get error: %d, offset %#x\n", ret > 0 ? ret & ~NVME_STATUS_DNR : ret, off); return ret; } EXPORT_SYMBOL_GPL(nvmf_reg_read32); /** * nvmf_reg_read64() - NVMe Fabrics "Property Get" API function. * @ctrl: Host NVMe controller instance maintaining the admin * queue used to submit the property read command to * the allocated controller resource on the target system. * @off: Starting offset value of the targeted property * register (see the fabrics section of the NVMe standard). * @val: OUTPUT parameter that will contain the value of * the property after a successful read. * * Used by the host system to retrieve a 64-bit capsule property value * from an NVMe controller on the target system. * * ("Capsule property" is an "PCIe register concept" applied to the * NVMe fabrics space.) * * Return: * 0: successful read * > 0: NVMe error status code * < 0: Linux errno error code */ int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val) { struct nvme_command cmd = { }; union nvme_result res; int ret; cmd.prop_get.opcode = nvme_fabrics_command; cmd.prop_get.fctype = nvme_fabrics_type_property_get; cmd.prop_get.attrib = 1; cmd.prop_get.offset = cpu_to_le32(off); ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, NULL, 0, NVME_QID_ANY, NVME_SUBMIT_RESERVED); if (ret >= 0) *val = le64_to_cpu(res.u64); if (unlikely(ret != 0)) dev_err(ctrl->device, "Property Get error: %d, offset %#x\n", ret > 0 ? ret & ~NVME_STATUS_DNR : ret, off); return ret; } EXPORT_SYMBOL_GPL(nvmf_reg_read64); /** * nvmf_reg_write32() - NVMe Fabrics "Property Write" API function. * @ctrl: Host NVMe controller instance maintaining the admin * queue used to submit the property read command to * the allocated NVMe controller resource on the target system. * @off: Starting offset value of the targeted property * register (see the fabrics section of the NVMe standard). * @val: Input parameter that contains the value to be * written to the property. * * Used by the NVMe host system to write a 32-bit capsule property value * to an NVMe controller on the target system. * * ("Capsule property" is an "PCIe register concept" applied to the * NVMe fabrics space.) * * Return: * 0: successful write * > 0: NVMe error status code * < 0: Linux errno error code */ int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val) { struct nvme_command cmd = { }; int ret; cmd.prop_set.opcode = nvme_fabrics_command; cmd.prop_set.fctype = nvme_fabrics_type_property_set; cmd.prop_set.attrib = 0; cmd.prop_set.offset = cpu_to_le32(off); cmd.prop_set.value = cpu_to_le64(val); ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, NULL, NULL, 0, NVME_QID_ANY, NVME_SUBMIT_RESERVED); if (unlikely(ret)) dev_err(ctrl->device, "Property Set error: %d, offset %#x\n", ret > 0 ? ret & ~NVME_STATUS_DNR : ret, off); return ret; } EXPORT_SYMBOL_GPL(nvmf_reg_write32); int nvmf_subsystem_reset(struct nvme_ctrl *ctrl) { int ret; if (!nvme_wait_reset(ctrl)) return -EBUSY; ret = ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, NVME_SUBSYS_RESET); if (ret) return ret; return nvme_try_sched_reset(ctrl); } EXPORT_SYMBOL_GPL(nvmf_subsystem_reset); /** * nvmf_log_connect_error() - Error-parsing-diagnostic print out function for * connect() errors. * @ctrl: The specific /dev/nvmeX device that had the error. * @errval: Error code to be decoded in a more human-friendly * printout. * @offset: For use with the NVMe error code * NVME_SC_CONNECT_INVALID_PARAM. * @cmd: This is the SQE portion of a submission capsule. * @data: This is the "Data" portion of a submission capsule. */ static void nvmf_log_connect_error(struct nvme_ctrl *ctrl, int errval, int offset, struct nvme_command *cmd, struct nvmf_connect_data *data) { int err_sctype = errval & ~NVME_STATUS_DNR; if (errval < 0) { dev_err(ctrl->device, "Connect command failed, errno: %d\n", errval); return; } switch (err_sctype) { case NVME_SC_CONNECT_INVALID_PARAM: if (offset >> 16) { char *inv_data = "Connect Invalid Data Parameter"; switch (offset & 0xffff) { case (offsetof(struct nvmf_connect_data, cntlid)): dev_err(ctrl->device, "%s, cntlid: %d\n", inv_data, data->cntlid); break; case (offsetof(struct nvmf_connect_data, hostnqn)): dev_err(ctrl->device, "%s, hostnqn \"%s\"\n", inv_data, data->hostnqn); break; case (offsetof(struct nvmf_connect_data, subsysnqn)): dev_err(ctrl->device, "%s, subsysnqn \"%s\"\n", inv_data, data->subsysnqn); break; default: dev_err(ctrl->device, "%s, starting byte offset: %d\n", inv_data, offset & 0xffff); break; } } else { char *inv_sqe = "Connect Invalid SQE Parameter"; switch (offset) { case (offsetof(struct nvmf_connect_command, qid)): dev_err(ctrl->device, "%s, qid %d\n", inv_sqe, cmd->connect.qid); break; default: dev_err(ctrl->device, "%s, starting byte offset: %d\n", inv_sqe, offset); } } break; case NVME_SC_CONNECT_INVALID_HOST: dev_err(ctrl->device, "Connect for subsystem %s is not allowed, hostnqn: %s\n", data->subsysnqn, data->hostnqn); break; case NVME_SC_CONNECT_CTRL_BUSY: dev_err(ctrl->device, "Connect command failed: controller is busy or not available\n"); break; case NVME_SC_CONNECT_FORMAT: dev_err(ctrl->device, "Connect incompatible format: %d", cmd->connect.recfmt); break; case NVME_SC_HOST_PATH_ERROR: dev_err(ctrl->device, "Connect command failed: host path error\n"); break; case NVME_SC_AUTH_REQUIRED: dev_err(ctrl->device, "Connect command failed: authentication required\n"); break; default: dev_err(ctrl->device, "Connect command failed, error wo/DNR bit: %d\n", err_sctype); break; } } static struct nvmf_connect_data *nvmf_connect_data_prep(struct nvme_ctrl *ctrl, u16 cntlid) { struct nvmf_connect_data *data; data = kzalloc(sizeof(*data), GFP_KERNEL); if (!data) return NULL; uuid_copy(&data->hostid, &ctrl->opts->host->id); data->cntlid = cpu_to_le16(cntlid); strscpy(data->subsysnqn, ctrl->opts->subsysnqn, NVMF_NQN_SIZE); strscpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE); return data; } static void nvmf_connect_cmd_prep(struct nvme_ctrl *ctrl, u16 qid, struct nvme_command *cmd) { cmd->connect.opcode = nvme_fabrics_command; cmd->connect.fctype = nvme_fabrics_type_connect; cmd->connect.qid = cpu_to_le16(qid); if (qid) { cmd->connect.sqsize = cpu_to_le16(ctrl->sqsize); } else { cmd->connect.sqsize = cpu_to_le16(NVME_AQ_DEPTH - 1); /* * set keep-alive timeout in seconds granularity (ms * 1000) */ cmd->connect.kato = cpu_to_le32(ctrl->kato * 1000); } if (ctrl->opts->disable_sqflow) cmd->connect.cattr |= NVME_CONNECT_DISABLE_SQFLOW; } /** * nvmf_connect_admin_queue() - NVMe Fabrics Admin Queue "Connect" * API function. * @ctrl: Host nvme controller instance used to request * a new NVMe controller allocation on the target * system and establish an NVMe Admin connection to * that controller. * * This function enables an NVMe host device to request a new allocation of * an NVMe controller resource on a target system as well establish a * fabrics-protocol connection of the NVMe Admin queue between the * host system device and the allocated NVMe controller on the * target system via a NVMe Fabrics "Connect" command. */ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl) { struct nvme_command cmd = { }; union nvme_result res; struct nvmf_connect_data *data; int ret; u32 result; nvmf_connect_cmd_prep(ctrl, 0, &cmd); data = nvmf_connect_data_prep(ctrl, 0xffff); if (!data) return -ENOMEM; ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, data, sizeof(*data), NVME_QID_ANY, NVME_SUBMIT_AT_HEAD | NVME_SUBMIT_NOWAIT | NVME_SUBMIT_RESERVED); if (ret) { nvmf_log_connect_error(ctrl, ret, le32_to_cpu(res.u32), &cmd, data); goto out_free_data; } result = le32_to_cpu(res.u32); ctrl->cntlid = result & 0xFFFF; if (result & (NVME_CONNECT_AUTHREQ_ATR | NVME_CONNECT_AUTHREQ_ASCR)) { /* Check for secure concatenation */ if ((result & NVME_CONNECT_AUTHREQ_ASCR) && !ctrl->opts->concat) { dev_warn(ctrl->device, "qid 0: secure concatenation is not supported\n"); ret = -EOPNOTSUPP; goto out_free_data; } /* Authentication required */ ret = nvme_auth_negotiate(ctrl, 0); if (ret) { dev_warn(ctrl->device, "qid 0: authentication setup failed\n"); goto out_free_data; } ret = nvme_auth_wait(ctrl, 0); if (ret) { dev_warn(ctrl->device, "qid 0: authentication failed, error %d\n", ret); } else dev_info(ctrl->device, "qid 0: authenticated\n"); } out_free_data: kfree(data); return ret; } EXPORT_SYMBOL_GPL(nvmf_connect_admin_queue); /** * nvmf_connect_io_queue() - NVMe Fabrics I/O Queue "Connect" * API function. * @ctrl: Host nvme controller instance used to establish an * NVMe I/O queue connection to the already allocated NVMe * controller on the target system. * @qid: NVMe I/O queue number for the new I/O connection between * host and target (note qid == 0 is illegal as this is * the Admin queue, per NVMe standard). * * This function issues a fabrics-protocol connection * of a NVMe I/O queue (via NVMe Fabrics "Connect" command) * between the host system device and the allocated NVMe controller * on the target system. * * Return: * 0: success * > 0: NVMe error status code * < 0: Linux errno error code */ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid) { struct nvme_command cmd = { }; struct nvmf_connect_data *data; union nvme_result res; int ret; u32 result; nvmf_connect_cmd_prep(ctrl, qid, &cmd); data = nvmf_connect_data_prep(ctrl, ctrl->cntlid); if (!data) return -ENOMEM; ret = __nvme_submit_sync_cmd(ctrl->connect_q, &cmd, &res, data, sizeof(*data), qid, NVME_SUBMIT_AT_HEAD | NVME_SUBMIT_RESERVED | NVME_SUBMIT_NOWAIT); if (ret) { nvmf_log_connect_error(ctrl, ret, le32_to_cpu(res.u32), &cmd, data); goto out_free_data; } result = le32_to_cpu(res.u32); if (result & (NVME_CONNECT_AUTHREQ_ATR | NVME_CONNECT_AUTHREQ_ASCR)) { /* Secure concatenation is not implemented */ if (result & NVME_CONNECT_AUTHREQ_ASCR) { dev_warn(ctrl->device, "qid %d: secure concatenation is not supported\n", qid); ret = -EOPNOTSUPP; goto out_free_data; } /* Authentication required */ ret = nvme_auth_negotiate(ctrl, qid); if (ret) { dev_warn(ctrl->device, "qid %d: authentication setup failed\n", qid); goto out_free_data; } ret = nvme_auth_wait(ctrl, qid); if (ret) { dev_warn(ctrl->device, "qid %u: authentication failed, error %d\n", qid, ret); } } out_free_data: kfree(data); return ret; } EXPORT_SYMBOL_GPL(nvmf_connect_io_queue); /* * Evaluate the status information returned by the transport in order to decided * if a reconnect attempt should be scheduled. * * Do not retry when: * * - the DNR bit is set and the specification states no further connect * attempts with the same set of paramenters should be attempted. * * - when the authentication attempt fails, because the key was invalid. * This error code is set on the host side. */ bool nvmf_should_reconnect(struct nvme_ctrl *ctrl, int status) { if (status > 0 && (status & NVME_STATUS_DNR)) return false; if (status == -EKEYREJECTED) return false; if (ctrl->opts->max_reconnects == -1 || ctrl->nr_reconnects < ctrl->opts->max_reconnects) return true; return false; } EXPORT_SYMBOL_GPL(nvmf_should_reconnect); /** * nvmf_register_transport() - NVMe Fabrics Library registration function. * @ops: Transport ops instance to be registered to the * common fabrics library. * * API function that registers the type of specific transport fabric * being implemented to the common NVMe fabrics library. Part of * the overall init sequence of starting up a fabrics driver. */ int nvmf_register_transport(struct nvmf_transport_ops *ops) { if (!ops->create_ctrl) return -EINVAL; down_write(&nvmf_transports_rwsem); list_add_tail(&ops->entry, &nvmf_transports); up_write(&nvmf_transports_rwsem); return 0; } EXPORT_SYMBOL_GPL(nvmf_register_transport); /** * nvmf_unregister_transport() - NVMe Fabrics Library unregistration function. * @ops: Transport ops instance to be unregistered from the * common fabrics library. * * Fabrics API function that unregisters the type of specific transport * fabric being implemented from the common NVMe fabrics library. * Part of the overall exit sequence of unloading the implemented driver. */ void nvmf_unregister_transport(struct nvmf_transport_ops *ops) { down_write(&nvmf_transports_rwsem); list_del(&ops->entry); up_write(&nvmf_transports_rwsem); } EXPORT_SYMBOL_GPL(nvmf_unregister_transport); static struct nvmf_transport_ops *nvmf_lookup_transport( struct nvmf_ctrl_options *opts) { struct nvmf_transport_ops *ops; lockdep_assert_held(&nvmf_transports_rwsem); list_for_each_entry(ops, &nvmf_transports, entry) { if (strcmp(ops->name, opts->transport) == 0) return ops; } return NULL; } static struct key *nvmf_parse_key(int key_id) { struct key *key; if (!IS_ENABLED(CONFIG_NVME_TCP_TLS)) { pr_err("TLS is not supported\n"); return ERR_PTR(-EINVAL); } key = nvme_tls_key_lookup(key_id); if (IS_ERR(key)) pr_err("key id %08x not found\n", key_id); else pr_debug("Using key id %08x\n", key_id); return key; } static const match_table_t opt_tokens = { { NVMF_OPT_TRANSPORT, "transport=%s" }, { NVMF_OPT_TRADDR, "traddr=%s" }, { NVMF_OPT_TRSVCID, "trsvcid=%s" }, { NVMF_OPT_NQN, "nqn=%s" }, { NVMF_OPT_QUEUE_SIZE, "queue_size=%d" }, { NVMF_OPT_NR_IO_QUEUES, "nr_io_queues=%d" }, { NVMF_OPT_RECONNECT_DELAY, "reconnect_delay=%d" }, { NVMF_OPT_CTRL_LOSS_TMO, "ctrl_loss_tmo=%d" }, { NVMF_OPT_KATO, "keep_alive_tmo=%d" }, { NVMF_OPT_HOSTNQN, "hostnqn=%s" }, { NVMF_OPT_HOST_TRADDR, "host_traddr=%s" }, { NVMF_OPT_HOST_IFACE, "host_iface=%s" }, { NVMF_OPT_HOST_ID, "hostid=%s" }, { NVMF_OPT_DUP_CONNECT, "duplicate_connect" }, { NVMF_OPT_DISABLE_SQFLOW, "disable_sqflow" }, { NVMF_OPT_HDR_DIGEST, "hdr_digest" }, { NVMF_OPT_DATA_DIGEST, "data_digest" }, { NVMF_OPT_NR_WRITE_QUEUES, "nr_write_queues=%d" }, { NVMF_OPT_NR_POLL_QUEUES, "nr_poll_queues=%d" }, { NVMF_OPT_TOS, "tos=%d" }, #ifdef CONFIG_NVME_TCP_TLS { NVMF_OPT_KEYRING, "keyring=%d" }, { NVMF_OPT_TLS_KEY, "tls_key=%d" }, #endif { NVMF_OPT_FAIL_FAST_TMO, "fast_io_fail_tmo=%d" }, { NVMF_OPT_DISCOVERY, "discovery" }, #ifdef CONFIG_NVME_HOST_AUTH { NVMF_OPT_DHCHAP_SECRET, "dhchap_secret=%s" }, { NVMF_OPT_DHCHAP_CTRL_SECRET, "dhchap_ctrl_secret=%s" }, #endif #ifdef CONFIG_NVME_TCP_TLS { NVMF_OPT_TLS, "tls" }, { NVMF_OPT_CONCAT, "concat" }, #endif { NVMF_OPT_ERR, NULL } }; static int nvmf_parse_options(struct nvmf_ctrl_options *opts, const char *buf) { substring_t args[MAX_OPT_ARGS]; char *options, *o, *p; int token, ret = 0; size_t nqnlen = 0; int ctrl_loss_tmo = NVMF_DEF_CTRL_LOSS_TMO, key_id; uuid_t hostid; char hostnqn[NVMF_NQN_SIZE]; struct key *key; /* Set defaults */ opts->queue_size = NVMF_DEF_QUEUE_SIZE; opts->nr_io_queues = num_online_cpus(); opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY; opts->kato = 0; opts->duplicate_connect = false; opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO; opts->hdr_digest = false; opts->data_digest = false; opts->tos = -1; /* < 0 == use transport default */ opts->tls = false; opts->tls_key = NULL; opts->keyring = NULL; opts->concat = false; options = o = kstrdup(buf, GFP_KERNEL); if (!options) return -ENOMEM; /* use default host if not given by user space */ uuid_copy(&hostid, &nvmf_default_host->id); strscpy(hostnqn, nvmf_default_host->nqn, NVMF_NQN_SIZE); while ((p = strsep(&o, ",\n")) != NULL) { if (!*p) continue; token = match_token(p, opt_tokens, args); opts->mask |= token; switch (token) { case NVMF_OPT_TRANSPORT: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->transport); opts->transport = p; break; case NVMF_OPT_NQN: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->subsysnqn); opts->subsysnqn = p; nqnlen = strlen(opts->subsysnqn); if (nqnlen >= NVMF_NQN_SIZE) { pr_err("%s needs to be < %d bytes\n", opts->subsysnqn, NVMF_NQN_SIZE); ret = -EINVAL; goto out; } opts->discovery_nqn = !(strcmp(opts->subsysnqn, NVME_DISC_SUBSYS_NAME)); break; case NVMF_OPT_TRADDR: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->traddr); opts->traddr = p; break; case NVMF_OPT_TRSVCID: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->trsvcid); opts->trsvcid = p; break; case NVMF_OPT_QUEUE_SIZE: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token < NVMF_MIN_QUEUE_SIZE || token > NVMF_MAX_QUEUE_SIZE) { pr_err("Invalid queue_size %d\n", token); ret = -EINVAL; goto out; } opts->queue_size = token; break; case NVMF_OPT_NR_IO_QUEUES: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token <= 0) { pr_err("Invalid number of IOQs %d\n", token); ret = -EINVAL; goto out; } if (opts->discovery_nqn) { pr_debug("Ignoring nr_io_queues value for discovery controller\n"); break; } opts->nr_io_queues = min_t(unsigned int, num_online_cpus(), token); break; case NVMF_OPT_KATO: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token < 0) { pr_err("Invalid keep_alive_tmo %d\n", token); ret = -EINVAL; goto out; } else if (token == 0 && !opts->discovery_nqn) { /* Allowed for debug */ pr_warn("keep_alive_tmo 0 won't execute keep alives!!!\n"); } opts->kato = token; break; case NVMF_OPT_CTRL_LOSS_TMO: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token < 0) pr_warn("ctrl_loss_tmo < 0 will reconnect forever\n"); ctrl_loss_tmo = token; break; case NVMF_OPT_FAIL_FAST_TMO: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token >= 0) pr_warn("I/O fail on reconnect controller after %d sec\n", token); else token = -1; opts->fast_io_fail_tmo = token; break; case NVMF_OPT_HOSTNQN: if (opts->host) { pr_err("hostnqn already user-assigned: %s\n", opts->host->nqn); ret = -EADDRINUSE; goto out; } p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } nqnlen = strlen(p); if (nqnlen >= NVMF_NQN_SIZE) { pr_err("%s needs to be < %d bytes\n", p, NVMF_NQN_SIZE); kfree(p); ret = -EINVAL; goto out; } strscpy(hostnqn, p, NVMF_NQN_SIZE); kfree(p); break; case NVMF_OPT_RECONNECT_DELAY: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token <= 0) { pr_err("Invalid reconnect_delay %d\n", token); ret = -EINVAL; goto out; } opts->reconnect_delay = token; break; case NVMF_OPT_HOST_TRADDR: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->host_traddr); opts->host_traddr = p; break; case NVMF_OPT_HOST_IFACE: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } kfree(opts->host_iface); opts->host_iface = p; break; case NVMF_OPT_HOST_ID: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } ret = uuid_parse(p, &hostid); if (ret) { pr_err("Invalid hostid %s\n", p); ret = -EINVAL; kfree(p); goto out; } kfree(p); break; case NVMF_OPT_DUP_CONNECT: opts->duplicate_connect = true; break; case NVMF_OPT_DISABLE_SQFLOW: opts->disable_sqflow = true; break; case NVMF_OPT_HDR_DIGEST: opts->hdr_digest = true; break; case NVMF_OPT_DATA_DIGEST: opts->data_digest = true; break; case NVMF_OPT_NR_WRITE_QUEUES: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token <= 0) { pr_err("Invalid nr_write_queues %d\n", token); ret = -EINVAL; goto out; } opts->nr_write_queues = token; break; case NVMF_OPT_NR_POLL_QUEUES: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token <= 0) { pr_err("Invalid nr_poll_queues %d\n", token); ret = -EINVAL; goto out; } opts->nr_poll_queues = token; break; case NVMF_OPT_TOS: if (match_int(args, &token)) { ret = -EINVAL; goto out; } if (token < 0) { pr_err("Invalid type of service %d\n", token); ret = -EINVAL; goto out; } if (token > 255) { pr_warn("Clamping type of service to 255\n"); token = 255; } opts->tos = token; break; case NVMF_OPT_KEYRING: if (match_int(args, &key_id) || key_id <= 0) { ret = -EINVAL; goto out; } key = nvmf_parse_key(key_id); if (IS_ERR(key)) { ret = PTR_ERR(key); goto out; } key_put(opts->keyring); opts->keyring = key; break; case NVMF_OPT_TLS_KEY: if (match_int(args, &key_id) || key_id <= 0) { ret = -EINVAL; goto out; } key = nvmf_parse_key(key_id); if (IS_ERR(key)) { ret = PTR_ERR(key); goto out; } key_put(opts->tls_key); opts->tls_key = key; break; case NVMF_OPT_DISCOVERY: opts->discovery_nqn = true; break; case NVMF_OPT_DHCHAP_SECRET: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } if (strlen(p) < 11 || strncmp(p, "DHHC-1:", 7)) { pr_err("Invalid DH-CHAP secret %s\n", p); ret = -EINVAL; goto out; } kfree(opts->dhchap_secret); opts->dhchap_secret = p; break; case NVMF_OPT_DHCHAP_CTRL_SECRET: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } if (strlen(p) < 11 || strncmp(p, "DHHC-1:", 7)) { pr_err("Invalid DH-CHAP secret %s\n", p); ret = -EINVAL; goto out; } kfree(opts->dhchap_ctrl_secret); opts->dhchap_ctrl_secret = p; break; case NVMF_OPT_TLS: if (!IS_ENABLED(CONFIG_NVME_TCP_TLS)) { pr_err("TLS is not supported\n"); ret = -EINVAL; goto out; } opts->tls = true; break; case NVMF_OPT_CONCAT: if (!IS_ENABLED(CONFIG_NVME_TCP_TLS)) { pr_err("TLS is not supported\n"); ret = -EINVAL; goto out; } opts->concat = true; break; default: pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n", p); ret = -EINVAL; goto out; } } if (opts->discovery_nqn) { opts->nr_io_queues = 0; opts->nr_write_queues = 0; opts->nr_poll_queues = 0; opts->duplicate_connect = true; } else { if (!opts->kato) opts->kato = NVME_DEFAULT_KATO; } if (ctrl_loss_tmo < 0) { opts->max_reconnects = -1; } else { opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo, opts->reconnect_delay); if (ctrl_loss_tmo < opts->fast_io_fail_tmo) pr_warn("failfast tmo (%d) larger than controller loss tmo (%d)\n", opts->fast_io_fail_tmo, ctrl_loss_tmo); } if (opts->concat) { if (opts->tls) { pr_err("Secure concatenation over TLS is not supported\n"); ret = -EINVAL; goto out; } if (opts->tls_key) { pr_err("Cannot specify a TLS key for secure concatenation\n"); ret = -EINVAL; goto out; } if (!opts->dhchap_secret) { pr_err("Need to enable DH-CHAP for secure concatenation\n"); ret = -EINVAL; goto out; } } opts->host = nvmf_host_add(hostnqn, &hostid); if (IS_ERR(opts->host)) { ret = PTR_ERR(opts->host); opts->host = NULL; goto out; } out: kfree(options); return ret; } void nvmf_set_io_queues(struct nvmf_ctrl_options *opts, u32 nr_io_queues, u32 io_queues[HCTX_MAX_TYPES]) { if (opts->nr_write_queues && opts->nr_io_queues < nr_io_queues) { /* * separate read/write queues * hand out dedicated default queues only after we have * sufficient read queues. */ io_queues[HCTX_TYPE_READ] = opts->nr_io_queues; nr_io_queues -= io_queues[HCTX_TYPE_READ]; io_queues[HCTX_TYPE_DEFAULT] = min(opts->nr_write_queues, nr_io_queues); nr_io_queues -= io_queues[HCTX_TYPE_DEFAULT]; } else { /* * shared read/write queues * either no write queues were requested, or we don't have * sufficient queue count to have dedicated default queues. */ io_queues[HCTX_TYPE_DEFAULT] = min(opts->nr_io_queues, nr_io_queues); nr_io_queues -= io_queues[HCTX_TYPE_DEFAULT]; } if (opts->nr_poll_queues && nr_io_queues) { /* map dedicated poll queues only if we have queues left */ io_queues[HCTX_TYPE_POLL] = min(opts->nr_poll_queues, nr_io_queues); } } EXPORT_SYMBOL_GPL(nvmf_set_io_queues); void nvmf_map_queues(struct blk_mq_tag_set *set, struct nvme_ctrl *ctrl, u32 io_queues[HCTX_MAX_TYPES]) { struct nvmf_ctrl_options *opts = ctrl->opts; if (opts->nr_write_queues && io_queues[HCTX_TYPE_READ]) { /* separate read/write queues */ set->map[HCTX_TYPE_DEFAULT].nr_queues = io_queues[HCTX_TYPE_DEFAULT]; set->map[HCTX_TYPE_DEFAULT].queue_offset = 0; set->map[HCTX_TYPE_READ].nr_queues = io_queues[HCTX_TYPE_READ]; set->map[HCTX_TYPE_READ].queue_offset = io_queues[HCTX_TYPE_DEFAULT]; } else { /* shared read/write queues */ set->map[HCTX_TYPE_DEFAULT].nr_queues = io_queues[HCTX_TYPE_DEFAULT]; set->map[HCTX_TYPE_DEFAULT].queue_offset = 0; set->map[HCTX_TYPE_READ].nr_queues = io_queues[HCTX_TYPE_DEFAULT]; set->map[HCTX_TYPE_READ].queue_offset = 0; } blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]); blk_mq_map_queues(&set->map[HCTX_TYPE_READ]); if (opts->nr_poll_queues && io_queues[HCTX_TYPE_POLL]) { /* map dedicated poll queues only if we have queues left */ set->map[HCTX_TYPE_POLL].nr_queues = io_queues[HCTX_TYPE_POLL]; set->map[HCTX_TYPE_POLL].queue_offset = io_queues[HCTX_TYPE_DEFAULT] + io_queues[HCTX_TYPE_READ]; blk_mq_map_queues(&set->map[HCTX_TYPE_POLL]); } dev_info(ctrl->device, "mapped %d/%d/%d default/read/poll queues.\n", io_queues[HCTX_TYPE_DEFAULT], io_queues[HCTX_TYPE_READ], io_queues[HCTX_TYPE_POLL]); } EXPORT_SYMBOL_GPL(nvmf_map_queues); static int nvmf_check_required_opts(struct nvmf_ctrl_options *opts, unsigned int required_opts) { if ((opts->mask & required_opts) != required_opts) { unsigned int i; for (i = 0; i < ARRAY_SIZE(opt_tokens); i++) { if ((opt_tokens[i].token & required_opts) && !(opt_tokens[i].token & opts->mask)) { pr_warn("missing parameter '%s'\n", opt_tokens[i].pattern); } } return -EINVAL; } return 0; } bool nvmf_ip_options_match(struct nvme_ctrl *ctrl, struct nvmf_ctrl_options *opts) { if (!nvmf_ctlr_matches_baseopts(ctrl, opts) || strcmp(opts->traddr, ctrl->opts->traddr) || strcmp(opts->trsvcid, ctrl->opts->trsvcid)) return false; /* * Checking the local address or host interfaces is rough. * * In most cases, none is specified and the host port or * host interface is selected by the stack. * * Assume no match if: * - local address or host interface is specified and address * or host interface is not the same * - local address or host interface is not specified but * remote is, or vice versa (admin using specific * host_traddr/host_iface when it matters). */ if ((opts->mask & NVMF_OPT_HOST_TRADDR) && (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR)) { if (strcmp(opts->host_traddr, ctrl->opts->host_traddr)) return false; } else if ((opts->mask & NVMF_OPT_HOST_TRADDR) || (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR)) { return false; } if ((opts->mask & NVMF_OPT_HOST_IFACE) && (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)) { if (strcmp(opts->host_iface, ctrl->opts->host_iface)) return false; } else if ((opts->mask & NVMF_OPT_HOST_IFACE) || (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)) { return false; } return true; } EXPORT_SYMBOL_GPL(nvmf_ip_options_match); static int nvmf_check_allowed_opts(struct nvmf_ctrl_options *opts, unsigned int allowed_opts) { if (opts->mask & ~allowed_opts) { unsigned int i; for (i = 0; i < ARRAY_SIZE(opt_tokens); i++) { if ((opt_tokens[i].token & opts->mask) && (opt_tokens[i].token & ~allowed_opts)) { pr_warn("invalid parameter '%s'\n", opt_tokens[i].pattern); } } return -EINVAL; } return 0; } void nvmf_free_options(struct nvmf_ctrl_options *opts) { nvmf_host_put(opts->host); key_put(opts->keyring); key_put(opts->tls_key); kfree(opts->transport); kfree(opts->traddr); kfree(opts->trsvcid); kfree(opts->subsysnqn); kfree(opts->host_traddr); kfree(opts->host_iface); kfree(opts->dhchap_secret); kfree(opts->dhchap_ctrl_secret); kfree(opts); } EXPORT_SYMBOL_GPL(nvmf_free_options); #define NVMF_REQUIRED_OPTS (NVMF_OPT_TRANSPORT | NVMF_OPT_NQN) #define NVMF_ALLOWED_OPTS (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \ NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \ NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\ NVMF_OPT_DISABLE_SQFLOW | NVMF_OPT_DISCOVERY |\ NVMF_OPT_FAIL_FAST_TMO | NVMF_OPT_DHCHAP_SECRET |\ NVMF_OPT_DHCHAP_CTRL_SECRET) static struct nvme_ctrl * nvmf_create_ctrl(struct device *dev, const char *buf) { struct nvmf_ctrl_options *opts; struct nvmf_transport_ops *ops; struct nvme_ctrl *ctrl; int ret; opts = kzalloc(sizeof(*opts), GFP_KERNEL); if (!opts) return ERR_PTR(-ENOMEM); ret = nvmf_parse_options(opts, buf); if (ret) goto out_free_opts; request_module("nvme-%s", opts->transport); /* * Check the generic options first as we need a valid transport for * the lookup below. Then clear the generic flags so that transport * drivers don't have to care about them. */ ret = nvmf_check_required_opts(opts, NVMF_REQUIRED_OPTS); if (ret) goto out_free_opts; opts->mask &= ~NVMF_REQUIRED_OPTS; down_read(&nvmf_transports_rwsem); ops = nvmf_lookup_transport(opts); if (!ops) { pr_info("no handler found for transport %s.\n", opts->transport); ret = -EINVAL; goto out_unlock; } if (!try_module_get(ops->module)) { ret = -EBUSY; goto out_unlock; } up_read(&nvmf_transports_rwsem); ret = nvmf_check_required_opts(opts, ops->required_opts); if (ret) goto out_module_put; ret = nvmf_check_allowed_opts(opts, NVMF_ALLOWED_OPTS | ops->allowed_opts | ops->required_opts); if (ret) goto out_module_put; ctrl = ops->create_ctrl(dev, opts); if (IS_ERR(ctrl)) { ret = PTR_ERR(ctrl); goto out_module_put; } module_put(ops->module); return ctrl; out_module_put: module_put(ops->module); goto out_free_opts; out_unlock: up_read(&nvmf_transports_rwsem); out_free_opts: nvmf_free_options(opts); return ERR_PTR(ret); } static const struct class nvmf_class = { .name = "nvme-fabrics", }; static struct device *nvmf_device; static DEFINE_MUTEX(nvmf_dev_mutex); static ssize_t nvmf_dev_write(struct file *file, const char __user *ubuf, size_t count, loff_t *pos) { struct seq_file *seq_file = file->private_data; struct nvme_ctrl *ctrl; const char *buf; int ret = 0; if (count > PAGE_SIZE) return -ENOMEM; buf = memdup_user_nul(ubuf, count); if (IS_ERR(buf)) return PTR_ERR(buf); mutex_lock(&nvmf_dev_mutex); if (seq_file->private) { ret = -EINVAL; goto out_unlock; } ctrl = nvmf_create_ctrl(nvmf_device, buf); if (IS_ERR(ctrl)) { ret = PTR_ERR(ctrl); goto out_unlock; } seq_file->private = ctrl; out_unlock: mutex_unlock(&nvmf_dev_mutex); kfree(buf); return ret ? ret : count; } static void __nvmf_concat_opt_tokens(struct seq_file *seq_file) { const struct match_token *tok; int idx; /* * Add dummy entries for instance and cntlid to * signal an invalid/non-existing controller */ seq_puts(seq_file, "instance=-1,cntlid=-1"); for (idx = 0; idx < ARRAY_SIZE(opt_tokens); idx++) { tok = &opt_tokens[idx]; if (tok->token == NVMF_OPT_ERR) continue; seq_putc(seq_file, ','); seq_puts(seq_file, tok->pattern); } seq_putc(seq_file, '\n'); } static int nvmf_dev_show(struct seq_file *seq_file, void *private) { struct nvme_ctrl *ctrl; mutex_lock(&nvmf_dev_mutex); ctrl = seq_file->private; if (!ctrl) { __nvmf_concat_opt_tokens(seq_file); goto out_unlock; } seq_printf(seq_file, "instance=%d,cntlid=%d\n", ctrl->instance, ctrl->cntlid); out_unlock: mutex_unlock(&nvmf_dev_mutex); return 0; } static int nvmf_dev_open(struct inode *inode, struct file *file) { /* * The miscdevice code initializes file->private_data, but doesn't * make use of it later. */ file->private_data = NULL; return single_open(file, nvmf_dev_show, NULL); } static int nvmf_dev_release(struct inode *inode, struct file *file) { struct seq_file *seq_file = file->private_data; struct nvme_ctrl *ctrl = seq_file->private; if (ctrl) nvme_put_ctrl(ctrl); return single_release(inode, file); } static const struct file_operations nvmf_dev_fops = { .owner = THIS_MODULE, .write = nvmf_dev_write, .read = seq_read, .open = nvmf_dev_open, .release = nvmf_dev_release, }; static struct miscdevice nvmf_misc = { .minor = MISC_DYNAMIC_MINOR, .name = "nvme-fabrics", .fops = &nvmf_dev_fops, }; static int __init nvmf_init(void) { int ret; nvmf_default_host = nvmf_host_default(); if (!nvmf_default_host) return -ENOMEM; ret = class_register(&nvmf_class); if (ret) { pr_err("couldn't register class nvme-fabrics\n"); goto out_free_host; } nvmf_device = device_create(&nvmf_class, NULL, MKDEV(0, 0), NULL, "ctl"); if (IS_ERR(nvmf_device)) { pr_err("couldn't create nvme-fabrics device!\n"); ret = PTR_ERR(nvmf_device); goto out_destroy_class; } ret = misc_register(&nvmf_misc); if (ret) { pr_err("couldn't register misc device: %d\n", ret); goto out_destroy_device; } return 0; out_destroy_device: device_destroy(&nvmf_class, MKDEV(0, 0)); out_destroy_class: class_unregister(&nvmf_class); out_free_host: nvmf_host_put(nvmf_default_host); return ret; } static void __exit nvmf_exit(void) { misc_deregister(&nvmf_misc); device_destroy(&nvmf_class, MKDEV(0, 0)); class_unregister(&nvmf_class); nvmf_host_put(nvmf_default_host); BUILD_BUG_ON(sizeof(struct nvmf_common_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_connect_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_property_get_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_property_set_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_auth_send_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_auth_receive_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_connect_data) != 1024); BUILD_BUG_ON(sizeof(struct nvmf_auth_dhchap_negotiate_data) != 8); BUILD_BUG_ON(sizeof(struct nvmf_auth_dhchap_challenge_data) != 16); BUILD_BUG_ON(sizeof(struct nvmf_auth_dhchap_reply_data) != 16); BUILD_BUG_ON(sizeof(struct nvmf_auth_dhchap_success1_data) != 16); BUILD_BUG_ON(sizeof(struct nvmf_auth_dhchap_success2_data) != 16); } MODULE_LICENSE("GPL v2"); MODULE_DESCRIPTION("NVMe host fabrics library"); module_init(nvmf_init); module_exit(nvmf_exit); |
72 72 72 72 72 72 72 72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 | // SPDX-License-Identifier: GPL-2.0 /* -*- linux-c -*- * sysctl_net_core.c: sysctl interface to net core subsystem. * * Begun April 1, 1996, Mike Shaver. * Added /proc/sys/net/core directory entry (empty =) ). [MS] */ #include <linux/filter.h> #include <linux/mm.h> #include <linux/sysctl.h> #include <linux/module.h> #include <linux/socket.h> #include <linux/netdevice.h> #include <linux/ratelimit.h> #include <linux/vmalloc.h> #include <linux/init.h> #include <linux/slab.h> #include <linux/sched/isolation.h> #include <net/ip.h> #include <net/sock.h> #include <net/net_ratelimit.h> #include <net/busy_poll.h> #include <net/pkt_sched.h> #include <net/hotdata.h> #include <net/proto_memory.h> #include <net/rps.h> #include "dev.h" static int int_3600 = 3600; static int min_sndbuf = SOCK_MIN_SNDBUF; static int min_rcvbuf = SOCK_MIN_RCVBUF; static int max_skb_frags = MAX_SKB_FRAGS; static int min_mem_pcpu_rsv = SK_MEMORY_PCPU_RESERVE; static int netdev_budget_usecs_min = 2 * USEC_PER_SEC / HZ; static int net_msg_warn; /* Unused, but still a sysctl */ int sysctl_fb_tunnels_only_for_init_net __read_mostly = 0; EXPORT_SYMBOL(sysctl_fb_tunnels_only_for_init_net); /* 0 - Keep current behavior: * IPv4: inherit all current settings from init_net * IPv6: reset all settings to default * 1 - Both inherit all current settings from init_net * 2 - Both reset all settings to default * 3 - Both inherit all settings from current netns */ int sysctl_devconf_inherit_init_net __read_mostly; EXPORT_SYMBOL(sysctl_devconf_inherit_init_net); #if IS_ENABLED(CONFIG_NET_FLOW_LIMIT) || IS_ENABLED(CONFIG_RPS) static int dump_cpumask(void *buffer, size_t *lenp, loff_t *ppos, struct cpumask *mask) { char *kbuf; int len; if (*ppos || !*lenp) { *lenp = 0; return 0; } /* CPUs are displayed as a hex bitmap + a comma between each groups of 8 * nibbles (except the last one which has a newline instead). * Guesstimate the buffer size at the group granularity level. */ len = min(DIV_ROUND_UP(nr_cpumask_bits, 32) * (8 + 1), *lenp); kbuf = kmalloc(len, GFP_KERNEL); if (!kbuf) { *lenp = 0; return -ENOMEM; } len = scnprintf(kbuf, len, "%*pb", cpumask_pr_args(mask)); if (!len) { *lenp = 0; goto free_buf; } /* scnprintf writes a trailing null char not counted in the returned * length, override it with a newline. */ kbuf[len++] = '\n'; memcpy(buffer, kbuf, len); *lenp = len; *ppos += len; free_buf: kfree(kbuf); return 0; } #endif #ifdef CONFIG_RPS static struct cpumask *rps_default_mask_cow_alloc(struct net *net) { struct cpumask *rps_default_mask; if (net->core.rps_default_mask) return net->core.rps_default_mask; rps_default_mask = kzalloc(cpumask_size(), GFP_KERNEL); if (!rps_default_mask) return NULL; /* pairs with READ_ONCE in rx_queue_default_mask() */ WRITE_ONCE(net->core.rps_default_mask, rps_default_mask); return rps_default_mask; } static int rps_default_mask_sysctl(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { struct net *net = (struct net *)table->data; int err = 0; rtnl_lock(); if (write) { struct cpumask *rps_default_mask = rps_default_mask_cow_alloc(net); err = -ENOMEM; if (!rps_default_mask) goto done; err = cpumask_parse(buffer, rps_default_mask); if (err) goto done; err = rps_cpumask_housekeeping(rps_default_mask); if (err) goto done; } else { err = dump_cpumask(buffer, lenp, ppos, net->core.rps_default_mask ? : cpu_none_mask); } done: rtnl_unlock(); return err; } static int rps_sock_flow_sysctl(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { unsigned int orig_size, size; int ret, i; struct ctl_table tmp = { .data = &size, .maxlen = sizeof(size), .mode = table->mode }; struct rps_sock_flow_table *orig_sock_table, *sock_table; static DEFINE_MUTEX(sock_flow_mutex); mutex_lock(&sock_flow_mutex); orig_sock_table = rcu_dereference_protected( net_hotdata.rps_sock_flow_table, lockdep_is_held(&sock_flow_mutex)); size = orig_size = orig_sock_table ? orig_sock_table->mask + 1 : 0; ret = proc_dointvec(&tmp, write, buffer, lenp, ppos); if (write) { if (size) { if (size > 1<<29) { /* Enforce limit to prevent overflow */ mutex_unlock(&sock_flow_mutex); return -EINVAL; } size = roundup_pow_of_two(size); if (size != orig_size) { sock_table = vmalloc(RPS_SOCK_FLOW_TABLE_SIZE(size)); if (!sock_table) { mutex_unlock(&sock_flow_mutex); return -ENOMEM; } net_hotdata.rps_cpu_mask = roundup_pow_of_two(nr_cpu_ids) - 1; sock_table->mask = size - 1; } else sock_table = orig_sock_table; for (i = 0; i < size; i++) sock_table->ents[i] = RPS_NO_CPU; } else sock_table = NULL; if (sock_table != orig_sock_table) { rcu_assign_pointer(net_hotdata.rps_sock_flow_table, sock_table); if (sock_table) { static_branch_inc(&rps_needed); static_branch_inc(&rfs_needed); } if (orig_sock_table) { static_branch_dec(&rps_needed); static_branch_dec(&rfs_needed); kvfree_rcu_mightsleep(orig_sock_table); } } } mutex_unlock(&sock_flow_mutex); return ret; } #endif /* CONFIG_RPS */ #ifdef CONFIG_NET_FLOW_LIMIT static DEFINE_MUTEX(flow_limit_update_mutex); static int flow_limit_cpu_sysctl(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { struct sd_flow_limit *cur; struct softnet_data *sd; cpumask_var_t mask; int i, len, ret = 0; if (!alloc_cpumask_var(&mask, GFP_KERNEL)) return -ENOMEM; if (write) { ret = cpumask_parse(buffer, mask); if (ret) goto done; mutex_lock(&flow_limit_update_mutex); len = sizeof(*cur) + netdev_flow_limit_table_len; for_each_possible_cpu(i) { sd = &per_cpu(softnet_data, i); cur = rcu_dereference_protected(sd->flow_limit, lockdep_is_held(&flow_limit_update_mutex)); if (cur && !cpumask_test_cpu(i, mask)) { RCU_INIT_POINTER(sd->flow_limit, NULL); kfree_rcu_mightsleep(cur); } else if (!cur && cpumask_test_cpu(i, mask)) { cur = kzalloc_node(len, GFP_KERNEL, cpu_to_node(i)); if (!cur) { /* not unwinding previous changes */ ret = -ENOMEM; goto write_unlock; } cur->num_buckets = netdev_flow_limit_table_len; rcu_assign_pointer(sd->flow_limit, cur); } } write_unlock: mutex_unlock(&flow_limit_update_mutex); } else { cpumask_clear(mask); rcu_read_lock(); for_each_possible_cpu(i) { sd = &per_cpu(softnet_data, i); if (rcu_dereference(sd->flow_limit)) cpumask_set_cpu(i, mask); } rcu_read_unlock(); ret = dump_cpumask(buffer, lenp, ppos, mask); } done: free_cpumask_var(mask); return ret; } static int flow_limit_table_len_sysctl(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { unsigned int old, *ptr; int ret; mutex_lock(&flow_limit_update_mutex); ptr = table->data; old = *ptr; ret = proc_dointvec(table, write, buffer, lenp, ppos); if (!ret && write && !is_power_of_2(*ptr)) { *ptr = old; ret = -EINVAL; } mutex_unlock(&flow_limit_update_mutex); return ret; } #endif /* CONFIG_NET_FLOW_LIMIT */ #ifdef CONFIG_NET_SCHED static int set_default_qdisc(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { char id[IFNAMSIZ]; struct ctl_table tbl = { .data = id, .maxlen = IFNAMSIZ, }; int ret; qdisc_get_default(id, IFNAMSIZ); ret = proc_dostring(&tbl, write, buffer, lenp, ppos); if (write && ret == 0) ret = qdisc_set_default(id); return ret; } #endif static int proc_do_dev_weight(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { static DEFINE_MUTEX(dev_weight_mutex); int ret, weight; mutex_lock(&dev_weight_mutex); ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (!ret && write) { weight = READ_ONCE(weight_p); WRITE_ONCE(net_hotdata.dev_rx_weight, weight * dev_weight_rx_bias); WRITE_ONCE(net_hotdata.dev_tx_weight, weight * dev_weight_tx_bias); } mutex_unlock(&dev_weight_mutex); return ret; } static int proc_do_rss_key(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { struct ctl_table fake_table; char buf[NETDEV_RSS_KEY_LEN * 3]; snprintf(buf, sizeof(buf), "%*phC", NETDEV_RSS_KEY_LEN, netdev_rss_key); fake_table.data = buf; fake_table.maxlen = sizeof(buf); return proc_dostring(&fake_table, write, buffer, lenp, ppos); } #ifdef CONFIG_BPF_JIT static int proc_dointvec_minmax_bpf_enable(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { int ret, jit_enable = *(int *)table->data; int min = *(int *)table->extra1; int max = *(int *)table->extra2; struct ctl_table tmp = *table; if (write && !capable(CAP_SYS_ADMIN)) return -EPERM; tmp.data = &jit_enable; ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); if (write && !ret) { if (jit_enable < 2 || (jit_enable == 2 && bpf_dump_raw_ok(current_cred()))) { *(int *)table->data = jit_enable; if (jit_enable == 2) pr_warn("bpf_jit_enable = 2 was set! NEVER use this in production, only for JIT debugging!\n"); } else { ret = -EPERM; } } if (write && ret && min == max) pr_info_once("CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1.\n"); return ret; } # ifdef CONFIG_HAVE_EBPF_JIT static int proc_dointvec_minmax_bpf_restricted(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { if (!capable(CAP_SYS_ADMIN)) return -EPERM; return proc_dointvec_minmax(table, write, buffer, lenp, ppos); } # endif /* CONFIG_HAVE_EBPF_JIT */ static int proc_dolongvec_minmax_bpf_restricted(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { if (!capable(CAP_SYS_ADMIN)) return -EPERM; return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); } #endif static struct ctl_table net_core_table[] = { { .procname = "mem_pcpu_rsv", .data = &net_hotdata.sysctl_mem_pcpu_rsv, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = &min_mem_pcpu_rsv, }, { .procname = "dev_weight", .data = &weight_p, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_do_dev_weight, .extra1 = SYSCTL_ONE, }, { .procname = "dev_weight_rx_bias", .data = &dev_weight_rx_bias, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_do_dev_weight, .extra1 = SYSCTL_ONE, }, { .procname = "dev_weight_tx_bias", .data = &dev_weight_tx_bias, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_do_dev_weight, .extra1 = SYSCTL_ONE, }, { .procname = "netdev_max_backlog", .data = &net_hotdata.max_backlog, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec }, { .procname = "netdev_rss_key", .data = &netdev_rss_key, .maxlen = sizeof(int), .mode = 0444, .proc_handler = proc_do_rss_key, }, #ifdef CONFIG_BPF_JIT { .procname = "bpf_jit_enable", .data = &bpf_jit_enable, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax_bpf_enable, # ifdef CONFIG_BPF_JIT_ALWAYS_ON .extra1 = SYSCTL_ONE, .extra2 = SYSCTL_ONE, # else .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, # endif }, # ifdef CONFIG_HAVE_EBPF_JIT { .procname = "bpf_jit_harden", .data = &bpf_jit_harden, .maxlen = sizeof(int), .mode = 0600, .proc_handler = proc_dointvec_minmax_bpf_restricted, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, }, { .procname = "bpf_jit_kallsyms", .data = &bpf_jit_kallsyms, .maxlen = sizeof(int), .mode = 0600, .proc_handler = proc_dointvec_minmax_bpf_restricted, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, # endif { .procname = "bpf_jit_limit", .data = &bpf_jit_limit, .maxlen = sizeof(long), .mode = 0600, .proc_handler = proc_dolongvec_minmax_bpf_restricted, .extra1 = SYSCTL_LONG_ONE, .extra2 = &bpf_jit_limit_max, }, #endif { .procname = "netdev_tstamp_prequeue", .data = &net_hotdata.tstamp_prequeue, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec }, { .procname = "message_cost", .data = &net_ratelimit_state.interval, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, { .procname = "message_burst", .data = &net_ratelimit_state.burst, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec, }, #ifdef CONFIG_RPS { .procname = "rps_sock_flow_entries", .maxlen = sizeof(int), .mode = 0644, .proc_handler = rps_sock_flow_sysctl }, #endif #ifdef CONFIG_NET_FLOW_LIMIT { .procname = "flow_limit_cpu_bitmap", .mode = 0644, .proc_handler = flow_limit_cpu_sysctl }, { .procname = "flow_limit_table_len", .data = &netdev_flow_limit_table_len, .maxlen = sizeof(int), .mode = 0644, .proc_handler = flow_limit_table_len_sysctl }, #endif /* CONFIG_NET_FLOW_LIMIT */ #ifdef CONFIG_NET_RX_BUSY_POLL { .procname = "busy_poll", .data = &sysctl_net_busy_poll, .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, }, { .procname = "busy_read", .data = &sysctl_net_busy_read, .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, }, #endif #ifdef CONFIG_NET_SCHED { .procname = "default_qdisc", .mode = 0644, .maxlen = IFNAMSIZ, .proc_handler = set_default_qdisc }, #endif { .procname = "netdev_budget", .data = &net_hotdata.netdev_budget, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec }, { .procname = "warnings", .data = &net_msg_warn, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec }, { .procname = "max_skb_frags", .data = &net_hotdata.sysctl_max_skb_frags, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ONE, .extra2 = &max_skb_frags, }, { .procname = "netdev_budget_usecs", .data = &net_hotdata.netdev_budget_usecs, .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = &netdev_budget_usecs_min, }, { .procname = "fb_tunnels_only_for_init_net", .data = &sysctl_fb_tunnels_only_for_init_net, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, }, { .procname = "devconf_inherit_init_net", .data = &sysctl_devconf_inherit_init_net, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_THREE, }, { .procname = "high_order_alloc_disable", .data = &net_high_order_alloc_disable_key.key, .maxlen = sizeof(net_high_order_alloc_disable_key), .mode = 0644, .proc_handler = proc_do_static_key, }, { .procname = "gro_normal_batch", .data = &net_hotdata.gro_normal_batch, .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ONE, }, { .procname = "netdev_unregister_timeout_secs", .data = &netdev_unregister_timeout_secs, .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ONE, .extra2 = &int_3600, }, { .procname = "skb_defer_max", .data = &net_hotdata.sysctl_skb_defer_max, .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, }, }; static struct ctl_table netns_core_table[] = { #if IS_ENABLED(CONFIG_RPS) { .procname = "rps_default_mask", .data = &init_net, .mode = 0644, .proc_handler = rps_default_mask_sysctl }, #endif { .procname = "somaxconn", .data = &init_net.core.sysctl_somaxconn, .maxlen = sizeof(int), .mode = 0644, .extra1 = SYSCTL_ZERO, .proc_handler = proc_dointvec_minmax }, { .procname = "optmem_max", .data = &init_net.core.sysctl_optmem_max, .maxlen = sizeof(int), .mode = 0644, .extra1 = SYSCTL_ZERO, .proc_handler = proc_dointvec_minmax }, { .procname = "txrehash", .data = &init_net.core.sysctl_txrehash, .maxlen = sizeof(u8), .mode = 0644, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, .proc_handler = proc_dou8vec_minmax, }, { .procname = "tstamp_allow_data", .data = &init_net.core.sysctl_tstamp_allow_data, .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE }, /* sysctl_core_net_init() will set the values after this * to readonly in network namespaces */ { .procname = "wmem_max", .data = &sysctl_wmem_max, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = &min_sndbuf, }, { .procname = "rmem_max", .data = &sysctl_rmem_max, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = &min_rcvbuf, }, { .procname = "wmem_default", .data = &sysctl_wmem_default, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = &min_sndbuf, }, { .procname = "rmem_default", .data = &sysctl_rmem_default, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = &min_rcvbuf, }, }; static int __init fb_tunnels_only_for_init_net_sysctl_setup(char *str) { /* fallback tunnels for initns only */ if (!strncmp(str, "initns", 6)) sysctl_fb_tunnels_only_for_init_net = 1; /* no fallback tunnels anywhere */ else if (!strncmp(str, "none", 4)) sysctl_fb_tunnels_only_for_init_net = 2; return 1; } __setup("fb_tunnels=", fb_tunnels_only_for_init_net_sysctl_setup); static __net_init int sysctl_core_net_init(struct net *net) { size_t table_size = ARRAY_SIZE(netns_core_table); struct ctl_table *tbl; tbl = netns_core_table; if (!net_eq(net, &init_net)) { int i; tbl = kmemdup(tbl, sizeof(netns_core_table), GFP_KERNEL); if (tbl == NULL) goto err_dup; for (i = 0; i < table_size; ++i) { if (tbl[i].data == &sysctl_wmem_max) break; tbl[i].data += (char *)net - (char *)&init_net; } for (; i < table_size; ++i) tbl[i].mode &= ~0222; } net->core.sysctl_hdr = register_net_sysctl_sz(net, "net/core", tbl, table_size); if (net->core.sysctl_hdr == NULL) goto err_reg; return 0; err_reg: if (tbl != netns_core_table) kfree(tbl); err_dup: return -ENOMEM; } static __net_exit void sysctl_core_net_exit(struct net *net) { const struct ctl_table *tbl; tbl = net->core.sysctl_hdr->ctl_table_arg; unregister_net_sysctl_table(net->core.sysctl_hdr); BUG_ON(tbl == netns_core_table); #if IS_ENABLED(CONFIG_RPS) kfree(net->core.rps_default_mask); #endif kfree(tbl); } static __net_initdata struct pernet_operations sysctl_core_ops = { .init = sysctl_core_net_init, .exit = sysctl_core_net_exit, }; static __init int sysctl_core_init(void) { register_net_sysctl(&init_net, "net/core", net_core_table); return register_pernet_subsys(&sysctl_core_ops); } fs_initcall(sysctl_core_init); |
11 8 2 1 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _NF_QUEUE_H #define _NF_QUEUE_H #include <linux/ip.h> #include <linux/ipv6.h> #include <linux/jhash.h> #include <linux/netfilter.h> #include <linux/skbuff.h> /* Each queued (to userspace) skbuff has one of these. */ struct nf_queue_entry { struct list_head list; struct sk_buff *skb; unsigned int id; unsigned int hook_index; /* index in hook_entries->hook[] */ #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) struct net_device *physin; struct net_device *physout; #endif struct nf_hook_state state; u16 size; /* sizeof(entry) + saved route keys */ /* extra space to store route keys */ }; #define nf_queue_entry_reroute(x) ((void *)x + sizeof(struct nf_queue_entry)) /* Packet queuing */ struct nf_queue_handler { int (*outfn)(struct nf_queue_entry *entry, unsigned int queuenum); void (*nf_hook_drop)(struct net *net); }; void nf_register_queue_handler(const struct nf_queue_handler *qh); void nf_unregister_queue_handler(void); bool nf_queue_entry_get_refs(struct nf_queue_entry *entry); void nf_queue_entry_free(struct nf_queue_entry *entry); static inline void init_hashrandom(u32 *jhash_initval) { while (*jhash_initval == 0) *jhash_initval = get_random_u32(); } static inline u32 hash_v4(const struct iphdr *iph, u32 initval) { /* packets in either direction go into same queue */ if ((__force u32)iph->saddr < (__force u32)iph->daddr) return jhash_3words((__force u32)iph->saddr, (__force u32)iph->daddr, iph->protocol, initval); return jhash_3words((__force u32)iph->daddr, (__force u32)iph->saddr, iph->protocol, initval); } static inline u32 hash_v6(const struct ipv6hdr *ip6h, u32 initval) { u32 a, b, c; if ((__force u32)ip6h->saddr.s6_addr32[3] < (__force u32)ip6h->daddr.s6_addr32[3]) { a = (__force u32) ip6h->saddr.s6_addr32[3]; b = (__force u32) ip6h->daddr.s6_addr32[3]; } else { b = (__force u32) ip6h->saddr.s6_addr32[3]; a = (__force u32) ip6h->daddr.s6_addr32[3]; } if ((__force u32)ip6h->saddr.s6_addr32[1] < (__force u32)ip6h->daddr.s6_addr32[1]) c = (__force u32) ip6h->saddr.s6_addr32[1]; else c = (__force u32) ip6h->daddr.s6_addr32[1]; return jhash_3words(a, b, c, initval); } static inline u32 hash_bridge(const struct sk_buff *skb, u32 initval) { struct ipv6hdr *ip6h, _ip6h; struct iphdr *iph, _iph; switch (eth_hdr(skb)->h_proto) { case htons(ETH_P_IP): iph = skb_header_pointer(skb, skb_network_offset(skb), sizeof(*iph), &_iph); if (iph) return hash_v4(iph, initval); break; case htons(ETH_P_IPV6): ip6h = skb_header_pointer(skb, skb_network_offset(skb), sizeof(*ip6h), &_ip6h); if (ip6h) return hash_v6(ip6h, initval); break; } return 0; } static inline u32 nfqueue_hash(const struct sk_buff *skb, u16 queue, u16 queues_total, u8 family, u32 initval) { switch (family) { case NFPROTO_IPV4: queue += reciprocal_scale(hash_v4(ip_hdr(skb), initval), queues_total); break; case NFPROTO_IPV6: queue += reciprocal_scale(hash_v6(ipv6_hdr(skb), initval), queues_total); break; case NFPROTO_BRIDGE: queue += reciprocal_scale(hash_bridge(skb, initval), queues_total); break; } return queue; } int nf_queue(struct sk_buff *skb, struct nf_hook_state *state, unsigned int index, unsigned int verdict); #endif /* _NF_QUEUE_H */ |
21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 1 1 20 20 21 21 10 21 1 1 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 2 2 2 2 8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 | // SPDX-License-Identifier: GPL-2.0-or-later #include <crypto/hash.h> #include <linux/cpu.h> #include <linux/kref.h> #include <linux/module.h> #include <linux/mutex.h> #include <linux/percpu.h> #include <linux/workqueue.h> #include <net/tcp.h> static size_t __scratch_size; struct sigpool_scratch { local_lock_t bh_lock; void __rcu *pad; }; static DEFINE_PER_CPU(struct sigpool_scratch, sigpool_scratch) = { .bh_lock = INIT_LOCAL_LOCK(bh_lock), }; struct sigpool_entry { struct crypto_ahash *hash; const char *alg; struct kref kref; uint16_t needs_key:1, reserved:15; }; #define CPOOL_SIZE (PAGE_SIZE / sizeof(struct sigpool_entry)) static struct sigpool_entry cpool[CPOOL_SIZE]; static unsigned int cpool_populated; static DEFINE_MUTEX(cpool_mutex); /* Slow-path */ struct scratches_to_free { struct rcu_head rcu; unsigned int cnt; void *scratches[]; }; static void free_old_scratches(struct rcu_head *head) { struct scratches_to_free *stf; stf = container_of(head, struct scratches_to_free, rcu); while (stf->cnt--) kfree(stf->scratches[stf->cnt]); kfree(stf); } /** * sigpool_reserve_scratch - re-allocates scratch buffer, slow-path * @size: request size for the scratch/temp buffer */ static int sigpool_reserve_scratch(size_t size) { struct scratches_to_free *stf; size_t stf_sz = struct_size(stf, scratches, num_possible_cpus()); int cpu, err = 0; lockdep_assert_held(&cpool_mutex); if (__scratch_size >= size) return 0; stf = kmalloc(stf_sz, GFP_KERNEL); if (!stf) return -ENOMEM; stf->cnt = 0; size = max(size, __scratch_size); cpus_read_lock(); for_each_possible_cpu(cpu) { void *scratch, *old_scratch; scratch = kmalloc_node(size, GFP_KERNEL, cpu_to_node(cpu)); if (!scratch) { err = -ENOMEM; break; } old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch.pad, cpu), scratch, lockdep_is_held(&cpool_mutex)); if (!cpu_online(cpu) || !old_scratch) { kfree(old_scratch); continue; } stf->scratches[stf->cnt++] = old_scratch; } cpus_read_unlock(); if (!err) __scratch_size = size; call_rcu(&stf->rcu, free_old_scratches); return err; } static void sigpool_scratch_free(void) { int cpu; for_each_possible_cpu(cpu) kfree(rcu_replace_pointer(per_cpu(sigpool_scratch.pad, cpu), NULL, lockdep_is_held(&cpool_mutex))); __scratch_size = 0; } static int __cpool_try_clone(struct crypto_ahash *hash) { struct crypto_ahash *tmp; tmp = crypto_clone_ahash(hash); if (IS_ERR(tmp)) return PTR_ERR(tmp); crypto_free_ahash(tmp); return 0; } static int __cpool_alloc_ahash(struct sigpool_entry *e, const char *alg) { struct crypto_ahash *cpu0_hash; int ret; e->alg = kstrdup(alg, GFP_KERNEL); if (!e->alg) return -ENOMEM; cpu0_hash = crypto_alloc_ahash(alg, 0, CRYPTO_ALG_ASYNC); if (IS_ERR(cpu0_hash)) { ret = PTR_ERR(cpu0_hash); goto out_free_alg; } e->needs_key = crypto_ahash_get_flags(cpu0_hash) & CRYPTO_TFM_NEED_KEY; ret = __cpool_try_clone(cpu0_hash); if (ret) goto out_free_cpu0_hash; e->hash = cpu0_hash; kref_init(&e->kref); return 0; out_free_cpu0_hash: crypto_free_ahash(cpu0_hash); out_free_alg: kfree(e->alg); e->alg = NULL; return ret; } /** * tcp_sigpool_alloc_ahash - allocates pool for ahash requests * @alg: name of async hash algorithm * @scratch_size: reserve a tcp_sigpool::scratch buffer of this size */ int tcp_sigpool_alloc_ahash(const char *alg, size_t scratch_size) { int i, ret; /* slow-path */ mutex_lock(&cpool_mutex); ret = sigpool_reserve_scratch(scratch_size); if (ret) goto out; for (i = 0; i < cpool_populated; i++) { if (!cpool[i].alg) continue; if (strcmp(cpool[i].alg, alg)) continue; /* pairs with tcp_sigpool_release() */ if (!kref_get_unless_zero(&cpool[i].kref)) kref_init(&cpool[i].kref); ret = i; goto out; } for (i = 0; i < cpool_populated; i++) { if (!cpool[i].alg) break; } if (i >= CPOOL_SIZE) { ret = -ENOSPC; goto out; } ret = __cpool_alloc_ahash(&cpool[i], alg); if (!ret) { ret = i; if (i == cpool_populated) cpool_populated++; } out: mutex_unlock(&cpool_mutex); return ret; } EXPORT_SYMBOL_GPL(tcp_sigpool_alloc_ahash); static void __cpool_free_entry(struct sigpool_entry *e) { crypto_free_ahash(e->hash); kfree(e->alg); memset(e, 0, sizeof(*e)); } static void cpool_cleanup_work_cb(struct work_struct *work) { bool free_scratch = true; unsigned int i; mutex_lock(&cpool_mutex); for (i = 0; i < cpool_populated; i++) { if (kref_read(&cpool[i].kref) > 0) { free_scratch = false; continue; } if (!cpool[i].alg) continue; __cpool_free_entry(&cpool[i]); } if (free_scratch) sigpool_scratch_free(); mutex_unlock(&cpool_mutex); } static DECLARE_WORK(cpool_cleanup_work, cpool_cleanup_work_cb); static void cpool_schedule_cleanup(struct kref *kref) { schedule_work(&cpool_cleanup_work); } /** * tcp_sigpool_release - decreases number of users for a pool. If it was * the last user of the pool, releases any memory that was consumed. * @id: tcp_sigpool that was previously allocated by tcp_sigpool_alloc_ahash() */ void tcp_sigpool_release(unsigned int id) { if (WARN_ON_ONCE(id >= cpool_populated || !cpool[id].alg)) return; /* slow-path */ kref_put(&cpool[id].kref, cpool_schedule_cleanup); } EXPORT_SYMBOL_GPL(tcp_sigpool_release); /** * tcp_sigpool_get - increases number of users (refcounter) for a pool * @id: tcp_sigpool that was previously allocated by tcp_sigpool_alloc_ahash() */ void tcp_sigpool_get(unsigned int id) { if (WARN_ON_ONCE(id >= cpool_populated || !cpool[id].alg)) return; kref_get(&cpool[id].kref); } EXPORT_SYMBOL_GPL(tcp_sigpool_get); int tcp_sigpool_start(unsigned int id, struct tcp_sigpool *c) __cond_acquires(RCU_BH) { struct crypto_ahash *hash; rcu_read_lock_bh(); if (WARN_ON_ONCE(id >= cpool_populated || !cpool[id].alg)) { rcu_read_unlock_bh(); return -EINVAL; } hash = crypto_clone_ahash(cpool[id].hash); if (IS_ERR(hash)) { rcu_read_unlock_bh(); return PTR_ERR(hash); } c->req = ahash_request_alloc(hash, GFP_ATOMIC); if (!c->req) { crypto_free_ahash(hash); rcu_read_unlock_bh(); return -ENOMEM; } ahash_request_set_callback(c->req, 0, NULL, NULL); /* Pairs with tcp_sigpool_reserve_scratch(), scratch area is * valid (allocated) until tcp_sigpool_end(). */ local_lock_nested_bh(&sigpool_scratch.bh_lock); c->scratch = rcu_dereference_bh(*this_cpu_ptr(&sigpool_scratch.pad)); return 0; } EXPORT_SYMBOL_GPL(tcp_sigpool_start); void tcp_sigpool_end(struct tcp_sigpool *c) __releases(RCU_BH) { struct crypto_ahash *hash = crypto_ahash_reqtfm(c->req); local_unlock_nested_bh(&sigpool_scratch.bh_lock); rcu_read_unlock_bh(); ahash_request_free(c->req); crypto_free_ahash(hash); } EXPORT_SYMBOL_GPL(tcp_sigpool_end); /** * tcp_sigpool_algo - return algorithm of tcp_sigpool * @id: tcp_sigpool that was previously allocated by tcp_sigpool_alloc_ahash() * @buf: buffer to return name of algorithm * @buf_len: size of @buf */ size_t tcp_sigpool_algo(unsigned int id, char *buf, size_t buf_len) { if (WARN_ON_ONCE(id >= cpool_populated || !cpool[id].alg)) return -EINVAL; return strscpy(buf, cpool[id].alg, buf_len); } EXPORT_SYMBOL_GPL(tcp_sigpool_algo); /** * tcp_sigpool_hash_skb_data - hash data in skb with initialized tcp_sigpool * @hp: tcp_sigpool pointer * @skb: buffer to add sign for * @header_len: TCP header length for this segment */ int tcp_sigpool_hash_skb_data(struct tcp_sigpool *hp, const struct sk_buff *skb, unsigned int header_len) { const unsigned int head_data_len = skb_headlen(skb) > header_len ? skb_headlen(skb) - header_len : 0; const struct skb_shared_info *shi = skb_shinfo(skb); const struct tcphdr *tp = tcp_hdr(skb); struct ahash_request *req = hp->req; struct sk_buff *frag_iter; struct scatterlist sg; unsigned int i; sg_init_table(&sg, 1); sg_set_buf(&sg, ((u8 *)tp) + header_len, head_data_len); ahash_request_set_crypt(req, &sg, NULL, head_data_len); if (crypto_ahash_update(req)) return 1; for (i = 0; i < shi->nr_frags; ++i) { const skb_frag_t *f = &shi->frags[i]; unsigned int offset = skb_frag_off(f); struct page *page; page = skb_frag_page(f) + (offset >> PAGE_SHIFT); sg_set_page(&sg, page, skb_frag_size(f), offset_in_page(offset)); ahash_request_set_crypt(req, &sg, NULL, skb_frag_size(f)); if (crypto_ahash_update(req)) return 1; } skb_walk_frags(skb, frag_iter) if (tcp_sigpool_hash_skb_data(hp, frag_iter, 0)) return 1; return 0; } EXPORT_SYMBOL(tcp_sigpool_hash_skb_data); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Per-CPU pool of crypto requests"); |
38 45 45 44 43 39 2 37 35 36 5 1 38 36 22 19 17 34 2 32 1 30 3 1 30 26 1 4 1 27 28 17 5 11 14 2 14 14 28 13 1 13 15 28 13 28 1 28 3 28 13 28 28 28 3 9 2 9 16 7 2 23 24 15 2 24 14 1 24 3 3 24 24 3 24 24 23 24 21 20 7 6 5 1 4 6 6 4 2 1 21 23 21 21 21 21 21 21 21 21 21 15 15 23 23 72 72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 | // SPDX-License-Identifier: GPL-2.0-or-later /* * net/sched/act_police.c Input police filter * * Authors: Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru> * J Hadi Salim (action changes) */ #include <linux/module.h> #include <linux/types.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/errno.h> #include <linux/skbuff.h> #include <linux/rtnetlink.h> #include <linux/init.h> #include <linux/slab.h> #include <net/act_api.h> #include <net/gso.h> #include <net/netlink.h> #include <net/pkt_cls.h> #include <net/tc_act/tc_police.h> #include <net/tc_wrapper.h> /* Each policer is serialized by its individual spinlock */ static struct tc_action_ops act_police_ops; static const struct nla_policy police_policy[TCA_POLICE_MAX + 1] = { [TCA_POLICE_RATE] = { .len = TC_RTAB_SIZE }, [TCA_POLICE_PEAKRATE] = { .len = TC_RTAB_SIZE }, [TCA_POLICE_AVRATE] = { .type = NLA_U32 }, [TCA_POLICE_RESULT] = { .type = NLA_U32 }, [TCA_POLICE_RATE64] = { .type = NLA_U64 }, [TCA_POLICE_PEAKRATE64] = { .type = NLA_U64 }, [TCA_POLICE_PKTRATE64] = { .type = NLA_U64, .min = 1 }, [TCA_POLICE_PKTBURST64] = { .type = NLA_U64, .min = 1 }, }; static int tcf_police_init(struct net *net, struct nlattr *nla, struct nlattr *est, struct tc_action **a, struct tcf_proto *tp, u32 flags, struct netlink_ext_ack *extack) { int ret = 0, tcfp_result = TC_ACT_OK, err, size; bool bind = flags & TCA_ACT_FLAGS_BIND; struct nlattr *tb[TCA_POLICE_MAX + 1]; struct tcf_chain *goto_ch = NULL; struct tc_police *parm; struct tcf_police *police; struct qdisc_rate_table *R_tab = NULL, *P_tab = NULL; struct tc_action_net *tn = net_generic(net, act_police_ops.net_id); struct tcf_police_params *new; bool exists = false; u32 index; u64 rate64, prate64; u64 pps, ppsburst; if (nla == NULL) return -EINVAL; err = nla_parse_nested_deprecated(tb, TCA_POLICE_MAX, nla, police_policy, NULL); if (err < 0) return err; if (tb[TCA_POLICE_TBF] == NULL) return -EINVAL; size = nla_len(tb[TCA_POLICE_TBF]); if (size != sizeof(*parm) && size != sizeof(struct tc_police_compat)) return -EINVAL; parm = nla_data(tb[TCA_POLICE_TBF]); index = parm->index; err = tcf_idr_check_alloc(tn, &index, a, bind); if (err < 0) return err; exists = err; if (exists && bind) return ACT_P_BOUND; if (!exists) { ret = tcf_idr_create(tn, index, NULL, a, &act_police_ops, bind, true, flags); if (ret) { tcf_idr_cleanup(tn, index); return ret; } ret = ACT_P_CREATED; spin_lock_init(&(to_police(*a)->tcfp_lock)); } else if (!(flags & TCA_ACT_FLAGS_REPLACE)) { tcf_idr_release(*a, bind); return -EEXIST; } err = tcf_action_check_ctrlact(parm->action, tp, &goto_ch, extack); if (err < 0) goto release_idr; police = to_police(*a); if (parm->rate.rate) { err = -ENOMEM; R_tab = qdisc_get_rtab(&parm->rate, tb[TCA_POLICE_RATE], NULL); if (R_tab == NULL) goto failure; if (parm->peakrate.rate) { P_tab = qdisc_get_rtab(&parm->peakrate, tb[TCA_POLICE_PEAKRATE], NULL); if (P_tab == NULL) goto failure; } } if (est) { err = gen_replace_estimator(&police->tcf_bstats, police->common.cpu_bstats, &police->tcf_rate_est, &police->tcf_lock, false, est); if (err) goto failure; } else if (tb[TCA_POLICE_AVRATE] && (ret == ACT_P_CREATED || !gen_estimator_active(&police->tcf_rate_est))) { err = -EINVAL; goto failure; } if (tb[TCA_POLICE_RESULT]) { tcfp_result = nla_get_u32(tb[TCA_POLICE_RESULT]); if (TC_ACT_EXT_CMP(tcfp_result, TC_ACT_GOTO_CHAIN)) { NL_SET_ERR_MSG(extack, "goto chain not allowed on fallback"); err = -EINVAL; goto failure; } } if ((tb[TCA_POLICE_PKTRATE64] && !tb[TCA_POLICE_PKTBURST64]) || (!tb[TCA_POLICE_PKTRATE64] && tb[TCA_POLICE_PKTBURST64])) { NL_SET_ERR_MSG(extack, "Both or neither packet-per-second burst and rate must be provided"); err = -EINVAL; goto failure; } if (tb[TCA_POLICE_PKTRATE64] && R_tab) { NL_SET_ERR_MSG(extack, "packet-per-second and byte-per-second rate limits not allowed in same action"); err = -EINVAL; goto failure; } new = kzalloc(sizeof(*new), GFP_KERNEL); if (unlikely(!new)) { err = -ENOMEM; goto failure; } /* No failure allowed after this point */ new->tcfp_result = tcfp_result; new->tcfp_mtu = parm->mtu; if (!new->tcfp_mtu) { new->tcfp_mtu = ~0; if (R_tab) new->tcfp_mtu = 255 << R_tab->rate.cell_log; } if (R_tab) { new->rate_present = true; rate64 = nla_get_u64_default(tb[TCA_POLICE_RATE64], 0); psched_ratecfg_precompute(&new->rate, &R_tab->rate, rate64); qdisc_put_rtab(R_tab); } else { new->rate_present = false; } if (P_tab) { new->peak_present = true; prate64 = nla_get_u64_default(tb[TCA_POLICE_PEAKRATE64], 0); psched_ratecfg_precompute(&new->peak, &P_tab->rate, prate64); qdisc_put_rtab(P_tab); } else { new->peak_present = false; } new->tcfp_burst = PSCHED_TICKS2NS(parm->burst); if (new->peak_present) new->tcfp_mtu_ptoks = (s64)psched_l2t_ns(&new->peak, new->tcfp_mtu); if (tb[TCA_POLICE_AVRATE]) new->tcfp_ewma_rate = nla_get_u32(tb[TCA_POLICE_AVRATE]); if (tb[TCA_POLICE_PKTRATE64]) { pps = nla_get_u64(tb[TCA_POLICE_PKTRATE64]); ppsburst = nla_get_u64(tb[TCA_POLICE_PKTBURST64]); new->pps_present = true; new->tcfp_pkt_burst = PSCHED_TICKS2NS(ppsburst); psched_ppscfg_precompute(&new->ppsrate, pps); } spin_lock_bh(&police->tcf_lock); spin_lock_bh(&police->tcfp_lock); police->tcfp_t_c = ktime_get_ns(); police->tcfp_toks = new->tcfp_burst; if (new->peak_present) police->tcfp_ptoks = new->tcfp_mtu_ptoks; spin_unlock_bh(&police->tcfp_lock); goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); new = rcu_replace_pointer(police->params, new, lockdep_is_held(&police->tcf_lock)); spin_unlock_bh(&police->tcf_lock); if (goto_ch) tcf_chain_put_by_act(goto_ch); if (new) kfree_rcu(new, rcu); return ret; failure: qdisc_put_rtab(P_tab); qdisc_put_rtab(R_tab); if (goto_ch) tcf_chain_put_by_act(goto_ch); release_idr: tcf_idr_release(*a, bind); return err; } static bool tcf_police_mtu_check(struct sk_buff *skb, u32 limit) { u32 len; if (skb_is_gso(skb)) return skb_gso_validate_mac_len(skb, limit); len = qdisc_pkt_len(skb); if (skb_at_tc_ingress(skb)) len += skb->mac_len; return len <= limit; } TC_INDIRECT_SCOPE int tcf_police_act(struct sk_buff *skb, const struct tc_action *a, struct tcf_result *res) { struct tcf_police *police = to_police(a); s64 now, toks, ppstoks = 0, ptoks = 0; struct tcf_police_params *p; int ret; tcf_lastuse_update(&police->tcf_tm); bstats_update(this_cpu_ptr(police->common.cpu_bstats), skb); ret = READ_ONCE(police->tcf_action); p = rcu_dereference_bh(police->params); if (p->tcfp_ewma_rate) { struct gnet_stats_rate_est64 sample; if (!gen_estimator_read(&police->tcf_rate_est, &sample) || sample.bps >= p->tcfp_ewma_rate) goto inc_overlimits; } if (tcf_police_mtu_check(skb, p->tcfp_mtu)) { if (!p->rate_present && !p->pps_present) { ret = p->tcfp_result; goto end; } now = ktime_get_ns(); spin_lock_bh(&police->tcfp_lock); toks = min_t(s64, now - police->tcfp_t_c, p->tcfp_burst); if (p->peak_present) { ptoks = toks + police->tcfp_ptoks; if (ptoks > p->tcfp_mtu_ptoks) ptoks = p->tcfp_mtu_ptoks; ptoks -= (s64)psched_l2t_ns(&p->peak, qdisc_pkt_len(skb)); } if (p->rate_present) { toks += police->tcfp_toks; if (toks > p->tcfp_burst) toks = p->tcfp_burst; toks -= (s64)psched_l2t_ns(&p->rate, qdisc_pkt_len(skb)); } else if (p->pps_present) { ppstoks = min_t(s64, now - police->tcfp_t_c, p->tcfp_pkt_burst); ppstoks += police->tcfp_pkttoks; if (ppstoks > p->tcfp_pkt_burst) ppstoks = p->tcfp_pkt_burst; ppstoks -= (s64)psched_pkt2t_ns(&p->ppsrate, 1); } if ((toks | ptoks | ppstoks) >= 0) { police->tcfp_t_c = now; police->tcfp_toks = toks; police->tcfp_ptoks = ptoks; police->tcfp_pkttoks = ppstoks; spin_unlock_bh(&police->tcfp_lock); ret = p->tcfp_result; goto inc_drops; } spin_unlock_bh(&police->tcfp_lock); } inc_overlimits: qstats_overlimit_inc(this_cpu_ptr(police->common.cpu_qstats)); inc_drops: if (ret == TC_ACT_SHOT) qstats_drop_inc(this_cpu_ptr(police->common.cpu_qstats)); end: return ret; } static void tcf_police_cleanup(struct tc_action *a) { struct tcf_police *police = to_police(a); struct tcf_police_params *p; p = rcu_dereference_protected(police->params, 1); if (p) kfree_rcu(p, rcu); } static void tcf_police_stats_update(struct tc_action *a, u64 bytes, u64 packets, u64 drops, u64 lastuse, bool hw) { struct tcf_police *police = to_police(a); struct tcf_t *tm = &police->tcf_tm; tcf_action_update_stats(a, bytes, packets, drops, hw); tm->lastuse = max_t(u64, tm->lastuse, lastuse); } static int tcf_police_dump(struct sk_buff *skb, struct tc_action *a, int bind, int ref) { unsigned char *b = skb_tail_pointer(skb); struct tcf_police *police = to_police(a); struct tcf_police_params *p; struct tc_police opt = { .index = police->tcf_index, .refcnt = refcount_read(&police->tcf_refcnt) - ref, .bindcnt = atomic_read(&police->tcf_bindcnt) - bind, }; struct tcf_t t; spin_lock_bh(&police->tcf_lock); opt.action = police->tcf_action; p = rcu_dereference_protected(police->params, lockdep_is_held(&police->tcf_lock)); opt.mtu = p->tcfp_mtu; opt.burst = PSCHED_NS2TICKS(p->tcfp_burst); if (p->rate_present) { psched_ratecfg_getrate(&opt.rate, &p->rate); if ((p->rate.rate_bytes_ps >= (1ULL << 32)) && nla_put_u64_64bit(skb, TCA_POLICE_RATE64, p->rate.rate_bytes_ps, TCA_POLICE_PAD)) goto nla_put_failure; } if (p->peak_present) { psched_ratecfg_getrate(&opt.peakrate, &p->peak); if ((p->peak.rate_bytes_ps >= (1ULL << 32)) && nla_put_u64_64bit(skb, TCA_POLICE_PEAKRATE64, p->peak.rate_bytes_ps, TCA_POLICE_PAD)) goto nla_put_failure; } if (p->pps_present) { if (nla_put_u64_64bit(skb, TCA_POLICE_PKTRATE64, p->ppsrate.rate_pkts_ps, TCA_POLICE_PAD)) goto nla_put_failure; if (nla_put_u64_64bit(skb, TCA_POLICE_PKTBURST64, PSCHED_NS2TICKS(p->tcfp_pkt_burst), TCA_POLICE_PAD)) goto nla_put_failure; } if (nla_put(skb, TCA_POLICE_TBF, sizeof(opt), &opt)) goto nla_put_failure; if (p->tcfp_result && nla_put_u32(skb, TCA_POLICE_RESULT, p->tcfp_result)) goto nla_put_failure; if (p->tcfp_ewma_rate && nla_put_u32(skb, TCA_POLICE_AVRATE, p->tcfp_ewma_rate)) goto nla_put_failure; tcf_tm_dump(&t, &police->tcf_tm); if (nla_put_64bit(skb, TCA_POLICE_TM, sizeof(t), &t, TCA_POLICE_PAD)) goto nla_put_failure; spin_unlock_bh(&police->tcf_lock); return skb->len; nla_put_failure: spin_unlock_bh(&police->tcf_lock); nlmsg_trim(skb, b); return -1; } static int tcf_police_act_to_flow_act(int tc_act, u32 *extval, struct netlink_ext_ack *extack) { int act_id = -EOPNOTSUPP; if (!TC_ACT_EXT_OPCODE(tc_act)) { if (tc_act == TC_ACT_OK) act_id = FLOW_ACTION_ACCEPT; else if (tc_act == TC_ACT_SHOT) act_id = FLOW_ACTION_DROP; else if (tc_act == TC_ACT_PIPE) act_id = FLOW_ACTION_PIPE; else if (tc_act == TC_ACT_RECLASSIFY) NL_SET_ERR_MSG_MOD(extack, "Offload not supported when conform/exceed action is \"reclassify\""); else NL_SET_ERR_MSG_MOD(extack, "Unsupported conform/exceed action offload"); } else if (TC_ACT_EXT_CMP(tc_act, TC_ACT_GOTO_CHAIN)) { act_id = FLOW_ACTION_GOTO; *extval = tc_act & TC_ACT_EXT_VAL_MASK; } else if (TC_ACT_EXT_CMP(tc_act, TC_ACT_JUMP)) { act_id = FLOW_ACTION_JUMP; *extval = tc_act & TC_ACT_EXT_VAL_MASK; } else if (tc_act == TC_ACT_UNSPEC) { act_id = FLOW_ACTION_CONTINUE; } else { NL_SET_ERR_MSG_MOD(extack, "Unsupported conform/exceed action offload"); } return act_id; } static int tcf_police_offload_act_setup(struct tc_action *act, void *entry_data, u32 *index_inc, bool bind, struct netlink_ext_ack *extack) { if (bind) { struct flow_action_entry *entry = entry_data; struct tcf_police *police = to_police(act); struct tcf_police_params *p; int act_id; p = rcu_dereference_protected(police->params, lockdep_is_held(&police->tcf_lock)); entry->id = FLOW_ACTION_POLICE; entry->police.burst = tcf_police_burst(act); entry->police.rate_bytes_ps = tcf_police_rate_bytes_ps(act); entry->police.peakrate_bytes_ps = tcf_police_peakrate_bytes_ps(act); entry->police.avrate = tcf_police_tcfp_ewma_rate(act); entry->police.overhead = tcf_police_rate_overhead(act); entry->police.burst_pkt = tcf_police_burst_pkt(act); entry->police.rate_pkt_ps = tcf_police_rate_pkt_ps(act); entry->police.mtu = tcf_police_tcfp_mtu(act); act_id = tcf_police_act_to_flow_act(police->tcf_action, &entry->police.exceed.extval, extack); if (act_id < 0) return act_id; entry->police.exceed.act_id = act_id; act_id = tcf_police_act_to_flow_act(p->tcfp_result, &entry->police.notexceed.extval, extack); if (act_id < 0) return act_id; entry->police.notexceed.act_id = act_id; *index_inc = 1; } else { struct flow_offload_action *fl_action = entry_data; fl_action->id = FLOW_ACTION_POLICE; } return 0; } MODULE_AUTHOR("Alexey Kuznetsov"); MODULE_DESCRIPTION("Policing actions"); MODULE_LICENSE("GPL"); static struct tc_action_ops act_police_ops = { .kind = "police", .id = TCA_ID_POLICE, .owner = THIS_MODULE, .stats_update = tcf_police_stats_update, .act = tcf_police_act, .dump = tcf_police_dump, .init = tcf_police_init, .cleanup = tcf_police_cleanup, .offload_act_setup = tcf_police_offload_act_setup, .size = sizeof(struct tcf_police), }; MODULE_ALIAS_NET_ACT("police"); static __net_init int police_init_net(struct net *net) { struct tc_action_net *tn = net_generic(net, act_police_ops.net_id); return tc_action_net_init(net, tn, &act_police_ops); } static void __net_exit police_exit_net(struct list_head *net_list) { tc_action_net_exit(net_list, act_police_ops.net_id); } static struct pernet_operations police_net_ops = { .init = police_init_net, .exit_batch = police_exit_net, .id = &act_police_ops.net_id, .size = sizeof(struct tc_action_net), }; static int __init police_init_module(void) { return tcf_register_action(&act_police_ops, &police_net_ops); } static void __exit police_cleanup_module(void) { tcf_unregister_action(&act_police_ops, &police_net_ops); } module_init(police_init_module); module_exit(police_cleanup_module); |
33 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __KVM_X86_PAGE_TRACK_H #define __KVM_X86_PAGE_TRACK_H #include <linux/kvm_host.h> #include <asm/kvm_page_track.h> bool kvm_page_track_write_tracking_enabled(struct kvm *kvm); int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot); void kvm_page_track_free_memslot(struct kvm_memory_slot *slot); int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned long npages); void __kvm_write_track_add_gfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn); void __kvm_write_track_remove_gfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn); bool kvm_gfn_is_write_tracked(struct kvm *kvm, const struct kvm_memory_slot *slot, gfn_t gfn); #ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING int kvm_page_track_init(struct kvm *kvm); void kvm_page_track_cleanup(struct kvm *kvm); void __kvm_page_track_write(struct kvm *kvm, gpa_t gpa, const u8 *new, int bytes); void kvm_page_track_delete_slot(struct kvm *kvm, struct kvm_memory_slot *slot); static inline bool kvm_page_track_has_external_user(struct kvm *kvm) { return !hlist_empty(&kvm->arch.track_notifier_head.track_notifier_list); } #else static inline int kvm_page_track_init(struct kvm *kvm) { return 0; } static inline void kvm_page_track_cleanup(struct kvm *kvm) { } static inline void __kvm_page_track_write(struct kvm *kvm, gpa_t gpa, const u8 *new, int bytes) { } static inline void kvm_page_track_delete_slot(struct kvm *kvm, struct kvm_memory_slot *slot) { } static inline bool kvm_page_track_has_external_user(struct kvm *kvm) { return false; } #endif /* CONFIG_KVM_EXTERNAL_WRITE_TRACKING */ static inline void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes) { __kvm_page_track_write(vcpu->kvm, gpa, new, bytes); kvm_mmu_track_write(vcpu, gpa, new, bytes); } #endif /* __KVM_X86_PAGE_TRACK_H */ |
728 730 701 729 70 70 1590 1590 1715 1653 1648 144 1650 1650 1647 1715 79 79 79 66 78 79 79 79 1716 1717 1714 1716 1590 1589 1585 1594 1717 1717 1710 1717 1611 1818 1814 1572 1569 1816 729 659 699 731 661 731 701 828 700 701 1535 1532 1534 1536 1536 1536 1534 659 661 660 658 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 | // SPDX-License-Identifier: GPL-2.0 /* * fs/sysfs/group.c - Operations for adding/removing multiple files at once. * * Copyright (c) 2003 Patrick Mochel * Copyright (c) 2003 Open Source Development Lab * Copyright (c) 2013 Greg Kroah-Hartman * Copyright (c) 2013 The Linux Foundation */ #include <linux/kobject.h> #include <linux/module.h> #include <linux/dcache.h> #include <linux/namei.h> #include <linux/err.h> #include <linux/fs.h> #include "sysfs.h" static void remove_files(struct kernfs_node *parent, const struct attribute_group *grp) { struct attribute *const *attr; struct bin_attribute *const *bin_attr; if (grp->attrs) for (attr = grp->attrs; *attr; attr++) kernfs_remove_by_name(parent, (*attr)->name); if (grp->bin_attrs) for (bin_attr = grp->bin_attrs; *bin_attr; bin_attr++) kernfs_remove_by_name(parent, (*bin_attr)->attr.name); } static umode_t __first_visible(const struct attribute_group *grp, struct kobject *kobj) { if (grp->attrs && grp->attrs[0] && grp->is_visible) return grp->is_visible(kobj, grp->attrs[0], 0); if (grp->bin_attrs && grp->bin_attrs[0] && grp->is_bin_visible) return grp->is_bin_visible(kobj, grp->bin_attrs[0], 0); return 0; } static int create_files(struct kernfs_node *parent, struct kobject *kobj, kuid_t uid, kgid_t gid, const struct attribute_group *grp, int update) { struct attribute *const *attr; struct bin_attribute *const *bin_attr; int error = 0, i; if (grp->attrs) { for (i = 0, attr = grp->attrs; *attr && !error; i++, attr++) { umode_t mode = (*attr)->mode; /* * In update mode, we're changing the permissions or * visibility. Do this by first removing then * re-adding (if required) the file. */ if (update) kernfs_remove_by_name(parent, (*attr)->name); if (grp->is_visible) { mode = grp->is_visible(kobj, *attr, i); mode &= ~SYSFS_GROUP_INVISIBLE; if (!mode) continue; } WARN(mode & ~(SYSFS_PREALLOC | 0664), "Attribute %s: Invalid permissions 0%o\n", (*attr)->name, mode); mode &= SYSFS_PREALLOC | 0664; error = sysfs_add_file_mode_ns(parent, *attr, mode, uid, gid, NULL); if (unlikely(error)) break; } if (error) { remove_files(parent, grp); goto exit; } } if (grp->bin_attrs) { for (i = 0, bin_attr = grp->bin_attrs; *bin_attr; i++, bin_attr++) { umode_t mode = (*bin_attr)->attr.mode; size_t size = (*bin_attr)->size; if (update) kernfs_remove_by_name(parent, (*bin_attr)->attr.name); if (grp->is_bin_visible) { mode = grp->is_bin_visible(kobj, *bin_attr, i); mode &= ~SYSFS_GROUP_INVISIBLE; if (!mode) continue; } if (grp->bin_size) size = grp->bin_size(kobj, *bin_attr, i); WARN(mode & ~(SYSFS_PREALLOC | 0664), "Attribute %s: Invalid permissions 0%o\n", (*bin_attr)->attr.name, mode); mode &= SYSFS_PREALLOC | 0664; error = sysfs_add_bin_file_mode_ns(parent, *bin_attr, mode, size, uid, gid, NULL); if (error) break; } if (error) remove_files(parent, grp); } exit: return error; } static int internal_create_group(struct kobject *kobj, int update, const struct attribute_group *grp) { struct kernfs_node *kn; kuid_t uid; kgid_t gid; int error; if (WARN_ON(!kobj || (!update && !kobj->sd))) return -EINVAL; /* Updates may happen before the object has been instantiated */ if (unlikely(update && !kobj->sd)) return -EINVAL; if (!grp->attrs && !grp->bin_attrs) { pr_debug("sysfs: (bin_)attrs not set by subsystem for group: %s/%s, skipping\n", kobj->name, grp->name ?: ""); return 0; } kobject_get_ownership(kobj, &uid, &gid); if (grp->name) { umode_t mode = __first_visible(grp, kobj); if (mode & SYSFS_GROUP_INVISIBLE) mode = 0; else mode = S_IRWXU | S_IRUGO | S_IXUGO; if (update) { kn = kernfs_find_and_get(kobj->sd, grp->name); if (!kn) { pr_debug("attr grp %s/%s not created yet\n", kobj->name, grp->name); /* may have been invisible prior to this update */ update = 0; } else if (!mode) { sysfs_remove_group(kobj, grp); kernfs_put(kn); return 0; } } if (!update) { if (!mode) return 0; kn = kernfs_create_dir_ns(kobj->sd, grp->name, mode, uid, gid, kobj, NULL); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(kobj->sd, grp->name); return PTR_ERR(kn); } } } else { kn = kobj->sd; } kernfs_get(kn); error = create_files(kn, kobj, uid, gid, grp, update); if (error) { if (grp->name) kernfs_remove(kn); } kernfs_put(kn); if (grp->name && update) kernfs_put(kn); return error; } /** * sysfs_create_group - given a directory kobject, create an attribute group * @kobj: The kobject to create the group on * @grp: The attribute group to create * * This function creates a group for the first time. It will explicitly * warn and error if any of the attribute files being created already exist. * * Returns 0 on success or error code on failure. */ int sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp) { return internal_create_group(kobj, 0, grp); } EXPORT_SYMBOL_GPL(sysfs_create_group); static int internal_create_groups(struct kobject *kobj, int update, const struct attribute_group **groups) { int error = 0; int i; if (!groups) return 0; for (i = 0; groups[i]; i++) { error = internal_create_group(kobj, update, groups[i]); if (error) { while (--i >= 0) sysfs_remove_group(kobj, groups[i]); break; } } return error; } /** * sysfs_create_groups - given a directory kobject, create a bunch of attribute groups * @kobj: The kobject to create the group on * @groups: The attribute groups to create, NULL terminated * * This function creates a bunch of attribute groups. If an error occurs when * creating a group, all previously created groups will be removed, unwinding * everything back to the original state when this function was called. * It will explicitly warn and error if any of the attribute files being * created already exist. * * Returns 0 on success or error code from sysfs_create_group on failure. */ int sysfs_create_groups(struct kobject *kobj, const struct attribute_group **groups) { return internal_create_groups(kobj, 0, groups); } EXPORT_SYMBOL_GPL(sysfs_create_groups); /** * sysfs_update_groups - given a directory kobject, create a bunch of attribute groups * @kobj: The kobject to update the group on * @groups: The attribute groups to update, NULL terminated * * This function update a bunch of attribute groups. If an error occurs when * updating a group, all previously updated groups will be removed together * with already existing (not updated) attributes. * * Returns 0 on success or error code from sysfs_update_group on failure. */ int sysfs_update_groups(struct kobject *kobj, const struct attribute_group **groups) { return internal_create_groups(kobj, 1, groups); } EXPORT_SYMBOL_GPL(sysfs_update_groups); /** * sysfs_update_group - given a directory kobject, update an attribute group * @kobj: The kobject to update the group on * @grp: The attribute group to update * * This function updates an attribute group. Unlike * sysfs_create_group(), it will explicitly not warn or error if any * of the attribute files being created already exist. Furthermore, * if the visibility of the files has changed through the is_visible() * callback, it will update the permissions and add or remove the * relevant files. Changing a group's name (subdirectory name under * kobj's directory in sysfs) is not allowed. * * The primary use for this function is to call it after making a change * that affects group visibility. * * Returns 0 on success or error code on failure. */ int sysfs_update_group(struct kobject *kobj, const struct attribute_group *grp) { return internal_create_group(kobj, 1, grp); } EXPORT_SYMBOL_GPL(sysfs_update_group); /** * sysfs_remove_group: remove a group from a kobject * @kobj: kobject to remove the group from * @grp: group to remove * * This function removes a group of attributes from a kobject. The attributes * previously have to have been created for this group, otherwise it will fail. */ void sysfs_remove_group(struct kobject *kobj, const struct attribute_group *grp) { struct kernfs_node *parent = kobj->sd; struct kernfs_node *kn; if (grp->name) { kn = kernfs_find_and_get(parent, grp->name); if (!kn) { pr_debug("sysfs group '%s' not found for kobject '%s'\n", grp->name, kobject_name(kobj)); return; } } else { kn = parent; kernfs_get(kn); } remove_files(kn, grp); if (grp->name) kernfs_remove(kn); kernfs_put(kn); } EXPORT_SYMBOL_GPL(sysfs_remove_group); /** * sysfs_remove_groups - remove a list of groups * * @kobj: The kobject for the groups to be removed from * @groups: NULL terminated list of groups to be removed * * If groups is not NULL, remove the specified groups from the kobject. */ void sysfs_remove_groups(struct kobject *kobj, const struct attribute_group **groups) { int i; if (!groups) return; for (i = 0; groups[i]; i++) sysfs_remove_group(kobj, groups[i]); } EXPORT_SYMBOL_GPL(sysfs_remove_groups); /** * sysfs_merge_group - merge files into a pre-existing named attribute group. * @kobj: The kobject containing the group. * @grp: The files to create and the attribute group they belong to. * * This function returns an error if the group doesn't exist, the .name field is * NULL or any of the files already exist in that group, in which case none of * the new files are created. */ int sysfs_merge_group(struct kobject *kobj, const struct attribute_group *grp) { struct kernfs_node *parent; kuid_t uid; kgid_t gid; int error = 0; struct attribute *const *attr; int i; parent = kernfs_find_and_get(kobj->sd, grp->name); if (!parent) return -ENOENT; kobject_get_ownership(kobj, &uid, &gid); for ((i = 0, attr = grp->attrs); *attr && !error; (++i, ++attr)) error = sysfs_add_file_mode_ns(parent, *attr, (*attr)->mode, uid, gid, NULL); if (error) { while (--i >= 0) kernfs_remove_by_name(parent, (*--attr)->name); } kernfs_put(parent); return error; } EXPORT_SYMBOL_GPL(sysfs_merge_group); /** * sysfs_unmerge_group - remove files from a pre-existing named attribute group. * @kobj: The kobject containing the group. * @grp: The files to remove and the attribute group they belong to. */ void sysfs_unmerge_group(struct kobject *kobj, const struct attribute_group *grp) { struct kernfs_node *parent; struct attribute *const *attr; parent = kernfs_find_and_get(kobj->sd, grp->name); if (parent) { for (attr = grp->attrs; *attr; ++attr) kernfs_remove_by_name(parent, (*attr)->name); kernfs_put(parent); } } EXPORT_SYMBOL_GPL(sysfs_unmerge_group); /** * sysfs_add_link_to_group - add a symlink to an attribute group. * @kobj: The kobject containing the group. * @group_name: The name of the group. * @target: The target kobject of the symlink to create. * @link_name: The name of the symlink to create. */ int sysfs_add_link_to_group(struct kobject *kobj, const char *group_name, struct kobject *target, const char *link_name) { struct kernfs_node *parent; int error = 0; parent = kernfs_find_and_get(kobj->sd, group_name); if (!parent) return -ENOENT; error = sysfs_create_link_sd(parent, target, link_name); kernfs_put(parent); return error; } EXPORT_SYMBOL_GPL(sysfs_add_link_to_group); /** * sysfs_remove_link_from_group - remove a symlink from an attribute group. * @kobj: The kobject containing the group. * @group_name: The name of the group. * @link_name: The name of the symlink to remove. */ void sysfs_remove_link_from_group(struct kobject *kobj, const char *group_name, const char *link_name) { struct kernfs_node *parent; parent = kernfs_find_and_get(kobj->sd, group_name); if (parent) { kernfs_remove_by_name(parent, link_name); kernfs_put(parent); } } EXPORT_SYMBOL_GPL(sysfs_remove_link_from_group); /** * compat_only_sysfs_link_entry_to_kobj - add a symlink to a kobject pointing * to a group or an attribute * @kobj: The kobject containing the group. * @target_kobj: The target kobject. * @target_name: The name of the target group or attribute. * @symlink_name: The name of the symlink file (target_name will be * considered if symlink_name is NULL). */ int compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj, struct kobject *target_kobj, const char *target_name, const char *symlink_name) { struct kernfs_node *target; struct kernfs_node *entry; struct kernfs_node *link; /* * We don't own @target_kobj and it may be removed at any time. * Synchronize using sysfs_symlink_target_lock. See sysfs_remove_dir() * for details. */ spin_lock(&sysfs_symlink_target_lock); target = target_kobj->sd; if (target) kernfs_get(target); spin_unlock(&sysfs_symlink_target_lock); if (!target) return -ENOENT; entry = kernfs_find_and_get(target, target_name); if (!entry) { kernfs_put(target); return -ENOENT; } if (!symlink_name) symlink_name = target_name; link = kernfs_create_link(kobj->sd, symlink_name, entry); if (PTR_ERR(link) == -EEXIST) sysfs_warn_dup(kobj->sd, symlink_name); kernfs_put(entry); kernfs_put(target); return PTR_ERR_OR_ZERO(link); } EXPORT_SYMBOL_GPL(compat_only_sysfs_link_entry_to_kobj); static int sysfs_group_attrs_change_owner(struct kernfs_node *grp_kn, const struct attribute_group *grp, struct iattr *newattrs) { struct kernfs_node *kn; int error; if (grp->attrs) { struct attribute *const *attr; for (attr = grp->attrs; *attr; attr++) { kn = kernfs_find_and_get(grp_kn, (*attr)->name); if (!kn) return -ENOENT; error = kernfs_setattr(kn, newattrs); kernfs_put(kn); if (error) return error; } } if (grp->bin_attrs) { struct bin_attribute *const *bin_attr; for (bin_attr = grp->bin_attrs; *bin_attr; bin_attr++) { kn = kernfs_find_and_get(grp_kn, (*bin_attr)->attr.name); if (!kn) return -ENOENT; error = kernfs_setattr(kn, newattrs); kernfs_put(kn); if (error) return error; } } return 0; } /** * sysfs_group_change_owner - change owner of an attribute group. * @kobj: The kobject containing the group. * @grp: The attribute group. * @kuid: new owner's kuid * @kgid: new owner's kgid * * Returns 0 on success or error code on failure. */ int sysfs_group_change_owner(struct kobject *kobj, const struct attribute_group *grp, kuid_t kuid, kgid_t kgid) { struct kernfs_node *grp_kn; int error; struct iattr newattrs = { .ia_valid = ATTR_UID | ATTR_GID, .ia_uid = kuid, .ia_gid = kgid, }; if (!kobj->state_in_sysfs) return -EINVAL; if (grp->name) { grp_kn = kernfs_find_and_get(kobj->sd, grp->name); } else { kernfs_get(kobj->sd); grp_kn = kobj->sd; } if (!grp_kn) return -ENOENT; error = kernfs_setattr(grp_kn, &newattrs); if (!error) error = sysfs_group_attrs_change_owner(grp_kn, grp, &newattrs); kernfs_put(grp_kn); return error; } EXPORT_SYMBOL_GPL(sysfs_group_change_owner); /** * sysfs_groups_change_owner - change owner of a set of attribute groups. * @kobj: The kobject containing the groups. * @groups: The attribute groups. * @kuid: new owner's kuid * @kgid: new owner's kgid * * Returns 0 on success or error code on failure. */ int sysfs_groups_change_owner(struct kobject *kobj, const struct attribute_group **groups, kuid_t kuid, kgid_t kgid) { int error = 0, i; if (!kobj->state_in_sysfs) return -EINVAL; if (!groups) return 0; for (i = 0; groups[i]; i++) { error = sysfs_group_change_owner(kobj, groups[i], kuid, kgid); if (error) break; } return error; } EXPORT_SYMBOL_GPL(sysfs_groups_change_owner); |
80 28 66 69 90 81 279 90 520 175 17 28 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 | /* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM compaction #if !defined(_TRACE_COMPACTION_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_COMPACTION_H #include <linux/types.h> #include <linux/list.h> #include <linux/tracepoint.h> #include <trace/events/mmflags.h> DECLARE_EVENT_CLASS(mm_compaction_isolate_template, TP_PROTO( unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_scanned, unsigned long nr_taken), TP_ARGS(start_pfn, end_pfn, nr_scanned, nr_taken), TP_STRUCT__entry( __field(unsigned long, start_pfn) __field(unsigned long, end_pfn) __field(unsigned long, nr_scanned) __field(unsigned long, nr_taken) ), TP_fast_assign( __entry->start_pfn = start_pfn; __entry->end_pfn = end_pfn; __entry->nr_scanned = nr_scanned; __entry->nr_taken = nr_taken; ), TP_printk("range=(0x%lx ~ 0x%lx) nr_scanned=%lu nr_taken=%lu", __entry->start_pfn, __entry->end_pfn, __entry->nr_scanned, __entry->nr_taken) ); DEFINE_EVENT(mm_compaction_isolate_template, mm_compaction_isolate_migratepages, TP_PROTO( unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_scanned, unsigned long nr_taken), TP_ARGS(start_pfn, end_pfn, nr_scanned, nr_taken) ); DEFINE_EVENT(mm_compaction_isolate_template, mm_compaction_isolate_freepages, TP_PROTO( unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_scanned, unsigned long nr_taken), TP_ARGS(start_pfn, end_pfn, nr_scanned, nr_taken) ); DEFINE_EVENT(mm_compaction_isolate_template, mm_compaction_fast_isolate_freepages, TP_PROTO( unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_scanned, unsigned long nr_taken), TP_ARGS(start_pfn, end_pfn, nr_scanned, nr_taken) ); #ifdef CONFIG_COMPACTION TRACE_EVENT(mm_compaction_migratepages, TP_PROTO(unsigned int nr_migratepages, unsigned int nr_succeeded), TP_ARGS(nr_migratepages, nr_succeeded), TP_STRUCT__entry( __field(unsigned long, nr_migrated) __field(unsigned long, nr_failed) ), TP_fast_assign( __entry->nr_migrated = nr_succeeded; __entry->nr_failed = nr_migratepages - nr_succeeded; ), TP_printk("nr_migrated=%lu nr_failed=%lu", __entry->nr_migrated, __entry->nr_failed) ); TRACE_EVENT(mm_compaction_begin, TP_PROTO(struct compact_control *cc, unsigned long zone_start, unsigned long zone_end, bool sync), TP_ARGS(cc, zone_start, zone_end, sync), TP_STRUCT__entry( __field(unsigned long, zone_start) __field(unsigned long, migrate_pfn) __field(unsigned long, free_pfn) __field(unsigned long, zone_end) __field(bool, sync) ), TP_fast_assign( __entry->zone_start = zone_start; __entry->migrate_pfn = cc->migrate_pfn; __entry->free_pfn = cc->free_pfn; __entry->zone_end = zone_end; __entry->sync = sync; ), TP_printk("zone_start=0x%lx migrate_pfn=0x%lx free_pfn=0x%lx zone_end=0x%lx, mode=%s", __entry->zone_start, __entry->migrate_pfn, __entry->free_pfn, __entry->zone_end, __entry->sync ? "sync" : "async") ); TRACE_EVENT(mm_compaction_end, TP_PROTO(struct compact_control *cc, unsigned long zone_start, unsigned long zone_end, bool sync, int status), TP_ARGS(cc, zone_start, zone_end, sync, status), TP_STRUCT__entry( __field(unsigned long, zone_start) __field(unsigned long, migrate_pfn) __field(unsigned long, free_pfn) __field(unsigned long, zone_end) __field(bool, sync) __field(int, status) ), TP_fast_assign( __entry->zone_start = zone_start; __entry->migrate_pfn = cc->migrate_pfn; __entry->free_pfn = cc->free_pfn; __entry->zone_end = zone_end; __entry->sync = sync; __entry->status = status; ), TP_printk("zone_start=0x%lx migrate_pfn=0x%lx free_pfn=0x%lx zone_end=0x%lx, mode=%s status=%s", __entry->zone_start, __entry->migrate_pfn, __entry->free_pfn, __entry->zone_end, __entry->sync ? "sync" : "async", __print_symbolic(__entry->status, COMPACTION_STATUS)) ); TRACE_EVENT(mm_compaction_try_to_compact_pages, TP_PROTO( int order, gfp_t gfp_mask, int prio), TP_ARGS(order, gfp_mask, prio), TP_STRUCT__entry( __field(int, order) __field(unsigned long, gfp_mask) __field(int, prio) ), TP_fast_assign( __entry->order = order; __entry->gfp_mask = (__force unsigned long)gfp_mask; __entry->prio = prio; ), TP_printk("order=%d gfp_mask=%s priority=%d", __entry->order, show_gfp_flags(__entry->gfp_mask), __entry->prio) ); DECLARE_EVENT_CLASS(mm_compaction_suitable_template, TP_PROTO(struct zone *zone, int order, int ret), TP_ARGS(zone, order, ret), TP_STRUCT__entry( __field(int, nid) __field(enum zone_type, idx) __field(int, order) __field(int, ret) ), TP_fast_assign( __entry->nid = zone_to_nid(zone); __entry->idx = zone_idx(zone); __entry->order = order; __entry->ret = ret; ), TP_printk("node=%d zone=%-8s order=%d ret=%s", __entry->nid, __print_symbolic(__entry->idx, ZONE_TYPE), __entry->order, __print_symbolic(__entry->ret, COMPACTION_STATUS)) ); DEFINE_EVENT(mm_compaction_suitable_template, mm_compaction_finished, TP_PROTO(struct zone *zone, int order, int ret), TP_ARGS(zone, order, ret) ); DEFINE_EVENT(mm_compaction_suitable_template, mm_compaction_suitable, TP_PROTO(struct zone *zone, int order, int ret), TP_ARGS(zone, order, ret) ); DECLARE_EVENT_CLASS(mm_compaction_defer_template, TP_PROTO(struct zone *zone, int order), TP_ARGS(zone, order), TP_STRUCT__entry( __field(int, nid) __field(enum zone_type, idx) __field(int, order) __field(unsigned int, considered) __field(unsigned int, defer_shift) __field(int, order_failed) ), TP_fast_assign( __entry->nid = zone_to_nid(zone); __entry->idx = zone_idx(zone); __entry->order = order; __entry->considered = zone->compact_considered; __entry->defer_shift = zone->compact_defer_shift; __entry->order_failed = zone->compact_order_failed; ), TP_printk("node=%d zone=%-8s order=%d order_failed=%d consider=%u limit=%lu", __entry->nid, __print_symbolic(__entry->idx, ZONE_TYPE), __entry->order, __entry->order_failed, __entry->considered, 1UL << __entry->defer_shift) ); DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_deferred, TP_PROTO(struct zone *zone, int order), TP_ARGS(zone, order) ); DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_defer_compaction, TP_PROTO(struct zone *zone, int order), TP_ARGS(zone, order) ); DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_defer_reset, TP_PROTO(struct zone *zone, int order), TP_ARGS(zone, order) ); TRACE_EVENT(mm_compaction_kcompactd_sleep, TP_PROTO(int nid), TP_ARGS(nid), TP_STRUCT__entry( __field(int, nid) ), TP_fast_assign( __entry->nid = nid; ), TP_printk("nid=%d", __entry->nid) ); DECLARE_EVENT_CLASS(kcompactd_wake_template, TP_PROTO(int nid, int order, enum zone_type highest_zoneidx), TP_ARGS(nid, order, highest_zoneidx), TP_STRUCT__entry( __field(int, nid) __field(int, order) __field(enum zone_type, highest_zoneidx) ), TP_fast_assign( __entry->nid = nid; __entry->order = order; __entry->highest_zoneidx = highest_zoneidx; ), /* * classzone_idx is previous name of the highest_zoneidx. * Reason not to change it is the ABI requirement of the tracepoint. */ TP_printk("nid=%d order=%d classzone_idx=%-8s", __entry->nid, __entry->order, __print_symbolic(__entry->highest_zoneidx, ZONE_TYPE)) ); DEFINE_EVENT(kcompactd_wake_template, mm_compaction_wakeup_kcompactd, TP_PROTO(int nid, int order, enum zone_type highest_zoneidx), TP_ARGS(nid, order, highest_zoneidx) ); DEFINE_EVENT(kcompactd_wake_template, mm_compaction_kcompactd_wake, TP_PROTO(int nid, int order, enum zone_type highest_zoneidx), TP_ARGS(nid, order, highest_zoneidx) ); #endif #endif /* _TRACE_COMPACTION_H */ /* This part must be outside protection */ #include <trace/define_trace.h> |
26 26 26 26 26 26 26 242 241 241 239 242 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 | // SPDX-License-Identifier: GPL-2.0 #include <linux/slab.h> #include <linux/lockdep.h> #include <linux/sysfs.h> #include <linux/kobject.h> #include <linux/memory.h> #include <linux/memory-tiers.h> #include <linux/notifier.h> #include <linux/sched/sysctl.h> #include "internal.h" struct memory_tier { /* hierarchy of memory tiers */ struct list_head list; /* list of all memory types part of this tier */ struct list_head memory_types; /* * start value of abstract distance. memory tier maps * an abstract distance range, * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE */ int adistance_start; struct device dev; /* All the nodes that are part of all the lower memory tiers. */ nodemask_t lower_tier_mask; }; struct demotion_nodes { nodemask_t preferred; }; struct node_memory_type_map { struct memory_dev_type *memtype; int map_count; }; static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); /* * The list is used to store all memory types that are not created * by a device driver. */ static LIST_HEAD(default_memory_types); static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; struct memory_dev_type *default_dram_type; nodemask_t default_dram_nodes __initdata = NODE_MASK_NONE; static const struct bus_type memory_tier_subsys = { .name = "memory_tiering", .dev_name = "memory_tier", }; #ifdef CONFIG_NUMA_BALANCING /** * folio_use_access_time - check if a folio reuses cpupid for page access time * @folio: folio to check * * folio's _last_cpupid field is repurposed by memory tiering. In memory * tiering mode, cpupid of slow memory folio (not toptier memory) is used to * record page access time. * * Return: the folio _last_cpupid is used to record page access time */ bool folio_use_access_time(struct folio *folio) { return (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)); } #endif #ifdef CONFIG_MIGRATION static int top_tier_adistance; /* * node_demotion[] examples: * * Example 1: * * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. * * node distances: * node 0 1 2 3 * 0 10 20 30 40 * 1 20 10 40 30 * 2 30 40 10 40 * 3 40 30 40 10 * * memory_tiers0 = 0-1 * memory_tiers1 = 2-3 * * node_demotion[0].preferred = 2 * node_demotion[1].preferred = 3 * node_demotion[2].preferred = <empty> * node_demotion[3].preferred = <empty> * * Example 2: * * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node. * * node distances: * node 0 1 2 * 0 10 20 30 * 1 20 10 30 * 2 30 30 10 * * memory_tiers0 = 0-2 * * node_demotion[0].preferred = <empty> * node_demotion[1].preferred = <empty> * node_demotion[2].preferred = <empty> * * Example 3: * * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. * * node distances: * node 0 1 2 * 0 10 20 30 * 1 20 10 40 * 2 30 40 10 * * memory_tiers0 = 1 * memory_tiers1 = 0 * memory_tiers2 = 2 * * node_demotion[0].preferred = 2 * node_demotion[1].preferred = 0 * node_demotion[2].preferred = <empty> * */ static struct demotion_nodes *node_demotion __read_mostly; #endif /* CONFIG_MIGRATION */ static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); /* The lock is used to protect `default_dram_perf*` info and nid. */ static DEFINE_MUTEX(default_dram_perf_lock); static bool default_dram_perf_error; static struct access_coordinate default_dram_perf; static int default_dram_perf_ref_nid = NUMA_NO_NODE; static const char *default_dram_perf_ref_source; static inline struct memory_tier *to_memory_tier(struct device *device) { return container_of(device, struct memory_tier, dev); } static __always_inline nodemask_t get_memtier_nodemask(struct memory_tier *memtier) { nodemask_t nodes = NODE_MASK_NONE; struct memory_dev_type *memtype; list_for_each_entry(memtype, &memtier->memory_types, tier_sibling) nodes_or(nodes, nodes, memtype->nodes); return nodes; } static void memory_tier_device_release(struct device *dev) { struct memory_tier *tier = to_memory_tier(dev); /* * synchronize_rcu in clear_node_memory_tier makes sure * we don't have rcu access to this memory tier. */ kfree(tier); } static ssize_t nodelist_show(struct device *dev, struct device_attribute *attr, char *buf) { int ret; nodemask_t nmask; mutex_lock(&memory_tier_lock); nmask = get_memtier_nodemask(to_memory_tier(dev)); ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask)); mutex_unlock(&memory_tier_lock); return ret; } static DEVICE_ATTR_RO(nodelist); static struct attribute *memtier_dev_attrs[] = { &dev_attr_nodelist.attr, NULL }; static const struct attribute_group memtier_dev_group = { .attrs = memtier_dev_attrs, }; static const struct attribute_group *memtier_dev_groups[] = { &memtier_dev_group, NULL }; static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) { int ret; bool found_slot = false; struct memory_tier *memtier, *new_memtier; int adistance = memtype->adistance; unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE; lockdep_assert_held_once(&memory_tier_lock); adistance = round_down(adistance, memtier_adistance_chunk_size); /* * If the memtype is already part of a memory tier, * just return that. */ if (!list_empty(&memtype->tier_sibling)) { list_for_each_entry(memtier, &memory_tiers, list) { if (adistance == memtier->adistance_start) return memtier; } WARN_ON(1); return ERR_PTR(-EINVAL); } list_for_each_entry(memtier, &memory_tiers, list) { if (adistance == memtier->adistance_start) { goto link_memtype; } else if (adistance < memtier->adistance_start) { found_slot = true; break; } } new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); if (!new_memtier) return ERR_PTR(-ENOMEM); new_memtier->adistance_start = adistance; INIT_LIST_HEAD(&new_memtier->list); INIT_LIST_HEAD(&new_memtier->memory_types); if (found_slot) list_add_tail(&new_memtier->list, &memtier->list); else list_add_tail(&new_memtier->list, &memory_tiers); new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS; new_memtier->dev.bus = &memory_tier_subsys; new_memtier->dev.release = memory_tier_device_release; new_memtier->dev.groups = memtier_dev_groups; ret = device_register(&new_memtier->dev); if (ret) { list_del(&new_memtier->list); put_device(&new_memtier->dev); return ERR_PTR(ret); } memtier = new_memtier; link_memtype: list_add(&memtype->tier_sibling, &memtier->memory_types); return memtier; } static struct memory_tier *__node_get_memory_tier(int node) { pg_data_t *pgdat; pgdat = NODE_DATA(node); if (!pgdat) return NULL; /* * Since we hold memory_tier_lock, we can avoid * RCU read locks when accessing the details. No * parallel updates are possible here. */ return rcu_dereference_check(pgdat->memtier, lockdep_is_held(&memory_tier_lock)); } #ifdef CONFIG_MIGRATION bool node_is_toptier(int node) { bool toptier; pg_data_t *pgdat; struct memory_tier *memtier; pgdat = NODE_DATA(node); if (!pgdat) return false; rcu_read_lock(); memtier = rcu_dereference(pgdat->memtier); if (!memtier) { toptier = true; goto out; } if (memtier->adistance_start <= top_tier_adistance) toptier = true; else toptier = false; out: rcu_read_unlock(); return toptier; } void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) { struct memory_tier *memtier; /* * pg_data_t.memtier updates includes a synchronize_rcu() * which ensures that we either find NULL or a valid memtier * in NODE_DATA. protect the access via rcu_read_lock(); */ rcu_read_lock(); memtier = rcu_dereference(pgdat->memtier); if (memtier) *targets = memtier->lower_tier_mask; else *targets = NODE_MASK_NONE; rcu_read_unlock(); } /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node * * Return: node id for next memory node in the demotion path hierarchy * from @node; NUMA_NO_NODE if @node is terminal. This does not keep * @node online or guarantee that it *continues* to be the next demotion * target. */ int next_demotion_node(int node) { struct demotion_nodes *nd; int target; if (!node_demotion) return NUMA_NO_NODE; nd = &node_demotion[node]; /* * node_demotion[] is updated without excluding this * function from running. * * Make sure to use RCU over entire code blocks if * node_demotion[] reads need to be consistent. */ rcu_read_lock(); /* * If there are multiple target nodes, just select one * target node randomly. * * In addition, we can also use round-robin to select * target node, but we should introduce another variable * for node_demotion[] to record last selected target node, * that may cause cache ping-pong due to the changing of * last target node. Or introducing per-cpu data to avoid * caching issue, which seems more complicated. So selecting * target node randomly seems better until now. */ target = node_random(&nd->preferred); rcu_read_unlock(); return target; } static void disable_all_demotion_targets(void) { struct memory_tier *memtier; int node; for_each_node_state(node, N_MEMORY) { node_demotion[node].preferred = NODE_MASK_NONE; /* * We are holding memory_tier_lock, it is safe * to access pgda->memtier. */ memtier = __node_get_memory_tier(node); if (memtier) memtier->lower_tier_mask = NODE_MASK_NONE; } /* * Ensure that the "disable" is visible across the system. * Readers will see either a combination of before+disable * state or disable+after. They will never see before and * after state together. */ synchronize_rcu(); } static void dump_demotion_targets(void) { int node; for_each_node_state(node, N_MEMORY) { struct memory_tier *memtier = __node_get_memory_tier(node); nodemask_t preferred = node_demotion[node].preferred; if (!memtier) continue; if (nodes_empty(preferred)) pr_info("Demotion targets for Node %d: null\n", node); else pr_info("Demotion targets for Node %d: preferred: %*pbl, fallback: %*pbl\n", node, nodemask_pr_args(&preferred), nodemask_pr_args(&memtier->lower_tier_mask)); } } /* * Find an automatic demotion target for all memory * nodes. Failing here is OK. It might just indicate * being at the end of a chain. */ static void establish_demotion_targets(void) { struct memory_tier *memtier; struct demotion_nodes *nd; int target = NUMA_NO_NODE, node; int distance, best_distance; nodemask_t tier_nodes, lower_tier; lockdep_assert_held_once(&memory_tier_lock); if (!node_demotion) return; disable_all_demotion_targets(); for_each_node_state(node, N_MEMORY) { best_distance = -1; nd = &node_demotion[node]; memtier = __node_get_memory_tier(node); if (!memtier || list_is_last(&memtier->list, &memory_tiers)) continue; /* * Get the lower memtier to find the demotion node list. */ memtier = list_next_entry(memtier, list); tier_nodes = get_memtier_nodemask(memtier); /* * find_next_best_node, use 'used' nodemask as a skip list. * Add all memory nodes except the selected memory tier * nodelist to skip list so that we find the best node from the * memtier nodelist. */ nodes_andnot(tier_nodes, node_states[N_MEMORY], tier_nodes); /* * Find all the nodes in the memory tier node list of same best distance. * add them to the preferred mask. We randomly select between nodes * in the preferred mask when allocating pages during demotion. */ do { target = find_next_best_node(node, &tier_nodes); if (target == NUMA_NO_NODE) break; distance = node_distance(node, target); if (distance == best_distance || best_distance == -1) { best_distance = distance; node_set(target, nd->preferred); } else { break; } } while (1); } /* * Promotion is allowed from a memory tier to higher * memory tier only if the memory tier doesn't include * compute. We want to skip promotion from a memory tier, * if any node that is part of the memory tier have CPUs. * Once we detect such a memory tier, we consider that tier * as top tiper from which promotion is not allowed. */ list_for_each_entry_reverse(memtier, &memory_tiers, list) { tier_nodes = get_memtier_nodemask(memtier); nodes_and(tier_nodes, node_states[N_CPU], tier_nodes); if (!nodes_empty(tier_nodes)) { /* * abstract distance below the max value of this memtier * is considered toptier. */ top_tier_adistance = memtier->adistance_start + MEMTIER_CHUNK_SIZE - 1; break; } } /* * Now build the lower_tier mask for each node collecting node mask from * all memory tier below it. This allows us to fallback demotion page * allocation to a set of nodes that is closer the above selected * preferred node. */ lower_tier = node_states[N_MEMORY]; list_for_each_entry(memtier, &memory_tiers, list) { /* * Keep removing current tier from lower_tier nodes, * This will remove all nodes in current and above * memory tier from the lower_tier mask. */ tier_nodes = get_memtier_nodemask(memtier); nodes_andnot(lower_tier, lower_tier, tier_nodes); memtier->lower_tier_mask = lower_tier; } dump_demotion_targets(); } #else static inline void establish_demotion_targets(void) {} #endif /* CONFIG_MIGRATION */ static inline void __init_node_memory_type(int node, struct memory_dev_type *memtype) { if (!node_memory_types[node].memtype) node_memory_types[node].memtype = memtype; /* * for each device getting added in the same NUMA node * with this specific memtype, bump the map count. We * Only take memtype device reference once, so that * changing a node memtype can be done by droping the * only reference count taken here. */ if (node_memory_types[node].memtype == memtype) { if (!node_memory_types[node].map_count++) kref_get(&memtype->kref); } } static struct memory_tier *set_node_memory_tier(int node) { struct memory_tier *memtier; struct memory_dev_type *memtype = default_dram_type; int adist = MEMTIER_ADISTANCE_DRAM; pg_data_t *pgdat = NODE_DATA(node); lockdep_assert_held_once(&memory_tier_lock); if (!node_state(node, N_MEMORY)) return ERR_PTR(-EINVAL); mt_calc_adistance(node, &adist); if (!node_memory_types[node].memtype) { memtype = mt_find_alloc_memory_type(adist, &default_memory_types); if (IS_ERR(memtype)) { memtype = default_dram_type; pr_info("Failed to allocate a memory type. Fall back.\n"); } } __init_node_memory_type(node, memtype); memtype = node_memory_types[node].memtype; node_set(node, memtype->nodes); memtier = find_create_memory_tier(memtype); if (!IS_ERR(memtier)) rcu_assign_pointer(pgdat->memtier, memtier); return memtier; } static void destroy_memory_tier(struct memory_tier *memtier) { list_del(&memtier->list); device_unregister(&memtier->dev); } static bool clear_node_memory_tier(int node) { bool cleared = false; pg_data_t *pgdat; struct memory_tier *memtier; pgdat = NODE_DATA(node); if (!pgdat) return false; /* * Make sure that anybody looking at NODE_DATA who finds * a valid memtier finds memory_dev_types with nodes still * linked to the memtier. We achieve this by waiting for * rcu read section to finish using synchronize_rcu. * This also enables us to free the destroyed memory tier * with kfree instead of kfree_rcu */ memtier = __node_get_memory_tier(node); if (memtier) { struct memory_dev_type *memtype; rcu_assign_pointer(pgdat->memtier, NULL); synchronize_rcu(); memtype = node_memory_types[node].memtype; node_clear(node, memtype->nodes); if (nodes_empty(memtype->nodes)) { list_del_init(&memtype->tier_sibling); if (list_empty(&memtier->memory_types)) destroy_memory_tier(memtier); } cleared = true; } return cleared; } static void release_memtype(struct kref *kref) { struct memory_dev_type *memtype; memtype = container_of(kref, struct memory_dev_type, kref); kfree(memtype); } struct memory_dev_type *alloc_memory_type(int adistance) { struct memory_dev_type *memtype; memtype = kmalloc(sizeof(*memtype), GFP_KERNEL); if (!memtype) return ERR_PTR(-ENOMEM); memtype->adistance = adistance; INIT_LIST_HEAD(&memtype->tier_sibling); memtype->nodes = NODE_MASK_NONE; kref_init(&memtype->kref); return memtype; } EXPORT_SYMBOL_GPL(alloc_memory_type); void put_memory_type(struct memory_dev_type *memtype) { kref_put(&memtype->kref, release_memtype); } EXPORT_SYMBOL_GPL(put_memory_type); void init_node_memory_type(int node, struct memory_dev_type *memtype) { mutex_lock(&memory_tier_lock); __init_node_memory_type(node, memtype); mutex_unlock(&memory_tier_lock); } EXPORT_SYMBOL_GPL(init_node_memory_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype) { mutex_lock(&memory_tier_lock); if (node_memory_types[node].memtype == memtype || !memtype) node_memory_types[node].map_count--; /* * If we umapped all the attached devices to this node, * clear the node memory type. */ if (!node_memory_types[node].map_count) { memtype = node_memory_types[node].memtype; node_memory_types[node].memtype = NULL; put_memory_type(memtype); } mutex_unlock(&memory_tier_lock); } EXPORT_SYMBOL_GPL(clear_node_memory_type); struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types) { struct memory_dev_type *mtype; list_for_each_entry(mtype, memory_types, list) if (mtype->adistance == adist) return mtype; mtype = alloc_memory_type(adist); if (IS_ERR(mtype)) return mtype; list_add(&mtype->list, memory_types); return mtype; } EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type); void mt_put_memory_types(struct list_head *memory_types) { struct memory_dev_type *mtype, *mtn; list_for_each_entry_safe(mtype, mtn, memory_types, list) { list_del(&mtype->list); put_memory_type(mtype); } } EXPORT_SYMBOL_GPL(mt_put_memory_types); /* * This is invoked via `late_initcall()` to initialize memory tiers for * memory nodes, both with and without CPUs. After the initialization of * firmware and devices, adistance algorithms are expected to be provided. */ static int __init memory_tier_late_init(void) { int nid; struct memory_tier *memtier; get_online_mems(); guard(mutex)(&memory_tier_lock); /* Assign each uninitialized N_MEMORY node to a memory tier. */ for_each_node_state(nid, N_MEMORY) { /* * Some device drivers may have initialized * memory tiers, potentially bringing memory nodes * online and configuring memory tiers. * Exclude them here. */ if (node_memory_types[nid].memtype) continue; memtier = set_node_memory_tier(nid); if (IS_ERR(memtier)) continue; } establish_demotion_targets(); put_online_mems(); return 0; } late_initcall(memory_tier_late_init); static void dump_hmem_attrs(struct access_coordinate *coord, const char *prefix) { pr_info( "%sread_latency: %u, write_latency: %u, read_bandwidth: %u, write_bandwidth: %u\n", prefix, coord->read_latency, coord->write_latency, coord->read_bandwidth, coord->write_bandwidth); } int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, const char *source) { guard(mutex)(&default_dram_perf_lock); if (default_dram_perf_error) return -EIO; if (perf->read_latency + perf->write_latency == 0 || perf->read_bandwidth + perf->write_bandwidth == 0) return -EINVAL; if (default_dram_perf_ref_nid == NUMA_NO_NODE) { default_dram_perf = *perf; default_dram_perf_ref_nid = nid; default_dram_perf_ref_source = kstrdup(source, GFP_KERNEL); return 0; } /* * The performance of all default DRAM nodes is expected to be * same (that is, the variation is less than 10%). And it * will be used as base to calculate the abstract distance of * other memory nodes. */ if (abs(perf->read_latency - default_dram_perf.read_latency) * 10 > default_dram_perf.read_latency || abs(perf->write_latency - default_dram_perf.write_latency) * 10 > default_dram_perf.write_latency || abs(perf->read_bandwidth - default_dram_perf.read_bandwidth) * 10 > default_dram_perf.read_bandwidth || abs(perf->write_bandwidth - default_dram_perf.write_bandwidth) * 10 > default_dram_perf.write_bandwidth) { pr_info( "memory-tiers: the performance of DRAM node %d mismatches that of the reference\n" "DRAM node %d.\n", nid, default_dram_perf_ref_nid); pr_info(" performance of reference DRAM node %d from %s:\n", default_dram_perf_ref_nid, default_dram_perf_ref_source); dump_hmem_attrs(&default_dram_perf, " "); pr_info(" performance of DRAM node %d from %s:\n", nid, source); dump_hmem_attrs(perf, " "); pr_info( " disable default DRAM node performance based abstract distance algorithm.\n"); default_dram_perf_error = true; return -EINVAL; } return 0; } int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) { guard(mutex)(&default_dram_perf_lock); if (default_dram_perf_error) return -EIO; if (perf->read_latency + perf->write_latency == 0 || perf->read_bandwidth + perf->write_bandwidth == 0) return -EINVAL; if (default_dram_perf_ref_nid == NUMA_NO_NODE) return -ENOENT; /* * The abstract distance of a memory node is in direct proportion to * its memory latency (read + write) and inversely proportional to its * memory bandwidth (read + write). The abstract distance, memory * latency, and memory bandwidth of the default DRAM nodes are used as * the base. */ *adist = MEMTIER_ADISTANCE_DRAM * (perf->read_latency + perf->write_latency) / (default_dram_perf.read_latency + default_dram_perf.write_latency) * (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / (perf->read_bandwidth + perf->write_bandwidth); return 0; } EXPORT_SYMBOL_GPL(mt_perf_to_adistance); /** * register_mt_adistance_algorithm() - Register memory tiering abstract distance algorithm * @nb: The notifier block which describe the algorithm * * Return: 0 on success, errno on error. * * Every memory tiering abstract distance algorithm provider needs to * register the algorithm with register_mt_adistance_algorithm(). To * calculate the abstract distance for a specified memory node, the * notifier function will be called unless some high priority * algorithm has provided result. The prototype of the notifier * function is as follows, * * int (*algorithm_notifier)(struct notifier_block *nb, * unsigned long nid, void *data); * * Where "nid" specifies the memory node, "data" is the pointer to the * returned abstract distance (that is, "int *adist"). If the * algorithm provides the result, NOTIFY_STOP should be returned. * Otherwise, return_value & %NOTIFY_STOP_MASK == 0 to allow the next * algorithm in the chain to provide the result. */ int register_mt_adistance_algorithm(struct notifier_block *nb) { return blocking_notifier_chain_register(&mt_adistance_algorithms, nb); } EXPORT_SYMBOL_GPL(register_mt_adistance_algorithm); /** * unregister_mt_adistance_algorithm() - Unregister memory tiering abstract distance algorithm * @nb: the notifier block which describe the algorithm * * Return: 0 on success, errno on error. */ int unregister_mt_adistance_algorithm(struct notifier_block *nb) { return blocking_notifier_chain_unregister(&mt_adistance_algorithms, nb); } EXPORT_SYMBOL_GPL(unregister_mt_adistance_algorithm); /** * mt_calc_adistance() - Calculate abstract distance with registered algorithms * @node: the node to calculate abstract distance for * @adist: the returned abstract distance * * Return: if return_value & %NOTIFY_STOP_MASK != 0, then some * abstract distance algorithm provides the result, and return it via * @adist. Otherwise, no algorithm can provide the result and @adist * will be kept as it is. */ int mt_calc_adistance(int node, int *adist) { return blocking_notifier_call_chain(&mt_adistance_algorithms, node, adist); } EXPORT_SYMBOL_GPL(mt_calc_adistance); static int __meminit memtier_hotplug_callback(struct notifier_block *self, unsigned long action, void *_arg) { struct memory_tier *memtier; struct memory_notify *arg = _arg; /* * Only update the node migration order when a node is * changing status, like online->offline. */ if (arg->status_change_nid < 0) return notifier_from_errno(0); switch (action) { case MEM_OFFLINE: mutex_lock(&memory_tier_lock); if (clear_node_memory_tier(arg->status_change_nid)) establish_demotion_targets(); mutex_unlock(&memory_tier_lock); break; case MEM_ONLINE: mutex_lock(&memory_tier_lock); memtier = set_node_memory_tier(arg->status_change_nid); if (!IS_ERR(memtier)) establish_demotion_targets(); mutex_unlock(&memory_tier_lock); break; } return notifier_from_errno(0); } static int __init memory_tier_init(void) { int ret; ret = subsys_virtual_register(&memory_tier_subsys, NULL); if (ret) panic("%s() failed to register memory tier subsystem\n", __func__); #ifdef CONFIG_MIGRATION node_demotion = kcalloc(nr_node_ids, sizeof(struct demotion_nodes), GFP_KERNEL); WARN_ON(!node_demotion); #endif mutex_lock(&memory_tier_lock); /* * For now we can have 4 faster memory tiers with smaller adistance * than default DRAM tier. */ default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, &default_memory_types); mutex_unlock(&memory_tier_lock); if (IS_ERR(default_dram_type)) panic("%s() failed to allocate default DRAM tier\n", __func__); /* Record nodes with memory and CPU to set default DRAM performance. */ nodes_and(default_dram_nodes, node_states[N_MEMORY], node_states[N_CPU]); hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI); return 0; } subsys_initcall(memory_tier_init); bool numa_demotion_enabled = false; #ifdef CONFIG_MIGRATION #ifdef CONFIG_SYSFS static ssize_t demotion_enabled_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sysfs_emit(buf, "%s\n", str_true_false(numa_demotion_enabled)); } static ssize_t demotion_enabled_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { ssize_t ret; ret = kstrtobool(buf, &numa_demotion_enabled); if (ret) return ret; return count; } static struct kobj_attribute numa_demotion_enabled_attr = __ATTR_RW(demotion_enabled); static struct attribute *numa_attrs[] = { &numa_demotion_enabled_attr.attr, NULL, }; static const struct attribute_group numa_attr_group = { .attrs = numa_attrs, }; static int __init numa_init_sysfs(void) { int err; struct kobject *numa_kobj; numa_kobj = kobject_create_and_add("numa", mm_kobj); if (!numa_kobj) { pr_err("failed to create numa kobject\n"); return -ENOMEM; } err = sysfs_create_group(numa_kobj, &numa_attr_group); if (err) { pr_err("failed to register numa group\n"); goto delete_obj; } return 0; delete_obj: kobject_put(numa_kobj); return err; } subsys_initcall(numa_init_sysfs); #endif /* CONFIG_SYSFS */ #endif |
2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | /* * linux/fs/hfs/part_tbl.c * * Copyright (C) 1996-1997 Paul H. Hargrove * (C) 2003 Ardis Technologies <roman@ardistech.com> * This file may be distributed under the terms of the GNU General Public License. * * Original code to handle the new style Mac partition table based on * a patch contributed by Holger Schemel (aeglos@valinor.owl.de). */ #include "hfs_fs.h" /* * The new style Mac partition map * * For each partition on the media there is a physical block (512-byte * block) containing one of these structures. These blocks are * contiguous starting at block 1. */ struct new_pmap { __be16 pmSig; /* signature */ __be16 reSigPad; /* padding */ __be32 pmMapBlkCnt; /* partition blocks count */ __be32 pmPyPartStart; /* physical block start of partition */ __be32 pmPartBlkCnt; /* physical block count of partition */ u8 pmPartName[32]; /* (null terminated?) string giving the name of this partition */ u8 pmPartType[32]; /* (null terminated?) string giving the type of this partition */ /* a bunch more stuff we don't need */ } __packed; /* * The old style Mac partition map * * The partition map consists for a 2-byte signature followed by an * array of these structures. The map is terminated with an all-zero * one of these. */ struct old_pmap { __be16 pdSig; /* Signature bytes */ struct old_pmap_entry { __be32 pdStart; __be32 pdSize; __be32 pdFSID; } pdEntry[42]; } __packed; /* * hfs_part_find() * * Parse the partition map looking for the * start and length of the 'part'th HFS partition. */ int hfs_part_find(struct super_block *sb, sector_t *part_start, sector_t *part_size) { struct buffer_head *bh; __be16 *data; int i, size, res; res = -ENOENT; bh = sb_bread512(sb, *part_start + HFS_PMAP_BLK, data); if (!bh) return -EIO; switch (be16_to_cpu(*data)) { case HFS_OLD_PMAP_MAGIC: { struct old_pmap *pm; struct old_pmap_entry *p; pm = (struct old_pmap *)bh->b_data; p = pm->pdEntry; size = 42; for (i = 0; i < size; p++, i++) { if (p->pdStart && p->pdSize && p->pdFSID == cpu_to_be32(0x54465331)/*"TFS1"*/ && (HFS_SB(sb)->part < 0 || HFS_SB(sb)->part == i)) { *part_start += be32_to_cpu(p->pdStart); *part_size = be32_to_cpu(p->pdSize); res = 0; } } break; } case HFS_NEW_PMAP_MAGIC: { struct new_pmap *pm; pm = (struct new_pmap *)bh->b_data; size = be32_to_cpu(pm->pmMapBlkCnt); for (i = 0; i < size;) { if (!memcmp(pm->pmPartType,"Apple_HFS", 9) && (HFS_SB(sb)->part < 0 || HFS_SB(sb)->part == i)) { *part_start += be32_to_cpu(pm->pmPyPartStart); *part_size = be32_to_cpu(pm->pmPartBlkCnt); res = 0; break; } brelse(bh); bh = sb_bread512(sb, *part_start + HFS_PMAP_BLK + ++i, pm); if (!bh) return -EIO; if (pm->pmSig != cpu_to_be16(HFS_NEW_PMAP_MAGIC)) break; } break; } } brelse(bh); return res; } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __UDF_DECL_H #define __UDF_DECL_H #define pr_fmt(fmt) "UDF-fs: " fmt #include "ecma_167.h" #include "osta_udf.h" #include <linux/fs.h> #include <linux/types.h> #include <linux/buffer_head.h> #include <linux/udf_fs_i.h> #include "udf_sb.h" #include "udfend.h" #include "udf_i.h" #define UDF_DEFAULT_PREALLOC_BLOCKS 8 extern __printf(3, 4) void _udf_err(struct super_block *sb, const char *function, const char *fmt, ...); #define udf_err(sb, fmt, ...) \ _udf_err(sb, __func__, fmt, ##__VA_ARGS__) extern __printf(3, 4) void _udf_warn(struct super_block *sb, const char *function, const char *fmt, ...); #define udf_warn(sb, fmt, ...) \ _udf_warn(sb, __func__, fmt, ##__VA_ARGS__) #define udf_info(fmt, ...) \ pr_info("INFO " fmt, ##__VA_ARGS__) #define udf_debug(fmt, ...) \ pr_debug("%s:%d:%s: " fmt, __FILE__, __LINE__, __func__, ##__VA_ARGS__) #define UDF_EXTENT_LENGTH_MASK 0x3FFFFFFF #define UDF_EXTENT_FLAG_MASK 0xC0000000 #define UDF_INVALID_ID ((uint32_t)-1) #define UDF_NAME_PAD 4 #define UDF_NAME_LEN 254 #define UDF_NAME_LEN_CS0 255 static inline size_t udf_file_entry_alloc_offset(struct inode *inode) { struct udf_inode_info *iinfo = UDF_I(inode); if (iinfo->i_use) return sizeof(struct unallocSpaceEntry); else if (iinfo->i_efe) return sizeof(struct extendedFileEntry) + iinfo->i_lenEAttr; else return sizeof(struct fileEntry) + iinfo->i_lenEAttr; } static inline size_t udf_ext0_offset(struct inode *inode) { if (UDF_I(inode)->i_alloc_type == ICBTAG_FLAG_AD_IN_ICB) return udf_file_entry_alloc_offset(inode); else return 0; } /* computes tag checksum */ u8 udf_tag_checksum(const struct tag *t); typedef uint32_t udf_pblk_t; struct dentry; struct inode; struct task_struct; struct buffer_head; struct super_block; extern const struct export_operations udf_export_ops; extern const struct inode_operations udf_dir_inode_operations; extern const struct file_operations udf_dir_operations; extern const struct inode_operations udf_file_inode_operations; extern const struct file_operations udf_file_operations; extern const struct inode_operations udf_symlink_inode_operations; extern const struct address_space_operations udf_aops; extern const struct address_space_operations udf_symlink_aops; struct udf_fileident_iter { struct inode *dir; /* Directory we are working with */ loff_t pos; /* Logical position in a dir */ struct buffer_head *bh[2]; /* Buffer containing 'pos' and possibly * next buffer if entry straddles * blocks */ struct kernel_lb_addr eloc; /* Start of extent containing 'pos' */ uint32_t elen; /* Length of extent containing 'pos' */ sector_t loffset; /* Block offset of 'pos' within above * extent */ struct extent_position epos; /* Position after the above extent */ struct fileIdentDesc fi; /* Copied directory entry */ uint8_t *name; /* Pointer to entry name */ uint8_t *namebuf; /* Storage for entry name in case * the name is split between two blocks */ }; struct udf_vds_record { uint32_t block; uint32_t volDescSeqNum; }; struct generic_desc { struct tag descTag; __le32 volDescSeqNum; }; /* super.c */ static inline void udf_updated_lvid(struct super_block *sb) { struct buffer_head *bh = UDF_SB(sb)->s_lvid_bh; BUG_ON(!bh); WARN_ON_ONCE(((struct logicalVolIntegrityDesc *) bh->b_data)->integrityType != cpu_to_le32(LVID_INTEGRITY_TYPE_OPEN)); UDF_SB(sb)->s_lvid_dirty = 1; } extern u64 lvid_get_unique_id(struct super_block *sb); struct inode *udf_find_metadata_inode_efe(struct super_block *sb, u32 meta_file_loc, u32 partition_num); /* namei.c */ static inline unsigned int udf_dir_entry_len(struct fileIdentDesc *cfi) { return ALIGN(sizeof(struct fileIdentDesc) + le16_to_cpu(cfi->lengthOfImpUse) + cfi->lengthFileIdent, UDF_NAME_PAD); } /* file.c */ extern long udf_ioctl(struct file *, unsigned int, unsigned long); /* inode.c */ extern struct inode *__udf_iget(struct super_block *, struct kernel_lb_addr *, bool hidden_inode); static inline struct inode *udf_iget_special(struct super_block *sb, struct kernel_lb_addr *ino) { return __udf_iget(sb, ino, true); } static inline struct inode *udf_iget(struct super_block *sb, struct kernel_lb_addr *ino) { return __udf_iget(sb, ino, false); } extern int udf_expand_file_adinicb(struct inode *); extern struct buffer_head *udf_bread(struct inode *inode, udf_pblk_t block, int create, int *err); extern int udf_setsize(struct inode *, loff_t); extern void udf_evict_inode(struct inode *); extern int udf_write_inode(struct inode *, struct writeback_control *wbc); extern int inode_bmap(struct inode *inode, sector_t block, struct extent_position *pos, struct kernel_lb_addr *eloc, uint32_t *elen, sector_t *offset, int8_t *etype); int udf_get_block(struct inode *, sector_t, struct buffer_head *, int); extern int udf_setup_indirect_aext(struct inode *inode, udf_pblk_t block, struct extent_position *epos); extern int __udf_add_aext(struct inode *inode, struct extent_position *epos, struct kernel_lb_addr *eloc, uint32_t elen, int inc); extern int udf_add_aext(struct inode *, struct extent_position *, struct kernel_lb_addr *, uint32_t, int); extern void udf_write_aext(struct inode *, struct extent_position *, struct kernel_lb_addr *, uint32_t, int); extern int8_t udf_delete_aext(struct inode *, struct extent_position); extern int udf_next_aext(struct inode *inode, struct extent_position *epos, struct kernel_lb_addr *eloc, uint32_t *elen, int8_t *etype, int inc); extern int udf_current_aext(struct inode *inode, struct extent_position *epos, struct kernel_lb_addr *eloc, uint32_t *elen, int8_t *etype, int inc); extern void udf_update_extra_perms(struct inode *inode, umode_t mode); /* misc.c */ extern struct genericFormat *udf_add_extendedattr(struct inode *, uint32_t, uint32_t, uint8_t); extern struct genericFormat *udf_get_extendedattr(struct inode *, uint32_t, uint8_t); extern struct buffer_head *udf_read_tagged(struct super_block *, uint32_t, uint32_t, uint16_t *); extern struct buffer_head *udf_read_ptagged(struct super_block *, struct kernel_lb_addr *, uint32_t, uint16_t *); extern void udf_update_tag(char *, int); extern void udf_new_tag(char *, uint16_t, uint16_t, uint16_t, uint32_t, int); /* lowlevel.c */ extern unsigned int udf_get_last_session(struct super_block *); udf_pblk_t udf_get_last_block(struct super_block *); /* partition.c */ extern uint32_t udf_get_pblock(struct super_block *, uint32_t, uint16_t, uint32_t); extern uint32_t udf_get_pblock_virt15(struct super_block *, uint32_t, uint16_t, uint32_t); extern uint32_t udf_get_pblock_virt20(struct super_block *, uint32_t, uint16_t, uint32_t); extern uint32_t udf_get_pblock_spar15(struct super_block *, uint32_t, uint16_t, uint32_t); extern uint32_t udf_get_pblock_meta25(struct super_block *, uint32_t, uint16_t, uint32_t); extern int udf_relocate_blocks(struct super_block *, long, long *); static inline uint32_t udf_get_lb_pblock(struct super_block *sb, struct kernel_lb_addr *loc, uint32_t offset) { return udf_get_pblock(sb, loc->logicalBlockNum, loc->partitionReferenceNum, offset); } /* unicode.c */ extern int udf_get_filename(struct super_block *, const uint8_t *, int, uint8_t *, int); extern int udf_put_filename(struct super_block *, const uint8_t *, int, uint8_t *, int); extern int udf_dstrCS0toChar(struct super_block *, uint8_t *, int, const uint8_t *, int); /* ialloc.c */ extern void udf_free_inode(struct inode *); extern struct inode *udf_new_inode(struct inode *, umode_t); /* truncate.c */ extern void udf_truncate_tail_extent(struct inode *); extern void udf_discard_prealloc(struct inode *); extern int udf_truncate_extents(struct inode *); /* balloc.c */ extern void udf_free_blocks(struct super_block *, struct inode *, struct kernel_lb_addr *, uint32_t, uint32_t); extern int udf_prealloc_blocks(struct super_block *, struct inode *, uint16_t, uint32_t, uint32_t); extern udf_pblk_t udf_new_block(struct super_block *sb, struct inode *inode, uint16_t partition, uint32_t goal, int *err); /* directory.c */ int udf_fiiter_init(struct udf_fileident_iter *iter, struct inode *dir, loff_t pos); int udf_fiiter_advance(struct udf_fileident_iter *iter); void udf_fiiter_release(struct udf_fileident_iter *iter); void udf_fiiter_write_fi(struct udf_fileident_iter *iter, uint8_t *impuse); void udf_fiiter_update_elen(struct udf_fileident_iter *iter, uint32_t new_elen); int udf_fiiter_append_blk(struct udf_fileident_iter *iter); extern struct long_ad *udf_get_filelongad(uint8_t *, int, uint32_t *, int); extern struct short_ad *udf_get_fileshortad(uint8_t *, int, uint32_t *, int); /* udftime.c */ extern void udf_disk_stamp_to_time(struct timespec64 *dest, struct timestamp src); extern void udf_time_to_disk_stamp(struct timestamp *dest, struct timespec64 src); #endif /* __UDF_DECL_H */ |
24 24 7 6 12 24 24 24 24 24 24 24 21 6 24 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 | // SPDX-License-Identifier: GPL-2.0+ /* * Copyright (C) 2003-2008 Takahiro Hirofuchi * Copyright (C) 2015-2016 Nobuo Iwata */ #include <linux/init.h> #include <linux/file.h> #include <linux/kernel.h> #include <linux/kthread.h> #include <linux/module.h> #include <linux/platform_device.h> #include <linux/slab.h> #include <linux/string_choices.h> #include "usbip_common.h" #include "vhci.h" #define DRIVER_AUTHOR "Takahiro Hirofuchi" #define DRIVER_DESC "USB/IP 'Virtual' Host Controller (VHCI) Driver" /* * TODO * - update root hub emulation * - move the emulation code to userland ? * porting to other operating systems * minimize kernel code * - add suspend/resume code * - clean up everything */ /* See usb gadget dummy hcd */ static int vhci_hub_status(struct usb_hcd *hcd, char *buff); static int vhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue, u16 wIndex, char *buff, u16 wLength); static int vhci_urb_enqueue(struct usb_hcd *hcd, struct urb *urb, gfp_t mem_flags); static int vhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status); static int vhci_start(struct usb_hcd *vhci_hcd); static void vhci_stop(struct usb_hcd *hcd); static int vhci_get_frame_number(struct usb_hcd *hcd); static const char driver_name[] = "vhci_hcd"; static const char driver_desc[] = "USB/IP Virtual Host Controller"; int vhci_num_controllers = VHCI_NR_HCS; struct vhci *vhcis; static const char * const bit_desc[] = { "CONNECTION", /*0*/ "ENABLE", /*1*/ "SUSPEND", /*2*/ "OVER_CURRENT", /*3*/ "RESET", /*4*/ "L1", /*5*/ "R6", /*6*/ "R7", /*7*/ "POWER", /*8*/ "LOWSPEED", /*9*/ "HIGHSPEED", /*10*/ "PORT_TEST", /*11*/ "INDICATOR", /*12*/ "R13", /*13*/ "R14", /*14*/ "R15", /*15*/ "C_CONNECTION", /*16*/ "C_ENABLE", /*17*/ "C_SUSPEND", /*18*/ "C_OVER_CURRENT", /*19*/ "C_RESET", /*20*/ "C_L1", /*21*/ "R22", /*22*/ "R23", /*23*/ "R24", /*24*/ "R25", /*25*/ "R26", /*26*/ "R27", /*27*/ "R28", /*28*/ "R29", /*29*/ "R30", /*30*/ "R31", /*31*/ }; static const char * const bit_desc_ss[] = { "CONNECTION", /*0*/ "ENABLE", /*1*/ "SUSPEND", /*2*/ "OVER_CURRENT", /*3*/ "RESET", /*4*/ "L1", /*5*/ "R6", /*6*/ "R7", /*7*/ "R8", /*8*/ "POWER", /*9*/ "HIGHSPEED", /*10*/ "PORT_TEST", /*11*/ "INDICATOR", /*12*/ "R13", /*13*/ "R14", /*14*/ "R15", /*15*/ "C_CONNECTION", /*16*/ "C_ENABLE", /*17*/ "C_SUSPEND", /*18*/ "C_OVER_CURRENT", /*19*/ "C_RESET", /*20*/ "C_BH_RESET", /*21*/ "C_LINK_STATE", /*22*/ "C_CONFIG_ERROR", /*23*/ "R24", /*24*/ "R25", /*25*/ "R26", /*26*/ "R27", /*27*/ "R28", /*28*/ "R29", /*29*/ "R30", /*30*/ "R31", /*31*/ }; static void dump_port_status_diff(u32 prev_status, u32 new_status, bool usb3) { int i = 0; u32 bit = 1; const char * const *desc = bit_desc; if (usb3) desc = bit_desc_ss; pr_debug("status prev -> new: %08x -> %08x\n", prev_status, new_status); while (bit) { u32 prev = prev_status & bit; u32 new = new_status & bit; char change; if (!prev && new) change = '+'; else if (prev && !new) change = '-'; else change = ' '; if (prev || new) { pr_debug(" %c%s\n", change, desc[i]); if (bit == 1) /* USB_PORT_STAT_CONNECTION */ pr_debug(" %c%s\n", change, "USB_PORT_STAT_SPEED_5GBPS"); } bit <<= 1; i++; } pr_debug("\n"); } void rh_port_connect(struct vhci_device *vdev, enum usb_device_speed speed) { struct vhci_hcd *vhci_hcd = vdev_to_vhci_hcd(vdev); struct vhci *vhci = vhci_hcd->vhci; int rhport = vdev->rhport; u32 status; unsigned long flags; usbip_dbg_vhci_rh("rh_port_connect %d\n", rhport); spin_lock_irqsave(&vhci->lock, flags); status = vhci_hcd->port_status[rhport]; status |= USB_PORT_STAT_CONNECTION | (1 << USB_PORT_FEAT_C_CONNECTION); switch (speed) { case USB_SPEED_HIGH: status |= USB_PORT_STAT_HIGH_SPEED; break; case USB_SPEED_LOW: status |= USB_PORT_STAT_LOW_SPEED; break; default: break; } vhci_hcd->port_status[rhport] = status; spin_unlock_irqrestore(&vhci->lock, flags); usb_hcd_poll_rh_status(vhci_hcd_to_hcd(vhci_hcd)); } static void rh_port_disconnect(struct vhci_device *vdev) { struct vhci_hcd *vhci_hcd = vdev_to_vhci_hcd(vdev); struct vhci *vhci = vhci_hcd->vhci; int rhport = vdev->rhport; u32 status; unsigned long flags; usbip_dbg_vhci_rh("rh_port_disconnect %d\n", rhport); spin_lock_irqsave(&vhci->lock, flags); status = vhci_hcd->port_status[rhport]; status &= ~USB_PORT_STAT_CONNECTION; status |= (1 << USB_PORT_FEAT_C_CONNECTION); vhci_hcd->port_status[rhport] = status; spin_unlock_irqrestore(&vhci->lock, flags); usb_hcd_poll_rh_status(vhci_hcd_to_hcd(vhci_hcd)); } #define PORT_C_MASK \ ((USB_PORT_STAT_C_CONNECTION \ | USB_PORT_STAT_C_ENABLE \ | USB_PORT_STAT_C_SUSPEND \ | USB_PORT_STAT_C_OVERCURRENT \ | USB_PORT_STAT_C_RESET) << 16) /* * Returns 0 if the status hasn't changed, or the number of bytes in buf. * Ports are 0-indexed from the HCD point of view, * and 1-indexed from the USB core pointer of view. * * @buf: a bitmap to show which port status has been changed. * bit 0: reserved * bit 1: the status of port 0 has been changed. * bit 2: the status of port 1 has been changed. * ... */ static int vhci_hub_status(struct usb_hcd *hcd, char *buf) { struct vhci_hcd *vhci_hcd = hcd_to_vhci_hcd(hcd); struct vhci *vhci = vhci_hcd->vhci; int retval = DIV_ROUND_UP(VHCI_HC_PORTS + 1, 8); int rhport; int changed = 0; unsigned long flags; memset(buf, 0, retval); spin_lock_irqsave(&vhci->lock, flags); if (!HCD_HW_ACCESSIBLE(hcd)) { usbip_dbg_vhci_rh("hw accessible flag not on?\n"); goto done; } /* check pseudo status register for each port */ for (rhport = 0; rhport < VHCI_HC_PORTS; rhport++) { if ((vhci_hcd->port_status[rhport] & PORT_C_MASK)) { /* The status of a port has been changed, */ usbip_dbg_vhci_rh("port %d status changed\n", rhport); buf[(rhport + 1) / 8] |= 1 << (rhport + 1) % 8; changed = 1; } } if ((hcd->state == HC_STATE_SUSPENDED) && (changed == 1)) usb_hcd_resume_root_hub(hcd); done: spin_unlock_irqrestore(&vhci->lock, flags); return changed ? retval : 0; } /* usb 3.0 root hub device descriptor */ static struct { struct usb_bos_descriptor bos; struct usb_ss_cap_descriptor ss_cap; } __packed usb3_bos_desc = { .bos = { .bLength = USB_DT_BOS_SIZE, .bDescriptorType = USB_DT_BOS, .wTotalLength = cpu_to_le16(sizeof(usb3_bos_desc)), .bNumDeviceCaps = 1, }, .ss_cap = { .bLength = USB_DT_USB_SS_CAP_SIZE, .bDescriptorType = USB_DT_DEVICE_CAPABILITY, .bDevCapabilityType = USB_SS_CAP_TYPE, .wSpeedSupported = cpu_to_le16(USB_5GBPS_OPERATION), .bFunctionalitySupport = ilog2(USB_5GBPS_OPERATION), }, }; static inline void ss_hub_descriptor(struct usb_hub_descriptor *desc) { memset(desc, 0, sizeof *desc); desc->bDescriptorType = USB_DT_SS_HUB; desc->bDescLength = 12; desc->wHubCharacteristics = cpu_to_le16( HUB_CHAR_INDV_PORT_LPSM | HUB_CHAR_COMMON_OCPM); desc->bNbrPorts = VHCI_HC_PORTS; desc->u.ss.bHubHdrDecLat = 0x04; /* Worst case: 0.4 micro sec*/ desc->u.ss.DeviceRemovable = 0xffff; } static inline void hub_descriptor(struct usb_hub_descriptor *desc) { int width; memset(desc, 0, sizeof(*desc)); desc->bDescriptorType = USB_DT_HUB; desc->wHubCharacteristics = cpu_to_le16( HUB_CHAR_INDV_PORT_LPSM | HUB_CHAR_COMMON_OCPM); desc->bNbrPorts = VHCI_HC_PORTS; BUILD_BUG_ON(VHCI_HC_PORTS > USB_MAXCHILDREN); width = desc->bNbrPorts / 8 + 1; desc->bDescLength = USB_DT_HUB_NONVAR_SIZE + 2 * width; memset(&desc->u.hs.DeviceRemovable[0], 0, width); memset(&desc->u.hs.DeviceRemovable[width], 0xff, width); } static int vhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue, u16 wIndex, char *buf, u16 wLength) { struct vhci_hcd *vhci_hcd; struct vhci *vhci; int retval = 0; int rhport = -1; unsigned long flags; bool invalid_rhport = false; u32 prev_port_status[VHCI_HC_PORTS]; if (!HCD_HW_ACCESSIBLE(hcd)) return -ETIMEDOUT; /* * NOTE: * wIndex (bits 0-7) shows the port number and begins from 1? */ wIndex = ((__u8)(wIndex & 0x00ff)); usbip_dbg_vhci_rh("typeReq %x wValue %x wIndex %x\n", typeReq, wValue, wIndex); /* * wIndex can be 0 for some request types (typeReq). rhport is * in valid range when wIndex >= 1 and < VHCI_HC_PORTS. * * Reference port_status[] only with valid rhport when * invalid_rhport is false. */ if (wIndex < 1 || wIndex > VHCI_HC_PORTS) { invalid_rhport = true; if (wIndex > VHCI_HC_PORTS) pr_err("invalid port number %d\n", wIndex); } else rhport = wIndex - 1; vhci_hcd = hcd_to_vhci_hcd(hcd); vhci = vhci_hcd->vhci; spin_lock_irqsave(&vhci->lock, flags); /* store old status and compare now and old later */ if (usbip_dbg_flag_vhci_rh) { if (!invalid_rhport) memcpy(prev_port_status, vhci_hcd->port_status, sizeof(prev_port_status)); } switch (typeReq) { case ClearHubFeature: usbip_dbg_vhci_rh(" ClearHubFeature\n"); break; case ClearPortFeature: if (invalid_rhport) { pr_err("invalid port number %d\n", wIndex); goto error; } switch (wValue) { case USB_PORT_FEAT_SUSPEND: if (hcd->speed >= HCD_USB3) { pr_err(" ClearPortFeature: USB_PORT_FEAT_SUSPEND req not " "supported for USB 3.0 roothub\n"); goto error; } usbip_dbg_vhci_rh( " ClearPortFeature: USB_PORT_FEAT_SUSPEND\n"); if (vhci_hcd->port_status[rhport] & USB_PORT_STAT_SUSPEND) { /* 20msec signaling */ vhci_hcd->resuming = 1; vhci_hcd->re_timeout = jiffies + msecs_to_jiffies(20); } break; case USB_PORT_FEAT_POWER: usbip_dbg_vhci_rh( " ClearPortFeature: USB_PORT_FEAT_POWER\n"); if (hcd->speed >= HCD_USB3) vhci_hcd->port_status[rhport] &= ~USB_SS_PORT_STAT_POWER; else vhci_hcd->port_status[rhport] &= ~USB_PORT_STAT_POWER; break; default: usbip_dbg_vhci_rh(" ClearPortFeature: default %x\n", wValue); if (wValue >= 32) goto error; vhci_hcd->port_status[rhport] &= ~(1 << wValue); break; } break; case GetHubDescriptor: usbip_dbg_vhci_rh(" GetHubDescriptor\n"); if (hcd->speed >= HCD_USB3 && (wLength < USB_DT_SS_HUB_SIZE || wValue != (USB_DT_SS_HUB << 8))) { pr_err("Wrong hub descriptor type for USB 3.0 roothub.\n"); goto error; } if (hcd->speed >= HCD_USB3) ss_hub_descriptor((struct usb_hub_descriptor *) buf); else hub_descriptor((struct usb_hub_descriptor *) buf); break; case DeviceRequest | USB_REQ_GET_DESCRIPTOR: if (hcd->speed < HCD_USB3) goto error; if ((wValue >> 8) != USB_DT_BOS) goto error; memcpy(buf, &usb3_bos_desc, sizeof(usb3_bos_desc)); retval = sizeof(usb3_bos_desc); break; case GetHubStatus: usbip_dbg_vhci_rh(" GetHubStatus\n"); *(__le32 *) buf = cpu_to_le32(0); break; case GetPortStatus: usbip_dbg_vhci_rh(" GetPortStatus port %x\n", wIndex); if (invalid_rhport) { pr_err("invalid port number %d\n", wIndex); retval = -EPIPE; goto error; } /* we do not care about resume. */ /* whoever resets or resumes must GetPortStatus to * complete it!! */ if (vhci_hcd->resuming && time_after(jiffies, vhci_hcd->re_timeout)) { vhci_hcd->port_status[rhport] |= (1 << USB_PORT_FEAT_C_SUSPEND); vhci_hcd->port_status[rhport] &= ~(1 << USB_PORT_FEAT_SUSPEND); vhci_hcd->resuming = 0; vhci_hcd->re_timeout = 0; } if ((vhci_hcd->port_status[rhport] & (1 << USB_PORT_FEAT_RESET)) != 0 && time_after(jiffies, vhci_hcd->re_timeout)) { vhci_hcd->port_status[rhport] |= (1 << USB_PORT_FEAT_C_RESET); vhci_hcd->port_status[rhport] &= ~(1 << USB_PORT_FEAT_RESET); vhci_hcd->re_timeout = 0; /* * A few drivers do usb reset during probe when * the device could be in VDEV_ST_USED state */ if (vhci_hcd->vdev[rhport].ud.status == VDEV_ST_NOTASSIGNED || vhci_hcd->vdev[rhport].ud.status == VDEV_ST_USED) { usbip_dbg_vhci_rh( " enable rhport %d (status %u)\n", rhport, vhci_hcd->vdev[rhport].ud.status); vhci_hcd->port_status[rhport] |= USB_PORT_STAT_ENABLE; } if (hcd->speed < HCD_USB3) { switch (vhci_hcd->vdev[rhport].speed) { case USB_SPEED_HIGH: vhci_hcd->port_status[rhport] |= USB_PORT_STAT_HIGH_SPEED; break; case USB_SPEED_LOW: vhci_hcd->port_status[rhport] |= USB_PORT_STAT_LOW_SPEED; break; default: pr_err("vhci_device speed not set\n"); break; } } } ((__le16 *) buf)[0] = cpu_to_le16(vhci_hcd->port_status[rhport]); ((__le16 *) buf)[1] = cpu_to_le16(vhci_hcd->port_status[rhport] >> 16); usbip_dbg_vhci_rh(" GetPortStatus bye %x %x\n", ((u16 *)buf)[0], ((u16 *)buf)[1]); break; case SetHubFeature: usbip_dbg_vhci_rh(" SetHubFeature\n"); retval = -EPIPE; break; case SetPortFeature: switch (wValue) { case USB_PORT_FEAT_LINK_STATE: usbip_dbg_vhci_rh( " SetPortFeature: USB_PORT_FEAT_LINK_STATE\n"); if (hcd->speed < HCD_USB3) { pr_err("USB_PORT_FEAT_LINK_STATE req not " "supported for USB 2.0 roothub\n"); goto error; } /* * Since this is dummy we don't have an actual link so * there is nothing to do for the SET_LINK_STATE cmd */ break; case USB_PORT_FEAT_U1_TIMEOUT: usbip_dbg_vhci_rh( " SetPortFeature: USB_PORT_FEAT_U1_TIMEOUT\n"); fallthrough; case USB_PORT_FEAT_U2_TIMEOUT: usbip_dbg_vhci_rh( " SetPortFeature: USB_PORT_FEAT_U2_TIMEOUT\n"); /* TODO: add suspend/resume support! */ if (hcd->speed < HCD_USB3) { pr_err("USB_PORT_FEAT_U1/2_TIMEOUT req not " "supported for USB 2.0 roothub\n"); goto error; } break; case USB_PORT_FEAT_SUSPEND: usbip_dbg_vhci_rh( " SetPortFeature: USB_PORT_FEAT_SUSPEND\n"); /* Applicable only for USB2.0 hub */ if (hcd->speed >= HCD_USB3) { pr_err("USB_PORT_FEAT_SUSPEND req not " "supported for USB 3.0 roothub\n"); goto error; } if (invalid_rhport) { pr_err("invalid port number %d\n", wIndex); goto error; } vhci_hcd->port_status[rhport] |= USB_PORT_STAT_SUSPEND; break; case USB_PORT_FEAT_POWER: usbip_dbg_vhci_rh( " SetPortFeature: USB_PORT_FEAT_POWER\n"); if (invalid_rhport) { pr_err("invalid port number %d\n", wIndex); goto error; } if (hcd->speed >= HCD_USB3) vhci_hcd->port_status[rhport] |= USB_SS_PORT_STAT_POWER; else vhci_hcd->port_status[rhport] |= USB_PORT_STAT_POWER; break; case USB_PORT_FEAT_BH_PORT_RESET: usbip_dbg_vhci_rh( " SetPortFeature: USB_PORT_FEAT_BH_PORT_RESET\n"); if (invalid_rhport) { pr_err("invalid port number %d\n", wIndex); goto error; } /* Applicable only for USB3.0 hub */ if (hcd->speed < HCD_USB3) { pr_err("USB_PORT_FEAT_BH_PORT_RESET req not " "supported for USB 2.0 roothub\n"); goto error; } fallthrough; case USB_PORT_FEAT_RESET: usbip_dbg_vhci_rh( " SetPortFeature: USB_PORT_FEAT_RESET\n"); if (invalid_rhport) { pr_err("invalid port number %d\n", wIndex); goto error; } /* if it's already enabled, disable */ if (hcd->speed >= HCD_USB3) { vhci_hcd->port_status[rhport] = 0; vhci_hcd->port_status[rhport] = (USB_SS_PORT_STAT_POWER | USB_PORT_STAT_CONNECTION | USB_PORT_STAT_RESET); } else if (vhci_hcd->port_status[rhport] & USB_PORT_STAT_ENABLE) { vhci_hcd->port_status[rhport] &= ~(USB_PORT_STAT_ENABLE | USB_PORT_STAT_LOW_SPEED | USB_PORT_STAT_HIGH_SPEED); } /* 50msec reset signaling */ vhci_hcd->re_timeout = jiffies + msecs_to_jiffies(50); fallthrough; default: usbip_dbg_vhci_rh(" SetPortFeature: default %d\n", wValue); if (invalid_rhport) { pr_err("invalid port number %d\n", wIndex); goto error; } if (wValue >= 32) goto error; if (hcd->speed >= HCD_USB3) { if ((vhci_hcd->port_status[rhport] & USB_SS_PORT_STAT_POWER) != 0) { vhci_hcd->port_status[rhport] |= (1 << wValue); } } else if ((vhci_hcd->port_status[rhport] & USB_PORT_STAT_POWER) != 0) { vhci_hcd->port_status[rhport] |= (1 << wValue); } } break; case GetPortErrorCount: usbip_dbg_vhci_rh(" GetPortErrorCount\n"); if (hcd->speed < HCD_USB3) { pr_err("GetPortErrorCount req not " "supported for USB 2.0 roothub\n"); goto error; } /* We'll always return 0 since this is a dummy hub */ *(__le32 *) buf = cpu_to_le32(0); break; case SetHubDepth: usbip_dbg_vhci_rh(" SetHubDepth\n"); if (hcd->speed < HCD_USB3) { pr_err("SetHubDepth req not supported for " "USB 2.0 roothub\n"); goto error; } break; default: pr_err("default hub control req: %04x v%04x i%04x l%d\n", typeReq, wValue, wIndex, wLength); error: /* "protocol stall" on error */ retval = -EPIPE; } if (usbip_dbg_flag_vhci_rh) { pr_debug("port %d\n", rhport); /* Only dump valid port status */ if (!invalid_rhport) { dump_port_status_diff(prev_port_status[rhport], vhci_hcd->port_status[rhport], hcd->speed >= HCD_USB3); } } usbip_dbg_vhci_rh(" bye\n"); spin_unlock_irqrestore(&vhci->lock, flags); if (!invalid_rhport && (vhci_hcd->port_status[rhport] & PORT_C_MASK) != 0) { usb_hcd_poll_rh_status(hcd); } return retval; } static void vhci_tx_urb(struct urb *urb, struct vhci_device *vdev) { struct vhci_priv *priv; struct vhci_hcd *vhci_hcd = vdev_to_vhci_hcd(vdev); unsigned long flags; priv = kzalloc(sizeof(struct vhci_priv), GFP_ATOMIC); if (!priv) { usbip_event_add(&vdev->ud, VDEV_EVENT_ERROR_MALLOC); return; } spin_lock_irqsave(&vdev->priv_lock, flags); priv->seqnum = (u32)atomic_inc_return(&vhci_hcd->seqnum); if (priv->seqnum == 0xffff) dev_info(&urb->dev->dev, "seqnum max\n"); priv->vdev = vdev; priv->urb = urb; urb->hcpriv = (void *) priv; list_add_tail(&priv->list, &vdev->priv_tx); wake_up(&vdev->waitq_tx); spin_unlock_irqrestore(&vdev->priv_lock, flags); } static int vhci_urb_enqueue(struct usb_hcd *hcd, struct urb *urb, gfp_t mem_flags) { struct vhci_hcd *vhci_hcd = hcd_to_vhci_hcd(hcd); struct vhci *vhci = vhci_hcd->vhci; struct device *dev = &urb->dev->dev; u8 portnum = urb->dev->portnum; int ret = 0; struct vhci_device *vdev; unsigned long flags; if (portnum > VHCI_HC_PORTS) { pr_err("invalid port number %d\n", portnum); return -ENODEV; } vdev = &vhci_hcd->vdev[portnum-1]; if (!urb->transfer_buffer && !urb->num_sgs && urb->transfer_buffer_length) { dev_dbg(dev, "Null URB transfer buffer\n"); return -EINVAL; } spin_lock_irqsave(&vhci->lock, flags); if (urb->status != -EINPROGRESS) { dev_err(dev, "URB already unlinked!, status %d\n", urb->status); spin_unlock_irqrestore(&vhci->lock, flags); return urb->status; } /* refuse enqueue for dead connection */ spin_lock(&vdev->ud.lock); if (vdev->ud.status == VDEV_ST_NULL || vdev->ud.status == VDEV_ST_ERROR) { dev_err(dev, "enqueue for inactive port %d\n", vdev->rhport); spin_unlock(&vdev->ud.lock); spin_unlock_irqrestore(&vhci->lock, flags); return -ENODEV; } spin_unlock(&vdev->ud.lock); ret = usb_hcd_link_urb_to_ep(hcd, urb); if (ret) goto no_need_unlink; /* * The enumeration process is as follows; * * 1. Get_Descriptor request to DevAddrs(0) EndPoint(0) * to get max packet length of default pipe * * 2. Set_Address request to DevAddr(0) EndPoint(0) * */ if (usb_pipedevice(urb->pipe) == 0) { struct usb_device *old; __u8 type = usb_pipetype(urb->pipe); struct usb_ctrlrequest *ctrlreq = (struct usb_ctrlrequest *) urb->setup_packet; if (type != PIPE_CONTROL || !ctrlreq) { dev_err(dev, "invalid request to devnum 0\n"); ret = -EINVAL; goto no_need_xmit; } old = vdev->udev; switch (ctrlreq->bRequest) { case USB_REQ_SET_ADDRESS: /* set_address may come when a device is reset */ dev_info(dev, "SetAddress Request (%d) to port %d\n", ctrlreq->wValue, vdev->rhport); vdev->udev = usb_get_dev(urb->dev); usb_put_dev(old); spin_lock(&vdev->ud.lock); vdev->ud.status = VDEV_ST_USED; spin_unlock(&vdev->ud.lock); if (urb->status == -EINPROGRESS) { /* This request is successfully completed. */ /* If not -EINPROGRESS, possibly unlinked. */ urb->status = 0; } goto no_need_xmit; case USB_REQ_GET_DESCRIPTOR: if (ctrlreq->wValue == cpu_to_le16(USB_DT_DEVICE << 8)) usbip_dbg_vhci_hc( "Not yet?:Get_Descriptor to device 0 (get max pipe size)\n"); vdev->udev = usb_get_dev(urb->dev); usb_put_dev(old); goto out; default: /* NOT REACHED */ dev_err(dev, "invalid request to devnum 0 bRequest %u, wValue %u\n", ctrlreq->bRequest, ctrlreq->wValue); ret = -EINVAL; goto no_need_xmit; } } out: vhci_tx_urb(urb, vdev); spin_unlock_irqrestore(&vhci->lock, flags); return 0; no_need_xmit: usb_hcd_unlink_urb_from_ep(hcd, urb); no_need_unlink: spin_unlock_irqrestore(&vhci->lock, flags); if (!ret) { /* usb_hcd_giveback_urb() should be called with * irqs disabled */ local_irq_disable(); usb_hcd_giveback_urb(hcd, urb, urb->status); local_irq_enable(); } return ret; } /* * vhci_rx gives back the urb after receiving the reply of the urb. If an * unlink pdu is sent or not, vhci_rx receives a normal return pdu and gives * back its urb. For the driver unlinking the urb, the content of the urb is * not important, but the calling to its completion handler is important; the * completion of unlinking is notified by the completion handler. * * * CLIENT SIDE * * - When vhci_hcd receives RET_SUBMIT, * * - case 1a). the urb of the pdu is not unlinking. * - normal case * => just give back the urb * * - case 1b). the urb of the pdu is unlinking. * - usbip.ko will return a reply of the unlinking request. * => give back the urb now and go to case 2b). * * - When vhci_hcd receives RET_UNLINK, * * - case 2a). a submit request is still pending in vhci_hcd. * - urb was really pending in usbip.ko and urb_unlink_urb() was * completed there. * => free a pending submit request * => notify unlink completeness by giving back the urb * * - case 2b). a submit request is *not* pending in vhci_hcd. * - urb was already given back to the core driver. * => do not give back the urb * * * SERVER SIDE * * - When usbip receives CMD_UNLINK, * * - case 3a). the urb of the unlink request is now in submission. * => do usb_unlink_urb(). * => after the unlink is completed, send RET_UNLINK. * * - case 3b). the urb of the unlink request is not in submission. * - may be already completed or never be received * => send RET_UNLINK * */ static int vhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status) { struct vhci_hcd *vhci_hcd = hcd_to_vhci_hcd(hcd); struct vhci *vhci = vhci_hcd->vhci; struct vhci_priv *priv; struct vhci_device *vdev; unsigned long flags; spin_lock_irqsave(&vhci->lock, flags); priv = urb->hcpriv; if (!priv) { /* URB was never linked! or will be soon given back by * vhci_rx. */ spin_unlock_irqrestore(&vhci->lock, flags); return -EIDRM; } { int ret = 0; ret = usb_hcd_check_unlink_urb(hcd, urb, status); if (ret) { spin_unlock_irqrestore(&vhci->lock, flags); return ret; } } /* send unlink request here? */ vdev = priv->vdev; if (!vdev->ud.tcp_socket) { /* tcp connection is closed */ spin_lock(&vdev->priv_lock); list_del(&priv->list); kfree(priv); urb->hcpriv = NULL; spin_unlock(&vdev->priv_lock); /* * If tcp connection is alive, we have sent CMD_UNLINK. * vhci_rx will receive RET_UNLINK and give back the URB. * Otherwise, we give back it here. */ usb_hcd_unlink_urb_from_ep(hcd, urb); spin_unlock_irqrestore(&vhci->lock, flags); usb_hcd_giveback_urb(hcd, urb, urb->status); spin_lock_irqsave(&vhci->lock, flags); } else { /* tcp connection is alive */ struct vhci_unlink *unlink; spin_lock(&vdev->priv_lock); /* setup CMD_UNLINK pdu */ unlink = kzalloc(sizeof(struct vhci_unlink), GFP_ATOMIC); if (!unlink) { spin_unlock(&vdev->priv_lock); spin_unlock_irqrestore(&vhci->lock, flags); usbip_event_add(&vdev->ud, VDEV_EVENT_ERROR_MALLOC); return -ENOMEM; } unlink->seqnum = atomic_inc_return(&vhci_hcd->seqnum); if (unlink->seqnum == 0xffff) pr_info("seqnum max\n"); unlink->unlink_seqnum = priv->seqnum; /* send cmd_unlink and try to cancel the pending URB in the * peer */ list_add_tail(&unlink->list, &vdev->unlink_tx); wake_up(&vdev->waitq_tx); spin_unlock(&vdev->priv_lock); } spin_unlock_irqrestore(&vhci->lock, flags); usbip_dbg_vhci_hc("leave\n"); return 0; } static void vhci_cleanup_unlink_list(struct vhci_device *vdev, struct list_head *unlink_list) { struct vhci_hcd *vhci_hcd = vdev_to_vhci_hcd(vdev); struct usb_hcd *hcd = vhci_hcd_to_hcd(vhci_hcd); struct vhci *vhci = vhci_hcd->vhci; struct vhci_unlink *unlink, *tmp; unsigned long flags; spin_lock_irqsave(&vhci->lock, flags); spin_lock(&vdev->priv_lock); list_for_each_entry_safe(unlink, tmp, unlink_list, list) { struct urb *urb; urb = pickup_urb_and_free_priv(vdev, unlink->unlink_seqnum); if (!urb) { list_del(&unlink->list); kfree(unlink); continue; } urb->status = -ENODEV; usb_hcd_unlink_urb_from_ep(hcd, urb); list_del(&unlink->list); spin_unlock(&vdev->priv_lock); spin_unlock_irqrestore(&vhci->lock, flags); usb_hcd_giveback_urb(hcd, urb, urb->status); spin_lock_irqsave(&vhci->lock, flags); spin_lock(&vdev->priv_lock); kfree(unlink); } spin_unlock(&vdev->priv_lock); spin_unlock_irqrestore(&vhci->lock, flags); } static void vhci_device_unlink_cleanup(struct vhci_device *vdev) { /* give back URB of unsent unlink request */ vhci_cleanup_unlink_list(vdev, &vdev->unlink_tx); /* give back URB of unanswered unlink request */ vhci_cleanup_unlink_list(vdev, &vdev->unlink_rx); } /* * The important thing is that only one context begins cleanup. * This is why error handling and cleanup become simple. * We do not want to consider race condition as possible. */ static void vhci_shutdown_connection(struct usbip_device *ud) { struct vhci_device *vdev = container_of(ud, struct vhci_device, ud); /* need this? see stub_dev.c */ if (ud->tcp_socket) { pr_debug("shutdown tcp_socket %d\n", ud->sockfd); kernel_sock_shutdown(ud->tcp_socket, SHUT_RDWR); } /* kill threads related to this sdev */ if (vdev->ud.tcp_rx) { kthread_stop_put(vdev->ud.tcp_rx); vdev->ud.tcp_rx = NULL; } if (vdev->ud.tcp_tx) { kthread_stop_put(vdev->ud.tcp_tx); vdev->ud.tcp_tx = NULL; } pr_info("stop threads\n"); /* active connection is closed */ if (vdev->ud.tcp_socket) { sockfd_put(vdev->ud.tcp_socket); vdev->ud.tcp_socket = NULL; vdev->ud.sockfd = -1; } pr_info("release socket\n"); vhci_device_unlink_cleanup(vdev); /* * rh_port_disconnect() is a trigger of ... * usb_disable_device(): * disable all the endpoints for a USB device. * usb_disable_endpoint(): * disable endpoints. pending urbs are unlinked(dequeued). * * NOTE: After calling rh_port_disconnect(), the USB device drivers of a * detached device should release used urbs in a cleanup function (i.e. * xxx_disconnect()). Therefore, vhci_hcd does not need to release * pushed urbs and their private data in this function. * * NOTE: vhci_dequeue() must be considered carefully. When shutting down * a connection, vhci_shutdown_connection() expects vhci_dequeue() * gives back pushed urbs and frees their private data by request of * the cleanup function of a USB driver. When unlinking a urb with an * active connection, vhci_dequeue() does not give back the urb which * is actually given back by vhci_rx after receiving its return pdu. * */ rh_port_disconnect(vdev); pr_info("disconnect device\n"); } static void vhci_device_reset(struct usbip_device *ud) { struct vhci_device *vdev = container_of(ud, struct vhci_device, ud); struct usb_device *old = vdev->udev; unsigned long flags; spin_lock_irqsave(&ud->lock, flags); vdev->speed = 0; vdev->devid = 0; vdev->udev = NULL; usb_put_dev(old); if (ud->tcp_socket) { sockfd_put(ud->tcp_socket); ud->tcp_socket = NULL; ud->sockfd = -1; } ud->status = VDEV_ST_NULL; spin_unlock_irqrestore(&ud->lock, flags); } static void vhci_device_unusable(struct usbip_device *ud) { unsigned long flags; spin_lock_irqsave(&ud->lock, flags); ud->status = VDEV_ST_ERROR; spin_unlock_irqrestore(&ud->lock, flags); } static void vhci_device_init(struct vhci_device *vdev) { memset(vdev, 0, sizeof(struct vhci_device)); vdev->ud.side = USBIP_VHCI; vdev->ud.status = VDEV_ST_NULL; spin_lock_init(&vdev->ud.lock); mutex_init(&vdev->ud.sysfs_lock); INIT_LIST_HEAD(&vdev->priv_rx); INIT_LIST_HEAD(&vdev->priv_tx); INIT_LIST_HEAD(&vdev->unlink_tx); INIT_LIST_HEAD(&vdev->unlink_rx); spin_lock_init(&vdev->priv_lock); init_waitqueue_head(&vdev->waitq_tx); vdev->ud.eh_ops.shutdown = vhci_shutdown_connection; vdev->ud.eh_ops.reset = vhci_device_reset; vdev->ud.eh_ops.unusable = vhci_device_unusable; usbip_start_eh(&vdev->ud); } static int hcd_name_to_id(const char *name) { char *c; long val; int ret; c = strchr(name, '.'); if (c == NULL) return 0; ret = kstrtol(c+1, 10, &val); if (ret < 0) return ret; return val; } static int vhci_setup(struct usb_hcd *hcd) { struct vhci *vhci = *((void **)dev_get_platdata(hcd->self.controller)); if (usb_hcd_is_primary_hcd(hcd)) { vhci->vhci_hcd_hs = hcd_to_vhci_hcd(hcd); vhci->vhci_hcd_hs->vhci = vhci; /* * Mark the first roothub as being USB 2.0. * The USB 3.0 roothub will be registered later by * vhci_hcd_probe() */ hcd->speed = HCD_USB2; hcd->self.root_hub->speed = USB_SPEED_HIGH; } else { vhci->vhci_hcd_ss = hcd_to_vhci_hcd(hcd); vhci->vhci_hcd_ss->vhci = vhci; hcd->speed = HCD_USB31; hcd->self.root_hub->speed = USB_SPEED_SUPER_PLUS; } /* accept arbitrarily long scatter-gather lists */ hcd->self.sg_tablesize = ~0; hcd->self.no_sg_constraint = 1; return 0; } static int vhci_start(struct usb_hcd *hcd) { struct vhci_hcd *vhci_hcd = hcd_to_vhci_hcd(hcd); int id, rhport; int err; usbip_dbg_vhci_hc("enter vhci_start\n"); if (usb_hcd_is_primary_hcd(hcd)) spin_lock_init(&vhci_hcd->vhci->lock); /* initialize private data of usb_hcd */ for (rhport = 0; rhport < VHCI_HC_PORTS; rhport++) { struct vhci_device *vdev = &vhci_hcd->vdev[rhport]; vhci_device_init(vdev); vdev->rhport = rhport; } atomic_set(&vhci_hcd->seqnum, 0); hcd->power_budget = 0; /* no limit */ hcd->uses_new_polling = 1; #ifdef CONFIG_USB_OTG hcd->self.otg_port = 1; #endif id = hcd_name_to_id(hcd_name(hcd)); if (id < 0) { pr_err("invalid vhci name %s\n", hcd_name(hcd)); return -EINVAL; } /* vhci_hcd is now ready to be controlled through sysfs */ if (id == 0 && usb_hcd_is_primary_hcd(hcd)) { err = vhci_init_attr_group(); if (err) { dev_err(hcd_dev(hcd), "init attr group failed, err = %d\n", err); return err; } err = sysfs_create_group(&hcd_dev(hcd)->kobj, &vhci_attr_group); if (err) { dev_err(hcd_dev(hcd), "create sysfs files failed, err = %d\n", err); vhci_finish_attr_group(); return err; } pr_info("created sysfs %s\n", hcd_name(hcd)); } return 0; } static void vhci_stop(struct usb_hcd *hcd) { struct vhci_hcd *vhci_hcd = hcd_to_vhci_hcd(hcd); int id, rhport; usbip_dbg_vhci_hc("stop VHCI controller\n"); /* 1. remove the userland interface of vhci_hcd */ id = hcd_name_to_id(hcd_name(hcd)); if (id == 0 && usb_hcd_is_primary_hcd(hcd)) { sysfs_remove_group(&hcd_dev(hcd)->kobj, &vhci_attr_group); vhci_finish_attr_group(); } /* 2. shutdown all the ports of vhci_hcd */ for (rhport = 0; rhport < VHCI_HC_PORTS; rhport++) { struct vhci_device *vdev = &vhci_hcd->vdev[rhport]; usbip_event_add(&vdev->ud, VDEV_EVENT_REMOVED); usbip_stop_eh(&vdev->ud); } } static int vhci_get_frame_number(struct usb_hcd *hcd) { dev_err_ratelimited(&hcd->self.root_hub->dev, "Not yet implemented\n"); return 0; } #ifdef CONFIG_PM /* FIXME: suspend/resume */ static int vhci_bus_suspend(struct usb_hcd *hcd) { struct vhci *vhci = *((void **)dev_get_platdata(hcd->self.controller)); unsigned long flags; dev_dbg(&hcd->self.root_hub->dev, "%s\n", __func__); spin_lock_irqsave(&vhci->lock, flags); hcd->state = HC_STATE_SUSPENDED; spin_unlock_irqrestore(&vhci->lock, flags); return 0; } static int vhci_bus_resume(struct usb_hcd *hcd) { struct vhci *vhci = *((void **)dev_get_platdata(hcd->self.controller)); int rc = 0; unsigned long flags; dev_dbg(&hcd->self.root_hub->dev, "%s\n", __func__); spin_lock_irqsave(&vhci->lock, flags); if (!HCD_HW_ACCESSIBLE(hcd)) rc = -ESHUTDOWN; else hcd->state = HC_STATE_RUNNING; spin_unlock_irqrestore(&vhci->lock, flags); return rc; } #else #define vhci_bus_suspend NULL #define vhci_bus_resume NULL #endif /* Change a group of bulk endpoints to support multiple stream IDs */ static int vhci_alloc_streams(struct usb_hcd *hcd, struct usb_device *udev, struct usb_host_endpoint **eps, unsigned int num_eps, unsigned int num_streams, gfp_t mem_flags) { dev_dbg(&hcd->self.root_hub->dev, "vhci_alloc_streams not implemented\n"); return 0; } /* Reverts a group of bulk endpoints back to not using stream IDs. */ static int vhci_free_streams(struct usb_hcd *hcd, struct usb_device *udev, struct usb_host_endpoint **eps, unsigned int num_eps, gfp_t mem_flags) { dev_dbg(&hcd->self.root_hub->dev, "vhci_free_streams not implemented\n"); return 0; } static const struct hc_driver vhci_hc_driver = { .description = driver_name, .product_desc = driver_desc, .hcd_priv_size = sizeof(struct vhci_hcd), .flags = HCD_USB31 | HCD_SHARED, .reset = vhci_setup, .start = vhci_start, .stop = vhci_stop, .urb_enqueue = vhci_urb_enqueue, .urb_dequeue = vhci_urb_dequeue, .get_frame_number = vhci_get_frame_number, .hub_status_data = vhci_hub_status, .hub_control = vhci_hub_control, .bus_suspend = vhci_bus_suspend, .bus_resume = vhci_bus_resume, .alloc_streams = vhci_alloc_streams, .free_streams = vhci_free_streams, }; static int vhci_hcd_probe(struct platform_device *pdev) { struct vhci *vhci = *((void **)dev_get_platdata(&pdev->dev)); struct usb_hcd *hcd_hs; struct usb_hcd *hcd_ss; int ret; usbip_dbg_vhci_hc("name %s id %d\n", pdev->name, pdev->id); /* * Allocate and initialize hcd. * Our private data is also allocated automatically. */ hcd_hs = usb_create_hcd(&vhci_hc_driver, &pdev->dev, dev_name(&pdev->dev)); if (!hcd_hs) { pr_err("create primary hcd failed\n"); return -ENOMEM; } hcd_hs->has_tt = 1; /* * Finish generic HCD structure initialization and register. * Call the driver's reset() and start() routines. */ ret = usb_add_hcd(hcd_hs, 0, 0); if (ret != 0) { pr_err("usb_add_hcd hs failed %d\n", ret); goto put_usb2_hcd; } hcd_ss = usb_create_shared_hcd(&vhci_hc_driver, &pdev->dev, dev_name(&pdev->dev), hcd_hs); if (!hcd_ss) { ret = -ENOMEM; pr_err("create shared hcd failed\n"); goto remove_usb2_hcd; } ret = usb_add_hcd(hcd_ss, 0, 0); if (ret) { pr_err("usb_add_hcd ss failed %d\n", ret); goto put_usb3_hcd; } usbip_dbg_vhci_hc("bye\n"); return 0; put_usb3_hcd: usb_put_hcd(hcd_ss); remove_usb2_hcd: usb_remove_hcd(hcd_hs); put_usb2_hcd: usb_put_hcd(hcd_hs); vhci->vhci_hcd_hs = NULL; vhci->vhci_hcd_ss = NULL; return ret; } static void vhci_hcd_remove(struct platform_device *pdev) { struct vhci *vhci = *((void **)dev_get_platdata(&pdev->dev)); /* * Disconnects the root hub, * then reverses the effects of usb_add_hcd(), * invoking the HCD's stop() methods. */ usb_remove_hcd(vhci_hcd_to_hcd(vhci->vhci_hcd_ss)); usb_put_hcd(vhci_hcd_to_hcd(vhci->vhci_hcd_ss)); usb_remove_hcd(vhci_hcd_to_hcd(vhci->vhci_hcd_hs)); usb_put_hcd(vhci_hcd_to_hcd(vhci->vhci_hcd_hs)); vhci->vhci_hcd_hs = NULL; vhci->vhci_hcd_ss = NULL; } #ifdef CONFIG_PM /* what should happen for USB/IP under suspend/resume? */ static int vhci_hcd_suspend(struct platform_device *pdev, pm_message_t state) { struct usb_hcd *hcd; struct vhci *vhci; int rhport; int connected = 0; int ret = 0; unsigned long flags; dev_dbg(&pdev->dev, "%s\n", __func__); hcd = platform_get_drvdata(pdev); if (!hcd) return 0; vhci = *((void **)dev_get_platdata(hcd->self.controller)); spin_lock_irqsave(&vhci->lock, flags); for (rhport = 0; rhport < VHCI_HC_PORTS; rhport++) { if (vhci->vhci_hcd_hs->port_status[rhport] & USB_PORT_STAT_CONNECTION) connected += 1; if (vhci->vhci_hcd_ss->port_status[rhport] & USB_PORT_STAT_CONNECTION) connected += 1; } spin_unlock_irqrestore(&vhci->lock, flags); if (connected > 0) { dev_info(&pdev->dev, "We have %d active connection%s. Do not suspend.\n", connected, str_plural(connected)); ret = -EBUSY; } else { dev_info(&pdev->dev, "suspend vhci_hcd"); clear_bit(HCD_FLAG_HW_ACCESSIBLE, &hcd->flags); } return ret; } static int vhci_hcd_resume(struct platform_device *pdev) { struct usb_hcd *hcd; dev_dbg(&pdev->dev, "%s\n", __func__); hcd = platform_get_drvdata(pdev); if (!hcd) return 0; set_bit(HCD_FLAG_HW_ACCESSIBLE, &hcd->flags); usb_hcd_poll_rh_status(hcd); return 0; } #else #define vhci_hcd_suspend NULL #define vhci_hcd_resume NULL #endif static struct platform_driver vhci_driver = { .probe = vhci_hcd_probe, .remove = vhci_hcd_remove, .suspend = vhci_hcd_suspend, .resume = vhci_hcd_resume, .driver = { .name = driver_name, }, }; static void del_platform_devices(void) { int i; for (i = 0; i < vhci_num_controllers; i++) { platform_device_unregister(vhcis[i].pdev); vhcis[i].pdev = NULL; } sysfs_remove_link(&platform_bus.kobj, driver_name); } static int __init vhci_hcd_init(void) { int i, ret; if (usb_disabled()) return -ENODEV; if (vhci_num_controllers < 1) vhci_num_controllers = 1; vhcis = kcalloc(vhci_num_controllers, sizeof(struct vhci), GFP_KERNEL); if (vhcis == NULL) return -ENOMEM; ret = platform_driver_register(&vhci_driver); if (ret) goto err_driver_register; for (i = 0; i < vhci_num_controllers; i++) { void *vhci = &vhcis[i]; struct platform_device_info pdevinfo = { .name = driver_name, .id = i, .data = &vhci, .size_data = sizeof(void *), }; vhcis[i].pdev = platform_device_register_full(&pdevinfo); ret = PTR_ERR_OR_ZERO(vhcis[i].pdev); if (ret < 0) { while (i--) platform_device_unregister(vhcis[i].pdev); goto err_add_hcd; } } return 0; err_add_hcd: platform_driver_unregister(&vhci_driver); err_driver_register: kfree(vhcis); return ret; } static void __exit vhci_hcd_exit(void) { del_platform_devices(); platform_driver_unregister(&vhci_driver); kfree(vhcis); } module_init(vhci_hcd_init); module_exit(vhci_hcd_exit); MODULE_AUTHOR(DRIVER_AUTHOR); MODULE_DESCRIPTION(DRIVER_DESC); MODULE_LICENSE("GPL"); |
5 2 18 3 2 2 1 3 2 2 8 3 3 3 1 1 1 1 11 28 28 28 28 28 9 11 5 4 4 4 11 6 6 3 3 2 1 1 1 1 1 6 24 9 9 16 24 24 5 22 24 1 4 2 2 2 1 1 4 4 5 22 22 22 22 27 20 27 27 14 27 20 30 28 17 30 30 28 10 24 24 5 27 9 20 28 10 30 30 21 10 2 8 21 5 30 22 14 22 22 19 19 19 19 19 19 19 19 19 18 29 17 17 17 17 16 16 16 16 16 16 6 14 14 31 14 14 14 2 12 14 14 14 13 14 14 14 14 13 13 13 13 13 13 13 5 5 5 5 5 5 5 5 31 31 31 31 14 31 31 31 21 21 21 19 21 15 30 18 41 28 1 11 11 11 10 1 11 41 66 4 1 3 2 63 1 62 8 1 61 3 1 60 6 1 66 30 30 20 20 8 6 5 3 9 6 3 37 8 6 37 31 30 30 29 18 29 2 1 27 28 1 28 28 41 59 1 7 6 1 56 15 14 1 13 1 12 53 6 3 5 52 4 2 2 48 5 52 8 52 2 52 5 3 1 51 15 14 35 1 5 45 8 8 41 3 5 2 46 5 7 5 42 2 5 4 42 1 2 59 8 28 5 28 18 15 15 8 10 10 8 8 9 23 28 41 41 41 37 28 9 6 9 3 9 6 6 9 3 18 18 18 9 5 5 9 9 9 76 87 88 63 63 48 48 88 87 88 88 88 88 88 88 88 2 88 1677 10 1676 1 1675 72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 | // SPDX-License-Identifier: GPL-2.0-only /* * GENEVE: Generic Network Virtualization Encapsulation * * Copyright (c) 2015 Red Hat, Inc. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/ethtool.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/etherdevice.h> #include <linux/hash.h> #include <net/ipv6_stubs.h> #include <net/dst_metadata.h> #include <net/gro_cells.h> #include <net/rtnetlink.h> #include <net/geneve.h> #include <net/gro.h> #include <net/netdev_lock.h> #include <net/protocol.h> #define GENEVE_NETDEV_VER "0.6" #define GENEVE_N_VID (1u << 24) #define GENEVE_VID_MASK (GENEVE_N_VID - 1) #define VNI_HASH_BITS 10 #define VNI_HASH_SIZE (1<<VNI_HASH_BITS) static bool log_ecn_error = true; module_param(log_ecn_error, bool, 0644); MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN"); #define GENEVE_VER 0 #define GENEVE_BASE_HLEN (sizeof(struct udphdr) + sizeof(struct genevehdr)) #define GENEVE_IPV4_HLEN (ETH_HLEN + sizeof(struct iphdr) + GENEVE_BASE_HLEN) #define GENEVE_IPV6_HLEN (ETH_HLEN + sizeof(struct ipv6hdr) + GENEVE_BASE_HLEN) /* per-network namespace private data for this module */ struct geneve_net { struct list_head geneve_list; struct list_head sock_list; }; static unsigned int geneve_net_id; struct geneve_dev_node { struct hlist_node hlist; struct geneve_dev *geneve; }; struct geneve_config { struct ip_tunnel_info info; bool collect_md; bool use_udp6_rx_checksums; bool ttl_inherit; enum ifla_geneve_df df; bool inner_proto_inherit; u16 port_min; u16 port_max; }; /* Pseudo network device */ struct geneve_dev { struct geneve_dev_node hlist4; /* vni hash table for IPv4 socket */ #if IS_ENABLED(CONFIG_IPV6) struct geneve_dev_node hlist6; /* vni hash table for IPv6 socket */ #endif struct net *net; /* netns for packet i/o */ struct net_device *dev; /* netdev for geneve tunnel */ struct geneve_sock __rcu *sock4; /* IPv4 socket used for geneve tunnel */ #if IS_ENABLED(CONFIG_IPV6) struct geneve_sock __rcu *sock6; /* IPv6 socket used for geneve tunnel */ #endif struct list_head next; /* geneve's per namespace list */ struct gro_cells gro_cells; struct geneve_config cfg; }; struct geneve_sock { bool collect_md; struct list_head list; struct socket *sock; struct rcu_head rcu; int refcnt; struct hlist_head vni_list[VNI_HASH_SIZE]; }; static inline __u32 geneve_net_vni_hash(u8 vni[3]) { __u32 vnid; vnid = (vni[0] << 16) | (vni[1] << 8) | vni[2]; return hash_32(vnid, VNI_HASH_BITS); } static __be64 vni_to_tunnel_id(const __u8 *vni) { #ifdef __BIG_ENDIAN return (vni[0] << 16) | (vni[1] << 8) | vni[2]; #else return (__force __be64)(((__force u64)vni[0] << 40) | ((__force u64)vni[1] << 48) | ((__force u64)vni[2] << 56)); #endif } /* Convert 64 bit tunnel ID to 24 bit VNI. */ static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni) { #ifdef __BIG_ENDIAN vni[0] = (__force __u8)(tun_id >> 16); vni[1] = (__force __u8)(tun_id >> 8); vni[2] = (__force __u8)tun_id; #else vni[0] = (__force __u8)((__force u64)tun_id >> 40); vni[1] = (__force __u8)((__force u64)tun_id >> 48); vni[2] = (__force __u8)((__force u64)tun_id >> 56); #endif } static bool eq_tun_id_and_vni(u8 *tun_id, u8 *vni) { return !memcmp(vni, &tun_id[5], 3); } static sa_family_t geneve_get_sk_family(struct geneve_sock *gs) { return gs->sock->sk->sk_family; } static struct geneve_dev *geneve_lookup(struct geneve_sock *gs, __be32 addr, u8 vni[]) { struct hlist_head *vni_list_head; struct geneve_dev_node *node; __u32 hash; /* Find the device for this VNI */ hash = geneve_net_vni_hash(vni); vni_list_head = &gs->vni_list[hash]; hlist_for_each_entry_rcu(node, vni_list_head, hlist) { if (eq_tun_id_and_vni((u8 *)&node->geneve->cfg.info.key.tun_id, vni) && addr == node->geneve->cfg.info.key.u.ipv4.dst) return node->geneve; } return NULL; } #if IS_ENABLED(CONFIG_IPV6) static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs, struct in6_addr addr6, u8 vni[]) { struct hlist_head *vni_list_head; struct geneve_dev_node *node; __u32 hash; /* Find the device for this VNI */ hash = geneve_net_vni_hash(vni); vni_list_head = &gs->vni_list[hash]; hlist_for_each_entry_rcu(node, vni_list_head, hlist) { if (eq_tun_id_and_vni((u8 *)&node->geneve->cfg.info.key.tun_id, vni) && ipv6_addr_equal(&addr6, &node->geneve->cfg.info.key.u.ipv6.dst)) return node->geneve; } return NULL; } #endif static inline struct genevehdr *geneve_hdr(const struct sk_buff *skb) { return (struct genevehdr *)(udp_hdr(skb) + 1); } static struct geneve_dev *geneve_lookup_skb(struct geneve_sock *gs, struct sk_buff *skb) { static u8 zero_vni[3]; u8 *vni; if (geneve_get_sk_family(gs) == AF_INET) { struct iphdr *iph; __be32 addr; iph = ip_hdr(skb); /* outer IP header... */ if (gs->collect_md) { vni = zero_vni; addr = 0; } else { vni = geneve_hdr(skb)->vni; addr = iph->saddr; } return geneve_lookup(gs, addr, vni); #if IS_ENABLED(CONFIG_IPV6) } else if (geneve_get_sk_family(gs) == AF_INET6) { static struct in6_addr zero_addr6; struct ipv6hdr *ip6h; struct in6_addr addr6; ip6h = ipv6_hdr(skb); /* outer IPv6 header... */ if (gs->collect_md) { vni = zero_vni; addr6 = zero_addr6; } else { vni = geneve_hdr(skb)->vni; addr6 = ip6h->saddr; } return geneve6_lookup(gs, addr6, vni); #endif } return NULL; } /* geneve receive/decap routine */ static void geneve_rx(struct geneve_dev *geneve, struct geneve_sock *gs, struct sk_buff *skb) { struct genevehdr *gnvh = geneve_hdr(skb); struct metadata_dst *tun_dst = NULL; unsigned int len; int nh, err = 0; void *oiph; if (ip_tunnel_collect_metadata() || gs->collect_md) { IP_TUNNEL_DECLARE_FLAGS(flags) = { }; __set_bit(IP_TUNNEL_KEY_BIT, flags); __assign_bit(IP_TUNNEL_OAM_BIT, flags, gnvh->oam); __assign_bit(IP_TUNNEL_CRIT_OPT_BIT, flags, gnvh->critical); tun_dst = udp_tun_rx_dst(skb, geneve_get_sk_family(gs), flags, vni_to_tunnel_id(gnvh->vni), gnvh->opt_len * 4); if (!tun_dst) { dev_dstats_rx_dropped(geneve->dev); goto drop; } /* Update tunnel dst according to Geneve options. */ ip_tunnel_flags_zero(flags); __set_bit(IP_TUNNEL_GENEVE_OPT_BIT, flags); ip_tunnel_info_opts_set(&tun_dst->u.tun_info, gnvh->options, gnvh->opt_len * 4, flags); } else { /* Drop packets w/ critical options, * since we don't support any... */ if (gnvh->critical) { DEV_STATS_INC(geneve->dev, rx_frame_errors); DEV_STATS_INC(geneve->dev, rx_errors); goto drop; } } if (tun_dst) skb_dst_set(skb, &tun_dst->dst); if (gnvh->proto_type == htons(ETH_P_TEB)) { skb_reset_mac_header(skb); skb->protocol = eth_type_trans(skb, geneve->dev); skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); /* Ignore packet loops (and multicast echo) */ if (ether_addr_equal(eth_hdr(skb)->h_source, geneve->dev->dev_addr)) { DEV_STATS_INC(geneve->dev, rx_errors); goto drop; } } else { skb_reset_mac_header(skb); skb->dev = geneve->dev; skb->pkt_type = PACKET_HOST; } /* Save offset of outer header relative to skb->head, * because we are going to reset the network header to the inner header * and might change skb->head. */ nh = skb_network_header(skb) - skb->head; skb_reset_network_header(skb); if (!pskb_inet_may_pull(skb)) { DEV_STATS_INC(geneve->dev, rx_length_errors); DEV_STATS_INC(geneve->dev, rx_errors); goto drop; } /* Get the outer header. */ oiph = skb->head + nh; if (geneve_get_sk_family(gs) == AF_INET) err = IP_ECN_decapsulate(oiph, skb); #if IS_ENABLED(CONFIG_IPV6) else err = IP6_ECN_decapsulate(oiph, skb); #endif if (unlikely(err)) { if (log_ecn_error) { if (geneve_get_sk_family(gs) == AF_INET) net_info_ratelimited("non-ECT from %pI4 " "with TOS=%#x\n", &((struct iphdr *)oiph)->saddr, ((struct iphdr *)oiph)->tos); #if IS_ENABLED(CONFIG_IPV6) else net_info_ratelimited("non-ECT from %pI6\n", &((struct ipv6hdr *)oiph)->saddr); #endif } if (err > 1) { DEV_STATS_INC(geneve->dev, rx_frame_errors); DEV_STATS_INC(geneve->dev, rx_errors); goto drop; } } len = skb->len; err = gro_cells_receive(&geneve->gro_cells, skb); if (likely(err == NET_RX_SUCCESS)) dev_dstats_rx_add(geneve->dev, len); return; drop: /* Consume bad packet */ kfree_skb(skb); } /* Setup stats when device is created */ static int geneve_init(struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); int err; err = gro_cells_init(&geneve->gro_cells, dev); if (err) return err; err = dst_cache_init(&geneve->cfg.info.dst_cache, GFP_KERNEL); if (err) { gro_cells_destroy(&geneve->gro_cells); return err; } netdev_lockdep_set_classes(dev); return 0; } static void geneve_uninit(struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); dst_cache_destroy(&geneve->cfg.info.dst_cache); gro_cells_destroy(&geneve->gro_cells); } /* Callback from net/ipv4/udp.c to receive packets */ static int geneve_udp_encap_recv(struct sock *sk, struct sk_buff *skb) { struct genevehdr *geneveh; struct geneve_dev *geneve; struct geneve_sock *gs; __be16 inner_proto; int opts_len; /* Need UDP and Geneve header to be present */ if (unlikely(!pskb_may_pull(skb, GENEVE_BASE_HLEN))) goto drop; /* Return packets with reserved bits set */ geneveh = geneve_hdr(skb); if (unlikely(geneveh->ver != GENEVE_VER)) goto drop; gs = rcu_dereference_sk_user_data(sk); if (!gs) goto drop; geneve = geneve_lookup_skb(gs, skb); if (!geneve) goto drop; inner_proto = geneveh->proto_type; if (unlikely((!geneve->cfg.inner_proto_inherit && inner_proto != htons(ETH_P_TEB)))) { dev_dstats_rx_dropped(geneve->dev); goto drop; } opts_len = geneveh->opt_len * 4; if (iptunnel_pull_header(skb, GENEVE_BASE_HLEN + opts_len, inner_proto, !net_eq(geneve->net, dev_net(geneve->dev)))) { dev_dstats_rx_dropped(geneve->dev); goto drop; } geneve_rx(geneve, gs, skb); return 0; drop: /* Consume bad packet */ kfree_skb(skb); return 0; } /* Callback from net/ipv{4,6}/udp.c to check that we have a tunnel for errors */ static int geneve_udp_encap_err_lookup(struct sock *sk, struct sk_buff *skb) { struct genevehdr *geneveh; struct geneve_sock *gs; u8 zero_vni[3] = { 0 }; u8 *vni = zero_vni; if (!pskb_may_pull(skb, skb_transport_offset(skb) + GENEVE_BASE_HLEN)) return -EINVAL; geneveh = geneve_hdr(skb); if (geneveh->ver != GENEVE_VER) return -EINVAL; if (geneveh->proto_type != htons(ETH_P_TEB)) return -EINVAL; gs = rcu_dereference_sk_user_data(sk); if (!gs) return -ENOENT; if (geneve_get_sk_family(gs) == AF_INET) { struct iphdr *iph = ip_hdr(skb); __be32 addr4 = 0; if (!gs->collect_md) { vni = geneve_hdr(skb)->vni; addr4 = iph->daddr; } return geneve_lookup(gs, addr4, vni) ? 0 : -ENOENT; } #if IS_ENABLED(CONFIG_IPV6) if (geneve_get_sk_family(gs) == AF_INET6) { struct ipv6hdr *ip6h = ipv6_hdr(skb); struct in6_addr addr6; memset(&addr6, 0, sizeof(struct in6_addr)); if (!gs->collect_md) { vni = geneve_hdr(skb)->vni; addr6 = ip6h->daddr; } return geneve6_lookup(gs, addr6, vni) ? 0 : -ENOENT; } #endif return -EPFNOSUPPORT; } static struct socket *geneve_create_sock(struct net *net, bool ipv6, __be16 port, bool ipv6_rx_csum) { struct socket *sock; struct udp_port_cfg udp_conf; int err; memset(&udp_conf, 0, sizeof(udp_conf)); if (ipv6) { udp_conf.family = AF_INET6; udp_conf.ipv6_v6only = 1; udp_conf.use_udp6_rx_checksums = ipv6_rx_csum; } else { udp_conf.family = AF_INET; udp_conf.local_ip.s_addr = htonl(INADDR_ANY); } udp_conf.local_udp_port = port; /* Open UDP socket */ err = udp_sock_create(net, &udp_conf, &sock); if (err < 0) return ERR_PTR(err); udp_allow_gso(sock->sk); return sock; } static int geneve_hlen(struct genevehdr *gh) { return sizeof(*gh) + gh->opt_len * 4; } static struct sk_buff *geneve_gro_receive(struct sock *sk, struct list_head *head, struct sk_buff *skb) { struct sk_buff *pp = NULL; struct sk_buff *p; struct genevehdr *gh, *gh2; unsigned int hlen, gh_len, off_gnv; const struct packet_offload *ptype; __be16 type; int flush = 1; off_gnv = skb_gro_offset(skb); hlen = off_gnv + sizeof(*gh); gh = skb_gro_header(skb, hlen, off_gnv); if (unlikely(!gh)) goto out; if (gh->ver != GENEVE_VER || gh->oam) goto out; gh_len = geneve_hlen(gh); hlen = off_gnv + gh_len; if (!skb_gro_may_pull(skb, hlen)) { gh = skb_gro_header_slow(skb, hlen, off_gnv); if (unlikely(!gh)) goto out; } list_for_each_entry(p, head, list) { if (!NAPI_GRO_CB(p)->same_flow) continue; gh2 = (struct genevehdr *)(p->data + off_gnv); if (gh->opt_len != gh2->opt_len || memcmp(gh, gh2, gh_len)) { NAPI_GRO_CB(p)->same_flow = 0; continue; } } skb_gro_pull(skb, gh_len); skb_gro_postpull_rcsum(skb, gh, gh_len); type = gh->proto_type; if (likely(type == htons(ETH_P_TEB))) return call_gro_receive(eth_gro_receive, head, skb); ptype = gro_find_receive_by_type(type); if (!ptype) goto out; pp = call_gro_receive(ptype->callbacks.gro_receive, head, skb); flush = 0; out: skb_gro_flush_final(skb, pp, flush); return pp; } static int geneve_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff) { struct genevehdr *gh; struct packet_offload *ptype; __be16 type; int gh_len; int err = -ENOSYS; gh = (struct genevehdr *)(skb->data + nhoff); gh_len = geneve_hlen(gh); type = gh->proto_type; /* since skb->encapsulation is set, eth_gro_complete() sets the inner mac header */ if (likely(type == htons(ETH_P_TEB))) return eth_gro_complete(skb, nhoff + gh_len); ptype = gro_find_complete_by_type(type); if (ptype) err = ptype->callbacks.gro_complete(skb, nhoff + gh_len); skb_set_inner_mac_header(skb, nhoff + gh_len); return err; } /* Create new listen socket if needed */ static struct geneve_sock *geneve_socket_create(struct net *net, __be16 port, bool ipv6, bool ipv6_rx_csum) { struct geneve_net *gn = net_generic(net, geneve_net_id); struct geneve_sock *gs; struct socket *sock; struct udp_tunnel_sock_cfg tunnel_cfg; int h; gs = kzalloc(sizeof(*gs), GFP_KERNEL); if (!gs) return ERR_PTR(-ENOMEM); sock = geneve_create_sock(net, ipv6, port, ipv6_rx_csum); if (IS_ERR(sock)) { kfree(gs); return ERR_CAST(sock); } gs->sock = sock; gs->refcnt = 1; for (h = 0; h < VNI_HASH_SIZE; ++h) INIT_HLIST_HEAD(&gs->vni_list[h]); /* Initialize the geneve udp offloads structure */ udp_tunnel_notify_add_rx_port(gs->sock, UDP_TUNNEL_TYPE_GENEVE); /* Mark socket as an encapsulation socket */ memset(&tunnel_cfg, 0, sizeof(tunnel_cfg)); tunnel_cfg.sk_user_data = gs; tunnel_cfg.encap_type = 1; tunnel_cfg.gro_receive = geneve_gro_receive; tunnel_cfg.gro_complete = geneve_gro_complete; tunnel_cfg.encap_rcv = geneve_udp_encap_recv; tunnel_cfg.encap_err_lookup = geneve_udp_encap_err_lookup; tunnel_cfg.encap_destroy = NULL; setup_udp_tunnel_sock(net, sock, &tunnel_cfg); list_add(&gs->list, &gn->sock_list); return gs; } static void __geneve_sock_release(struct geneve_sock *gs) { if (!gs || --gs->refcnt) return; list_del(&gs->list); udp_tunnel_notify_del_rx_port(gs->sock, UDP_TUNNEL_TYPE_GENEVE); udp_tunnel_sock_release(gs->sock); kfree_rcu(gs, rcu); } static void geneve_sock_release(struct geneve_dev *geneve) { struct geneve_sock *gs4 = rtnl_dereference(geneve->sock4); #if IS_ENABLED(CONFIG_IPV6) struct geneve_sock *gs6 = rtnl_dereference(geneve->sock6); rcu_assign_pointer(geneve->sock6, NULL); #endif rcu_assign_pointer(geneve->sock4, NULL); synchronize_net(); __geneve_sock_release(gs4); #if IS_ENABLED(CONFIG_IPV6) __geneve_sock_release(gs6); #endif } static struct geneve_sock *geneve_find_sock(struct geneve_net *gn, sa_family_t family, __be16 dst_port) { struct geneve_sock *gs; list_for_each_entry(gs, &gn->sock_list, list) { if (inet_sk(gs->sock->sk)->inet_sport == dst_port && geneve_get_sk_family(gs) == family) { return gs; } } return NULL; } static int geneve_sock_add(struct geneve_dev *geneve, bool ipv6) { struct net *net = geneve->net; struct geneve_net *gn = net_generic(net, geneve_net_id); struct geneve_dev_node *node; struct geneve_sock *gs; __u8 vni[3]; __u32 hash; gs = geneve_find_sock(gn, ipv6 ? AF_INET6 : AF_INET, geneve->cfg.info.key.tp_dst); if (gs) { gs->refcnt++; goto out; } gs = geneve_socket_create(net, geneve->cfg.info.key.tp_dst, ipv6, geneve->cfg.use_udp6_rx_checksums); if (IS_ERR(gs)) return PTR_ERR(gs); out: gs->collect_md = geneve->cfg.collect_md; #if IS_ENABLED(CONFIG_IPV6) if (ipv6) { rcu_assign_pointer(geneve->sock6, gs); node = &geneve->hlist6; } else #endif { rcu_assign_pointer(geneve->sock4, gs); node = &geneve->hlist4; } node->geneve = geneve; tunnel_id_to_vni(geneve->cfg.info.key.tun_id, vni); hash = geneve_net_vni_hash(vni); hlist_add_head_rcu(&node->hlist, &gs->vni_list[hash]); return 0; } static int geneve_open(struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); bool metadata = geneve->cfg.collect_md; bool ipv4, ipv6; int ret = 0; ipv6 = geneve->cfg.info.mode & IP_TUNNEL_INFO_IPV6 || metadata; ipv4 = !ipv6 || metadata; #if IS_ENABLED(CONFIG_IPV6) if (ipv6) { ret = geneve_sock_add(geneve, true); if (ret < 0 && ret != -EAFNOSUPPORT) ipv4 = false; } #endif if (ipv4) ret = geneve_sock_add(geneve, false); if (ret < 0) geneve_sock_release(geneve); return ret; } static int geneve_stop(struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); hlist_del_init_rcu(&geneve->hlist4.hlist); #if IS_ENABLED(CONFIG_IPV6) hlist_del_init_rcu(&geneve->hlist6.hlist); #endif geneve_sock_release(geneve); return 0; } static void geneve_build_header(struct genevehdr *geneveh, const struct ip_tunnel_info *info, __be16 inner_proto) { geneveh->ver = GENEVE_VER; geneveh->opt_len = info->options_len / 4; geneveh->oam = test_bit(IP_TUNNEL_OAM_BIT, info->key.tun_flags); geneveh->critical = test_bit(IP_TUNNEL_CRIT_OPT_BIT, info->key.tun_flags); geneveh->rsvd1 = 0; tunnel_id_to_vni(info->key.tun_id, geneveh->vni); geneveh->proto_type = inner_proto; geneveh->rsvd2 = 0; if (test_bit(IP_TUNNEL_GENEVE_OPT_BIT, info->key.tun_flags)) ip_tunnel_info_opts_get(geneveh->options, info); } static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb, const struct ip_tunnel_info *info, bool xnet, int ip_hdr_len, bool inner_proto_inherit) { bool udp_sum = test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags); struct genevehdr *gnvh; __be16 inner_proto; int min_headroom; int err; skb_reset_mac_header(skb); skb_scrub_packet(skb, xnet); min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len + GENEVE_BASE_HLEN + info->options_len + ip_hdr_len; err = skb_cow_head(skb, min_headroom); if (unlikely(err)) goto free_dst; err = udp_tunnel_handle_offloads(skb, udp_sum); if (err) goto free_dst; gnvh = __skb_push(skb, sizeof(*gnvh) + info->options_len); inner_proto = inner_proto_inherit ? skb->protocol : htons(ETH_P_TEB); geneve_build_header(gnvh, info, inner_proto); skb_set_inner_protocol(skb, inner_proto); return 0; free_dst: dst_release(dst); return err; } static u8 geneve_get_dsfield(struct sk_buff *skb, struct net_device *dev, const struct ip_tunnel_info *info, bool *use_cache) { struct geneve_dev *geneve = netdev_priv(dev); u8 dsfield; dsfield = info->key.tos; if (dsfield == 1 && !geneve->cfg.collect_md) { dsfield = ip_tunnel_get_dsfield(ip_hdr(skb), skb); *use_cache = false; } return dsfield; } static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, struct geneve_dev *geneve, const struct ip_tunnel_info *info) { bool inner_proto_inherit = geneve->cfg.inner_proto_inherit; bool xnet = !net_eq(geneve->net, dev_net(geneve->dev)); struct geneve_sock *gs4 = rcu_dereference(geneve->sock4); const struct ip_tunnel_key *key = &info->key; struct rtable *rt; bool use_cache; __u8 tos, ttl; __be16 df = 0; __be32 saddr; __be16 sport; int err; if (skb_vlan_inet_prepare(skb, inner_proto_inherit)) return -EINVAL; if (!gs4) return -EIO; use_cache = ip_tunnel_dst_cache_usable(skb, info); tos = geneve_get_dsfield(skb, dev, info, &use_cache); sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min, geneve->cfg.port_max, true); rt = udp_tunnel_dst_lookup(skb, dev, geneve->net, 0, &saddr, &info->key, sport, geneve->cfg.info.key.tp_dst, tos, use_cache ? (struct dst_cache *)&info->dst_cache : NULL); if (IS_ERR(rt)) return PTR_ERR(rt); err = skb_tunnel_check_pmtu(skb, &rt->dst, GENEVE_IPV4_HLEN + info->options_len, netif_is_any_bridge_port(dev)); if (err < 0) { dst_release(&rt->dst); return err; } else if (err) { struct ip_tunnel_info *info; info = skb_tunnel_info(skb); if (info) { struct ip_tunnel_info *unclone; unclone = skb_tunnel_info_unclone(skb); if (unlikely(!unclone)) { dst_release(&rt->dst); return -ENOMEM; } unclone->key.u.ipv4.dst = saddr; unclone->key.u.ipv4.src = info->key.u.ipv4.dst; } if (!pskb_may_pull(skb, ETH_HLEN)) { dst_release(&rt->dst); return -EINVAL; } skb->protocol = eth_type_trans(skb, geneve->dev); __netif_rx(skb); dst_release(&rt->dst); return -EMSGSIZE; } tos = ip_tunnel_ecn_encap(tos, ip_hdr(skb), skb); if (geneve->cfg.collect_md) { ttl = key->ttl; df = test_bit(IP_TUNNEL_DONT_FRAGMENT_BIT, key->tun_flags) ? htons(IP_DF) : 0; } else { if (geneve->cfg.ttl_inherit) ttl = ip_tunnel_get_ttl(ip_hdr(skb), skb); else ttl = key->ttl; ttl = ttl ? : ip4_dst_hoplimit(&rt->dst); if (geneve->cfg.df == GENEVE_DF_SET) { df = htons(IP_DF); } else if (geneve->cfg.df == GENEVE_DF_INHERIT) { struct ethhdr *eth = skb_eth_hdr(skb); if (ntohs(eth->h_proto) == ETH_P_IPV6) { df = htons(IP_DF); } else if (ntohs(eth->h_proto) == ETH_P_IP) { struct iphdr *iph = ip_hdr(skb); if (iph->frag_off & htons(IP_DF)) df = htons(IP_DF); } } } err = geneve_build_skb(&rt->dst, skb, info, xnet, sizeof(struct iphdr), inner_proto_inherit); if (unlikely(err)) return err; udp_tunnel_xmit_skb(rt, gs4->sock->sk, skb, saddr, info->key.u.ipv4.dst, tos, ttl, df, sport, geneve->cfg.info.key.tp_dst, !net_eq(geneve->net, dev_net(geneve->dev)), !test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags)); return 0; } #if IS_ENABLED(CONFIG_IPV6) static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, struct geneve_dev *geneve, const struct ip_tunnel_info *info) { bool inner_proto_inherit = geneve->cfg.inner_proto_inherit; bool xnet = !net_eq(geneve->net, dev_net(geneve->dev)); struct geneve_sock *gs6 = rcu_dereference(geneve->sock6); const struct ip_tunnel_key *key = &info->key; struct dst_entry *dst = NULL; struct in6_addr saddr; bool use_cache; __u8 prio, ttl; __be16 sport; int err; if (skb_vlan_inet_prepare(skb, inner_proto_inherit)) return -EINVAL; if (!gs6) return -EIO; use_cache = ip_tunnel_dst_cache_usable(skb, info); prio = geneve_get_dsfield(skb, dev, info, &use_cache); sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min, geneve->cfg.port_max, true); dst = udp_tunnel6_dst_lookup(skb, dev, geneve->net, gs6->sock, 0, &saddr, key, sport, geneve->cfg.info.key.tp_dst, prio, use_cache ? (struct dst_cache *)&info->dst_cache : NULL); if (IS_ERR(dst)) return PTR_ERR(dst); err = skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len, netif_is_any_bridge_port(dev)); if (err < 0) { dst_release(dst); return err; } else if (err) { struct ip_tunnel_info *info = skb_tunnel_info(skb); if (info) { struct ip_tunnel_info *unclone; unclone = skb_tunnel_info_unclone(skb); if (unlikely(!unclone)) { dst_release(dst); return -ENOMEM; } unclone->key.u.ipv6.dst = saddr; unclone->key.u.ipv6.src = info->key.u.ipv6.dst; } if (!pskb_may_pull(skb, ETH_HLEN)) { dst_release(dst); return -EINVAL; } skb->protocol = eth_type_trans(skb, geneve->dev); __netif_rx(skb); dst_release(dst); return -EMSGSIZE; } prio = ip_tunnel_ecn_encap(prio, ip_hdr(skb), skb); if (geneve->cfg.collect_md) { ttl = key->ttl; } else { if (geneve->cfg.ttl_inherit) ttl = ip_tunnel_get_ttl(ip_hdr(skb), skb); else ttl = key->ttl; ttl = ttl ? : ip6_dst_hoplimit(dst); } err = geneve_build_skb(dst, skb, info, xnet, sizeof(struct ipv6hdr), inner_proto_inherit); if (unlikely(err)) return err; udp_tunnel6_xmit_skb(dst, gs6->sock->sk, skb, dev, &saddr, &key->u.ipv6.dst, prio, ttl, info->key.label, sport, geneve->cfg.info.key.tp_dst, !test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags)); return 0; } #endif static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); struct ip_tunnel_info *info = NULL; int err; if (geneve->cfg.collect_md) { info = skb_tunnel_info(skb); if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) { netdev_dbg(dev, "no tunnel metadata\n"); dev_kfree_skb(skb); dev_dstats_tx_dropped(dev); return NETDEV_TX_OK; } } else { info = &geneve->cfg.info; } rcu_read_lock(); #if IS_ENABLED(CONFIG_IPV6) if (info->mode & IP_TUNNEL_INFO_IPV6) err = geneve6_xmit_skb(skb, dev, geneve, info); else #endif err = geneve_xmit_skb(skb, dev, geneve, info); rcu_read_unlock(); if (likely(!err)) return NETDEV_TX_OK; if (err != -EMSGSIZE) dev_kfree_skb(skb); if (err == -ELOOP) DEV_STATS_INC(dev, collisions); else if (err == -ENETUNREACH) DEV_STATS_INC(dev, tx_carrier_errors); DEV_STATS_INC(dev, tx_errors); return NETDEV_TX_OK; } static int geneve_change_mtu(struct net_device *dev, int new_mtu) { if (new_mtu > dev->max_mtu) new_mtu = dev->max_mtu; else if (new_mtu < dev->min_mtu) new_mtu = dev->min_mtu; WRITE_ONCE(dev->mtu, new_mtu); return 0; } static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb) { struct ip_tunnel_info *info = skb_tunnel_info(skb); struct geneve_dev *geneve = netdev_priv(dev); __be16 sport; if (ip_tunnel_info_af(info) == AF_INET) { struct rtable *rt; struct geneve_sock *gs4 = rcu_dereference(geneve->sock4); bool use_cache; __be32 saddr; u8 tos; if (!gs4) return -EIO; use_cache = ip_tunnel_dst_cache_usable(skb, info); tos = geneve_get_dsfield(skb, dev, info, &use_cache); sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min, geneve->cfg.port_max, true); rt = udp_tunnel_dst_lookup(skb, dev, geneve->net, 0, &saddr, &info->key, sport, geneve->cfg.info.key.tp_dst, tos, use_cache ? &info->dst_cache : NULL); if (IS_ERR(rt)) return PTR_ERR(rt); ip_rt_put(rt); info->key.u.ipv4.src = saddr; #if IS_ENABLED(CONFIG_IPV6) } else if (ip_tunnel_info_af(info) == AF_INET6) { struct dst_entry *dst; struct geneve_sock *gs6 = rcu_dereference(geneve->sock6); struct in6_addr saddr; bool use_cache; u8 prio; if (!gs6) return -EIO; use_cache = ip_tunnel_dst_cache_usable(skb, info); prio = geneve_get_dsfield(skb, dev, info, &use_cache); sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min, geneve->cfg.port_max, true); dst = udp_tunnel6_dst_lookup(skb, dev, geneve->net, gs6->sock, 0, &saddr, &info->key, sport, geneve->cfg.info.key.tp_dst, prio, use_cache ? &info->dst_cache : NULL); if (IS_ERR(dst)) return PTR_ERR(dst); dst_release(dst); info->key.u.ipv6.src = saddr; #endif } else { return -EINVAL; } info->key.tp_src = sport; info->key.tp_dst = geneve->cfg.info.key.tp_dst; return 0; } static const struct net_device_ops geneve_netdev_ops = { .ndo_init = geneve_init, .ndo_uninit = geneve_uninit, .ndo_open = geneve_open, .ndo_stop = geneve_stop, .ndo_start_xmit = geneve_xmit, .ndo_change_mtu = geneve_change_mtu, .ndo_validate_addr = eth_validate_addr, .ndo_set_mac_address = eth_mac_addr, .ndo_fill_metadata_dst = geneve_fill_metadata_dst, }; static void geneve_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *drvinfo) { strscpy(drvinfo->version, GENEVE_NETDEV_VER, sizeof(drvinfo->version)); strscpy(drvinfo->driver, "geneve", sizeof(drvinfo->driver)); } static const struct ethtool_ops geneve_ethtool_ops = { .get_drvinfo = geneve_get_drvinfo, .get_link = ethtool_op_get_link, }; /* Info for udev, that this is a virtual tunnel endpoint */ static const struct device_type geneve_type = { .name = "geneve", }; /* Calls the ndo_udp_tunnel_add of the caller in order to * supply the listening GENEVE udp ports. Callers are expected * to implement the ndo_udp_tunnel_add. */ static void geneve_offload_rx_ports(struct net_device *dev, bool push) { struct net *net = dev_net(dev); struct geneve_net *gn = net_generic(net, geneve_net_id); struct geneve_sock *gs; rcu_read_lock(); list_for_each_entry_rcu(gs, &gn->sock_list, list) { if (push) { udp_tunnel_push_rx_port(dev, gs->sock, UDP_TUNNEL_TYPE_GENEVE); } else { udp_tunnel_drop_rx_port(dev, gs->sock, UDP_TUNNEL_TYPE_GENEVE); } } rcu_read_unlock(); } /* Initialize the device structure. */ static void geneve_setup(struct net_device *dev) { ether_setup(dev); dev->netdev_ops = &geneve_netdev_ops; dev->ethtool_ops = &geneve_ethtool_ops; dev->needs_free_netdev = true; SET_NETDEV_DEVTYPE(dev, &geneve_type); dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_FRAGLIST; dev->features |= NETIF_F_RXCSUM; dev->features |= NETIF_F_GSO_SOFTWARE; dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_FRAGLIST; dev->hw_features |= NETIF_F_RXCSUM; dev->hw_features |= NETIF_F_GSO_SOFTWARE; dev->pcpu_stat_type = NETDEV_PCPU_STAT_DSTATS; /* MTU range: 68 - (something less than 65535) */ dev->min_mtu = ETH_MIN_MTU; /* The max_mtu calculation does not take account of GENEVE * options, to avoid excluding potentially valid * configurations. This will be further reduced by IPvX hdr size. */ dev->max_mtu = IP_MAX_MTU - GENEVE_BASE_HLEN - dev->hard_header_len; netif_keep_dst(dev); dev->priv_flags &= ~IFF_TX_SKB_SHARING; dev->priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_NO_QUEUE; dev->lltx = true; eth_hw_addr_random(dev); } static const struct nla_policy geneve_policy[IFLA_GENEVE_MAX + 1] = { [IFLA_GENEVE_UNSPEC] = { .strict_start_type = IFLA_GENEVE_INNER_PROTO_INHERIT }, [IFLA_GENEVE_ID] = { .type = NLA_U32 }, [IFLA_GENEVE_REMOTE] = { .len = sizeof_field(struct iphdr, daddr) }, [IFLA_GENEVE_REMOTE6] = { .len = sizeof(struct in6_addr) }, [IFLA_GENEVE_TTL] = { .type = NLA_U8 }, [IFLA_GENEVE_TOS] = { .type = NLA_U8 }, [IFLA_GENEVE_LABEL] = { .type = NLA_U32 }, [IFLA_GENEVE_PORT] = { .type = NLA_U16 }, [IFLA_GENEVE_COLLECT_METADATA] = { .type = NLA_FLAG }, [IFLA_GENEVE_UDP_CSUM] = { .type = NLA_U8 }, [IFLA_GENEVE_UDP_ZERO_CSUM6_TX] = { .type = NLA_U8 }, [IFLA_GENEVE_UDP_ZERO_CSUM6_RX] = { .type = NLA_U8 }, [IFLA_GENEVE_TTL_INHERIT] = { .type = NLA_U8 }, [IFLA_GENEVE_DF] = { .type = NLA_U8 }, [IFLA_GENEVE_INNER_PROTO_INHERIT] = { .type = NLA_FLAG }, [IFLA_GENEVE_PORT_RANGE] = NLA_POLICY_EXACT_LEN(sizeof(struct ifla_geneve_port_range)), }; static int geneve_validate(struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { if (tb[IFLA_ADDRESS]) { if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) { NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_ADDRESS], "Provided link layer address is not Ethernet"); return -EINVAL; } if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS]))) { NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_ADDRESS], "Provided Ethernet address is not unicast"); return -EADDRNOTAVAIL; } } if (!data) { NL_SET_ERR_MSG(extack, "Not enough attributes provided to perform the operation"); return -EINVAL; } if (data[IFLA_GENEVE_ID]) { __u32 vni = nla_get_u32(data[IFLA_GENEVE_ID]); if (vni >= GENEVE_N_VID) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_ID], "Geneve ID must be lower than 16777216"); return -ERANGE; } } if (data[IFLA_GENEVE_DF]) { enum ifla_geneve_df df = nla_get_u8(data[IFLA_GENEVE_DF]); if (df < 0 || df > GENEVE_DF_MAX) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_DF], "Invalid DF attribute"); return -EINVAL; } } if (data[IFLA_GENEVE_PORT_RANGE]) { const struct ifla_geneve_port_range *p; p = nla_data(data[IFLA_GENEVE_PORT_RANGE]); if (ntohs(p->high) < ntohs(p->low)) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_PORT_RANGE], "Invalid source port range"); return -EINVAL; } } return 0; } static struct geneve_dev *geneve_find_dev(struct geneve_net *gn, const struct ip_tunnel_info *info, bool *tun_on_same_port, bool *tun_collect_md) { struct geneve_dev *geneve, *t = NULL; *tun_on_same_port = false; *tun_collect_md = false; list_for_each_entry(geneve, &gn->geneve_list, next) { if (info->key.tp_dst == geneve->cfg.info.key.tp_dst) { *tun_collect_md = geneve->cfg.collect_md; *tun_on_same_port = true; } if (info->key.tun_id == geneve->cfg.info.key.tun_id && info->key.tp_dst == geneve->cfg.info.key.tp_dst && !memcmp(&info->key.u, &geneve->cfg.info.key.u, sizeof(info->key.u))) t = geneve; } return t; } static bool is_tnl_info_zero(const struct ip_tunnel_info *info) { return !(info->key.tun_id || info->key.tos || !ip_tunnel_flags_empty(info->key.tun_flags) || info->key.ttl || info->key.label || info->key.tp_src || memchr_inv(&info->key.u, 0, sizeof(info->key.u))); } static bool geneve_dst_addr_equal(struct ip_tunnel_info *a, struct ip_tunnel_info *b) { if (ip_tunnel_info_af(a) == AF_INET) return a->key.u.ipv4.dst == b->key.u.ipv4.dst; else return ipv6_addr_equal(&a->key.u.ipv6.dst, &b->key.u.ipv6.dst); } static int geneve_configure(struct net *net, struct net_device *dev, struct netlink_ext_ack *extack, const struct geneve_config *cfg) { struct geneve_net *gn = net_generic(net, geneve_net_id); struct geneve_dev *t, *geneve = netdev_priv(dev); const struct ip_tunnel_info *info = &cfg->info; bool tun_collect_md, tun_on_same_port; int err, encap_len; if (cfg->collect_md && !is_tnl_info_zero(info)) { NL_SET_ERR_MSG(extack, "Device is externally controlled, so attributes (VNI, Port, and so on) must not be specified"); return -EINVAL; } geneve->net = net; geneve->dev = dev; t = geneve_find_dev(gn, info, &tun_on_same_port, &tun_collect_md); if (t) return -EBUSY; /* make enough headroom for basic scenario */ encap_len = GENEVE_BASE_HLEN + ETH_HLEN; if (!cfg->collect_md && ip_tunnel_info_af(info) == AF_INET) { encap_len += sizeof(struct iphdr); dev->max_mtu -= sizeof(struct iphdr); } else { encap_len += sizeof(struct ipv6hdr); dev->max_mtu -= sizeof(struct ipv6hdr); } dev->needed_headroom = encap_len + ETH_HLEN; if (cfg->collect_md) { if (tun_on_same_port) { NL_SET_ERR_MSG(extack, "There can be only one externally controlled device on a destination port"); return -EPERM; } } else { if (tun_collect_md) { NL_SET_ERR_MSG(extack, "There already exists an externally controlled device on this destination port"); return -EPERM; } } dst_cache_reset(&geneve->cfg.info.dst_cache); memcpy(&geneve->cfg, cfg, sizeof(*cfg)); if (geneve->cfg.inner_proto_inherit) { dev->header_ops = NULL; dev->type = ARPHRD_NONE; dev->hard_header_len = 0; dev->addr_len = 0; dev->flags = IFF_POINTOPOINT | IFF_NOARP; } err = register_netdevice(dev); if (err) return err; list_add(&geneve->next, &gn->geneve_list); return 0; } static void init_tnl_info(struct ip_tunnel_info *info, __u16 dst_port) { memset(info, 0, sizeof(*info)); info->key.tp_dst = htons(dst_port); } static int geneve_nl2info(struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack, struct geneve_config *cfg, bool changelink) { struct ip_tunnel_info *info = &cfg->info; int attrtype; if (data[IFLA_GENEVE_REMOTE] && data[IFLA_GENEVE_REMOTE6]) { NL_SET_ERR_MSG(extack, "Cannot specify both IPv4 and IPv6 Remote addresses"); return -EINVAL; } if (data[IFLA_GENEVE_REMOTE]) { if (changelink && (ip_tunnel_info_af(info) == AF_INET6)) { attrtype = IFLA_GENEVE_REMOTE; goto change_notsup; } info->key.u.ipv4.dst = nla_get_in_addr(data[IFLA_GENEVE_REMOTE]); if (ipv4_is_multicast(info->key.u.ipv4.dst)) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_REMOTE], "Remote IPv4 address cannot be Multicast"); return -EINVAL; } } if (data[IFLA_GENEVE_REMOTE6]) { #if IS_ENABLED(CONFIG_IPV6) if (changelink && (ip_tunnel_info_af(info) == AF_INET)) { attrtype = IFLA_GENEVE_REMOTE6; goto change_notsup; } info->mode = IP_TUNNEL_INFO_IPV6; info->key.u.ipv6.dst = nla_get_in6_addr(data[IFLA_GENEVE_REMOTE6]); if (ipv6_addr_type(&info->key.u.ipv6.dst) & IPV6_ADDR_LINKLOCAL) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_REMOTE6], "Remote IPv6 address cannot be link-local"); return -EINVAL; } if (ipv6_addr_is_multicast(&info->key.u.ipv6.dst)) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_REMOTE6], "Remote IPv6 address cannot be Multicast"); return -EINVAL; } __set_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags); cfg->use_udp6_rx_checksums = true; #else NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_REMOTE6], "IPv6 support not enabled in the kernel"); return -EPFNOSUPPORT; #endif } if (data[IFLA_GENEVE_ID]) { __u32 vni; __u8 tvni[3]; __be64 tunid; vni = nla_get_u32(data[IFLA_GENEVE_ID]); tvni[0] = (vni & 0x00ff0000) >> 16; tvni[1] = (vni & 0x0000ff00) >> 8; tvni[2] = vni & 0x000000ff; tunid = vni_to_tunnel_id(tvni); if (changelink && (tunid != info->key.tun_id)) { attrtype = IFLA_GENEVE_ID; goto change_notsup; } info->key.tun_id = tunid; } if (data[IFLA_GENEVE_TTL_INHERIT]) { if (nla_get_u8(data[IFLA_GENEVE_TTL_INHERIT])) cfg->ttl_inherit = true; else cfg->ttl_inherit = false; } else if (data[IFLA_GENEVE_TTL]) { info->key.ttl = nla_get_u8(data[IFLA_GENEVE_TTL]); cfg->ttl_inherit = false; } if (data[IFLA_GENEVE_TOS]) info->key.tos = nla_get_u8(data[IFLA_GENEVE_TOS]); if (data[IFLA_GENEVE_DF]) cfg->df = nla_get_u8(data[IFLA_GENEVE_DF]); if (data[IFLA_GENEVE_LABEL]) { info->key.label = nla_get_be32(data[IFLA_GENEVE_LABEL]) & IPV6_FLOWLABEL_MASK; if (info->key.label && (!(info->mode & IP_TUNNEL_INFO_IPV6))) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_LABEL], "Label attribute only applies for IPv6 Geneve devices"); return -EINVAL; } } if (data[IFLA_GENEVE_PORT]) { if (changelink) { attrtype = IFLA_GENEVE_PORT; goto change_notsup; } info->key.tp_dst = nla_get_be16(data[IFLA_GENEVE_PORT]); } if (data[IFLA_GENEVE_PORT_RANGE]) { const struct ifla_geneve_port_range *p; if (changelink) { attrtype = IFLA_GENEVE_PORT_RANGE; goto change_notsup; } p = nla_data(data[IFLA_GENEVE_PORT_RANGE]); cfg->port_min = ntohs(p->low); cfg->port_max = ntohs(p->high); } if (data[IFLA_GENEVE_COLLECT_METADATA]) { if (changelink) { attrtype = IFLA_GENEVE_COLLECT_METADATA; goto change_notsup; } cfg->collect_md = true; } if (data[IFLA_GENEVE_UDP_CSUM]) { if (changelink) { attrtype = IFLA_GENEVE_UDP_CSUM; goto change_notsup; } if (nla_get_u8(data[IFLA_GENEVE_UDP_CSUM])) __set_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags); } if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]) { #if IS_ENABLED(CONFIG_IPV6) if (changelink) { attrtype = IFLA_GENEVE_UDP_ZERO_CSUM6_TX; goto change_notsup; } if (nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX])) __clear_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags); #else NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX], "IPv6 support not enabled in the kernel"); return -EPFNOSUPPORT; #endif } if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX]) { #if IS_ENABLED(CONFIG_IPV6) if (changelink) { attrtype = IFLA_GENEVE_UDP_ZERO_CSUM6_RX; goto change_notsup; } if (nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX])) cfg->use_udp6_rx_checksums = false; #else NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX], "IPv6 support not enabled in the kernel"); return -EPFNOSUPPORT; #endif } if (data[IFLA_GENEVE_INNER_PROTO_INHERIT]) { if (changelink) { attrtype = IFLA_GENEVE_INNER_PROTO_INHERIT; goto change_notsup; } cfg->inner_proto_inherit = true; } return 0; change_notsup: NL_SET_ERR_MSG_ATTR(extack, data[attrtype], "Changing VNI, Port, endpoint IP address family, external, inner_proto_inherit, and UDP checksum attributes are not supported"); return -EOPNOTSUPP; } static void geneve_link_config(struct net_device *dev, struct ip_tunnel_info *info, struct nlattr *tb[]) { struct geneve_dev *geneve = netdev_priv(dev); int ldev_mtu = 0; if (tb[IFLA_MTU]) { geneve_change_mtu(dev, nla_get_u32(tb[IFLA_MTU])); return; } switch (ip_tunnel_info_af(info)) { case AF_INET: { struct flowi4 fl4 = { .daddr = info->key.u.ipv4.dst }; struct rtable *rt = ip_route_output_key(geneve->net, &fl4); if (!IS_ERR(rt) && rt->dst.dev) { ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV4_HLEN; ip_rt_put(rt); } break; } #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: { struct rt6_info *rt; if (!__in6_dev_get(dev)) break; rt = rt6_lookup(geneve->net, &info->key.u.ipv6.dst, NULL, 0, NULL, 0); if (rt && rt->dst.dev) ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV6_HLEN; ip6_rt_put(rt); break; } #endif } if (ldev_mtu <= 0) return; geneve_change_mtu(dev, ldev_mtu - info->options_len); } static int geneve_newlink(struct net_device *dev, struct rtnl_newlink_params *params, struct netlink_ext_ack *extack) { struct net *link_net = rtnl_newlink_link_net(params); struct nlattr **data = params->data; struct nlattr **tb = params->tb; struct geneve_config cfg = { .df = GENEVE_DF_UNSET, .use_udp6_rx_checksums = false, .ttl_inherit = false, .collect_md = false, .port_min = 1, .port_max = USHRT_MAX, }; int err; init_tnl_info(&cfg.info, GENEVE_UDP_PORT); err = geneve_nl2info(tb, data, extack, &cfg, false); if (err) return err; err = geneve_configure(link_net, dev, extack, &cfg); if (err) return err; geneve_link_config(dev, &cfg.info, tb); return 0; } /* Quiesces the geneve device data path for both TX and RX. * * On transmit geneve checks for non-NULL geneve_sock before it proceeds. * So, if we set that socket to NULL under RCU and wait for synchronize_net() * to complete for the existing set of in-flight packets to be transmitted, * then we would have quiesced the transmit data path. All the future packets * will get dropped until we unquiesce the data path. * * On receive geneve dereference the geneve_sock stashed in the socket. So, * if we set that to NULL under RCU and wait for synchronize_net() to * complete, then we would have quiesced the receive data path. */ static void geneve_quiesce(struct geneve_dev *geneve, struct geneve_sock **gs4, struct geneve_sock **gs6) { *gs4 = rtnl_dereference(geneve->sock4); rcu_assign_pointer(geneve->sock4, NULL); if (*gs4) rcu_assign_sk_user_data((*gs4)->sock->sk, NULL); #if IS_ENABLED(CONFIG_IPV6) *gs6 = rtnl_dereference(geneve->sock6); rcu_assign_pointer(geneve->sock6, NULL); if (*gs6) rcu_assign_sk_user_data((*gs6)->sock->sk, NULL); #else *gs6 = NULL; #endif synchronize_net(); } /* Resumes the geneve device data path for both TX and RX. */ static void geneve_unquiesce(struct geneve_dev *geneve, struct geneve_sock *gs4, struct geneve_sock __maybe_unused *gs6) { rcu_assign_pointer(geneve->sock4, gs4); if (gs4) rcu_assign_sk_user_data(gs4->sock->sk, gs4); #if IS_ENABLED(CONFIG_IPV6) rcu_assign_pointer(geneve->sock6, gs6); if (gs6) rcu_assign_sk_user_data(gs6->sock->sk, gs6); #endif synchronize_net(); } static int geneve_changelink(struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { struct geneve_dev *geneve = netdev_priv(dev); struct geneve_sock *gs4, *gs6; struct geneve_config cfg; int err; /* If the geneve device is configured for metadata (or externally * controlled, for example, OVS), then nothing can be changed. */ if (geneve->cfg.collect_md) return -EOPNOTSUPP; /* Start with the existing info. */ memcpy(&cfg, &geneve->cfg, sizeof(cfg)); err = geneve_nl2info(tb, data, extack, &cfg, true); if (err) return err; if (!geneve_dst_addr_equal(&geneve->cfg.info, &cfg.info)) { dst_cache_reset(&cfg.info.dst_cache); geneve_link_config(dev, &cfg.info, tb); } geneve_quiesce(geneve, &gs4, &gs6); memcpy(&geneve->cfg, &cfg, sizeof(cfg)); geneve_unquiesce(geneve, gs4, gs6); return 0; } static void geneve_dellink(struct net_device *dev, struct list_head *head) { struct geneve_dev *geneve = netdev_priv(dev); list_del(&geneve->next); unregister_netdevice_queue(dev, head); } static size_t geneve_get_size(const struct net_device *dev) { return nla_total_size(sizeof(__u32)) + /* IFLA_GENEVE_ID */ nla_total_size(sizeof(struct in6_addr)) + /* IFLA_GENEVE_REMOTE{6} */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TTL */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TOS */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_DF */ nla_total_size(sizeof(__be32)) + /* IFLA_GENEVE_LABEL */ nla_total_size(sizeof(__be16)) + /* IFLA_GENEVE_PORT */ nla_total_size(0) + /* IFLA_GENEVE_COLLECT_METADATA */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_CSUM */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_TX */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_RX */ nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TTL_INHERIT */ nla_total_size(0) + /* IFLA_GENEVE_INNER_PROTO_INHERIT */ nla_total_size(sizeof(struct ifla_geneve_port_range)) + /* IFLA_GENEVE_PORT_RANGE */ 0; } static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); struct ip_tunnel_info *info = &geneve->cfg.info; bool ttl_inherit = geneve->cfg.ttl_inherit; bool metadata = geneve->cfg.collect_md; struct ifla_geneve_port_range ports = { .low = htons(geneve->cfg.port_min), .high = htons(geneve->cfg.port_max), }; __u8 tmp_vni[3]; __u32 vni; tunnel_id_to_vni(info->key.tun_id, tmp_vni); vni = (tmp_vni[0] << 16) | (tmp_vni[1] << 8) | tmp_vni[2]; if (nla_put_u32(skb, IFLA_GENEVE_ID, vni)) goto nla_put_failure; if (!metadata && ip_tunnel_info_af(info) == AF_INET) { if (nla_put_in_addr(skb, IFLA_GENEVE_REMOTE, info->key.u.ipv4.dst)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM, test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags))) goto nla_put_failure; #if IS_ENABLED(CONFIG_IPV6) } else if (!metadata) { if (nla_put_in6_addr(skb, IFLA_GENEVE_REMOTE6, &info->key.u.ipv6.dst)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX, !test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags))) goto nla_put_failure; #endif } if (nla_put_u8(skb, IFLA_GENEVE_TTL, info->key.ttl) || nla_put_u8(skb, IFLA_GENEVE_TOS, info->key.tos) || nla_put_be32(skb, IFLA_GENEVE_LABEL, info->key.label)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_GENEVE_DF, geneve->cfg.df)) goto nla_put_failure; if (nla_put_be16(skb, IFLA_GENEVE_PORT, info->key.tp_dst)) goto nla_put_failure; if (metadata && nla_put_flag(skb, IFLA_GENEVE_COLLECT_METADATA)) goto nla_put_failure; #if IS_ENABLED(CONFIG_IPV6) if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX, !geneve->cfg.use_udp6_rx_checksums)) goto nla_put_failure; #endif if (nla_put_u8(skb, IFLA_GENEVE_TTL_INHERIT, ttl_inherit)) goto nla_put_failure; if (geneve->cfg.inner_proto_inherit && nla_put_flag(skb, IFLA_GENEVE_INNER_PROTO_INHERIT)) goto nla_put_failure; if (nla_put(skb, IFLA_GENEVE_PORT_RANGE, sizeof(ports), &ports)) goto nla_put_failure; return 0; nla_put_failure: return -EMSGSIZE; } static struct rtnl_link_ops geneve_link_ops __read_mostly = { .kind = "geneve", .maxtype = IFLA_GENEVE_MAX, .policy = geneve_policy, .priv_size = sizeof(struct geneve_dev), .setup = geneve_setup, .validate = geneve_validate, .newlink = geneve_newlink, .changelink = geneve_changelink, .dellink = geneve_dellink, .get_size = geneve_get_size, .fill_info = geneve_fill_info, }; struct net_device *geneve_dev_create_fb(struct net *net, const char *name, u8 name_assign_type, u16 dst_port) { struct nlattr *tb[IFLA_MAX + 1]; struct net_device *dev; LIST_HEAD(list_kill); int err; struct geneve_config cfg = { .df = GENEVE_DF_UNSET, .use_udp6_rx_checksums = true, .ttl_inherit = false, .collect_md = true, .port_min = 1, .port_max = USHRT_MAX, }; memset(tb, 0, sizeof(tb)); dev = rtnl_create_link(net, name, name_assign_type, &geneve_link_ops, tb, NULL); if (IS_ERR(dev)) return dev; init_tnl_info(&cfg.info, dst_port); err = geneve_configure(net, dev, NULL, &cfg); if (err) { free_netdev(dev); return ERR_PTR(err); } /* openvswitch users expect packet sizes to be unrestricted, * so set the largest MTU we can. */ err = geneve_change_mtu(dev, IP_MAX_MTU); if (err) goto err; err = rtnl_configure_link(dev, NULL, 0, NULL); if (err < 0) goto err; return dev; err: geneve_dellink(dev, &list_kill); unregister_netdevice_many(&list_kill); return ERR_PTR(err); } EXPORT_SYMBOL_GPL(geneve_dev_create_fb); static int geneve_netdevice_event(struct notifier_block *unused, unsigned long event, void *ptr) { struct net_device *dev = netdev_notifier_info_to_dev(ptr); if (event == NETDEV_UDP_TUNNEL_PUSH_INFO) geneve_offload_rx_ports(dev, true); else if (event == NETDEV_UDP_TUNNEL_DROP_INFO) geneve_offload_rx_ports(dev, false); return NOTIFY_DONE; } static struct notifier_block geneve_notifier_block __read_mostly = { .notifier_call = geneve_netdevice_event, }; static __net_init int geneve_init_net(struct net *net) { struct geneve_net *gn = net_generic(net, geneve_net_id); INIT_LIST_HEAD(&gn->geneve_list); INIT_LIST_HEAD(&gn->sock_list); return 0; } static void geneve_destroy_tunnels(struct net *net, struct list_head *head) { struct geneve_net *gn = net_generic(net, geneve_net_id); struct geneve_dev *geneve, *next; list_for_each_entry_safe(geneve, next, &gn->geneve_list, next) geneve_dellink(geneve->dev, head); } static void __net_exit geneve_exit_batch_rtnl(struct list_head *net_list, struct list_head *dev_to_kill) { struct net *net; list_for_each_entry(net, net_list, exit_list) geneve_destroy_tunnels(net, dev_to_kill); } static void __net_exit geneve_exit_net(struct net *net) { const struct geneve_net *gn = net_generic(net, geneve_net_id); WARN_ON_ONCE(!list_empty(&gn->sock_list)); } static struct pernet_operations geneve_net_ops = { .init = geneve_init_net, .exit_batch_rtnl = geneve_exit_batch_rtnl, .exit = geneve_exit_net, .id = &geneve_net_id, .size = sizeof(struct geneve_net), }; static int __init geneve_init_module(void) { int rc; rc = register_pernet_subsys(&geneve_net_ops); if (rc) goto out1; rc = register_netdevice_notifier(&geneve_notifier_block); if (rc) goto out2; rc = rtnl_link_register(&geneve_link_ops); if (rc) goto out3; return 0; out3: unregister_netdevice_notifier(&geneve_notifier_block); out2: unregister_pernet_subsys(&geneve_net_ops); out1: return rc; } late_initcall(geneve_init_module); static void __exit geneve_cleanup_module(void) { rtnl_link_unregister(&geneve_link_ops); unregister_netdevice_notifier(&geneve_notifier_block); unregister_pernet_subsys(&geneve_net_ops); } module_exit(geneve_cleanup_module); MODULE_LICENSE("GPL"); MODULE_VERSION(GENEVE_NETDEV_VER); MODULE_AUTHOR("John W. Linville <linville@tuxdriver.com>"); MODULE_DESCRIPTION("Interface driver for GENEVE encapsulated traffic"); MODULE_ALIAS_RTNL_LINK("geneve"); |
2 6 2 1 1 6 3 3 3 3 1 3 2 2 2 2 2 2 2 2 3 2 1 1 3 3 2 3 3 5 5 5 5 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 | // SPDX-License-Identifier: GPL-2.0-only /* * linux/fs/adfs/super.c * * Copyright (C) 1997-1999 Russell King */ #include <linux/module.h> #include <linux/init.h> #include <linux/fs_parser.h> #include <linux/fs_context.h> #include <linux/mount.h> #include <linux/seq_file.h> #include <linux/slab.h> #include <linux/statfs.h> #include <linux/user_namespace.h> #include <linux/blkdev.h> #include "adfs.h" #include "dir_f.h" #include "dir_fplus.h" #define ADFS_SB_FLAGS SB_NOATIME #define ADFS_DEFAULT_OWNER_MASK S_IRWXU #define ADFS_DEFAULT_OTHER_MASK (S_IRWXG | S_IRWXO) void __adfs_error(struct super_block *sb, const char *function, const char *fmt, ...) { struct va_format vaf; va_list args; va_start(args, fmt); vaf.fmt = fmt; vaf.va = &args; printk(KERN_CRIT "ADFS-fs error (device %s)%s%s: %pV\n", sb->s_id, function ? ": " : "", function ? function : "", &vaf); va_end(args); } void adfs_msg(struct super_block *sb, const char *pfx, const char *fmt, ...) { struct va_format vaf; va_list args; va_start(args, fmt); vaf.fmt = fmt; vaf.va = &args; printk("%sADFS-fs (%s): %pV\n", pfx, sb->s_id, &vaf); va_end(args); } static int adfs_checkdiscrecord(struct adfs_discrecord *dr) { unsigned int max_idlen; int i; /* sector size must be 256, 512 or 1024 bytes */ if (dr->log2secsize != 8 && dr->log2secsize != 9 && dr->log2secsize != 10) return 1; /* idlen must be at least log2secsize + 3 */ if (dr->idlen < dr->log2secsize + 3) return 1; /* we cannot have such a large disc that we * are unable to represent sector offsets in * 32 bits. This works out at 2.0 TB. */ if (le32_to_cpu(dr->disc_size_high) >> dr->log2secsize) return 1; /* * Maximum idlen is limited to 16 bits for new directories by * the three-byte storage of an indirect disc address. For * big directories, idlen must be no greater than 19 v2 [1.0] */ max_idlen = dr->format_version ? 19 : 16; if (dr->idlen > max_idlen) return 1; /* reserved bytes should be zero */ for (i = 0; i < sizeof(dr->unused52); i++) if (dr->unused52[i] != 0) return 1; return 0; } static void adfs_put_super(struct super_block *sb) { struct adfs_sb_info *asb = ADFS_SB(sb); adfs_free_map(sb); kfree_rcu(asb, rcu); } static int adfs_show_options(struct seq_file *seq, struct dentry *root) { struct adfs_sb_info *asb = ADFS_SB(root->d_sb); if (!uid_eq(asb->s_uid, GLOBAL_ROOT_UID)) seq_printf(seq, ",uid=%u", from_kuid_munged(&init_user_ns, asb->s_uid)); if (!gid_eq(asb->s_gid, GLOBAL_ROOT_GID)) seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, asb->s_gid)); if (asb->s_owner_mask != ADFS_DEFAULT_OWNER_MASK) seq_printf(seq, ",ownmask=%o", asb->s_owner_mask); if (asb->s_other_mask != ADFS_DEFAULT_OTHER_MASK) seq_printf(seq, ",othmask=%o", asb->s_other_mask); if (asb->s_ftsuffix != 0) seq_printf(seq, ",ftsuffix=%u", asb->s_ftsuffix); return 0; } enum {Opt_uid, Opt_gid, Opt_ownmask, Opt_othmask, Opt_ftsuffix}; static const struct fs_parameter_spec adfs_param_spec[] = { fsparam_uid ("uid", Opt_uid), fsparam_gid ("gid", Opt_gid), fsparam_u32oct ("ownmask", Opt_ownmask), fsparam_u32oct ("othmask", Opt_othmask), fsparam_u32 ("ftsuffix", Opt_ftsuffix), {} }; static int adfs_parse_param(struct fs_context *fc, struct fs_parameter *param) { struct adfs_sb_info *asb = fc->s_fs_info; struct fs_parse_result result; int opt; opt = fs_parse(fc, adfs_param_spec, param, &result); if (opt < 0) return opt; switch (opt) { case Opt_uid: asb->s_uid = result.uid; break; case Opt_gid: asb->s_gid = result.gid; break; case Opt_ownmask: asb->s_owner_mask = result.uint_32; break; case Opt_othmask: asb->s_other_mask = result.uint_32; break; case Opt_ftsuffix: asb->s_ftsuffix = result.uint_32; break; default: return -EINVAL; } return 0; } static int adfs_reconfigure(struct fs_context *fc) { struct adfs_sb_info *new_asb = fc->s_fs_info; struct adfs_sb_info *asb = ADFS_SB(fc->root->d_sb); sync_filesystem(fc->root->d_sb); fc->sb_flags |= ADFS_SB_FLAGS; /* Structure copy newly parsed options */ *asb = *new_asb; return 0; } static int adfs_statfs(struct dentry *dentry, struct kstatfs *buf) { struct super_block *sb = dentry->d_sb; struct adfs_sb_info *sbi = ADFS_SB(sb); u64 id = huge_encode_dev(sb->s_bdev->bd_dev); adfs_map_statfs(sb, buf); buf->f_type = ADFS_SUPER_MAGIC; buf->f_namelen = sbi->s_namelen; buf->f_bsize = sb->s_blocksize; buf->f_ffree = (long)(buf->f_bfree * buf->f_files) / (long)buf->f_blocks; buf->f_fsid = u64_to_fsid(id); return 0; } static struct kmem_cache *adfs_inode_cachep; static struct inode *adfs_alloc_inode(struct super_block *sb) { struct adfs_inode_info *ei; ei = alloc_inode_sb(sb, adfs_inode_cachep, GFP_KERNEL); if (!ei) return NULL; return &ei->vfs_inode; } static void adfs_free_inode(struct inode *inode) { kmem_cache_free(adfs_inode_cachep, ADFS_I(inode)); } static int adfs_drop_inode(struct inode *inode) { /* always drop inodes if we are read-only */ return !IS_ENABLED(CONFIG_ADFS_FS_RW) || IS_RDONLY(inode); } static void init_once(void *foo) { struct adfs_inode_info *ei = (struct adfs_inode_info *) foo; inode_init_once(&ei->vfs_inode); } static int __init init_inodecache(void) { adfs_inode_cachep = kmem_cache_create("adfs_inode_cache", sizeof(struct adfs_inode_info), 0, (SLAB_RECLAIM_ACCOUNT| SLAB_ACCOUNT), init_once); if (adfs_inode_cachep == NULL) return -ENOMEM; return 0; } static void destroy_inodecache(void) { /* * Make sure all delayed rcu free inodes are flushed before we * destroy cache. */ rcu_barrier(); kmem_cache_destroy(adfs_inode_cachep); } static const struct super_operations adfs_sops = { .alloc_inode = adfs_alloc_inode, .free_inode = adfs_free_inode, .drop_inode = adfs_drop_inode, .write_inode = adfs_write_inode, .put_super = adfs_put_super, .statfs = adfs_statfs, .show_options = adfs_show_options, }; static int adfs_probe(struct super_block *sb, unsigned int offset, int silent, int (*validate)(struct super_block *sb, struct buffer_head *bh, struct adfs_discrecord **bhp)) { struct adfs_sb_info *asb = ADFS_SB(sb); struct adfs_discrecord *dr; struct buffer_head *bh; unsigned int blocksize = BLOCK_SIZE; int ret, try; for (try = 0; try < 2; try++) { /* try to set the requested block size */ if (sb->s_blocksize != blocksize && !sb_set_blocksize(sb, blocksize)) { if (!silent) adfs_msg(sb, KERN_ERR, "error: unsupported blocksize"); return -EINVAL; } /* read the buffer */ bh = sb_bread(sb, offset >> sb->s_blocksize_bits); if (!bh) { adfs_msg(sb, KERN_ERR, "error: unable to read block %u, try %d", offset >> sb->s_blocksize_bits, try); return -EIO; } /* validate it */ ret = validate(sb, bh, &dr); if (ret) { brelse(bh); return ret; } /* does the block size match the filesystem block size? */ blocksize = 1 << dr->log2secsize; if (sb->s_blocksize == blocksize) { asb->s_map = adfs_read_map(sb, dr); brelse(bh); return PTR_ERR_OR_ZERO(asb->s_map); } brelse(bh); } return -EIO; } static int adfs_validate_bblk(struct super_block *sb, struct buffer_head *bh, struct adfs_discrecord **drp) { struct adfs_discrecord *dr; unsigned char *b_data; b_data = bh->b_data + (ADFS_DISCRECORD % sb->s_blocksize); if (adfs_checkbblk(b_data)) return -EILSEQ; /* Do some sanity checks on the ADFS disc record */ dr = (struct adfs_discrecord *)(b_data + ADFS_DR_OFFSET); if (adfs_checkdiscrecord(dr)) return -EILSEQ; *drp = dr; return 0; } static int adfs_validate_dr0(struct super_block *sb, struct buffer_head *bh, struct adfs_discrecord **drp) { struct adfs_discrecord *dr; /* Do some sanity checks on the ADFS disc record */ dr = (struct adfs_discrecord *)(bh->b_data + 4); if (adfs_checkdiscrecord(dr) || dr->nzones_high || dr->nzones != 1) return -EILSEQ; *drp = dr; return 0; } static int adfs_fill_super(struct super_block *sb, struct fs_context *fc) { struct adfs_discrecord *dr; struct object_info root_obj; struct adfs_sb_info *asb = sb->s_fs_info; struct inode *root; int ret = -EINVAL; int silent = fc->sb_flags & SB_SILENT; sb->s_flags |= ADFS_SB_FLAGS; sb->s_fs_info = asb; sb->s_magic = ADFS_SUPER_MAGIC; sb->s_time_gran = 10000000; /* Try to probe the filesystem boot block */ ret = adfs_probe(sb, ADFS_DISCRECORD, 1, adfs_validate_bblk); if (ret == -EILSEQ) ret = adfs_probe(sb, 0, silent, adfs_validate_dr0); if (ret == -EILSEQ) { if (!silent) adfs_msg(sb, KERN_ERR, "error: can't find an ADFS filesystem on dev %s.", sb->s_id); ret = -EINVAL; } if (ret) goto error; /* set up enough so that we can read an inode */ sb->s_op = &adfs_sops; dr = adfs_map_discrecord(asb->s_map); root_obj.parent_id = root_obj.indaddr = le32_to_cpu(dr->root); root_obj.name_len = 0; /* Set root object date as 01 Jan 1987 00:00:00 */ root_obj.loadaddr = 0xfff0003f; root_obj.execaddr = 0xec22c000; root_obj.size = ADFS_NEWDIR_SIZE; root_obj.attr = ADFS_NDA_DIRECTORY | ADFS_NDA_OWNER_READ | ADFS_NDA_OWNER_WRITE | ADFS_NDA_PUBLIC_READ; /* * If this is a F+ disk with variable length directories, * get the root_size from the disc record. */ if (dr->format_version) { root_obj.size = le32_to_cpu(dr->root_size); asb->s_dir = &adfs_fplus_dir_ops; asb->s_namelen = ADFS_FPLUS_NAME_LEN; } else { asb->s_dir = &adfs_f_dir_ops; asb->s_namelen = ADFS_F_NAME_LEN; } /* * ,xyz hex filetype suffix may be added by driver * to files that have valid RISC OS filetype */ if (asb->s_ftsuffix) asb->s_namelen += 4; sb->s_d_op = &adfs_dentry_operations; root = adfs_iget(sb, &root_obj); sb->s_root = d_make_root(root); if (!sb->s_root) { adfs_free_map(sb); adfs_error(sb, "get root inode failed\n"); ret = -EIO; goto error; } return 0; error: sb->s_fs_info = NULL; kfree(asb); return ret; } static int adfs_get_tree(struct fs_context *fc) { return get_tree_bdev(fc, adfs_fill_super); } static void adfs_free_fc(struct fs_context *fc) { struct adfs_context *asb = fc->s_fs_info; kfree(asb); } static const struct fs_context_operations adfs_context_ops = { .parse_param = adfs_parse_param, .get_tree = adfs_get_tree, .reconfigure = adfs_reconfigure, .free = adfs_free_fc, }; static int adfs_init_fs_context(struct fs_context *fc) { struct adfs_sb_info *asb; asb = kzalloc(sizeof(struct adfs_sb_info), GFP_KERNEL); if (!asb) return -ENOMEM; if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) { struct super_block *sb = fc->root->d_sb; struct adfs_sb_info *old_asb = ADFS_SB(sb); /* structure copy existing options before parsing */ *asb = *old_asb; } else { /* set default options */ asb->s_uid = GLOBAL_ROOT_UID; asb->s_gid = GLOBAL_ROOT_GID; asb->s_owner_mask = ADFS_DEFAULT_OWNER_MASK; asb->s_other_mask = ADFS_DEFAULT_OTHER_MASK; asb->s_ftsuffix = 0; } fc->ops = &adfs_context_ops; fc->s_fs_info = asb; return 0; } static struct file_system_type adfs_fs_type = { .owner = THIS_MODULE, .name = "adfs", .kill_sb = kill_block_super, .fs_flags = FS_REQUIRES_DEV, .init_fs_context = adfs_init_fs_context, .parameters = adfs_param_spec, }; MODULE_ALIAS_FS("adfs"); static int __init init_adfs_fs(void) { int err = init_inodecache(); if (err) goto out1; err = register_filesystem(&adfs_fs_type); if (err) goto out; return 0; out: destroy_inodecache(); out1: return err; } static void __exit exit_adfs_fs(void) { unregister_filesystem(&adfs_fs_type); destroy_inodecache(); } module_init(init_adfs_fs) module_exit(exit_adfs_fs) MODULE_DESCRIPTION("Acorn Disc Filing System"); MODULE_LICENSE("GPL"); |
47 60 5536 424 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_STRING_H_ #define _LINUX_STRING_H_ #include <linux/args.h> #include <linux/array_size.h> #include <linux/cleanup.h> /* for DEFINE_FREE() */ #include <linux/compiler.h> /* for inline */ #include <linux/types.h> /* for size_t */ #include <linux/stddef.h> /* for NULL */ #include <linux/err.h> /* for ERR_PTR() */ #include <linux/errno.h> /* for E2BIG */ #include <linux/overflow.h> /* for check_mul_overflow() */ #include <linux/stdarg.h> #include <uapi/linux/string.h> extern char *strndup_user(const char __user *, long); extern void *memdup_user(const void __user *, size_t) __realloc_size(2); extern void *vmemdup_user(const void __user *, size_t) __realloc_size(2); extern void *memdup_user_nul(const void __user *, size_t); /** * memdup_array_user - duplicate array from user space * @src: source address in user space * @n: number of array members to copy * @size: size of one array member * * Return: an ERR_PTR() on failure. Result is physically * contiguous, to be freed by kfree(). */ static inline __realloc_size(2, 3) void *memdup_array_user(const void __user *src, size_t n, size_t size) { size_t nbytes; if (check_mul_overflow(n, size, &nbytes)) return ERR_PTR(-EOVERFLOW); return memdup_user(src, nbytes); } /** * vmemdup_array_user - duplicate array from user space * @src: source address in user space * @n: number of array members to copy * @size: size of one array member * * Return: an ERR_PTR() on failure. Result may be not * physically contiguous. Use kvfree() to free. */ static inline __realloc_size(2, 3) void *vmemdup_array_user(const void __user *src, size_t n, size_t size) { size_t nbytes; if (check_mul_overflow(n, size, &nbytes)) return ERR_PTR(-EOVERFLOW); return vmemdup_user(src, nbytes); } /* * Include machine specific inline routines */ #include <asm/string.h> #ifndef __HAVE_ARCH_STRCPY extern char * strcpy(char *,const char *); #endif #ifndef __HAVE_ARCH_STRNCPY extern char * strncpy(char *,const char *, __kernel_size_t); #endif ssize_t sized_strscpy(char *, const char *, size_t); /* * The 2 argument style can only be used when dst is an array with a * known size. */ #define __strscpy0(dst, src, ...) \ sized_strscpy(dst, src, sizeof(dst) + __must_be_array(dst) + \ __must_be_cstr(dst) + __must_be_cstr(src)) #define __strscpy1(dst, src, size) \ sized_strscpy(dst, src, size + __must_be_cstr(dst) + __must_be_cstr(src)) #define __strscpy_pad0(dst, src, ...) \ sized_strscpy_pad(dst, src, sizeof(dst) + __must_be_array(dst) + \ __must_be_cstr(dst) + __must_be_cstr(src)) #define __strscpy_pad1(dst, src, size) \ sized_strscpy_pad(dst, src, size + __must_be_cstr(dst) + __must_be_cstr(src)) /** * strscpy - Copy a C-string into a sized buffer * @dst: Where to copy the string to * @src: Where to copy the string from * @...: Size of destination buffer (optional) * * Copy the source string @src, or as much of it as fits, into the * destination @dst buffer. The behavior is undefined if the string * buffers overlap. The destination @dst buffer is always NUL terminated, * unless it's zero-sized. * * The size argument @... is only required when @dst is not an array, or * when the copy needs to be smaller than sizeof(@dst). * * Preferred to strncpy() since it always returns a valid string, and * doesn't unnecessarily force the tail of the destination buffer to be * zero padded. If padding is desired please use strscpy_pad(). * * Returns the number of characters copied in @dst (not including the * trailing %NUL) or -E2BIG if @size is 0 or the copy from @src was * truncated. */ #define strscpy(dst, src, ...) \ CONCATENATE(__strscpy, COUNT_ARGS(__VA_ARGS__))(dst, src, __VA_ARGS__) #define sized_strscpy_pad(dest, src, count) ({ \ char *__dst = (dest); \ const char *__src = (src); \ const size_t __count = (count); \ ssize_t __wrote; \ \ __wrote = sized_strscpy(__dst, __src, __count); \ if (__wrote >= 0 && __wrote < __count) \ memset(__dst + __wrote + 1, 0, __count - __wrote - 1); \ __wrote; \ }) /** * strscpy_pad() - Copy a C-string into a sized buffer * @dst: Where to copy the string to * @src: Where to copy the string from * @...: Size of destination buffer * * Copy the string, or as much of it as fits, into the dest buffer. The * behavior is undefined if the string buffers overlap. The destination * buffer is always %NUL terminated, unless it's zero-sized. * * If the source string is shorter than the destination buffer, the * remaining bytes in the buffer will be filled with %NUL bytes. * * For full explanation of why you may want to consider using the * 'strscpy' functions please see the function docstring for strscpy(). * * Returns: * * The number of characters copied (not including the trailing %NULs) * * -E2BIG if count is 0 or @src was truncated. */ #define strscpy_pad(dst, src, ...) \ CONCATENATE(__strscpy_pad, COUNT_ARGS(__VA_ARGS__))(dst, src, __VA_ARGS__) #ifndef __HAVE_ARCH_STRCAT extern char * strcat(char *, const char *); #endif #ifndef __HAVE_ARCH_STRNCAT extern char * strncat(char *, const char *, __kernel_size_t); #endif #ifndef __HAVE_ARCH_STRLCAT extern size_t strlcat(char *, const char *, __kernel_size_t); #endif #ifndef __HAVE_ARCH_STRCMP extern int strcmp(const char *,const char *); #endif #ifndef __HAVE_ARCH_STRNCMP extern int strncmp(const char *,const char *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_STRCASECMP extern int strcasecmp(const char *s1, const char *s2); #endif #ifndef __HAVE_ARCH_STRNCASECMP extern int strncasecmp(const char *s1, const char *s2, size_t n); #endif #ifndef __HAVE_ARCH_STRCHR extern char * strchr(const char *,int); #endif #ifndef __HAVE_ARCH_STRCHRNUL extern char * strchrnul(const char *,int); #endif extern char * strnchrnul(const char *, size_t, int); #ifndef __HAVE_ARCH_STRNCHR extern char * strnchr(const char *, size_t, int); #endif #ifndef __HAVE_ARCH_STRRCHR extern char * strrchr(const char *,int); #endif extern char * __must_check skip_spaces(const char *); extern char *strim(char *); static inline __must_check char *strstrip(char *str) { return strim(str); } #ifndef __HAVE_ARCH_STRSTR extern char * strstr(const char *, const char *); #endif #ifndef __HAVE_ARCH_STRNSTR extern char * strnstr(const char *, const char *, size_t); #endif #ifndef __HAVE_ARCH_STRLEN extern __kernel_size_t strlen(const char *); #endif #ifndef __HAVE_ARCH_STRNLEN extern __kernel_size_t strnlen(const char *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_STRPBRK extern char * strpbrk(const char *,const char *); #endif #ifndef __HAVE_ARCH_STRSEP extern char * strsep(char **,const char *); #endif #ifndef __HAVE_ARCH_STRSPN extern __kernel_size_t strspn(const char *,const char *); #endif #ifndef __HAVE_ARCH_STRCSPN extern __kernel_size_t strcspn(const char *,const char *); #endif #ifndef __HAVE_ARCH_MEMSET extern void * memset(void *,int,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMSET16 extern void *memset16(uint16_t *, uint16_t, __kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMSET32 extern void *memset32(uint32_t *, uint32_t, __kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMSET64 extern void *memset64(uint64_t *, uint64_t, __kernel_size_t); #endif static inline void *memset_l(unsigned long *p, unsigned long v, __kernel_size_t n) { if (BITS_PER_LONG == 32) return memset32((uint32_t *)p, v, n); else return memset64((uint64_t *)p, v, n); } static inline void *memset_p(void **p, void *v, __kernel_size_t n) { if (BITS_PER_LONG == 32) return memset32((uint32_t *)p, (uintptr_t)v, n); else return memset64((uint64_t *)p, (uintptr_t)v, n); } extern void **__memcat_p(void **a, void **b); #define memcat_p(a, b) ({ \ BUILD_BUG_ON_MSG(!__same_type(*(a), *(b)), \ "type mismatch in memcat_p()"); \ (typeof(*a) *)__memcat_p((void **)(a), (void **)(b)); \ }) #ifndef __HAVE_ARCH_MEMCPY extern void * memcpy(void *,const void *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMMOVE extern void * memmove(void *,const void *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMSCAN extern void * memscan(void *,int,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMCMP extern int memcmp(const void *,const void *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_BCMP extern int bcmp(const void *,const void *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMCHR extern void * memchr(const void *,int,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt) { memcpy(dst, src, cnt); } #endif void *memchr_inv(const void *s, int c, size_t n); char *strreplace(char *str, char old, char new); /** * mem_is_zero - Check if an area of memory is all 0's. * @s: The memory area * @n: The size of the area * * Return: True if the area of memory is all 0's. */ static inline bool mem_is_zero(const void *s, size_t n) { return !memchr_inv(s, 0, n); } extern void kfree_const(const void *x); extern char *kstrdup(const char *s, gfp_t gfp) __malloc; extern const char *kstrdup_const(const char *s, gfp_t gfp); extern char *kstrndup(const char *s, size_t len, gfp_t gfp); extern void *kmemdup_noprof(const void *src, size_t len, gfp_t gfp) __realloc_size(2); #define kmemdup(...) alloc_hooks(kmemdup_noprof(__VA_ARGS__)) extern void *kvmemdup(const void *src, size_t len, gfp_t gfp) __realloc_size(2); extern char *kmemdup_nul(const char *s, size_t len, gfp_t gfp); extern void *kmemdup_array(const void *src, size_t count, size_t element_size, gfp_t gfp) __realloc_size(2, 3); /* lib/argv_split.c */ extern char **argv_split(gfp_t gfp, const char *str, int *argcp); extern void argv_free(char **argv); DEFINE_FREE(argv_free, char **, if (!IS_ERR_OR_NULL(_T)) argv_free(_T)) /* lib/cmdline.c */ extern int get_option(char **str, int *pint); extern char *get_options(const char *str, int nints, int *ints); extern unsigned long long memparse(const char *ptr, char **retptr); extern bool parse_option_str(const char *str, const char *option); extern char *next_arg(char *args, char **param, char **val); extern bool sysfs_streq(const char *s1, const char *s2); int match_string(const char * const *array, size_t n, const char *string); int __sysfs_match_string(const char * const *array, size_t n, const char *s); /** * sysfs_match_string - matches given string in an array * @_a: array of strings * @_s: string to match with * * Helper for __sysfs_match_string(). Calculates the size of @a automatically. */ #define sysfs_match_string(_a, _s) __sysfs_match_string(_a, ARRAY_SIZE(_a), _s) #ifdef CONFIG_BINARY_PRINTF __printf(3, 0) int vbin_printf(u32 *bin_buf, size_t size, const char *fmt, va_list args); __printf(3, 0) int bstr_printf(char *buf, size_t size, const char *fmt, const u32 *bin_buf); #endif extern ssize_t memory_read_from_buffer(void *to, size_t count, loff_t *ppos, const void *from, size_t available); int ptr_to_hashval(const void *ptr, unsigned long *hashval_out); /** * strstarts - does @str start with @prefix? * @str: string to examine * @prefix: prefix to look for. */ static inline bool strstarts(const char *str, const char *prefix) { return strncmp(str, prefix, strlen(prefix)) == 0; } size_t memweight(const void *ptr, size_t bytes); /** * memzero_explicit - Fill a region of memory (e.g. sensitive * keying data) with 0s. * @s: Pointer to the start of the area. * @count: The size of the area. * * Note: usually using memset() is just fine (!), but in cases * where clearing out _local_ data at the end of a scope is * necessary, memzero_explicit() should be used instead in * order to prevent the compiler from optimising away zeroing. * * memzero_explicit() doesn't need an arch-specific version as * it just invokes the one of memset() implicitly. */ static inline void memzero_explicit(void *s, size_t count) { memset(s, 0, count); barrier_data(s); } /** * kbasename - return the last part of a pathname. * * @path: path to extract the filename from. */ static inline const char *kbasename(const char *path) { const char *tail = strrchr(path, '/'); return tail ? tail + 1 : path; } #if !defined(__NO_FORTIFY) && defined(__OPTIMIZE__) && defined(CONFIG_FORTIFY_SOURCE) #include <linux/fortify-string.h> #endif #ifndef unsafe_memcpy #define unsafe_memcpy(dst, src, bytes, justification) \ memcpy(dst, src, bytes) #endif void memcpy_and_pad(void *dest, size_t dest_len, const void *src, size_t count, int pad); /** * strtomem_pad - Copy NUL-terminated string to non-NUL-terminated buffer * * @dest: Pointer of destination character array (marked as __nonstring) * @src: Pointer to NUL-terminated string * @pad: Padding character to fill any remaining bytes of @dest after copy * * This is a replacement for strncpy() uses where the destination is not * a NUL-terminated string, but with bounds checking on the source size, and * an explicit padding character. If padding is not required, use strtomem(). * * Note that the size of @dest is not an argument, as the length of @dest * must be discoverable by the compiler. */ #define strtomem_pad(dest, src, pad) do { \ const size_t _dest_len = __must_be_byte_array(dest) + \ __must_be_noncstr(dest) + \ ARRAY_SIZE(dest); \ const size_t _src_len = __must_be_cstr(src) + \ __builtin_object_size(src, 1); \ \ BUILD_BUG_ON(!__builtin_constant_p(_dest_len) || \ _dest_len == (size_t)-1); \ memcpy_and_pad(dest, _dest_len, src, \ strnlen(src, min(_src_len, _dest_len)), pad); \ } while (0) /** * strtomem - Copy NUL-terminated string to non-NUL-terminated buffer * * @dest: Pointer of destination character array (marked as __nonstring) * @src: Pointer to NUL-terminated string * * This is a replacement for strncpy() uses where the destination is not * a NUL-terminated string, but with bounds checking on the source size, and * without trailing padding. If padding is required, use strtomem_pad(). * * Note that the size of @dest is not an argument, as the length of @dest * must be discoverable by the compiler. */ #define strtomem(dest, src) do { \ const size_t _dest_len = __must_be_byte_array(dest) + \ __must_be_noncstr(dest) + \ ARRAY_SIZE(dest); \ const size_t _src_len = __must_be_cstr(src) + \ __builtin_object_size(src, 1); \ \ BUILD_BUG_ON(!__builtin_constant_p(_dest_len) || \ _dest_len == (size_t)-1); \ memcpy(dest, src, strnlen(src, min(_src_len, _dest_len))); \ } while (0) /** * memtostr - Copy a possibly non-NUL-term string to a NUL-term string * @dest: Pointer to destination NUL-terminates string * @src: Pointer to character array (likely marked as __nonstring) * * This is a replacement for strncpy() uses where the source is not * a NUL-terminated string. * * Note that sizes of @dest and @src must be known at compile-time. */ #define memtostr(dest, src) do { \ const size_t _dest_len = __must_be_byte_array(dest) + \ __must_be_cstr(dest) + \ ARRAY_SIZE(dest); \ const size_t _src_len = __must_be_noncstr(src) + \ __builtin_object_size(src, 1); \ const size_t _src_chars = strnlen(src, _src_len); \ const size_t _copy_len = min(_dest_len - 1, _src_chars); \ \ BUILD_BUG_ON(!__builtin_constant_p(_dest_len) || \ !__builtin_constant_p(_src_len) || \ _dest_len == 0 || _dest_len == (size_t)-1 || \ _src_len == 0 || _src_len == (size_t)-1); \ memcpy(dest, src, _copy_len); \ dest[_copy_len] = '\0'; \ } while (0) /** * memtostr_pad - Copy a possibly non-NUL-term string to a NUL-term string * with NUL padding in the destination * @dest: Pointer to destination NUL-terminates string * @src: Pointer to character array (likely marked as __nonstring) * * This is a replacement for strncpy() uses where the source is not * a NUL-terminated string. * * Note that sizes of @dest and @src must be known at compile-time. */ #define memtostr_pad(dest, src) do { \ const size_t _dest_len = __must_be_byte_array(dest) + \ __must_be_cstr(dest) + \ ARRAY_SIZE(dest); \ const size_t _src_len = __must_be_noncstr(src) + \ __builtin_object_size(src, 1); \ const size_t _src_chars = strnlen(src, _src_len); \ const size_t _copy_len = min(_dest_len - 1, _src_chars); \ \ BUILD_BUG_ON(!__builtin_constant_p(_dest_len) || \ !__builtin_constant_p(_src_len) || \ _dest_len == 0 || _dest_len == (size_t)-1 || \ _src_len == 0 || _src_len == (size_t)-1); \ memcpy(dest, src, _copy_len); \ memset(&dest[_copy_len], 0, _dest_len - _copy_len); \ } while (0) /** * memset_after - Set a value after a struct member to the end of a struct * * @obj: Address of target struct instance * @v: Byte value to repeatedly write * @member: after which struct member to start writing bytes * * This is good for clearing padding following the given member. */ #define memset_after(obj, v, member) \ ({ \ u8 *__ptr = (u8 *)(obj); \ typeof(v) __val = (v); \ memset(__ptr + offsetofend(typeof(*(obj)), member), __val, \ sizeof(*(obj)) - offsetofend(typeof(*(obj)), member)); \ }) /** * memset_startat - Set a value starting at a member to the end of a struct * * @obj: Address of target struct instance * @v: Byte value to repeatedly write * @member: struct member to start writing at * * Note that if there is padding between the prior member and the target * member, memset_after() should be used to clear the prior padding. */ #define memset_startat(obj, v, member) \ ({ \ u8 *__ptr = (u8 *)(obj); \ typeof(v) __val = (v); \ memset(__ptr + offsetof(typeof(*(obj)), member), __val, \ sizeof(*(obj)) - offsetof(typeof(*(obj)), member)); \ }) /** * str_has_prefix - Test if a string has a given prefix * @str: The string to test * @prefix: The string to see if @str starts with * * A common way to test a prefix of a string is to do: * strncmp(str, prefix, sizeof(prefix) - 1) * * But this can lead to bugs due to typos, or if prefix is a pointer * and not a constant. Instead use str_has_prefix(). * * Returns: * * strlen(@prefix) if @str starts with @prefix * * 0 if @str does not start with @prefix */ static __always_inline size_t str_has_prefix(const char *str, const char *prefix) { size_t len = strlen(prefix); return strncmp(str, prefix, len) == 0 ? len : 0; } #endif /* _LINUX_STRING_H_ */ |
97 11 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 | /* * Copyright (c) 2004 Topspin Communications. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #ifndef _CORE_PRIV_H #define _CORE_PRIV_H #include <linux/list.h> #include <linux/spinlock.h> #include <linux/cgroup_rdma.h> #include <net/net_namespace.h> #include <net/netns/generic.h> #include <rdma/ib_verbs.h> #include <rdma/opa_addr.h> #include <rdma/ib_mad.h> #include <rdma/restrack.h> #include "mad_priv.h" #include "restrack.h" /* Total number of ports combined across all struct ib_devices's */ #define RDMA_MAX_PORTS 8192 struct pkey_index_qp_list { struct list_head pkey_index_list; u16 pkey_index; /* Lock to hold while iterating the qp_list. */ spinlock_t qp_list_lock; struct list_head qp_list; }; /** * struct rdma_dev_net - rdma net namespace metadata for a net * @nl_sock: Pointer to netlink socket * @net: Pointer to owner net namespace * @id: xarray id to identify the net namespace. */ struct rdma_dev_net { struct sock *nl_sock; possible_net_t net; u32 id; }; extern const struct attribute_group ib_dev_attr_group; extern bool ib_devices_shared_netns; extern unsigned int rdma_dev_net_id; static inline struct rdma_dev_net *rdma_net_to_dev_net(struct net *net) { return net_generic(net, rdma_dev_net_id); } int ib_device_rename(struct ib_device *ibdev, const char *name); int ib_device_set_dim(struct ib_device *ibdev, u8 use_dim); typedef void (*roce_netdev_callback)(struct ib_device *device, u32 port, struct net_device *idev, void *cookie); typedef bool (*roce_netdev_filter)(struct ib_device *device, u32 port, struct net_device *idev, void *cookie); struct net_device *ib_device_get_netdev(struct ib_device *ib_dev, u32 port); void ib_enum_roce_netdev(struct ib_device *ib_dev, roce_netdev_filter filter, void *filter_cookie, roce_netdev_callback cb, void *cookie); void ib_enum_all_roce_netdevs(roce_netdev_filter filter, void *filter_cookie, roce_netdev_callback cb, void *cookie); typedef int (*nldev_callback)(struct ib_device *device, struct sk_buff *skb, struct netlink_callback *cb, unsigned int idx); int ib_enum_all_devs(nldev_callback nldev_cb, struct sk_buff *skb, struct netlink_callback *cb); struct ib_client_nl_info { struct sk_buff *nl_msg; struct device *cdev; u32 port; u64 abi; }; int ib_get_client_nl_info(struct ib_device *ibdev, const char *client_name, struct ib_client_nl_info *res); enum ib_cache_gid_default_mode { IB_CACHE_GID_DEFAULT_MODE_SET, IB_CACHE_GID_DEFAULT_MODE_DELETE }; int ib_cache_gid_parse_type_str(const char *buf); const char *ib_cache_gid_type_str(enum ib_gid_type gid_type); void ib_cache_gid_set_default_gid(struct ib_device *ib_dev, u32 port, struct net_device *ndev, unsigned long gid_type_mask, enum ib_cache_gid_default_mode mode); int ib_cache_gid_add(struct ib_device *ib_dev, u32 port, union ib_gid *gid, struct ib_gid_attr *attr); int ib_cache_gid_del(struct ib_device *ib_dev, u32 port, union ib_gid *gid, struct ib_gid_attr *attr); int ib_cache_gid_del_all_netdev_gids(struct ib_device *ib_dev, u32 port, struct net_device *ndev); int roce_gid_mgmt_init(void); void roce_gid_mgmt_cleanup(void); unsigned long roce_gid_type_mask_support(struct ib_device *ib_dev, u32 port); int ib_cache_setup_one(struct ib_device *device); void ib_cache_cleanup_one(struct ib_device *device); void ib_cache_release_one(struct ib_device *device); void ib_dispatch_event_clients(struct ib_event *event); #ifdef CONFIG_CGROUP_RDMA void ib_device_register_rdmacg(struct ib_device *device); void ib_device_unregister_rdmacg(struct ib_device *device); int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, enum rdmacg_resource_type resource_index); void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, enum rdmacg_resource_type resource_index); #else static inline void ib_device_register_rdmacg(struct ib_device *device) { } static inline void ib_device_unregister_rdmacg(struct ib_device *device) { } static inline int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, enum rdmacg_resource_type resource_index) { return 0; } static inline void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, enum rdmacg_resource_type resource_index) { } #endif static inline bool rdma_is_upper_dev_rcu(struct net_device *dev, struct net_device *upper) { return netdev_has_upper_dev_all_rcu(dev, upper); } int addr_init(void); void addr_cleanup(void); int ib_mad_init(void); void ib_mad_cleanup(void); int ib_sa_init(void); void ib_sa_cleanup(void); void rdma_nl_init(void); void rdma_nl_exit(void); int ib_nl_handle_resolve_resp(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack); int ib_nl_handle_set_timeout(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack); int ib_nl_handle_ip_res_resp(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack); void ib_get_cached_subnet_prefix(struct ib_device *device, u32 port_num, u64 *sn_pfx); #ifdef CONFIG_SECURITY_INFINIBAND void ib_security_release_port_pkey_list(struct ib_device *device); void ib_security_cache_change(struct ib_device *device, u32 port_num, u64 subnet_prefix); int ib_security_modify_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_udata *udata); int ib_create_qp_security(struct ib_qp *qp, struct ib_device *dev); void ib_destroy_qp_security_begin(struct ib_qp_security *sec); void ib_destroy_qp_security_abort(struct ib_qp_security *sec); void ib_destroy_qp_security_end(struct ib_qp_security *sec); int ib_open_shared_qp_security(struct ib_qp *qp, struct ib_device *dev); void ib_close_shared_qp_security(struct ib_qp_security *sec); int ib_mad_agent_security_setup(struct ib_mad_agent *agent, enum ib_qp_type qp_type); void ib_mad_agent_security_cleanup(struct ib_mad_agent *agent); int ib_mad_enforce_security(struct ib_mad_agent_private *map, u16 pkey_index); void ib_mad_agent_security_change(void); #else static inline void ib_security_release_port_pkey_list(struct ib_device *device) { } static inline void ib_security_cache_change(struct ib_device *device, u32 port_num, u64 subnet_prefix) { } static inline int ib_security_modify_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_udata *udata) { return qp->device->ops.modify_qp(qp->real_qp, qp_attr, qp_attr_mask, udata); } static inline int ib_create_qp_security(struct ib_qp *qp, struct ib_device *dev) { return 0; } static inline void ib_destroy_qp_security_begin(struct ib_qp_security *sec) { } static inline void ib_destroy_qp_security_abort(struct ib_qp_security *sec) { } static inline void ib_destroy_qp_security_end(struct ib_qp_security *sec) { } static inline int ib_open_shared_qp_security(struct ib_qp *qp, struct ib_device *dev) { return 0; } static inline void ib_close_shared_qp_security(struct ib_qp_security *sec) { } static inline int ib_mad_agent_security_setup(struct ib_mad_agent *agent, enum ib_qp_type qp_type) { return 0; } static inline void ib_mad_agent_security_cleanup(struct ib_mad_agent *agent) { } static inline int ib_mad_enforce_security(struct ib_mad_agent_private *map, u16 pkey_index) { return 0; } static inline void ib_mad_agent_security_change(void) { } #endif struct ib_device *ib_device_get_by_index(const struct net *net, u32 index); /* RDMA device netlink */ void nldev_init(void); void nldev_exit(void); struct ib_qp *ib_create_qp_user(struct ib_device *dev, struct ib_pd *pd, struct ib_qp_init_attr *attr, struct ib_udata *udata, struct ib_uqp_object *uobj, const char *caller); void ib_qp_usecnt_inc(struct ib_qp *qp); void ib_qp_usecnt_dec(struct ib_qp *qp); struct rdma_dev_addr; int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid, const union ib_gid *dgid, u8 *dmac, const struct ib_gid_attr *sgid_attr, int *hoplimit); void rdma_copy_src_l2_addr(struct rdma_dev_addr *dev_addr, const struct net_device *dev); struct sa_path_rec; int roce_resolve_route_from_path(struct sa_path_rec *rec, const struct ib_gid_attr *attr); struct net_device *rdma_read_gid_attr_ndev_rcu(const struct ib_gid_attr *attr); void ib_free_port_attrs(struct ib_core_device *coredev); int ib_setup_port_attrs(struct ib_core_device *coredev); struct rdma_hw_stats *ib_get_hw_stats_port(struct ib_device *ibdev, u32 port_num); void ib_device_release_hw_stats(struct hw_stats_device_data *data); int ib_setup_device_attrs(struct ib_device *ibdev); int rdma_compatdev_set(u8 enable); int ib_port_register_client_groups(struct ib_device *ibdev, u32 port_num, const struct attribute_group **groups); void ib_port_unregister_client_groups(struct ib_device *ibdev, u32 port_num, const struct attribute_group **groups); int ib_device_set_netns_put(struct sk_buff *skb, struct ib_device *dev, u32 ns_fd); int rdma_nl_net_init(struct rdma_dev_net *rnet); void rdma_nl_net_exit(struct rdma_dev_net *rnet); struct rdma_umap_priv { struct vm_area_struct *vma; struct list_head list; struct rdma_user_mmap_entry *entry; }; void rdma_umap_priv_init(struct rdma_umap_priv *priv, struct vm_area_struct *vma, struct rdma_user_mmap_entry *entry); void ib_cq_pool_cleanup(struct ib_device *dev); bool rdma_nl_get_privileged_qkey(void); #endif /* _CORE_PRIV_H */ |
1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/net_tstamp.h> #include <linux/ptp_clock_kernel.h> #include "netlink.h" #include "common.h" #include "bitset.h" #include "../core/dev.h" #include "ts.h" struct tsconfig_req_info { struct ethnl_req_info base; }; struct tsconfig_reply_data { struct ethnl_reply_data base; struct hwtstamp_provider_desc hwprov_desc; struct { u32 tx_type; u32 rx_filter; u32 flags; } hwtst_config; }; #define TSCONFIG_REPDATA(__reply_base) \ container_of(__reply_base, struct tsconfig_reply_data, base) const struct nla_policy ethnl_tsconfig_get_policy[ETHTOOL_A_TSCONFIG_HEADER + 1] = { [ETHTOOL_A_TSCONFIG_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), }; static int tsconfig_prepare_data(const struct ethnl_req_info *req_base, struct ethnl_reply_data *reply_base, const struct genl_info *info) { struct tsconfig_reply_data *data = TSCONFIG_REPDATA(reply_base); struct hwtstamp_provider *hwprov = NULL; struct net_device *dev = reply_base->dev; struct kernel_hwtstamp_config cfg = {}; int ret; if (!dev->netdev_ops->ndo_hwtstamp_get) return -EOPNOTSUPP; ret = ethnl_ops_begin(dev); if (ret < 0) return ret; ret = dev_get_hwtstamp_phylib(dev, &cfg); if (ret) goto out; data->hwtst_config.tx_type = BIT(cfg.tx_type); data->hwtst_config.rx_filter = BIT(cfg.rx_filter); data->hwtst_config.flags = cfg.flags; data->hwprov_desc.index = -1; hwprov = rtnl_dereference(dev->hwprov); if (hwprov) { data->hwprov_desc.index = hwprov->desc.index; data->hwprov_desc.qualifier = hwprov->desc.qualifier; } else { struct kernel_ethtool_ts_info ts_info = {}; ts_info.phc_index = -1; ret = __ethtool_get_ts_info(dev, &ts_info); if (ret) goto out; if (ts_info.phc_index == -1) return -ENODEV; data->hwprov_desc.index = ts_info.phc_index; data->hwprov_desc.qualifier = ts_info.phc_qualifier; } out: ethnl_ops_complete(dev); return ret; } static int tsconfig_reply_size(const struct ethnl_req_info *req_base, const struct ethnl_reply_data *reply_base) { const struct tsconfig_reply_data *data = TSCONFIG_REPDATA(reply_base); bool compact = req_base->flags & ETHTOOL_FLAG_COMPACT_BITSETS; int len = 0; int ret; BUILD_BUG_ON(__HWTSTAMP_TX_CNT > 32); BUILD_BUG_ON(__HWTSTAMP_FILTER_CNT > 32); BUILD_BUG_ON(__HWTSTAMP_FLAG_CNT > 32); if (data->hwtst_config.flags) { ret = ethnl_bitset32_size(&data->hwtst_config.flags, NULL, __HWTSTAMP_FLAG_CNT, ts_flags_names, compact); if (ret < 0) return ret; len += ret; /* _TSCONFIG_HWTSTAMP_FLAGS */ } if (data->hwtst_config.tx_type) { ret = ethnl_bitset32_size(&data->hwtst_config.tx_type, NULL, __HWTSTAMP_TX_CNT, ts_tx_type_names, compact); if (ret < 0) return ret; len += ret; /* _TSCONFIG_TX_TYPES */ } if (data->hwtst_config.rx_filter) { ret = ethnl_bitset32_size(&data->hwtst_config.rx_filter, NULL, __HWTSTAMP_FILTER_CNT, ts_rx_filter_names, compact); if (ret < 0) return ret; len += ret; /* _TSCONFIG_RX_FILTERS */ } if (data->hwprov_desc.index >= 0) /* _TSCONFIG_HWTSTAMP_PROVIDER */ len += nla_total_size(0) + 2 * nla_total_size(sizeof(u32)); return len; } static int tsconfig_fill_reply(struct sk_buff *skb, const struct ethnl_req_info *req_base, const struct ethnl_reply_data *reply_base) { const struct tsconfig_reply_data *data = TSCONFIG_REPDATA(reply_base); bool compact = req_base->flags & ETHTOOL_FLAG_COMPACT_BITSETS; int ret; if (data->hwtst_config.flags) { ret = ethnl_put_bitset32(skb, ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS, &data->hwtst_config.flags, NULL, __HWTSTAMP_FLAG_CNT, ts_flags_names, compact); if (ret < 0) return ret; } if (data->hwtst_config.tx_type) { ret = ethnl_put_bitset32(skb, ETHTOOL_A_TSCONFIG_TX_TYPES, &data->hwtst_config.tx_type, NULL, __HWTSTAMP_TX_CNT, ts_tx_type_names, compact); if (ret < 0) return ret; } if (data->hwtst_config.rx_filter) { ret = ethnl_put_bitset32(skb, ETHTOOL_A_TSCONFIG_RX_FILTERS, &data->hwtst_config.rx_filter, NULL, __HWTSTAMP_FILTER_CNT, ts_rx_filter_names, compact); if (ret < 0) return ret; } if (data->hwprov_desc.index >= 0) { struct nlattr *nest; nest = nla_nest_start(skb, ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER); if (!nest) return -EMSGSIZE; if (nla_put_u32(skb, ETHTOOL_A_TS_HWTSTAMP_PROVIDER_INDEX, data->hwprov_desc.index) || nla_put_u32(skb, ETHTOOL_A_TS_HWTSTAMP_PROVIDER_QUALIFIER, data->hwprov_desc.qualifier)) { nla_nest_cancel(skb, nest); return -EMSGSIZE; } nla_nest_end(skb, nest); } return 0; } /* TSCONFIG_SET */ const struct nla_policy ethnl_tsconfig_set_policy[ETHTOOL_A_TSCONFIG_MAX + 1] = { [ETHTOOL_A_TSCONFIG_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), [ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER] = NLA_POLICY_NESTED(ethnl_ts_hwtst_prov_policy), [ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS] = { .type = NLA_NESTED }, [ETHTOOL_A_TSCONFIG_RX_FILTERS] = { .type = NLA_NESTED }, [ETHTOOL_A_TSCONFIG_TX_TYPES] = { .type = NLA_NESTED }, }; static int tsconfig_send_reply(struct net_device *dev, struct genl_info *info) { struct tsconfig_reply_data *reply_data; struct tsconfig_req_info *req_info; struct sk_buff *rskb; void *reply_payload; int reply_len = 0; int ret; req_info = kzalloc(sizeof(*req_info), GFP_KERNEL); if (!req_info) return -ENOMEM; reply_data = kmalloc(sizeof(*reply_data), GFP_KERNEL); if (!reply_data) { kfree(req_info); return -ENOMEM; } ASSERT_RTNL(); reply_data->base.dev = dev; ret = tsconfig_prepare_data(&req_info->base, &reply_data->base, info); if (ret < 0) goto err_cleanup; ret = tsconfig_reply_size(&req_info->base, &reply_data->base); if (ret < 0) goto err_cleanup; reply_len = ret + ethnl_reply_header_size(); rskb = ethnl_reply_init(reply_len, dev, ETHTOOL_MSG_TSCONFIG_SET_REPLY, ETHTOOL_A_TSCONFIG_HEADER, info, &reply_payload); if (!rskb) goto err_cleanup; ret = tsconfig_fill_reply(rskb, &req_info->base, &reply_data->base); if (ret < 0) goto err_cleanup; genlmsg_end(rskb, reply_payload); ret = genlmsg_reply(rskb, info); err_cleanup: kfree(reply_data); kfree(req_info); return ret; } static int ethnl_set_tsconfig_validate(struct ethnl_req_info *req_base, struct genl_info *info) { const struct net_device_ops *ops = req_base->dev->netdev_ops; if (!ops->ndo_hwtstamp_set || !ops->ndo_hwtstamp_get) return -EOPNOTSUPP; return 1; } static struct hwtstamp_provider * tsconfig_set_hwprov_from_desc(struct net_device *dev, struct genl_info *info, struct hwtstamp_provider_desc *hwprov_desc) { struct kernel_ethtool_ts_info ts_info; struct hwtstamp_provider *hwprov; struct nlattr **tb = info->attrs; struct phy_device *phy = NULL; enum hwtstamp_source source; int ret; ret = ethtool_net_get_ts_info_by_phc(dev, &ts_info, hwprov_desc); if (!ret) { /* Found */ source = HWTSTAMP_SOURCE_NETDEV; } else { phy = ethtool_phy_get_ts_info_by_phc(dev, &ts_info, hwprov_desc); if (IS_ERR(phy)) { if (PTR_ERR(phy) == -ENODEV) NL_SET_ERR_MSG_ATTR(info->extack, tb[ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER], "phc not in this net device topology"); return ERR_CAST(phy); } source = HWTSTAMP_SOURCE_PHYLIB; } hwprov = kzalloc(sizeof(*hwprov), GFP_KERNEL); if (!hwprov) return ERR_PTR(-ENOMEM); hwprov->desc.index = hwprov_desc->index; hwprov->desc.qualifier = hwprov_desc->qualifier; hwprov->source = source; hwprov->phydev = phy; return hwprov; } static int ethnl_set_tsconfig(struct ethnl_req_info *req_base, struct genl_info *info) { struct kernel_hwtstamp_config hwtst_config = {0}; bool hwprov_mod = false, config_mod = false; struct hwtstamp_provider *hwprov = NULL; struct net_device *dev = req_base->dev; struct nlattr **tb = info->attrs; int ret; BUILD_BUG_ON(__HWTSTAMP_TX_CNT >= 32); BUILD_BUG_ON(__HWTSTAMP_FILTER_CNT >= 32); BUILD_BUG_ON(__HWTSTAMP_FLAG_CNT > 32); if (!netif_device_present(dev)) return -ENODEV; if (tb[ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER]) { struct hwtstamp_provider_desc __hwprov_desc = {.index = -1}; struct hwtstamp_provider *__hwprov; __hwprov = rtnl_dereference(dev->hwprov); if (__hwprov) { __hwprov_desc.index = __hwprov->desc.index; __hwprov_desc.qualifier = __hwprov->desc.qualifier; } ret = ts_parse_hwtst_provider(tb[ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER], &__hwprov_desc, info->extack, &hwprov_mod); if (ret < 0) return ret; if (hwprov_mod) { hwprov = tsconfig_set_hwprov_from_desc(dev, info, &__hwprov_desc); if (IS_ERR(hwprov)) return PTR_ERR(hwprov); } } /* Get current hwtstamp config if we are not changing the * hwtstamp source. It will be zeroed in the other case. */ if (!hwprov_mod) { ret = dev_get_hwtstamp_phylib(dev, &hwtst_config); if (ret < 0 && ret != -EOPNOTSUPP) goto err_free_hwprov; } /* Get the hwtstamp config from netlink */ if (tb[ETHTOOL_A_TSCONFIG_TX_TYPES]) { u32 req_tx_type; req_tx_type = BIT(hwtst_config.tx_type); ret = ethnl_update_bitset32(&req_tx_type, __HWTSTAMP_TX_CNT, tb[ETHTOOL_A_TSCONFIG_TX_TYPES], ts_tx_type_names, info->extack, &config_mod); if (ret < 0) goto err_free_hwprov; /* Select only one tx type at a time */ if (ffs(req_tx_type) != fls(req_tx_type)) { ret = -EINVAL; goto err_free_hwprov; } hwtst_config.tx_type = ffs(req_tx_type) - 1; } if (tb[ETHTOOL_A_TSCONFIG_RX_FILTERS]) { u32 req_rx_filter; req_rx_filter = BIT(hwtst_config.rx_filter); ret = ethnl_update_bitset32(&req_rx_filter, __HWTSTAMP_FILTER_CNT, tb[ETHTOOL_A_TSCONFIG_RX_FILTERS], ts_rx_filter_names, info->extack, &config_mod); if (ret < 0) goto err_free_hwprov; /* Select only one rx filter at a time */ if (ffs(req_rx_filter) != fls(req_rx_filter)) { ret = -EINVAL; goto err_free_hwprov; } hwtst_config.rx_filter = ffs(req_rx_filter) - 1; } if (tb[ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS]) { ret = ethnl_update_bitset32(&hwtst_config.flags, __HWTSTAMP_FLAG_CNT, tb[ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS], ts_flags_names, info->extack, &config_mod); if (ret < 0) goto err_free_hwprov; } ret = net_hwtstamp_validate(&hwtst_config); if (ret) goto err_free_hwprov; if (hwprov_mod) { struct kernel_hwtstamp_config zero_config = {0}; struct hwtstamp_provider *__hwprov; /* Disable current time stamping if we try to enable * another one */ ret = dev_set_hwtstamp_phylib(dev, &zero_config, info->extack); if (ret < 0) goto err_free_hwprov; /* Change the selected hwtstamp source */ __hwprov = rcu_replace_pointer_rtnl(dev->hwprov, hwprov); if (__hwprov) kfree_rcu(__hwprov, rcu_head); } if (config_mod) { ret = dev_set_hwtstamp_phylib(dev, &hwtst_config, info->extack); if (ret < 0) return ret; } if (hwprov_mod || config_mod) { ret = tsconfig_send_reply(dev, info); if (ret && ret != -EOPNOTSUPP) { NL_SET_ERR_MSG(info->extack, "error while reading the new configuration set"); return ret; } } /* tsconfig has no notification */ return 0; err_free_hwprov: kfree(hwprov); return ret; } const struct ethnl_request_ops ethnl_tsconfig_request_ops = { .request_cmd = ETHTOOL_MSG_TSCONFIG_GET, .reply_cmd = ETHTOOL_MSG_TSCONFIG_GET_REPLY, .hdr_attr = ETHTOOL_A_TSCONFIG_HEADER, .req_info_size = sizeof(struct tsconfig_req_info), .reply_data_size = sizeof(struct tsconfig_reply_data), .prepare_data = tsconfig_prepare_data, .reply_size = tsconfig_reply_size, .fill_reply = tsconfig_fill_reply, .set_validate = ethnl_set_tsconfig_validate, .set = ethnl_set_tsconfig, }; |
8 8 704 235 704 68 647 707 705 704 705 3 3 3 3 3 3 3 3 3 3 3 706 431 430 432 2 2 2 2 2 2 2 2 2 2 2 2 2 87 87 87 87 87 87 87 87 86 10 10 10 10 10 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 | // SPDX-License-Identifier: GPL-2.0 /* * Tag allocation using scalable bitmaps. Uses active queue tracking to support * fairer distribution of tags between multiple submitters when a shared tag map * is used. * * Copyright (C) 2013-2014 Jens Axboe */ #include <linux/kernel.h> #include <linux/module.h> #include <linux/delay.h> #include "blk.h" #include "blk-mq.h" #include "blk-mq-sched.h" /* * Recalculate wakeup batch when tag is shared by hctx. */ static void blk_mq_update_wake_batch(struct blk_mq_tags *tags, unsigned int users) { if (!users) return; sbitmap_queue_recalculate_wake_batch(&tags->bitmap_tags, users); sbitmap_queue_recalculate_wake_batch(&tags->breserved_tags, users); } /* * If a previously inactive queue goes active, bump the active user count. * We need to do this before try to allocate driver tag, then even if fail * to get tag when first time, the other shared-tag users could reserve * budget for it. */ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) { unsigned int users; unsigned long flags; struct blk_mq_tags *tags = hctx->tags; /* * calling test_bit() prior to test_and_set_bit() is intentional, * it avoids dirtying the cacheline if the queue is already active. */ if (blk_mq_is_shared_tags(hctx->flags)) { struct request_queue *q = hctx->queue; if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) || test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) return; } else { if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) || test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) return; } spin_lock_irqsave(&tags->lock, flags); users = tags->active_queues + 1; WRITE_ONCE(tags->active_queues, users); blk_mq_update_wake_batch(tags, users); spin_unlock_irqrestore(&tags->lock, flags); } /* * Wakeup all potentially sleeping on tags */ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve) { sbitmap_queue_wake_all(&tags->bitmap_tags); if (include_reserve) sbitmap_queue_wake_all(&tags->breserved_tags); } /* * If a previously busy queue goes inactive, potential waiters could now * be allowed to queue. Wake them up and check. */ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) { struct blk_mq_tags *tags = hctx->tags; unsigned int users; if (blk_mq_is_shared_tags(hctx->flags)) { struct request_queue *q = hctx->queue; if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) return; } else { if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) return; } spin_lock_irq(&tags->lock); users = tags->active_queues - 1; WRITE_ONCE(tags->active_queues, users); blk_mq_update_wake_batch(tags, users); spin_unlock_irq(&tags->lock); blk_mq_tag_wakeup_all(tags, false); } static int __blk_mq_get_tag(struct blk_mq_alloc_data *data, struct sbitmap_queue *bt) { if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) && !hctx_may_queue(data->hctx, bt)) return BLK_MQ_NO_TAG; if (data->shallow_depth) return sbitmap_queue_get_shallow(bt, data->shallow_depth); else return __sbitmap_queue_get(bt); } unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags, unsigned int *offset) { struct blk_mq_tags *tags = blk_mq_tags_from_data(data); struct sbitmap_queue *bt = &tags->bitmap_tags; unsigned long ret; if (data->shallow_depth ||data->flags & BLK_MQ_REQ_RESERVED || data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) return 0; ret = __sbitmap_queue_get_batch(bt, nr_tags, offset); *offset += tags->nr_reserved_tags; return ret; } unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data) { struct blk_mq_tags *tags = blk_mq_tags_from_data(data); struct sbitmap_queue *bt; struct sbq_wait_state *ws; DEFINE_SBQ_WAIT(wait); unsigned int tag_offset; int tag; if (data->flags & BLK_MQ_REQ_RESERVED) { if (unlikely(!tags->nr_reserved_tags)) { WARN_ON_ONCE(1); return BLK_MQ_NO_TAG; } bt = &tags->breserved_tags; tag_offset = 0; } else { bt = &tags->bitmap_tags; tag_offset = tags->nr_reserved_tags; } tag = __blk_mq_get_tag(data, bt); if (tag != BLK_MQ_NO_TAG) goto found_tag; if (data->flags & BLK_MQ_REQ_NOWAIT) return BLK_MQ_NO_TAG; ws = bt_wait_ptr(bt, data->hctx); do { struct sbitmap_queue *bt_prev; /* * We're out of tags on this hardware queue, kick any * pending IO submits before going to sleep waiting for * some to complete. */ blk_mq_run_hw_queue(data->hctx, false); /* * Retry tag allocation after running the hardware queue, * as running the queue may also have found completions. */ tag = __blk_mq_get_tag(data, bt); if (tag != BLK_MQ_NO_TAG) break; sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE); tag = __blk_mq_get_tag(data, bt); if (tag != BLK_MQ_NO_TAG) break; bt_prev = bt; io_schedule(); sbitmap_finish_wait(bt, ws, &wait); data->ctx = blk_mq_get_ctx(data->q); data->hctx = blk_mq_map_queue(data->cmd_flags, data->ctx); tags = blk_mq_tags_from_data(data); if (data->flags & BLK_MQ_REQ_RESERVED) bt = &tags->breserved_tags; else bt = &tags->bitmap_tags; /* * If destination hw queue is changed, fake wake up on * previous queue for compensating the wake up miss, so * other allocations on previous queue won't be starved. */ if (bt != bt_prev) sbitmap_queue_wake_up(bt_prev, 1); ws = bt_wait_ptr(bt, data->hctx); } while (1); sbitmap_finish_wait(bt, ws, &wait); found_tag: /* * Give up this allocation if the hctx is inactive. The caller will * retry on an active hctx. */ if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) { blk_mq_put_tag(tags, data->ctx, tag + tag_offset); return BLK_MQ_NO_TAG; } return tag + tag_offset; } void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx, unsigned int tag) { if (!blk_mq_tag_is_reserved(tags, tag)) { const int real_tag = tag - tags->nr_reserved_tags; BUG_ON(real_tag >= tags->nr_tags); sbitmap_queue_clear(&tags->bitmap_tags, real_tag, ctx->cpu); } else { sbitmap_queue_clear(&tags->breserved_tags, tag, ctx->cpu); } } void blk_mq_put_tags(struct blk_mq_tags *tags, int *tag_array, int nr_tags) { sbitmap_queue_clear_batch(&tags->bitmap_tags, tags->nr_reserved_tags, tag_array, nr_tags); } struct bt_iter_data { struct blk_mq_hw_ctx *hctx; struct request_queue *q; busy_tag_iter_fn *fn; void *data; bool reserved; }; static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags, unsigned int bitnr) { struct request *rq; unsigned long flags; spin_lock_irqsave(&tags->lock, flags); rq = tags->rqs[bitnr]; if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq)) rq = NULL; spin_unlock_irqrestore(&tags->lock, flags); return rq; } static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) { struct bt_iter_data *iter_data = data; struct blk_mq_hw_ctx *hctx = iter_data->hctx; struct request_queue *q = iter_data->q; struct blk_mq_tag_set *set = q->tag_set; struct blk_mq_tags *tags; struct request *rq; bool ret = true; if (blk_mq_is_shared_tags(set->flags)) tags = set->shared_tags; else tags = hctx->tags; if (!iter_data->reserved) bitnr += tags->nr_reserved_tags; /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. */ rq = blk_mq_find_and_get_req(tags, bitnr); if (!rq) return true; if (rq->q == q && (!hctx || rq->mq_hctx == hctx)) ret = iter_data->fn(rq, iter_data->data); blk_mq_put_rq_ref(rq); return ret; } /** * bt_for_each - iterate over the requests associated with a hardware queue * @hctx: Hardware queue to examine. * @q: Request queue to examine. * @bt: sbitmap to examine. This is either the breserved_tags member * or the bitmap_tags member of struct blk_mq_tags. * @fn: Pointer to the function that will be called for each request * associated with @hctx that has been assigned a driver tag. * @fn will be called as follows: @fn(@hctx, rq, @data, @reserved) * where rq is a pointer to a request. Return true to continue * iterating tags, false to stop. * @data: Will be passed as third argument to @fn. * @reserved: Indicates whether @bt is the breserved_tags member or the * bitmap_tags member of struct blk_mq_tags. */ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct request_queue *q, struct sbitmap_queue *bt, busy_tag_iter_fn *fn, void *data, bool reserved) { struct bt_iter_data iter_data = { .hctx = hctx, .fn = fn, .data = data, .reserved = reserved, .q = q, }; sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); } struct bt_tags_iter_data { struct blk_mq_tags *tags; busy_tag_iter_fn *fn; void *data; unsigned int flags; }; #define BT_TAG_ITER_RESERVED (1 << 0) #define BT_TAG_ITER_STARTED (1 << 1) #define BT_TAG_ITER_STATIC_RQS (1 << 2) static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) { struct bt_tags_iter_data *iter_data = data; struct blk_mq_tags *tags = iter_data->tags; struct request *rq; bool ret = true; bool iter_static_rqs = !!(iter_data->flags & BT_TAG_ITER_STATIC_RQS); if (!(iter_data->flags & BT_TAG_ITER_RESERVED)) bitnr += tags->nr_reserved_tags; /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. */ if (iter_static_rqs) rq = tags->static_rqs[bitnr]; else rq = blk_mq_find_and_get_req(tags, bitnr); if (!rq) return true; if (!(iter_data->flags & BT_TAG_ITER_STARTED) || blk_mq_request_started(rq)) ret = iter_data->fn(rq, iter_data->data); if (!iter_static_rqs) blk_mq_put_rq_ref(rq); return ret; } /** * bt_tags_for_each - iterate over the requests in a tag map * @tags: Tag map to iterate over. * @bt: sbitmap to examine. This is either the breserved_tags member * or the bitmap_tags member of struct blk_mq_tags. * @fn: Pointer to the function that will be called for each started * request. @fn will be called as follows: @fn(rq, @data, * @reserved) where rq is a pointer to a request. Return true * to continue iterating tags, false to stop. * @data: Will be passed as second argument to @fn. * @flags: BT_TAG_ITER_* */ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, busy_tag_iter_fn *fn, void *data, unsigned int flags) { struct bt_tags_iter_data iter_data = { .tags = tags, .fn = fn, .data = data, .flags = flags, }; if (tags->rqs) sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data); } static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void *priv, unsigned int flags) { WARN_ON_ONCE(flags & BT_TAG_ITER_RESERVED); if (tags->nr_reserved_tags) bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, flags | BT_TAG_ITER_RESERVED); bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, flags); } /** * blk_mq_all_tag_iter - iterate over all requests in a tag map * @tags: Tag map to iterate over. * @fn: Pointer to the function that will be called for each * request. @fn will be called as follows: @fn(rq, @priv, * reserved) where rq is a pointer to a request. 'reserved' * indicates whether or not @rq is a reserved request. Return * true to continue iterating tags, false to stop. * @priv: Will be passed as second argument to @fn. * * Caller has to pass the tag map from which requests are allocated. */ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void *priv) { __blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS); } /** * blk_mq_tagset_busy_iter - iterate over all started requests in a tag set * @tagset: Tag set to iterate over. * @fn: Pointer to the function that will be called for each started * request. @fn will be called as follows: @fn(rq, @priv, * reserved) where rq is a pointer to a request. 'reserved' * indicates whether or not @rq is a reserved request. Return * true to continue iterating tags, false to stop. * @priv: Will be passed as second argument to @fn. * * We grab one request reference before calling @fn and release it after * @fn returns. */ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, busy_tag_iter_fn *fn, void *priv) { unsigned int flags = tagset->flags; int i, nr_tags; nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues; for (i = 0; i < nr_tags; i++) { if (tagset->tags && tagset->tags[i]) __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, BT_TAG_ITER_STARTED); } } EXPORT_SYMBOL(blk_mq_tagset_busy_iter); static bool blk_mq_tagset_count_completed_rqs(struct request *rq, void *data) { unsigned *count = data; if (blk_mq_request_completed(rq)) (*count)++; return true; } /** * blk_mq_tagset_wait_completed_request - Wait until all scheduled request * completions have finished. * @tagset: Tag set to drain completed request * * Note: This function has to be run after all IO queues are shutdown */ void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset) { while (true) { unsigned count = 0; blk_mq_tagset_busy_iter(tagset, blk_mq_tagset_count_completed_rqs, &count); if (!count) break; msleep(5); } } EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request); /** * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag * @q: Request queue to examine. * @fn: Pointer to the function that will be called for each request * on @q. @fn will be called as follows: @fn(hctx, rq, @priv, * reserved) where rq is a pointer to a request and hctx points * to the hardware queue associated with the request. 'reserved' * indicates whether or not @rq is a reserved request. * @priv: Will be passed as third argument to @fn. * * Note: if @q->tag_set is shared with other request queues then @fn will be * called for all requests on all queues that share that tag set and not only * for requests associated with @q. */ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn, void *priv) { /* * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and hctx_table * while the queue is frozen. So we can use q_usage_counter to avoid * racing with it. */ if (!percpu_ref_tryget(&q->q_usage_counter)) return; if (blk_mq_is_shared_tags(q->tag_set->flags)) { struct blk_mq_tags *tags = q->tag_set->shared_tags; struct sbitmap_queue *bresv = &tags->breserved_tags; struct sbitmap_queue *btags = &tags->bitmap_tags; if (tags->nr_reserved_tags) bt_for_each(NULL, q, bresv, fn, priv, true); bt_for_each(NULL, q, btags, fn, priv, false); } else { struct blk_mq_hw_ctx *hctx; unsigned long i; queue_for_each_hw_ctx(q, hctx, i) { struct blk_mq_tags *tags = hctx->tags; struct sbitmap_queue *bresv = &tags->breserved_tags; struct sbitmap_queue *btags = &tags->bitmap_tags; /* * If no software queues are currently mapped to this * hardware queue, there's nothing to check */ if (!blk_mq_hw_queue_mapped(hctx)) continue; if (tags->nr_reserved_tags) bt_for_each(hctx, q, bresv, fn, priv, true); bt_for_each(hctx, q, btags, fn, priv, false); } } blk_queue_exit(q); } static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth, bool round_robin, int node) { return sbitmap_queue_init_node(bt, depth, -1, round_robin, GFP_KERNEL, node); } struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, unsigned int reserved_tags, unsigned int flags, int node) { unsigned int depth = total_tags - reserved_tags; bool round_robin = flags & BLK_MQ_F_TAG_RR; struct blk_mq_tags *tags; if (total_tags > BLK_MQ_TAG_MAX) { pr_err("blk-mq: tag depth too large\n"); return NULL; } tags = kzalloc_node(sizeof(*tags), GFP_KERNEL, node); if (!tags) return NULL; tags->nr_tags = total_tags; tags->nr_reserved_tags = reserved_tags; spin_lock_init(&tags->lock); if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node)) goto out_free_tags; if (bt_alloc(&tags->breserved_tags, reserved_tags, round_robin, node)) goto out_free_bitmap_tags; return tags; out_free_bitmap_tags: sbitmap_queue_free(&tags->bitmap_tags); out_free_tags: kfree(tags); return NULL; } void blk_mq_free_tags(struct blk_mq_tags *tags) { sbitmap_queue_free(&tags->bitmap_tags); sbitmap_queue_free(&tags->breserved_tags); kfree(tags); } int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx, struct blk_mq_tags **tagsptr, unsigned int tdepth, bool can_grow) { struct blk_mq_tags *tags = *tagsptr; if (tdepth <= tags->nr_reserved_tags) return -EINVAL; /* * If we are allowed to grow beyond the original size, allocate * a new set of tags before freeing the old one. */ if (tdepth > tags->nr_tags) { struct blk_mq_tag_set *set = hctx->queue->tag_set; struct blk_mq_tags *new; if (!can_grow) return -EINVAL; /* * We need some sort of upper limit, set it high enough that * no valid use cases should require more. */ if (tdepth > MAX_SCHED_RQ) return -EINVAL; /* * Only the sbitmap needs resizing since we allocated the max * initially. */ if (blk_mq_is_shared_tags(set->flags)) return 0; new = blk_mq_alloc_map_and_rqs(set, hctx->queue_num, tdepth); if (!new) return -ENOMEM; blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num); *tagsptr = new; } else { /* * Don't need (or can't) update reserved tags here, they * remain static and should never need resizing. */ sbitmap_queue_resize(&tags->bitmap_tags, tdepth - tags->nr_reserved_tags); } return 0; } void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set, unsigned int size) { struct blk_mq_tags *tags = set->shared_tags; sbitmap_queue_resize(&tags->bitmap_tags, size - set->reserved_tags); } void blk_mq_tag_update_sched_shared_tags(struct request_queue *q) { sbitmap_queue_resize(&q->sched_shared_tags->bitmap_tags, q->nr_requests - q->tag_set->reserved_tags); } /** * blk_mq_unique_tag() - return a tag that is unique queue-wide * @rq: request for which to compute a unique tag * * The tag field in struct request is unique per hardware queue but not over * all hardware queues. Hence this function that returns a tag with the * hardware context index in the upper bits and the per hardware queue tag in * the lower bits. * * Note: When called for a request that is queued on a non-multiqueue request * queue, the hardware context index is set to zero. */ u32 blk_mq_unique_tag(struct request *rq) { return (rq->mq_hctx->queue_num << BLK_MQ_UNIQUE_TAG_BITS) | (rq->tag & BLK_MQ_UNIQUE_TAG_MASK); } EXPORT_SYMBOL(blk_mq_unique_tag); |
101 63 63 63 63 63 63 28 63 3 41 1 41 41 41 41 41 34 40 41 41 41 39 41 10 41 41 41 41 41 64 63 63 63 63 60 63 63 62 64 64 64 64 64 63 64 1 1 1 36 41 41 41 36 41 40 505 70 11 11 11 9 9 10 10 10 9 9 7 11 9 8 8 8 4 2 2 10 12 12 12 12 15 30 30 30 15 15 15 15 7 7 7 7 7 91 90 78 90 91 78 12 12 12 90 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 | /* * \author Rickard E. (Rik) Faith <faith@valinux.com> * \author Daryll Strauss <daryll@valinux.com> * \author Gareth Hughes <gareth@valinux.com> */ /* * Created: Mon Jan 4 08:58:31 1999 by faith@valinux.com * * Copyright 1999 Precision Insight, Inc., Cedar Park, Texas. * Copyright 2000 VA Linux Systems, Inc., Sunnyvale, California. * All Rights Reserved. * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the "Software"), * to deal in the Software without restriction, including without limitation * the rights to use, copy, modify, merge, publish, distribute, sublicense, * and/or sell copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice (including the next * paragraph) shall be included in all copies or substantial portions of the * Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR * OTHER DEALINGS IN THE SOFTWARE. */ #include <linux/anon_inodes.h> #include <linux/dma-fence.h> #include <linux/file.h> #include <linux/module.h> #include <linux/pci.h> #include <linux/poll.h> #include <linux/slab.h> #include <linux/vga_switcheroo.h> #include <drm/drm_client_event.h> #include <drm/drm_drv.h> #include <drm/drm_file.h> #include <drm/drm_gem.h> #include <drm/drm_print.h> #include "drm_crtc_internal.h" #include "drm_internal.h" /* from BKL pushdown */ DEFINE_MUTEX(drm_global_mutex); bool drm_dev_needs_global_mutex(struct drm_device *dev) { /* * The deprecated ->load callback must be called after the driver is * already registered. This means such drivers rely on the BKL to make * sure an open can't proceed until the driver is actually fully set up. * Similar hilarity holds for the unload callback. */ if (dev->driver->load || dev->driver->unload) return true; return false; } /** * DOC: file operations * * Drivers must define the file operations structure that forms the DRM * userspace API entry point, even though most of those operations are * implemented in the DRM core. The resulting &struct file_operations must be * stored in the &drm_driver.fops field. The mandatory functions are drm_open(), * drm_read(), drm_ioctl() and drm_compat_ioctl() if CONFIG_COMPAT is enabled * Note that drm_compat_ioctl will be NULL if CONFIG_COMPAT=n, so there's no * need to sprinkle #ifdef into the code. Drivers which implement private ioctls * that require 32/64 bit compatibility support must provide their own * &file_operations.compat_ioctl handler that processes private ioctls and calls * drm_compat_ioctl() for core ioctls. * * In addition drm_read() and drm_poll() provide support for DRM events. DRM * events are a generic and extensible means to send asynchronous events to * userspace through the file descriptor. They are used to send vblank event and * page flip completions by the KMS API. But drivers can also use it for their * own needs, e.g. to signal completion of rendering. * * For the driver-side event interface see drm_event_reserve_init() and * drm_send_event() as the main starting points. * * The memory mapping implementation will vary depending on how the driver * manages memory. For GEM-based drivers this is drm_gem_mmap(). * * No other file operations are supported by the DRM userspace API. Overall the * following is an example &file_operations structure:: * * static const example_drm_fops = { * .owner = THIS_MODULE, * .open = drm_open, * .release = drm_release, * .unlocked_ioctl = drm_ioctl, * .compat_ioctl = drm_compat_ioctl, // NULL if CONFIG_COMPAT=n * .poll = drm_poll, * .read = drm_read, * .mmap = drm_gem_mmap, * }; * * For plain GEM based drivers there is the DEFINE_DRM_GEM_FOPS() macro, and for * DMA based drivers there is the DEFINE_DRM_GEM_DMA_FOPS() macro to make this * simpler. * * The driver's &file_operations must be stored in &drm_driver.fops. * * For driver-private IOCTL handling see the more detailed discussion in * :ref:`IOCTL support in the userland interfaces chapter<drm_driver_ioctl>`. */ /** * drm_file_alloc - allocate file context * @minor: minor to allocate on * * This allocates a new DRM file context. It is not linked into any context and * can be used by the caller freely. Note that the context keeps a pointer to * @minor, so it must be freed before @minor is. * * RETURNS: * Pointer to newly allocated context, ERR_PTR on failure. */ struct drm_file *drm_file_alloc(struct drm_minor *minor) { static atomic64_t ident = ATOMIC64_INIT(0); struct drm_device *dev = minor->dev; struct drm_file *file; int ret; file = kzalloc(sizeof(*file), GFP_KERNEL); if (!file) return ERR_PTR(-ENOMEM); /* Get a unique identifier for fdinfo: */ file->client_id = atomic64_inc_return(&ident); rcu_assign_pointer(file->pid, get_pid(task_tgid(current))); file->minor = minor; /* for compatibility root is always authenticated */ file->authenticated = capable(CAP_SYS_ADMIN); INIT_LIST_HEAD(&file->lhead); INIT_LIST_HEAD(&file->fbs); mutex_init(&file->fbs_lock); INIT_LIST_HEAD(&file->blobs); INIT_LIST_HEAD(&file->pending_event_list); INIT_LIST_HEAD(&file->event_list); init_waitqueue_head(&file->event_wait); file->event_space = 4096; /* set aside 4k for event buffer */ spin_lock_init(&file->master_lookup_lock); mutex_init(&file->event_read_lock); mutex_init(&file->client_name_lock); if (drm_core_check_feature(dev, DRIVER_GEM)) drm_gem_open(dev, file); if (drm_core_check_feature(dev, DRIVER_SYNCOBJ)) drm_syncobj_open(file); drm_prime_init_file_private(&file->prime); if (dev->driver->open) { ret = dev->driver->open(dev, file); if (ret < 0) goto out_prime_destroy; } return file; out_prime_destroy: drm_prime_destroy_file_private(&file->prime); if (drm_core_check_feature(dev, DRIVER_SYNCOBJ)) drm_syncobj_release(file); if (drm_core_check_feature(dev, DRIVER_GEM)) drm_gem_release(dev, file); put_pid(rcu_access_pointer(file->pid)); kfree(file); return ERR_PTR(ret); } static void drm_events_release(struct drm_file *file_priv) { struct drm_device *dev = file_priv->minor->dev; struct drm_pending_event *e, *et; unsigned long flags; spin_lock_irqsave(&dev->event_lock, flags); /* Unlink pending events */ list_for_each_entry_safe(e, et, &file_priv->pending_event_list, pending_link) { list_del(&e->pending_link); e->file_priv = NULL; } /* Remove unconsumed events */ list_for_each_entry_safe(e, et, &file_priv->event_list, link) { list_del(&e->link); kfree(e); } spin_unlock_irqrestore(&dev->event_lock, flags); } /** * drm_file_free - free file context * @file: context to free, or NULL * * This destroys and deallocates a DRM file context previously allocated via * drm_file_alloc(). The caller must make sure to unlink it from any contexts * before calling this. * * If NULL is passed, this is a no-op. */ void drm_file_free(struct drm_file *file) { struct drm_device *dev; if (!file) return; dev = file->minor->dev; drm_dbg_core(dev, "comm=\"%s\", pid=%d, dev=0x%lx, open_count=%d\n", current->comm, task_pid_nr(current), (long)old_encode_dev(file->minor->kdev->devt), atomic_read(&dev->open_count)); drm_events_release(file); if (drm_core_check_feature(dev, DRIVER_MODESET)) { drm_fb_release(file); drm_property_destroy_user_blobs(dev, file); } if (drm_core_check_feature(dev, DRIVER_SYNCOBJ)) drm_syncobj_release(file); if (drm_core_check_feature(dev, DRIVER_GEM)) drm_gem_release(dev, file); if (drm_is_primary_client(file)) drm_master_release(file); if (dev->driver->postclose) dev->driver->postclose(dev, file); drm_prime_destroy_file_private(&file->prime); WARN_ON(!list_empty(&file->event_list)); put_pid(rcu_access_pointer(file->pid)); mutex_destroy(&file->client_name_lock); kfree(file->client_name); kfree(file); } static void drm_close_helper(struct file *filp) { struct drm_file *file_priv = filp->private_data; struct drm_device *dev = file_priv->minor->dev; mutex_lock(&dev->filelist_mutex); list_del(&file_priv->lhead); mutex_unlock(&dev->filelist_mutex); drm_file_free(file_priv); } /* * Check whether DRI will run on this CPU. * * \return non-zero if the DRI will run on this CPU, or zero otherwise. */ static int drm_cpu_valid(void) { #if defined(__sparc__) && !defined(__sparc_v9__) return 0; /* No cmpxchg before v9 sparc. */ #endif return 1; } /* * Called whenever a process opens a drm node * * \param filp file pointer. * \param minor acquired minor-object. * \return zero on success or a negative number on failure. * * Creates and initializes a drm_file structure for the file private data in \p * filp and add it into the double linked list in \p dev. */ int drm_open_helper(struct file *filp, struct drm_minor *minor) { struct drm_device *dev = minor->dev; struct drm_file *priv; int ret; if (filp->f_flags & O_EXCL) return -EBUSY; /* No exclusive opens */ if (!drm_cpu_valid()) return -EINVAL; if (dev->switch_power_state != DRM_SWITCH_POWER_ON && dev->switch_power_state != DRM_SWITCH_POWER_DYNAMIC_OFF) return -EINVAL; if (WARN_ON_ONCE(!(filp->f_op->fop_flags & FOP_UNSIGNED_OFFSET))) return -EINVAL; drm_dbg_core(dev, "comm=\"%s\", pid=%d, minor=%d\n", current->comm, task_pid_nr(current), minor->index); priv = drm_file_alloc(minor); if (IS_ERR(priv)) return PTR_ERR(priv); if (drm_is_primary_client(priv)) { ret = drm_master_open(priv); if (ret) { drm_file_free(priv); return ret; } } filp->private_data = priv; priv->filp = filp; mutex_lock(&dev->filelist_mutex); list_add(&priv->lhead, &dev->filelist); mutex_unlock(&dev->filelist_mutex); return 0; } /** * drm_open - open method for DRM file * @inode: device inode * @filp: file pointer. * * This function must be used by drivers as their &file_operations.open method. * It looks up the correct DRM device and instantiates all the per-file * resources for it. It also calls the &drm_driver.open driver callback. * * RETURNS: * 0 on success or negative errno value on failure. */ int drm_open(struct inode *inode, struct file *filp) { struct drm_device *dev; struct drm_minor *minor; int retcode; minor = drm_minor_acquire(&drm_minors_xa, iminor(inode)); if (IS_ERR(minor)) return PTR_ERR(minor); dev = minor->dev; if (drm_dev_needs_global_mutex(dev)) mutex_lock(&drm_global_mutex); atomic_fetch_inc(&dev->open_count); /* share address_space across all char-devs of a single device */ filp->f_mapping = dev->anon_inode->i_mapping; retcode = drm_open_helper(filp, minor); if (retcode) goto err_undo; if (drm_dev_needs_global_mutex(dev)) mutex_unlock(&drm_global_mutex); return 0; err_undo: atomic_dec(&dev->open_count); if (drm_dev_needs_global_mutex(dev)) mutex_unlock(&drm_global_mutex); drm_minor_release(minor); return retcode; } EXPORT_SYMBOL(drm_open); static void drm_lastclose(struct drm_device *dev) { drm_client_dev_restore(dev); if (dev_is_pci(dev->dev)) vga_switcheroo_process_delayed_switch(); } /** * drm_release - release method for DRM file * @inode: device inode * @filp: file pointer. * * This function must be used by drivers as their &file_operations.release * method. It frees any resources associated with the open file. If this * is the last open file for the DRM device, it also restores the active * in-kernel DRM client. * * RETURNS: * Always succeeds and returns 0. */ int drm_release(struct inode *inode, struct file *filp) { struct drm_file *file_priv = filp->private_data; struct drm_minor *minor = file_priv->minor; struct drm_device *dev = minor->dev; if (drm_dev_needs_global_mutex(dev)) mutex_lock(&drm_global_mutex); drm_dbg_core(dev, "open_count = %d\n", atomic_read(&dev->open_count)); drm_close_helper(filp); if (atomic_dec_and_test(&dev->open_count)) drm_lastclose(dev); if (drm_dev_needs_global_mutex(dev)) mutex_unlock(&drm_global_mutex); drm_minor_release(minor); return 0; } EXPORT_SYMBOL(drm_release); void drm_file_update_pid(struct drm_file *filp) { struct drm_device *dev; struct pid *pid, *old; /* * Master nodes need to keep the original ownership in order for * drm_master_check_perm to keep working correctly. (See comment in * drm_auth.c.) */ if (filp->was_master) return; pid = task_tgid(current); /* * Quick unlocked check since the model is a single handover followed by * exclusive repeated use. */ if (pid == rcu_access_pointer(filp->pid)) return; dev = filp->minor->dev; mutex_lock(&dev->filelist_mutex); get_pid(pid); old = rcu_replace_pointer(filp->pid, pid, 1); mutex_unlock(&dev->filelist_mutex); synchronize_rcu(); put_pid(old); } /** * drm_release_noglobal - release method for DRM file * @inode: device inode * @filp: file pointer. * * This function may be used by drivers as their &file_operations.release * method. It frees any resources associated with the open file prior to taking * the drm_global_mutex. If this is the last open file for the DRM device, it * then restores the active in-kernel DRM client. * * RETURNS: * Always succeeds and returns 0. */ int drm_release_noglobal(struct inode *inode, struct file *filp) { struct drm_file *file_priv = filp->private_data; struct drm_minor *minor = file_priv->minor; struct drm_device *dev = minor->dev; drm_close_helper(filp); if (atomic_dec_and_mutex_lock(&dev->open_count, &drm_global_mutex)) { drm_lastclose(dev); mutex_unlock(&drm_global_mutex); } drm_minor_release(minor); return 0; } EXPORT_SYMBOL(drm_release_noglobal); /** * drm_read - read method for DRM file * @filp: file pointer * @buffer: userspace destination pointer for the read * @count: count in bytes to read * @offset: offset to read * * This function must be used by drivers as their &file_operations.read * method if they use DRM events for asynchronous signalling to userspace. * Since events are used by the KMS API for vblank and page flip completion this * means all modern display drivers must use it. * * @offset is ignored, DRM events are read like a pipe. Polling support is * provided by drm_poll(). * * This function will only ever read a full event. Therefore userspace must * supply a big enough buffer to fit any event to ensure forward progress. Since * the maximum event space is currently 4K it's recommended to just use that for * safety. * * RETURNS: * Number of bytes read (always aligned to full events, and can be 0) or a * negative error code on failure. */ ssize_t drm_read(struct file *filp, char __user *buffer, size_t count, loff_t *offset) { struct drm_file *file_priv = filp->private_data; struct drm_device *dev = file_priv->minor->dev; ssize_t ret; ret = mutex_lock_interruptible(&file_priv->event_read_lock); if (ret) return ret; for (;;) { struct drm_pending_event *e = NULL; spin_lock_irq(&dev->event_lock); if (!list_empty(&file_priv->event_list)) { e = list_first_entry(&file_priv->event_list, struct drm_pending_event, link); file_priv->event_space += e->event->length; list_del(&e->link); } spin_unlock_irq(&dev->event_lock); if (e == NULL) { if (ret) break; if (filp->f_flags & O_NONBLOCK) { ret = -EAGAIN; break; } mutex_unlock(&file_priv->event_read_lock); ret = wait_event_interruptible(file_priv->event_wait, !list_empty(&file_priv->event_list)); if (ret >= 0) ret = mutex_lock_interruptible(&file_priv->event_read_lock); if (ret) return ret; } else { unsigned length = e->event->length; if (length > count - ret) { put_back_event: spin_lock_irq(&dev->event_lock); file_priv->event_space -= length; list_add(&e->link, &file_priv->event_list); spin_unlock_irq(&dev->event_lock); wake_up_interruptible_poll(&file_priv->event_wait, EPOLLIN | EPOLLRDNORM); break; } if (copy_to_user(buffer + ret, e->event, length)) { if (ret == 0) ret = -EFAULT; goto put_back_event; } ret += length; kfree(e); } } mutex_unlock(&file_priv->event_read_lock); return ret; } EXPORT_SYMBOL(drm_read); /** * drm_poll - poll method for DRM file * @filp: file pointer * @wait: poll waiter table * * This function must be used by drivers as their &file_operations.read method * if they use DRM events for asynchronous signalling to userspace. Since * events are used by the KMS API for vblank and page flip completion this means * all modern display drivers must use it. * * See also drm_read(). * * RETURNS: * Mask of POLL flags indicating the current status of the file. */ __poll_t drm_poll(struct file *filp, struct poll_table_struct *wait) { struct drm_file *file_priv = filp->private_data; __poll_t mask = 0; poll_wait(filp, &file_priv->event_wait, wait); if (!list_empty(&file_priv->event_list)) mask |= EPOLLIN | EPOLLRDNORM; return mask; } EXPORT_SYMBOL(drm_poll); /** * drm_event_reserve_init_locked - init a DRM event and reserve space for it * @dev: DRM device * @file_priv: DRM file private data * @p: tracking structure for the pending event * @e: actual event data to deliver to userspace * * This function prepares the passed in event for eventual delivery. If the event * doesn't get delivered (because the IOCTL fails later on, before queuing up * anything) then the even must be cancelled and freed using * drm_event_cancel_free(). Successfully initialized events should be sent out * using drm_send_event() or drm_send_event_locked() to signal completion of the * asynchronous event to userspace. * * If callers embedded @p into a larger structure it must be allocated with * kmalloc and @p must be the first member element. * * This is the locked version of drm_event_reserve_init() for callers which * already hold &drm_device.event_lock. * * RETURNS: * 0 on success or a negative error code on failure. */ int drm_event_reserve_init_locked(struct drm_device *dev, struct drm_file *file_priv, struct drm_pending_event *p, struct drm_event *e) { if (file_priv->event_space < e->length) return -ENOMEM; file_priv->event_space -= e->length; p->event = e; list_add(&p->pending_link, &file_priv->pending_event_list); p->file_priv = file_priv; return 0; } EXPORT_SYMBOL(drm_event_reserve_init_locked); /** * drm_event_reserve_init - init a DRM event and reserve space for it * @dev: DRM device * @file_priv: DRM file private data * @p: tracking structure for the pending event * @e: actual event data to deliver to userspace * * This function prepares the passed in event for eventual delivery. If the event * doesn't get delivered (because the IOCTL fails later on, before queuing up * anything) then the even must be cancelled and freed using * drm_event_cancel_free(). Successfully initialized events should be sent out * using drm_send_event() or drm_send_event_locked() to signal completion of the * asynchronous event to userspace. * * If callers embedded @p into a larger structure it must be allocated with * kmalloc and @p must be the first member element. * * Callers which already hold &drm_device.event_lock should use * drm_event_reserve_init_locked() instead. * * RETURNS: * 0 on success or a negative error code on failure. */ int drm_event_reserve_init(struct drm_device *dev, struct drm_file *file_priv, struct drm_pending_event *p, struct drm_event *e) { unsigned long flags; int ret; spin_lock_irqsave(&dev->event_lock, flags); ret = drm_event_reserve_init_locked(dev, file_priv, p, e); spin_unlock_irqrestore(&dev->event_lock, flags); return ret; } EXPORT_SYMBOL(drm_event_reserve_init); /** * drm_event_cancel_free - free a DRM event and release its space * @dev: DRM device * @p: tracking structure for the pending event * * This function frees the event @p initialized with drm_event_reserve_init() * and releases any allocated space. It is used to cancel an event when the * nonblocking operation could not be submitted and needed to be aborted. */ void drm_event_cancel_free(struct drm_device *dev, struct drm_pending_event *p) { unsigned long flags; spin_lock_irqsave(&dev->event_lock, flags); if (p->file_priv) { p->file_priv->event_space += p->event->length; list_del(&p->pending_link); } spin_unlock_irqrestore(&dev->event_lock, flags); if (p->fence) dma_fence_put(p->fence); kfree(p); } EXPORT_SYMBOL(drm_event_cancel_free); static void drm_send_event_helper(struct drm_device *dev, struct drm_pending_event *e, ktime_t timestamp) { assert_spin_locked(&dev->event_lock); if (e->completion) { complete_all(e->completion); e->completion_release(e->completion); e->completion = NULL; } if (e->fence) { if (timestamp) dma_fence_signal_timestamp(e->fence, timestamp); else dma_fence_signal(e->fence); dma_fence_put(e->fence); } if (!e->file_priv) { kfree(e); return; } list_del(&e->pending_link); list_add_tail(&e->link, &e->file_priv->event_list); wake_up_interruptible_poll(&e->file_priv->event_wait, EPOLLIN | EPOLLRDNORM); } /** * drm_send_event_timestamp_locked - send DRM event to file descriptor * @dev: DRM device * @e: DRM event to deliver * @timestamp: timestamp to set for the fence event in kernel's CLOCK_MONOTONIC * time domain * * This function sends the event @e, initialized with drm_event_reserve_init(), * to its associated userspace DRM file. Callers must already hold * &drm_device.event_lock. * * Note that the core will take care of unlinking and disarming events when the * corresponding DRM file is closed. Drivers need not worry about whether the * DRM file for this event still exists and can call this function upon * completion of the asynchronous work unconditionally. */ void drm_send_event_timestamp_locked(struct drm_device *dev, struct drm_pending_event *e, ktime_t timestamp) { drm_send_event_helper(dev, e, timestamp); } EXPORT_SYMBOL(drm_send_event_timestamp_locked); /** * drm_send_event_locked - send DRM event to file descriptor * @dev: DRM device * @e: DRM event to deliver * * This function sends the event @e, initialized with drm_event_reserve_init(), * to its associated userspace DRM file. Callers must already hold * &drm_device.event_lock, see drm_send_event() for the unlocked version. * * Note that the core will take care of unlinking and disarming events when the * corresponding DRM file is closed. Drivers need not worry about whether the * DRM file for this event still exists and can call this function upon * completion of the asynchronous work unconditionally. */ void drm_send_event_locked(struct drm_device *dev, struct drm_pending_event *e) { drm_send_event_helper(dev, e, 0); } EXPORT_SYMBOL(drm_send_event_locked); /** * drm_send_event - send DRM event to file descriptor * @dev: DRM device * @e: DRM event to deliver * * This function sends the event @e, initialized with drm_event_reserve_init(), * to its associated userspace DRM file. This function acquires * &drm_device.event_lock, see drm_send_event_locked() for callers which already * hold this lock. * * Note that the core will take care of unlinking and disarming events when the * corresponding DRM file is closed. Drivers need not worry about whether the * DRM file for this event still exists and can call this function upon * completion of the asynchronous work unconditionally. */ void drm_send_event(struct drm_device *dev, struct drm_pending_event *e) { unsigned long irqflags; spin_lock_irqsave(&dev->event_lock, irqflags); drm_send_event_helper(dev, e, 0); spin_unlock_irqrestore(&dev->event_lock, irqflags); } EXPORT_SYMBOL(drm_send_event); void drm_fdinfo_print_size(struct drm_printer *p, const char *prefix, const char *stat, const char *region, u64 sz) { const char *units[] = {"", " KiB", " MiB"}; unsigned u; for (u = 0; u < ARRAY_SIZE(units) - 1; u++) { if (sz == 0 || !IS_ALIGNED(sz, SZ_1K)) break; sz = div_u64(sz, SZ_1K); } drm_printf(p, "%s-%s-%s:\t%llu%s\n", prefix, stat, region, sz, units[u]); } EXPORT_SYMBOL(drm_fdinfo_print_size); int drm_memory_stats_is_zero(const struct drm_memory_stats *stats) { return (stats->shared == 0 && stats->private == 0 && stats->resident == 0 && stats->purgeable == 0 && stats->active == 0); } EXPORT_SYMBOL(drm_memory_stats_is_zero); /** * drm_print_memory_stats - A helper to print memory stats * @p: The printer to print output to * @stats: The collected memory stats * @supported_status: Bitmask of optional stats which are available * @region: The memory region * */ void drm_print_memory_stats(struct drm_printer *p, const struct drm_memory_stats *stats, enum drm_gem_object_status supported_status, const char *region) { const char *prefix = "drm"; drm_fdinfo_print_size(p, prefix, "total", region, stats->private + stats->shared); drm_fdinfo_print_size(p, prefix, "shared", region, stats->shared); if (supported_status & DRM_GEM_OBJECT_ACTIVE) drm_fdinfo_print_size(p, prefix, "active", region, stats->active); if (supported_status & DRM_GEM_OBJECT_RESIDENT) drm_fdinfo_print_size(p, prefix, "resident", region, stats->resident); if (supported_status & DRM_GEM_OBJECT_PURGEABLE) drm_fdinfo_print_size(p, prefix, "purgeable", region, stats->purgeable); } EXPORT_SYMBOL(drm_print_memory_stats); /** * drm_show_memory_stats - Helper to collect and show standard fdinfo memory stats * @p: the printer to print output to * @file: the DRM file * * Helper to iterate over GEM objects with a handle allocated in the specified * file. */ void drm_show_memory_stats(struct drm_printer *p, struct drm_file *file) { struct drm_gem_object *obj; struct drm_memory_stats status = {}; enum drm_gem_object_status supported_status = 0; int id; spin_lock(&file->table_lock); idr_for_each_entry (&file->object_idr, obj, id) { enum drm_gem_object_status s = 0; size_t add_size = (obj->funcs && obj->funcs->rss) ? obj->funcs->rss(obj) : obj->size; if (obj->funcs && obj->funcs->status) { s = obj->funcs->status(obj); supported_status |= s; } if (drm_gem_object_is_shared_for_memory_stats(obj)) status.shared += obj->size; else status.private += obj->size; if (s & DRM_GEM_OBJECT_RESIDENT) { status.resident += add_size; } else { /* If already purged or not yet backed by pages, don't * count it as purgeable: */ s &= ~DRM_GEM_OBJECT_PURGEABLE; } if (!dma_resv_test_signaled(obj->resv, dma_resv_usage_rw(true))) { status.active += add_size; supported_status |= DRM_GEM_OBJECT_ACTIVE; /* If still active, don't count as purgeable: */ s &= ~DRM_GEM_OBJECT_PURGEABLE; } if (s & DRM_GEM_OBJECT_PURGEABLE) status.purgeable += add_size; } spin_unlock(&file->table_lock); drm_print_memory_stats(p, &status, supported_status, "memory"); } EXPORT_SYMBOL(drm_show_memory_stats); /** * drm_show_fdinfo - helper for drm file fops * @m: output stream * @f: the device file instance * * Helper to implement fdinfo, for userspace to query usage stats, etc, of a * process using the GPU. See also &drm_driver.show_fdinfo. * * For text output format description please see Documentation/gpu/drm-usage-stats.rst */ void drm_show_fdinfo(struct seq_file *m, struct file *f) { struct drm_file *file = f->private_data; struct drm_device *dev = file->minor->dev; struct drm_printer p = drm_seq_file_printer(m); int idx; if (!drm_dev_enter(dev, &idx)) return; drm_printf(&p, "drm-driver:\t%s\n", dev->driver->name); drm_printf(&p, "drm-client-id:\t%llu\n", file->client_id); if (dev_is_pci(dev->dev)) { struct pci_dev *pdev = to_pci_dev(dev->dev); drm_printf(&p, "drm-pdev:\t%04x:%02x:%02x.%d\n", pci_domain_nr(pdev->bus), pdev->bus->number, PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn)); } mutex_lock(&file->client_name_lock); if (file->client_name) drm_printf(&p, "drm-client-name:\t%s\n", file->client_name); mutex_unlock(&file->client_name_lock); if (dev->driver->show_fdinfo) dev->driver->show_fdinfo(&p, file); drm_dev_exit(idx); } EXPORT_SYMBOL(drm_show_fdinfo); /** * mock_drm_getfile - Create a new struct file for the drm device * @minor: drm minor to wrap (e.g. #drm_device.primary) * @flags: file creation mode (O_RDWR etc) * * This create a new struct file that wraps a DRM file context around a * DRM minor. This mimicks userspace opening e.g. /dev/dri/card0, but without * invoking userspace. The struct file may be operated on using its f_op * (the drm_device.driver.fops) to mimick userspace operations, or be supplied * to userspace facing functions as an internal/anonymous client. * * RETURNS: * Pointer to newly created struct file, ERR_PTR on failure. */ struct file *mock_drm_getfile(struct drm_minor *minor, unsigned int flags) { struct drm_device *dev = minor->dev; struct drm_file *priv; struct file *file; priv = drm_file_alloc(minor); if (IS_ERR(priv)) return ERR_CAST(priv); file = anon_inode_getfile("drm", dev->driver->fops, priv, flags); if (IS_ERR(file)) { drm_file_free(priv); return file; } /* Everyone shares a single global address space */ file->f_mapping = dev->anon_inode->i_mapping; drm_dev_get(dev); priv->filp = file; return file; } EXPORT_SYMBOL_FOR_TESTS_ONLY(mock_drm_getfile); |
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (c) 2010 Werner Fink, Jiri Slaby */ #include <linux/console.h> #include <linux/kernel.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/tty_driver.h> /* * This is handler for /proc/consoles */ static int show_console_dev(struct seq_file *m, void *v) { static const struct { short flag; char name; } con_flags[] = { { CON_ENABLED, 'E' }, { CON_CONSDEV, 'C' }, { CON_BOOT, 'B' }, { CON_NBCON, 'N' }, { CON_PRINTBUFFER, 'p' }, { CON_BRL, 'b' }, { CON_ANYTIME, 'a' }, }; char flags[ARRAY_SIZE(con_flags) + 1]; struct console *con = v; unsigned int a; dev_t dev = 0; if (con->device) { const struct tty_driver *driver; int index; /* * Take console_lock to serialize device() callback with * other console operations. For example, fg_console is * modified under console_lock when switching vt. */ console_lock(); driver = con->device(con, &index); console_unlock(); if (driver) { dev = MKDEV(driver->major, driver->minor_start); dev += index; } } for (a = 0; a < ARRAY_SIZE(con_flags); a++) flags[a] = (con->flags & con_flags[a].flag) ? con_flags[a].name : ' '; flags[a] = 0; seq_setwidth(m, 21 - 1); seq_printf(m, "%s%d", con->name, con->index); seq_pad(m, ' '); seq_printf(m, "%c%c%c (%s)", con->read ? 'R' : '-', ((con->flags & CON_NBCON) || con->write) ? 'W' : '-', con->unblank ? 'U' : '-', flags); if (dev) seq_printf(m, " %4d:%d", MAJOR(dev), MINOR(dev)); seq_putc(m, '\n'); return 0; } static void *c_start(struct seq_file *m, loff_t *pos) __acquires(&console_mutex) { struct console *con; loff_t off = 0; /* * Hold the console_list_lock to guarantee safe traversal of the * console list. SRCU cannot be used because there is no * place to store the SRCU cookie. */ console_list_lock(); for_each_console(con) if (off++ == *pos) break; return con; } static void *c_next(struct seq_file *m, void *v, loff_t *pos) { struct console *con = v; ++*pos; return hlist_entry_safe(con->node.next, struct console, node); } static void c_stop(struct seq_file *m, void *v) __releases(&console_mutex) { console_list_unlock(); } static const struct seq_operations consoles_op = { .start = c_start, .next = c_next, .stop = c_stop, .show = show_console_dev }; static int __init proc_consoles_init(void) { proc_create_seq("consoles", 0, NULL, &consoles_op); return 0; } fs_initcall(proc_consoles_init); |
17 17 17 17 17 17 17 17 17 16 17 16 17 17 17 9 9 10 10 8 6 4 4 3 4 3 3 3 2 3 3 3 2 2 3 3 4 3 4 5 5 5 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 | // SPDX-License-Identifier: GPL-2.0-or-later /* * net/sched/sch_prio.c Simple 3-band priority "scheduler". * * Authors: Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru> * Fixes: 19990609: J Hadi Salim <hadi@nortelnetworks.com>: * Init -- EINVAL when opt undefined */ #include <linux/module.h> #include <linux/slab.h> #include <linux/types.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/errno.h> #include <linux/skbuff.h> #include <net/netlink.h> #include <net/pkt_sched.h> #include <net/pkt_cls.h> struct prio_sched_data { int bands; struct tcf_proto __rcu *filter_list; struct tcf_block *block; u8 prio2band[TC_PRIO_MAX+1]; struct Qdisc *queues[TCQ_PRIO_BANDS]; }; static struct Qdisc * prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) { struct prio_sched_data *q = qdisc_priv(sch); u32 band = skb->priority; struct tcf_result res; struct tcf_proto *fl; int err; *qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; if (TC_H_MAJ(skb->priority) != sch->handle) { fl = rcu_dereference_bh(q->filter_list); err = tcf_classify(skb, NULL, fl, &res, false); #ifdef CONFIG_NET_CLS_ACT switch (err) { case TC_ACT_STOLEN: case TC_ACT_QUEUED: case TC_ACT_TRAP: *qerr = NET_XMIT_SUCCESS | __NET_XMIT_STOLEN; fallthrough; case TC_ACT_SHOT: return NULL; } #endif if (!fl || err < 0) { if (TC_H_MAJ(band)) band = 0; return q->queues[q->prio2band[band & TC_PRIO_MAX]]; } band = res.classid; } band = TC_H_MIN(band) - 1; if (band >= q->bands) return q->queues[q->prio2band[0]]; return q->queues[band]; } static int prio_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { unsigned int len = qdisc_pkt_len(skb); struct Qdisc *qdisc; int ret; qdisc = prio_classify(skb, sch, &ret); #ifdef CONFIG_NET_CLS_ACT if (qdisc == NULL) { if (ret & __NET_XMIT_BYPASS) qdisc_qstats_drop(sch); __qdisc_drop(skb, to_free); return ret; } #endif ret = qdisc_enqueue(skb, qdisc, to_free); if (ret == NET_XMIT_SUCCESS) { sch->qstats.backlog += len; sch->q.qlen++; return NET_XMIT_SUCCESS; } if (net_xmit_drop_count(ret)) qdisc_qstats_drop(sch); return ret; } static struct sk_buff *prio_peek(struct Qdisc *sch) { struct prio_sched_data *q = qdisc_priv(sch); int prio; for (prio = 0; prio < q->bands; prio++) { struct Qdisc *qdisc = q->queues[prio]; struct sk_buff *skb = qdisc->ops->peek(qdisc); if (skb) return skb; } return NULL; } static struct sk_buff *prio_dequeue(struct Qdisc *sch) { struct prio_sched_data *q = qdisc_priv(sch); int prio; for (prio = 0; prio < q->bands; prio++) { struct Qdisc *qdisc = q->queues[prio]; struct sk_buff *skb = qdisc_dequeue_peeked(qdisc); if (skb) { qdisc_bstats_update(sch, skb); qdisc_qstats_backlog_dec(sch, skb); sch->q.qlen--; return skb; } } return NULL; } static void prio_reset(struct Qdisc *sch) { int prio; struct prio_sched_data *q = qdisc_priv(sch); for (prio = 0; prio < q->bands; prio++) qdisc_reset(q->queues[prio]); } static int prio_offload(struct Qdisc *sch, struct tc_prio_qopt *qopt) { struct net_device *dev = qdisc_dev(sch); struct tc_prio_qopt_offload opt = { .handle = sch->handle, .parent = sch->parent, }; if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) return -EOPNOTSUPP; if (qopt) { opt.command = TC_PRIO_REPLACE; opt.replace_params.bands = qopt->bands; memcpy(&opt.replace_params.priomap, qopt->priomap, TC_PRIO_MAX + 1); opt.replace_params.qstats = &sch->qstats; } else { opt.command = TC_PRIO_DESTROY; } return dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_PRIO, &opt); } static void prio_destroy(struct Qdisc *sch) { int prio; struct prio_sched_data *q = qdisc_priv(sch); tcf_block_put(q->block); prio_offload(sch, NULL); for (prio = 0; prio < q->bands; prio++) qdisc_put(q->queues[prio]); } static int prio_tune(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack) { struct prio_sched_data *q = qdisc_priv(sch); struct Qdisc *queues[TCQ_PRIO_BANDS]; int oldbands = q->bands, i; struct tc_prio_qopt *qopt; if (nla_len(opt) < sizeof(*qopt)) return -EINVAL; qopt = nla_data(opt); if (qopt->bands > TCQ_PRIO_BANDS || qopt->bands < TCQ_MIN_PRIO_BANDS) return -EINVAL; for (i = 0; i <= TC_PRIO_MAX; i++) { if (qopt->priomap[i] >= qopt->bands) return -EINVAL; } /* Before commit, make sure we can allocate all new qdiscs */ for (i = oldbands; i < qopt->bands; i++) { queues[i] = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops, TC_H_MAKE(sch->handle, i + 1), extack); if (!queues[i]) { while (i > oldbands) qdisc_put(queues[--i]); return -ENOMEM; } } prio_offload(sch, qopt); sch_tree_lock(sch); q->bands = qopt->bands; memcpy(q->prio2band, qopt->priomap, TC_PRIO_MAX+1); for (i = q->bands; i < oldbands; i++) qdisc_tree_flush_backlog(q->queues[i]); for (i = oldbands; i < q->bands; i++) { q->queues[i] = queues[i]; if (q->queues[i] != &noop_qdisc) qdisc_hash_add(q->queues[i], true); } sch_tree_unlock(sch); for (i = q->bands; i < oldbands; i++) qdisc_put(q->queues[i]); return 0; } static int prio_init(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack) { struct prio_sched_data *q = qdisc_priv(sch); int err; if (!opt) return -EINVAL; err = tcf_block_get(&q->block, &q->filter_list, sch, extack); if (err) return err; return prio_tune(sch, opt, extack); } static int prio_dump_offload(struct Qdisc *sch) { struct tc_prio_qopt_offload hw_stats = { .command = TC_PRIO_STATS, .handle = sch->handle, .parent = sch->parent, { .stats = { .bstats = &sch->bstats, .qstats = &sch->qstats, }, }, }; return qdisc_offload_dump_helper(sch, TC_SETUP_QDISC_PRIO, &hw_stats); } static int prio_dump(struct Qdisc *sch, struct sk_buff *skb) { struct prio_sched_data *q = qdisc_priv(sch); unsigned char *b = skb_tail_pointer(skb); struct tc_prio_qopt opt; int err; opt.bands = q->bands; memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX + 1); err = prio_dump_offload(sch); if (err) goto nla_put_failure; if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt)) goto nla_put_failure; return skb->len; nla_put_failure: nlmsg_trim(skb, b); return -1; } static int prio_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new, struct Qdisc **old, struct netlink_ext_ack *extack) { struct prio_sched_data *q = qdisc_priv(sch); struct tc_prio_qopt_offload graft_offload; unsigned long band = arg - 1; if (!new) { new = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops, TC_H_MAKE(sch->handle, arg), extack); if (!new) new = &noop_qdisc; else qdisc_hash_add(new, true); } *old = qdisc_replace(sch, new, &q->queues[band]); graft_offload.handle = sch->handle; graft_offload.parent = sch->parent; graft_offload.graft_params.band = band; graft_offload.graft_params.child_handle = new->handle; graft_offload.command = TC_PRIO_GRAFT; qdisc_offload_graft_helper(qdisc_dev(sch), sch, new, *old, TC_SETUP_QDISC_PRIO, &graft_offload, extack); return 0; } static struct Qdisc * prio_leaf(struct Qdisc *sch, unsigned long arg) { struct prio_sched_data *q = qdisc_priv(sch); unsigned long band = arg - 1; return q->queues[band]; } static unsigned long prio_find(struct Qdisc *sch, u32 classid) { struct prio_sched_data *q = qdisc_priv(sch); unsigned long band = TC_H_MIN(classid); if (band - 1 >= q->bands) return 0; return band; } static unsigned long prio_bind(struct Qdisc *sch, unsigned long parent, u32 classid) { return prio_find(sch, classid); } static void prio_unbind(struct Qdisc *q, unsigned long cl) { } static int prio_dump_class(struct Qdisc *sch, unsigned long cl, struct sk_buff *skb, struct tcmsg *tcm) { struct prio_sched_data *q = qdisc_priv(sch); tcm->tcm_handle |= TC_H_MIN(cl); tcm->tcm_info = q->queues[cl-1]->handle; return 0; } static int prio_dump_class_stats(struct Qdisc *sch, unsigned long cl, struct gnet_dump *d) { struct prio_sched_data *q = qdisc_priv(sch); struct Qdisc *cl_q; cl_q = q->queues[cl - 1]; if (gnet_stats_copy_basic(d, cl_q->cpu_bstats, &cl_q->bstats, true) < 0 || qdisc_qstats_copy(d, cl_q) < 0) return -1; return 0; } static void prio_walk(struct Qdisc *sch, struct qdisc_walker *arg) { struct prio_sched_data *q = qdisc_priv(sch); int prio; if (arg->stop) return; for (prio = 0; prio < q->bands; prio++) { if (!tc_qdisc_stats_dump(sch, prio + 1, arg)) break; } } static struct tcf_block *prio_tcf_block(struct Qdisc *sch, unsigned long cl, struct netlink_ext_ack *extack) { struct prio_sched_data *q = qdisc_priv(sch); if (cl) return NULL; return q->block; } static const struct Qdisc_class_ops prio_class_ops = { .graft = prio_graft, .leaf = prio_leaf, .find = prio_find, .walk = prio_walk, .tcf_block = prio_tcf_block, .bind_tcf = prio_bind, .unbind_tcf = prio_unbind, .dump = prio_dump_class, .dump_stats = prio_dump_class_stats, }; static struct Qdisc_ops prio_qdisc_ops __read_mostly = { .next = NULL, .cl_ops = &prio_class_ops, .id = "prio", .priv_size = sizeof(struct prio_sched_data), .enqueue = prio_enqueue, .dequeue = prio_dequeue, .peek = prio_peek, .init = prio_init, .reset = prio_reset, .destroy = prio_destroy, .change = prio_tune, .dump = prio_dump, .owner = THIS_MODULE, }; MODULE_ALIAS_NET_SCH("prio"); static int __init prio_module_init(void) { return register_qdisc(&prio_qdisc_ops); } static void __exit prio_module_exit(void) { unregister_qdisc(&prio_qdisc_ops); } module_init(prio_module_init) module_exit(prio_module_exit) MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Simple 3-band priority qdisc"); |
30 34 33 33 33 11 20 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 | /* BlueZ - Bluetooth protocol stack for Linux Copyright (C) 2000-2001 Qualcomm Incorporated Copyright (C) 2009-2010 Gustavo F. Padovan <gustavo@padovan.org> Copyright (C) 2010 Google Inc. Written 2000,2001 by Maxim Krasnyansky <maxk@qualcomm.com> This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation; THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER(S) AND AUTHOR(S) BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. ALL LIABILITY, INCLUDING LIABILITY FOR INFRINGEMENT OF ANY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS, RELATING TO USE OF THIS SOFTWARE IS DISCLAIMED. */ #ifndef __L2CAP_H #define __L2CAP_H #include <linux/unaligned.h> #include <linux/atomic.h> /* L2CAP defaults */ #define L2CAP_DEFAULT_MTU 672 #define L2CAP_DEFAULT_MIN_MTU 48 #define L2CAP_DEFAULT_FLUSH_TO 0xFFFF #define L2CAP_EFS_DEFAULT_FLUSH_TO 0xFFFFFFFF #define L2CAP_DEFAULT_TX_WINDOW 63 #define L2CAP_DEFAULT_EXT_WINDOW 0x3FFF #define L2CAP_DEFAULT_MAX_TX 3 #define L2CAP_DEFAULT_RETRANS_TO 2 /* seconds */ #define L2CAP_DEFAULT_MONITOR_TO 12 /* seconds */ #define L2CAP_DEFAULT_MAX_PDU_SIZE 1492 /* Sized for AMP packet */ #define L2CAP_DEFAULT_ACK_TO 200 #define L2CAP_DEFAULT_MAX_SDU_SIZE 0xFFFF #define L2CAP_DEFAULT_SDU_ITIME 0xFFFFFFFF #define L2CAP_DEFAULT_ACC_LAT 0xFFFFFFFF #define L2CAP_BREDR_MAX_PAYLOAD 1019 /* 3-DH5 packet */ #define L2CAP_LE_MIN_MTU 23 #define L2CAP_ECRED_CONN_SCID_MAX 5 #define L2CAP_DISC_TIMEOUT msecs_to_jiffies(100) #define L2CAP_DISC_REJ_TIMEOUT msecs_to_jiffies(5000) #define L2CAP_ENC_TIMEOUT msecs_to_jiffies(5000) #define L2CAP_CONN_TIMEOUT msecs_to_jiffies(40000) #define L2CAP_INFO_TIMEOUT msecs_to_jiffies(4000) #define L2CAP_MOVE_TIMEOUT msecs_to_jiffies(4000) #define L2CAP_MOVE_ERTX_TIMEOUT msecs_to_jiffies(60000) #define L2CAP_WAIT_ACK_POLL_PERIOD msecs_to_jiffies(200) #define L2CAP_WAIT_ACK_TIMEOUT msecs_to_jiffies(10000) /* L2CAP socket address */ struct sockaddr_l2 { sa_family_t l2_family; __le16 l2_psm; bdaddr_t l2_bdaddr; __le16 l2_cid; __u8 l2_bdaddr_type; }; /* L2CAP socket options */ #define L2CAP_OPTIONS 0x01 struct l2cap_options { __u16 omtu; __u16 imtu; __u16 flush_to; __u8 mode; __u8 fcs; __u8 max_tx; __u16 txwin_size; }; #define L2CAP_CONNINFO 0x02 struct l2cap_conninfo { __u16 hci_handle; __u8 dev_class[3]; }; #define L2CAP_LM 0x03 #define L2CAP_LM_MASTER 0x0001 #define L2CAP_LM_AUTH 0x0002 #define L2CAP_LM_ENCRYPT 0x0004 #define L2CAP_LM_TRUSTED 0x0008 #define L2CAP_LM_RELIABLE 0x0010 #define L2CAP_LM_SECURE 0x0020 #define L2CAP_LM_FIPS 0x0040 /* L2CAP command codes */ #define L2CAP_COMMAND_REJ 0x01 #define L2CAP_CONN_REQ 0x02 #define L2CAP_CONN_RSP 0x03 #define L2CAP_CONF_REQ 0x04 #define L2CAP_CONF_RSP 0x05 #define L2CAP_DISCONN_REQ 0x06 #define L2CAP_DISCONN_RSP 0x07 #define L2CAP_ECHO_REQ 0x08 #define L2CAP_ECHO_RSP 0x09 #define L2CAP_INFO_REQ 0x0a #define L2CAP_INFO_RSP 0x0b #define L2CAP_CONN_PARAM_UPDATE_REQ 0x12 #define L2CAP_CONN_PARAM_UPDATE_RSP 0x13 #define L2CAP_LE_CONN_REQ 0x14 #define L2CAP_LE_CONN_RSP 0x15 #define L2CAP_LE_CREDITS 0x16 #define L2CAP_ECRED_CONN_REQ 0x17 #define L2CAP_ECRED_CONN_RSP 0x18 #define L2CAP_ECRED_RECONF_REQ 0x19 #define L2CAP_ECRED_RECONF_RSP 0x1a /* L2CAP extended feature mask */ #define L2CAP_FEAT_FLOWCTL 0x00000001 #define L2CAP_FEAT_RETRANS 0x00000002 #define L2CAP_FEAT_BIDIR_QOS 0x00000004 #define L2CAP_FEAT_ERTM 0x00000008 #define L2CAP_FEAT_STREAMING 0x00000010 #define L2CAP_FEAT_FCS 0x00000020 #define L2CAP_FEAT_EXT_FLOW 0x00000040 #define L2CAP_FEAT_FIXED_CHAN 0x00000080 #define L2CAP_FEAT_EXT_WINDOW 0x00000100 #define L2CAP_FEAT_UCD 0x00000200 /* L2CAP checksum option */ #define L2CAP_FCS_NONE 0x00 #define L2CAP_FCS_CRC16 0x01 /* L2CAP fixed channels */ #define L2CAP_FC_SIG_BREDR 0x02 #define L2CAP_FC_CONNLESS 0x04 #define L2CAP_FC_ATT 0x10 #define L2CAP_FC_SIG_LE 0x20 #define L2CAP_FC_SMP_LE 0x40 #define L2CAP_FC_SMP_BREDR 0x80 /* L2CAP Control Field bit masks */ #define L2CAP_CTRL_SAR 0xC000 #define L2CAP_CTRL_REQSEQ 0x3F00 #define L2CAP_CTRL_TXSEQ 0x007E #define L2CAP_CTRL_SUPERVISE 0x000C #define L2CAP_CTRL_RETRANS 0x0080 #define L2CAP_CTRL_FINAL 0x0080 #define L2CAP_CTRL_POLL 0x0010 #define L2CAP_CTRL_FRAME_TYPE 0x0001 /* I- or S-Frame */ #define L2CAP_CTRL_TXSEQ_SHIFT 1 #define L2CAP_CTRL_SUPER_SHIFT 2 #define L2CAP_CTRL_POLL_SHIFT 4 #define L2CAP_CTRL_FINAL_SHIFT 7 #define L2CAP_CTRL_REQSEQ_SHIFT 8 #define L2CAP_CTRL_SAR_SHIFT 14 /* L2CAP Extended Control Field bit mask */ #define L2CAP_EXT_CTRL_TXSEQ 0xFFFC0000 #define L2CAP_EXT_CTRL_SAR 0x00030000 #define L2CAP_EXT_CTRL_SUPERVISE 0x00030000 #define L2CAP_EXT_CTRL_REQSEQ 0x0000FFFC #define L2CAP_EXT_CTRL_POLL 0x00040000 #define L2CAP_EXT_CTRL_FINAL 0x00000002 #define L2CAP_EXT_CTRL_FRAME_TYPE 0x00000001 /* I- or S-Frame */ #define L2CAP_EXT_CTRL_FINAL_SHIFT 1 #define L2CAP_EXT_CTRL_REQSEQ_SHIFT 2 #define L2CAP_EXT_CTRL_SAR_SHIFT 16 #define L2CAP_EXT_CTRL_SUPER_SHIFT 16 #define L2CAP_EXT_CTRL_POLL_SHIFT 18 #define L2CAP_EXT_CTRL_TXSEQ_SHIFT 18 /* L2CAP Supervisory Function */ #define L2CAP_SUPER_RR 0x00 #define L2CAP_SUPER_REJ 0x01 #define L2CAP_SUPER_RNR 0x02 #define L2CAP_SUPER_SREJ 0x03 /* L2CAP Segmentation and Reassembly */ #define L2CAP_SAR_UNSEGMENTED 0x00 #define L2CAP_SAR_START 0x01 #define L2CAP_SAR_END 0x02 #define L2CAP_SAR_CONTINUE 0x03 /* L2CAP Command rej. reasons */ #define L2CAP_REJ_NOT_UNDERSTOOD 0x0000 #define L2CAP_REJ_MTU_EXCEEDED 0x0001 #define L2CAP_REJ_INVALID_CID 0x0002 /* L2CAP structures */ struct l2cap_hdr { __le16 len; __le16 cid; } __packed; #define L2CAP_LEN_SIZE 2 #define L2CAP_HDR_SIZE 4 #define L2CAP_ENH_HDR_SIZE 6 #define L2CAP_EXT_HDR_SIZE 8 #define L2CAP_FCS_SIZE 2 #define L2CAP_SDULEN_SIZE 2 #define L2CAP_PSMLEN_SIZE 2 #define L2CAP_ENH_CTRL_SIZE 2 #define L2CAP_EXT_CTRL_SIZE 4 struct l2cap_cmd_hdr { __u8 code; __u8 ident; __le16 len; } __packed; #define L2CAP_CMD_HDR_SIZE 4 struct l2cap_cmd_rej_unk { __le16 reason; } __packed; struct l2cap_cmd_rej_mtu { __le16 reason; __le16 max_mtu; } __packed; struct l2cap_cmd_rej_cid { __le16 reason; __le16 scid; __le16 dcid; } __packed; struct l2cap_conn_req { __le16 psm; __le16 scid; } __packed; struct l2cap_conn_rsp { __le16 dcid; __le16 scid; __le16 result; __le16 status; } __packed; /* protocol/service multiplexer (PSM) */ #define L2CAP_PSM_SDP 0x0001 #define L2CAP_PSM_RFCOMM 0x0003 #define L2CAP_PSM_3DSP 0x0021 #define L2CAP_PSM_IPSP 0x0023 /* 6LoWPAN */ #define L2CAP_PSM_DYN_START 0x1001 #define L2CAP_PSM_DYN_END 0xffff #define L2CAP_PSM_AUTO_END 0x10ff #define L2CAP_PSM_LE_DYN_START 0x0080 #define L2CAP_PSM_LE_DYN_END 0x00ff /* channel identifier */ #define L2CAP_CID_SIGNALING 0x0001 #define L2CAP_CID_CONN_LESS 0x0002 #define L2CAP_CID_ATT 0x0004 #define L2CAP_CID_LE_SIGNALING 0x0005 #define L2CAP_CID_SMP 0x0006 #define L2CAP_CID_SMP_BREDR 0x0007 #define L2CAP_CID_DYN_START 0x0040 #define L2CAP_CID_DYN_END 0xffff #define L2CAP_CID_LE_DYN_END 0x007f /* connect/create channel results */ #define L2CAP_CR_SUCCESS 0x0000 #define L2CAP_CR_PEND 0x0001 #define L2CAP_CR_BAD_PSM 0x0002 #define L2CAP_CR_SEC_BLOCK 0x0003 #define L2CAP_CR_NO_MEM 0x0004 #define L2CAP_CR_INVALID_SCID 0x0006 #define L2CAP_CR_SCID_IN_USE 0x0007 /* credit based connect results */ #define L2CAP_CR_LE_SUCCESS 0x0000 #define L2CAP_CR_LE_BAD_PSM 0x0002 #define L2CAP_CR_LE_NO_MEM 0x0004 #define L2CAP_CR_LE_AUTHENTICATION 0x0005 #define L2CAP_CR_LE_AUTHORIZATION 0x0006 #define L2CAP_CR_LE_BAD_KEY_SIZE 0x0007 #define L2CAP_CR_LE_ENCRYPTION 0x0008 #define L2CAP_CR_LE_INVALID_SCID 0x0009 #define L2CAP_CR_LE_SCID_IN_USE 0X000A #define L2CAP_CR_LE_UNACCEPT_PARAMS 0X000B #define L2CAP_CR_LE_INVALID_PARAMS 0X000C /* connect/create channel status */ #define L2CAP_CS_NO_INFO 0x0000 #define L2CAP_CS_AUTHEN_PEND 0x0001 #define L2CAP_CS_AUTHOR_PEND 0x0002 struct l2cap_conf_req { __le16 dcid; __le16 flags; __u8 data[]; } __packed; struct l2cap_conf_rsp { __le16 scid; __le16 flags; __le16 result; __u8 data[]; } __packed; #define L2CAP_CONF_SUCCESS 0x0000 #define L2CAP_CONF_UNACCEPT 0x0001 #define L2CAP_CONF_REJECT 0x0002 #define L2CAP_CONF_UNKNOWN 0x0003 #define L2CAP_CONF_PENDING 0x0004 #define L2CAP_CONF_EFS_REJECT 0x0005 /* configuration req/rsp continuation flag */ #define L2CAP_CONF_FLAG_CONTINUATION 0x0001 struct l2cap_conf_opt { __u8 type; __u8 len; __u8 val[]; } __packed; #define L2CAP_CONF_OPT_SIZE 2 #define L2CAP_CONF_HINT 0x80 #define L2CAP_CONF_MASK 0x7f #define L2CAP_CONF_MTU 0x01 #define L2CAP_CONF_FLUSH_TO 0x02 #define L2CAP_CONF_QOS 0x03 #define L2CAP_CONF_RFC 0x04 #define L2CAP_CONF_FCS 0x05 #define L2CAP_CONF_EFS 0x06 #define L2CAP_CONF_EWS 0x07 #define L2CAP_CONF_MAX_SIZE 22 struct l2cap_conf_rfc { __u8 mode; __u8 txwin_size; __u8 max_transmit; __le16 retrans_timeout; __le16 monitor_timeout; __le16 max_pdu_size; } __packed; #define L2CAP_MODE_BASIC 0x00 #define L2CAP_MODE_RETRANS 0x01 #define L2CAP_MODE_FLOWCTL 0x02 #define L2CAP_MODE_ERTM 0x03 #define L2CAP_MODE_STREAMING 0x04 /* Unlike the above this one doesn't actually map to anything that would * ever be sent over the air. Therefore, use a value that's unlikely to * ever be used in the BR/EDR configuration phase. */ #define L2CAP_MODE_LE_FLOWCTL 0x80 #define L2CAP_MODE_EXT_FLOWCTL 0x81 struct l2cap_conf_efs { __u8 id; __u8 stype; __le16 msdu; __le32 sdu_itime; __le32 acc_lat; __le32 flush_to; } __packed; #define L2CAP_SERV_NOTRAFIC 0x00 #define L2CAP_SERV_BESTEFFORT 0x01 #define L2CAP_SERV_GUARANTEED 0x02 #define L2CAP_BESTEFFORT_ID 0x01 struct l2cap_disconn_req { __le16 dcid; __le16 scid; } __packed; struct l2cap_disconn_rsp { __le16 dcid; __le16 scid; } __packed; struct l2cap_info_req { __le16 type; } __packed; struct l2cap_info_rsp { __le16 type; __le16 result; __u8 data[]; } __packed; #define L2CAP_MR_SUCCESS 0x0000 #define L2CAP_MR_PEND 0x0001 #define L2CAP_MR_BAD_ID 0x0002 #define L2CAP_MR_SAME_ID 0x0003 #define L2CAP_MR_NOT_SUPP 0x0004 #define L2CAP_MR_COLLISION 0x0005 #define L2CAP_MR_NOT_ALLOWED 0x0006 struct l2cap_move_chan_cfm { __le16 icid; __le16 result; } __packed; #define L2CAP_MC_CONFIRMED 0x0000 #define L2CAP_MC_UNCONFIRMED 0x0001 struct l2cap_move_chan_cfm_rsp { __le16 icid; } __packed; /* info type */ #define L2CAP_IT_CL_MTU 0x0001 #define L2CAP_IT_FEAT_MASK 0x0002 #define L2CAP_IT_FIXED_CHAN 0x0003 /* info result */ #define L2CAP_IR_SUCCESS 0x0000 #define L2CAP_IR_NOTSUPP 0x0001 struct l2cap_conn_param_update_req { __le16 min; __le16 max; __le16 latency; __le16 to_multiplier; } __packed; struct l2cap_conn_param_update_rsp { __le16 result; } __packed; /* Connection Parameters result */ #define L2CAP_CONN_PARAM_ACCEPTED 0x0000 #define L2CAP_CONN_PARAM_REJECTED 0x0001 struct l2cap_le_conn_req { __le16 psm; __le16 scid; __le16 mtu; __le16 mps; __le16 credits; } __packed; struct l2cap_le_conn_rsp { __le16 dcid; __le16 mtu; __le16 mps; __le16 credits; __le16 result; } __packed; struct l2cap_le_credits { __le16 cid; __le16 credits; } __packed; #define L2CAP_ECRED_MIN_MTU 64 #define L2CAP_ECRED_MIN_MPS 64 #define L2CAP_ECRED_MAX_CID 5 struct l2cap_ecred_conn_req { /* New members must be added within the struct_group() macro below. */ __struct_group(l2cap_ecred_conn_req_hdr, hdr, __packed, __le16 psm; __le16 mtu; __le16 mps; __le16 credits; ); __le16 scid[]; } __packed; struct l2cap_ecred_conn_rsp { /* New members must be added within the struct_group() macro below. */ struct_group_tagged(l2cap_ecred_conn_rsp_hdr, hdr, __le16 mtu; __le16 mps; __le16 credits; __le16 result; ); __le16 dcid[]; }; struct l2cap_ecred_reconf_req { __le16 mtu; __le16 mps; __le16 scid[]; } __packed; #define L2CAP_RECONF_SUCCESS 0x0000 #define L2CAP_RECONF_INVALID_MTU 0x0001 #define L2CAP_RECONF_INVALID_MPS 0x0002 struct l2cap_ecred_reconf_rsp { __le16 result; } __packed; /* ----- L2CAP channels and connections ----- */ struct l2cap_seq_list { __u16 head; __u16 tail; __u16 mask; __u16 *list; }; #define L2CAP_SEQ_LIST_CLEAR 0xFFFF #define L2CAP_SEQ_LIST_TAIL 0x8000 struct l2cap_chan { struct l2cap_conn *conn; struct kref kref; atomic_t nesting; __u8 state; bdaddr_t dst; __u8 dst_type; bdaddr_t src; __u8 src_type; __le16 psm; __le16 sport; __u16 dcid; __u16 scid; __u16 imtu; __u16 omtu; __u16 flush_to; __u8 mode; __u8 chan_type; __u8 chan_policy; __u8 sec_level; __u8 ident; __u8 conf_req[64]; __u8 conf_len; __u8 num_conf_req; __u8 num_conf_rsp; __u8 fcs; __u16 tx_win; __u16 tx_win_max; __u16 ack_win; __u8 max_tx; __u16 retrans_timeout; __u16 monitor_timeout; __u16 mps; __u16 tx_credits; __u16 rx_credits; /* estimated available receive buffer space or -1 if unknown */ ssize_t rx_avail; __u8 tx_state; __u8 rx_state; unsigned long conf_state; unsigned long conn_state; unsigned long flags; __u16 next_tx_seq; __u16 expected_ack_seq; __u16 expected_tx_seq; __u16 buffer_seq; __u16 srej_save_reqseq; __u16 last_acked_seq; __u16 frames_sent; __u16 unacked_frames; __u8 retry_count; __u16 sdu_len; struct sk_buff *sdu; struct sk_buff *sdu_last_frag; __u16 remote_tx_win; __u8 remote_max_tx; __u16 remote_mps; __u8 local_id; __u8 local_stype; __u16 local_msdu; __u32 local_sdu_itime; __u32 local_acc_lat; __u32 local_flush_to; __u8 remote_id; __u8 remote_stype; __u16 remote_msdu; __u32 remote_sdu_itime; __u32 remote_acc_lat; __u32 remote_flush_to; struct delayed_work chan_timer; struct delayed_work retrans_timer; struct delayed_work monitor_timer; struct delayed_work ack_timer; struct sk_buff *tx_send_head; struct sk_buff_head tx_q; struct sk_buff_head srej_q; struct l2cap_seq_list srej_list; struct l2cap_seq_list retrans_list; struct list_head list; struct list_head global_l; void *data; const struct l2cap_ops *ops; struct mutex lock; }; struct l2cap_ops { char *name; struct l2cap_chan *(*new_connection) (struct l2cap_chan *chan); int (*recv) (struct l2cap_chan * chan, struct sk_buff *skb); void (*teardown) (struct l2cap_chan *chan, int err); void (*close) (struct l2cap_chan *chan); void (*state_change) (struct l2cap_chan *chan, int state, int err); void (*ready) (struct l2cap_chan *chan); void (*defer) (struct l2cap_chan *chan); void (*resume) (struct l2cap_chan *chan); void (*suspend) (struct l2cap_chan *chan); void (*set_shutdown) (struct l2cap_chan *chan); long (*get_sndtimeo) (struct l2cap_chan *chan); struct pid *(*get_peer_pid) (struct l2cap_chan *chan); struct sk_buff *(*alloc_skb) (struct l2cap_chan *chan, unsigned long hdr_len, unsigned long len, int nb); int (*filter) (struct l2cap_chan * chan, struct sk_buff *skb); }; struct l2cap_conn { struct hci_conn *hcon; struct hci_chan *hchan; unsigned int mtu; __u32 feat_mask; __u8 remote_fixed_chan; __u8 local_fixed_chan; __u8 info_state; __u8 info_ident; struct delayed_work info_timer; struct sk_buff *rx_skb; __u32 rx_len; __u8 tx_ident; struct mutex ident_lock; struct sk_buff_head pending_rx; struct work_struct pending_rx_work; struct delayed_work id_addr_timer; __u8 disc_reason; struct l2cap_chan *smp; struct list_head chan_l; struct mutex lock; struct kref ref; struct list_head users; }; struct l2cap_user { struct list_head list; int (*probe) (struct l2cap_conn *conn, struct l2cap_user *user); void (*remove) (struct l2cap_conn *conn, struct l2cap_user *user); }; #define L2CAP_INFO_CL_MTU_REQ_SENT 0x01 #define L2CAP_INFO_FEAT_MASK_REQ_SENT 0x04 #define L2CAP_INFO_FEAT_MASK_REQ_DONE 0x08 #define L2CAP_CHAN_RAW 1 #define L2CAP_CHAN_CONN_LESS 2 #define L2CAP_CHAN_CONN_ORIENTED 3 #define L2CAP_CHAN_FIXED 4 /* ----- L2CAP socket info ----- */ #define l2cap_pi(sk) ((struct l2cap_pinfo *) sk) struct l2cap_rx_busy { struct list_head list; struct sk_buff *skb; }; struct l2cap_pinfo { struct bt_sock bt; struct l2cap_chan *chan; struct list_head rx_busy; }; enum { CONF_REQ_SENT, CONF_INPUT_DONE, CONF_OUTPUT_DONE, CONF_MTU_DONE, CONF_MODE_DONE, CONF_CONNECT_PEND, CONF_RECV_NO_FCS, CONF_STATE2_DEVICE, CONF_EWS_RECV, CONF_LOC_CONF_PEND, CONF_REM_CONF_PEND, CONF_NOT_COMPLETE, }; #define L2CAP_CONF_MAX_CONF_REQ 2 #define L2CAP_CONF_MAX_CONF_RSP 2 enum { CONN_SREJ_SENT, CONN_WAIT_F, CONN_SREJ_ACT, CONN_SEND_PBIT, CONN_REMOTE_BUSY, CONN_LOCAL_BUSY, CONN_REJ_ACT, CONN_SEND_FBIT, CONN_RNR_SENT, }; /* Definitions for flags in l2cap_chan */ enum { FLAG_ROLE_SWITCH, FLAG_FORCE_ACTIVE, FLAG_FORCE_RELIABLE, FLAG_FLUSHABLE, FLAG_EXT_CTRL, FLAG_EFS_ENABLE, FLAG_DEFER_SETUP, FLAG_LE_CONN_REQ_SENT, FLAG_ECRED_CONN_REQ_SENT, FLAG_PENDING_SECURITY, FLAG_HOLD_HCI_CONN, }; /* Lock nesting levels for L2CAP channels. We need these because lockdep * otherwise considers all channels equal and will e.g. complain about a * connection oriented channel triggering SMP procedures or a listening * channel creating and locking a child channel. */ enum { L2CAP_NESTING_SMP, L2CAP_NESTING_NORMAL, L2CAP_NESTING_PARENT, }; enum { L2CAP_TX_STATE_XMIT, L2CAP_TX_STATE_WAIT_F, }; enum { L2CAP_RX_STATE_RECV, L2CAP_RX_STATE_SREJ_SENT, L2CAP_RX_STATE_MOVE, L2CAP_RX_STATE_WAIT_P, L2CAP_RX_STATE_WAIT_F, }; enum { L2CAP_TXSEQ_EXPECTED, L2CAP_TXSEQ_EXPECTED_SREJ, L2CAP_TXSEQ_UNEXPECTED, L2CAP_TXSEQ_UNEXPECTED_SREJ, L2CAP_TXSEQ_DUPLICATE, L2CAP_TXSEQ_DUPLICATE_SREJ, L2CAP_TXSEQ_INVALID, L2CAP_TXSEQ_INVALID_IGNORE, }; enum { L2CAP_EV_DATA_REQUEST, L2CAP_EV_LOCAL_BUSY_DETECTED, L2CAP_EV_LOCAL_BUSY_CLEAR, L2CAP_EV_RECV_REQSEQ_AND_FBIT, L2CAP_EV_RECV_FBIT, L2CAP_EV_RETRANS_TO, L2CAP_EV_MONITOR_TO, L2CAP_EV_EXPLICIT_POLL, L2CAP_EV_RECV_IFRAME, L2CAP_EV_RECV_RR, L2CAP_EV_RECV_REJ, L2CAP_EV_RECV_RNR, L2CAP_EV_RECV_SREJ, L2CAP_EV_RECV_FRAME, }; enum { L2CAP_MOVE_ROLE_NONE, L2CAP_MOVE_ROLE_INITIATOR, L2CAP_MOVE_ROLE_RESPONDER, }; enum { L2CAP_MOVE_STABLE, L2CAP_MOVE_WAIT_REQ, L2CAP_MOVE_WAIT_RSP, L2CAP_MOVE_WAIT_RSP_SUCCESS, L2CAP_MOVE_WAIT_CONFIRM, L2CAP_MOVE_WAIT_CONFIRM_RSP, L2CAP_MOVE_WAIT_LOGICAL_COMP, L2CAP_MOVE_WAIT_LOGICAL_CFM, L2CAP_MOVE_WAIT_LOCAL_BUSY, L2CAP_MOVE_WAIT_PREPARE, }; void l2cap_chan_hold(struct l2cap_chan *c); struct l2cap_chan *l2cap_chan_hold_unless_zero(struct l2cap_chan *c); void l2cap_chan_put(struct l2cap_chan *c); static inline void l2cap_chan_lock(struct l2cap_chan *chan) { mutex_lock_nested(&chan->lock, atomic_read(&chan->nesting)); } static inline void l2cap_chan_unlock(struct l2cap_chan *chan) { mutex_unlock(&chan->lock); } static inline void l2cap_set_timer(struct l2cap_chan *chan, struct delayed_work *work, long timeout) { BT_DBG("chan %p state %s timeout %ld", chan, state_to_string(chan->state), timeout); /* If delayed work cancelled do not hold(chan) since it is already done with previous set_timer */ if (!cancel_delayed_work(work)) l2cap_chan_hold(chan); schedule_delayed_work(work, timeout); } static inline bool l2cap_clear_timer(struct l2cap_chan *chan, struct delayed_work *work) { bool ret; /* put(chan) if delayed work cancelled otherwise it is done in delayed work function */ ret = cancel_delayed_work(work); if (ret) l2cap_chan_put(chan); return ret; } #define __set_chan_timer(c, t) l2cap_set_timer(c, &c->chan_timer, (t)) #define __clear_chan_timer(c) l2cap_clear_timer(c, &c->chan_timer) #define __clear_retrans_timer(c) l2cap_clear_timer(c, &c->retrans_timer) #define __clear_monitor_timer(c) l2cap_clear_timer(c, &c->monitor_timer) #define __set_ack_timer(c) l2cap_set_timer(c, &chan->ack_timer, \ msecs_to_jiffies(L2CAP_DEFAULT_ACK_TO)); #define __clear_ack_timer(c) l2cap_clear_timer(c, &c->ack_timer) static inline int __seq_offset(struct l2cap_chan *chan, __u16 seq1, __u16 seq2) { if (seq1 >= seq2) return seq1 - seq2; else return chan->tx_win_max + 1 - seq2 + seq1; } static inline __u16 __next_seq(struct l2cap_chan *chan, __u16 seq) { return (seq + 1) % (chan->tx_win_max + 1); } static inline struct l2cap_chan *l2cap_chan_no_new_connection(struct l2cap_chan *chan) { return NULL; } static inline int l2cap_chan_no_recv(struct l2cap_chan *chan, struct sk_buff *skb) { return -ENOSYS; } static inline struct sk_buff *l2cap_chan_no_alloc_skb(struct l2cap_chan *chan, unsigned long hdr_len, unsigned long len, int nb) { return ERR_PTR(-ENOSYS); } static inline void l2cap_chan_no_teardown(struct l2cap_chan *chan, int err) { } static inline void l2cap_chan_no_close(struct l2cap_chan *chan) { } static inline void l2cap_chan_no_ready(struct l2cap_chan *chan) { } static inline void l2cap_chan_no_state_change(struct l2cap_chan *chan, int state, int err) { } static inline void l2cap_chan_no_defer(struct l2cap_chan *chan) { } static inline void l2cap_chan_no_suspend(struct l2cap_chan *chan) { } static inline void l2cap_chan_no_resume(struct l2cap_chan *chan) { } static inline void l2cap_chan_no_set_shutdown(struct l2cap_chan *chan) { } static inline long l2cap_chan_no_get_sndtimeo(struct l2cap_chan *chan) { return 0; } extern bool disable_ertm; extern bool enable_ecred; int l2cap_init_sockets(void); void l2cap_cleanup_sockets(void); bool l2cap_is_socket(struct socket *sock); void __l2cap_le_connect_rsp_defer(struct l2cap_chan *chan); void __l2cap_ecred_conn_rsp_defer(struct l2cap_chan *chan); void __l2cap_connect_rsp_defer(struct l2cap_chan *chan); int l2cap_add_psm(struct l2cap_chan *chan, bdaddr_t *src, __le16 psm); int l2cap_add_scid(struct l2cap_chan *chan, __u16 scid); struct l2cap_chan *l2cap_chan_create(void); void l2cap_chan_close(struct l2cap_chan *chan, int reason); int l2cap_chan_connect(struct l2cap_chan *chan, __le16 psm, u16 cid, bdaddr_t *dst, u8 dst_type, u16 timeout); int l2cap_chan_reconfigure(struct l2cap_chan *chan, __u16 mtu); int l2cap_chan_send(struct l2cap_chan *chan, struct msghdr *msg, size_t len, const struct sockcm_cookie *sockc); void l2cap_chan_busy(struct l2cap_chan *chan, int busy); void l2cap_chan_rx_avail(struct l2cap_chan *chan, ssize_t rx_avail); int l2cap_chan_check_security(struct l2cap_chan *chan, bool initiator); void l2cap_chan_set_defaults(struct l2cap_chan *chan); int l2cap_ertm_init(struct l2cap_chan *chan); void l2cap_chan_add(struct l2cap_conn *conn, struct l2cap_chan *chan); void __l2cap_chan_add(struct l2cap_conn *conn, struct l2cap_chan *chan); typedef void (*l2cap_chan_func_t)(struct l2cap_chan *chan, void *data); void l2cap_chan_list(struct l2cap_conn *conn, l2cap_chan_func_t func, void *data); void l2cap_chan_del(struct l2cap_chan *chan, int err); void l2cap_send_conn_req(struct l2cap_chan *chan); struct l2cap_conn *l2cap_conn_get(struct l2cap_conn *conn); struct l2cap_conn *l2cap_conn_hold_unless_zero(struct l2cap_conn *conn); void l2cap_conn_put(struct l2cap_conn *conn); int l2cap_register_user(struct l2cap_conn *conn, struct l2cap_user *user); void l2cap_unregister_user(struct l2cap_conn *conn, struct l2cap_user *user); #endif /* __L2CAP_H */ |
11 11 8 5 8 3 29 4 4 3 29 3 8 8 8 8 7 7 2 2 2 3 3 2 1 6 6 1 6 5 4 2 4 4 3 2 2 2 2 1 2 3 2 1 1 2 2 2 1 2 3 2 2 2 5 4 4 2 7 6 5 2 2 2 6 6 14 10 16 9 4 3 1 1 1 156 3 2 1 9 8 3 3 2 1 3 2 1 1 1 1 1 3 2 1 7 1 5 5 5 5 5 5 5 5 14 13 4 13 7 4 3 12 6 3 2 6 7 10 10 10 7 10 4 10 3 7 251 253 93 27 156 134 42 1 1 19 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 5 4 249 3 3 2 1 8 7 6 1 1 5 9 8 5 3 8 5 5 5 5 5 5 5 15 14 1 1 1 1 2 1 18 52 31 30 29 8 12 12 7 6 5 5 258 259 258 31 34 76 255 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 1992 obz under the linux copyright * * Dynamic diacritical handling - aeb@cwi.nl - Dec 1993 * Dynamic keymap and string allocation - aeb@cwi.nl - May 1994 * Restrict VT switching via ioctl() - grif@cs.ucr.edu - Dec 1995 * Some code moved for less code duplication - Andi Kleen - Mar 1997 * Check put/get_user, cleanups - acme@conectiva.com.br - Jun 2001 */ #include <linux/types.h> #include <linux/errno.h> #include <linux/sched/signal.h> #include <linux/tty.h> #include <linux/timer.h> #include <linux/kernel.h> #include <linux/compat.h> #include <linux/module.h> #include <linux/kd.h> #include <linux/vt.h> #include <linux/string.h> #include <linux/slab.h> #include <linux/major.h> #include <linux/fs.h> #include <linux/console.h> #include <linux/consolemap.h> #include <linux/signal.h> #include <linux/suspend.h> #include <linux/timex.h> #include <asm/io.h> #include <linux/uaccess.h> #include <linux/nospec.h> #include <linux/kbd_kern.h> #include <linux/vt_kern.h> #include <linux/kbd_diacr.h> #include <linux/selection.h> bool vt_dont_switch; static inline bool vt_in_use(unsigned int i) { const struct vc_data *vc = vc_cons[i].d; /* * console_lock must be held to prevent the vc from being deallocated * while we're checking whether it's in-use. */ WARN_CONSOLE_UNLOCKED(); return vc && kref_read(&vc->port.kref) > 1; } static inline bool vt_busy(int i) { if (vt_in_use(i)) return true; if (i == fg_console) return true; if (vc_is_sel(vc_cons[i].d)) return true; return false; } /* * Console (vt and kd) routines, as defined by USL SVR4 manual, and by * experimentation and study of X386 SYSV handling. * * One point of difference: SYSV vt's are /dev/vtX, which X >= 0, and * /dev/console is a separate ttyp. Under Linux, /dev/tty0 is /dev/console, * and the vc start at /dev/ttyX, X >= 1. We maintain that here, so we will * always treat our set of vt as numbered 1..MAX_NR_CONSOLES (corresponding to * ttys 0..MAX_NR_CONSOLES-1). Explicitly naming VT 0 is illegal, but using * /dev/tty0 (fg_console) as a target is legal, since an implicit aliasing * to the current console is done by the main ioctl code. */ #ifdef CONFIG_X86 #include <asm/syscalls.h> #endif static void complete_change_console(struct vc_data *vc); /* * User space VT_EVENT handlers */ struct vt_event_wait { struct list_head list; struct vt_event event; int done; }; static LIST_HEAD(vt_events); static DEFINE_SPINLOCK(vt_event_lock); static DECLARE_WAIT_QUEUE_HEAD(vt_event_waitqueue); /** * vt_event_post * @event: the event that occurred * @old: old console * @new: new console * * Post an VT event to interested VT handlers */ void vt_event_post(unsigned int event, unsigned int old, unsigned int new) { struct list_head *pos, *head; unsigned long flags; int wake = 0; spin_lock_irqsave(&vt_event_lock, flags); head = &vt_events; list_for_each(pos, head) { struct vt_event_wait *ve = list_entry(pos, struct vt_event_wait, list); if (!(ve->event.event & event)) continue; ve->event.event = event; /* kernel view is consoles 0..n-1, user space view is console 1..n with 0 meaning current, so we must bias */ ve->event.oldev = old + 1; ve->event.newev = new + 1; wake = 1; ve->done = 1; } spin_unlock_irqrestore(&vt_event_lock, flags); if (wake) wake_up_interruptible(&vt_event_waitqueue); } static void __vt_event_queue(struct vt_event_wait *vw) { unsigned long flags; /* Prepare the event */ INIT_LIST_HEAD(&vw->list); vw->done = 0; /* Queue our event */ spin_lock_irqsave(&vt_event_lock, flags); list_add(&vw->list, &vt_events); spin_unlock_irqrestore(&vt_event_lock, flags); } static void __vt_event_wait(struct vt_event_wait *vw) { /* Wait for it to pass */ wait_event_interruptible(vt_event_waitqueue, vw->done); } static void __vt_event_dequeue(struct vt_event_wait *vw) { unsigned long flags; /* Dequeue it */ spin_lock_irqsave(&vt_event_lock, flags); list_del(&vw->list); spin_unlock_irqrestore(&vt_event_lock, flags); } /** * vt_event_wait - wait for an event * @vw: our event * * Waits for an event to occur which completes our vt_event_wait * structure. On return the structure has wv->done set to 1 for success * or 0 if some event such as a signal ended the wait. */ static void vt_event_wait(struct vt_event_wait *vw) { __vt_event_queue(vw); __vt_event_wait(vw); __vt_event_dequeue(vw); } /** * vt_event_wait_ioctl - event ioctl handler * @event: argument to ioctl (the event) * * Implement the VT_WAITEVENT ioctl using the VT event interface */ static int vt_event_wait_ioctl(struct vt_event __user *event) { struct vt_event_wait vw; if (copy_from_user(&vw.event, event, sizeof(struct vt_event))) return -EFAULT; /* Highest supported event for now */ if (vw.event.event & ~VT_MAX_EVENT) return -EINVAL; vt_event_wait(&vw); /* If it occurred report it */ if (vw.done) { if (copy_to_user(event, &vw.event, sizeof(struct vt_event))) return -EFAULT; return 0; } return -EINTR; } /** * vt_waitactive - active console wait * @n: new console * * Helper for event waits. Used to implement the legacy * event waiting ioctls in terms of events */ int vt_waitactive(int n) { struct vt_event_wait vw; do { vw.event.event = VT_EVENT_SWITCH; __vt_event_queue(&vw); if (n == fg_console + 1) { __vt_event_dequeue(&vw); break; } __vt_event_wait(&vw); __vt_event_dequeue(&vw); if (vw.done == 0) return -EINTR; } while (vw.event.newev != n); return 0; } /* * these are the valid i/o ports we're allowed to change. they map all the * video ports */ #define GPFIRST 0x3b4 #define GPLAST 0x3df #define GPNUM (GPLAST - GPFIRST + 1) /* * currently, setting the mode from KD_TEXT to KD_GRAPHICS doesn't do a whole * lot. i'm not sure if it should do any restoration of modes or what... * * XXX It should at least call into the driver, fbdev's definitely need to * restore their engine state. --BenH * * Called with the console lock held. */ static int vt_kdsetmode(struct vc_data *vc, unsigned long mode) { switch (mode) { case KD_GRAPHICS: break; case KD_TEXT0: case KD_TEXT1: mode = KD_TEXT; fallthrough; case KD_TEXT: break; default: return -EINVAL; } if (vc->vc_mode == mode) return 0; vc->vc_mode = mode; if (vc->vc_num != fg_console) return 0; /* explicitly blank/unblank the screen if switching modes */ if (mode == KD_TEXT) do_unblank_screen(1); else do_blank_screen(1); return 0; } static int vt_k_ioctl(struct tty_struct *tty, unsigned int cmd, unsigned long arg, bool perm) { struct vc_data *vc = tty->driver_data; void __user *up = (void __user *)arg; unsigned int console = vc->vc_num; int ret; switch (cmd) { case KIOCSOUND: if (!perm) return -EPERM; /* * The use of PIT_TICK_RATE is historic, it used to be * the platform-dependent CLOCK_TICK_RATE between 2.6.12 * and 2.6.36, which was a minor but unfortunate ABI * change. kd_mksound is locked by the input layer. */ if (arg) arg = PIT_TICK_RATE / arg; kd_mksound(arg, 0); break; case KDMKTONE: if (!perm) return -EPERM; { unsigned int ticks, count; /* * Generate the tone for the appropriate number of ticks. * If the time is zero, turn off sound ourselves. */ ticks = msecs_to_jiffies((arg >> 16) & 0xffff); count = ticks ? (arg & 0xffff) : 0; if (count) count = PIT_TICK_RATE / count; kd_mksound(count, ticks); break; } case KDGKBTYPE: /* * this is naïve. */ return put_user(KB_101, (char __user *)arg); /* * These cannot be implemented on any machine that implements * ioperm() in user level (such as Alpha PCs) or not at all. * * XXX: you should never use these, just call ioperm directly.. */ #ifdef CONFIG_X86 case KDADDIO: case KDDELIO: /* * KDADDIO and KDDELIO may be able to add ports beyond what * we reject here, but to be safe... * * These are locked internally via sys_ioperm */ if (arg < GPFIRST || arg > GPLAST) return -EINVAL; return ksys_ioperm(arg, 1, (cmd == KDADDIO)) ? -ENXIO : 0; case KDENABIO: case KDDISABIO: return ksys_ioperm(GPFIRST, GPNUM, (cmd == KDENABIO)) ? -ENXIO : 0; #endif /* Linux m68k/i386 interface for setting the keyboard delay/repeat rate */ case KDKBDREP: { struct kbd_repeat kbrep; if (!capable(CAP_SYS_TTY_CONFIG)) return -EPERM; if (copy_from_user(&kbrep, up, sizeof(struct kbd_repeat))) return -EFAULT; ret = kbd_rate(&kbrep); if (ret) return ret; if (copy_to_user(up, &kbrep, sizeof(struct kbd_repeat))) return -EFAULT; break; } case KDSETMODE: if (!perm) return -EPERM; console_lock(); ret = vt_kdsetmode(vc, arg); console_unlock(); return ret; case KDGETMODE: return put_user(vc->vc_mode, (int __user *)arg); case KDMAPDISP: case KDUNMAPDISP: /* * these work like a combination of mmap and KDENABIO. * this could be easily finished. */ return -EINVAL; case KDSKBMODE: if (!perm) return -EPERM; ret = vt_do_kdskbmode(console, arg); if (ret) return ret; tty_ldisc_flush(tty); break; case KDGKBMODE: return put_user(vt_do_kdgkbmode(console), (int __user *)arg); /* this could be folded into KDSKBMODE, but for compatibility reasons it is not so easy to fold KDGKBMETA into KDGKBMODE */ case KDSKBMETA: return vt_do_kdskbmeta(console, arg); case KDGKBMETA: /* FIXME: should review whether this is worth locking */ return put_user(vt_do_kdgkbmeta(console), (int __user *)arg); case KDGETKEYCODE: case KDSETKEYCODE: if(!capable(CAP_SYS_TTY_CONFIG)) perm = 0; return vt_do_kbkeycode_ioctl(cmd, up, perm); case KDGKBENT: case KDSKBENT: return vt_do_kdsk_ioctl(cmd, up, perm, console); case KDGKBSENT: case KDSKBSENT: return vt_do_kdgkb_ioctl(cmd, up, perm); /* Diacritical processing. Handled in keyboard.c as it has to operate on the keyboard locks and structures */ case KDGKBDIACR: case KDGKBDIACRUC: case KDSKBDIACR: case KDSKBDIACRUC: return vt_do_diacrit(cmd, up, perm); /* the ioctls below read/set the flags usually shown in the leds */ /* don't use them - they will go away without warning */ case KDGKBLED: case KDSKBLED: case KDGETLED: case KDSETLED: return vt_do_kdskled(console, cmd, arg, perm); /* * A process can indicate its willingness to accept signals * generated by pressing an appropriate key combination. * Thus, one can have a daemon that e.g. spawns a new console * upon a keypress and then changes to it. * See also the kbrequest field of inittab(5). */ case KDSIGACCEPT: if (!perm || !capable(CAP_KILL)) return -EPERM; if (!valid_signal(arg) || arg < 1 || arg == SIGKILL) return -EINVAL; spin_lock_irq(&vt_spawn_con.lock); put_pid(vt_spawn_con.pid); vt_spawn_con.pid = get_pid(task_pid(current)); vt_spawn_con.sig = arg; spin_unlock_irq(&vt_spawn_con.lock); break; case KDFONTOP: { struct console_font_op op; if (copy_from_user(&op, up, sizeof(op))) return -EFAULT; if (!perm && op.op != KD_FONT_OP_GET) return -EPERM; ret = con_font_op(vc, &op); if (ret) return ret; if (copy_to_user(up, &op, sizeof(op))) return -EFAULT; break; } default: return -ENOIOCTLCMD; } return 0; } static inline int do_unimap_ioctl(int cmd, struct unimapdesc __user *user_ud, bool perm, struct vc_data *vc) { struct unimapdesc tmp; if (copy_from_user(&tmp, user_ud, sizeof tmp)) return -EFAULT; switch (cmd) { case PIO_UNIMAP: if (!perm) return -EPERM; return con_set_unimap(vc, tmp.entry_ct, tmp.entries); case GIO_UNIMAP: if (!perm && fg_console != vc->vc_num) return -EPERM; return con_get_unimap(vc, tmp.entry_ct, &(user_ud->entry_ct), tmp.entries); } return 0; } static int vt_io_ioctl(struct vc_data *vc, unsigned int cmd, void __user *up, bool perm) { switch (cmd) { case PIO_CMAP: if (!perm) return -EPERM; return con_set_cmap(up); case GIO_CMAP: return con_get_cmap(up); case PIO_SCRNMAP: if (!perm) return -EPERM; return con_set_trans_old(up); case GIO_SCRNMAP: return con_get_trans_old(up); case PIO_UNISCRNMAP: if (!perm) return -EPERM; return con_set_trans_new(up); case GIO_UNISCRNMAP: return con_get_trans_new(up); case PIO_UNIMAPCLR: if (!perm) return -EPERM; con_clear_unimap(vc); break; case PIO_UNIMAP: case GIO_UNIMAP: return do_unimap_ioctl(cmd, up, perm, vc); default: return -ENOIOCTLCMD; } return 0; } static int vt_reldisp(struct vc_data *vc, unsigned int swtch) { int newvt, ret; if (vc->vt_mode.mode != VT_PROCESS) return -EINVAL; /* Switched-to response */ if (vc->vt_newvt < 0) { /* If it's just an ACK, ignore it */ return swtch == VT_ACKACQ ? 0 : -EINVAL; } /* Switching-from response */ if (swtch == 0) { /* Switch disallowed, so forget we were trying to do it. */ vc->vt_newvt = -1; return 0; } /* The current vt has been released, so complete the switch. */ newvt = vc->vt_newvt; vc->vt_newvt = -1; ret = vc_allocate(newvt); if (ret) return ret; /* * When we actually do the console switch, make sure we are atomic with * respect to other console switches.. */ complete_change_console(vc_cons[newvt].d); return 0; } static int vt_setactivate(struct vt_setactivate __user *sa) { struct vt_setactivate vsa; struct vc_data *nvc; int ret; if (copy_from_user(&vsa, sa, sizeof(vsa))) return -EFAULT; if (vsa.console == 0 || vsa.console > MAX_NR_CONSOLES) return -ENXIO; vsa.console--; vsa.console = array_index_nospec(vsa.console, MAX_NR_CONSOLES); console_lock(); ret = vc_allocate(vsa.console); if (ret) { console_unlock(); return ret; } /* * This is safe providing we don't drop the console sem between * vc_allocate and finishing referencing nvc. */ nvc = vc_cons[vsa.console].d; nvc->vt_mode = vsa.mode; nvc->vt_mode.frsig = 0; put_pid(nvc->vt_pid); nvc->vt_pid = get_pid(task_pid(current)); console_unlock(); /* Commence switch and lock */ /* Review set_console locks */ set_console(vsa.console); return 0; } /* deallocate a single console, if possible (leave 0) */ static int vt_disallocate(unsigned int vc_num) { struct vc_data *vc = NULL; int ret = 0; console_lock(); if (vt_busy(vc_num)) ret = -EBUSY; else if (vc_num) vc = vc_deallocate(vc_num); console_unlock(); if (vc && vc_num >= MIN_NR_CONSOLES) tty_port_put(&vc->port); return ret; } /* deallocate all unused consoles, but leave 0 */ static void vt_disallocate_all(void) { struct vc_data *vc[MAX_NR_CONSOLES]; int i; console_lock(); for (i = 1; i < MAX_NR_CONSOLES; i++) if (!vt_busy(i)) vc[i] = vc_deallocate(i); else vc[i] = NULL; console_unlock(); for (i = 1; i < MAX_NR_CONSOLES; i++) { if (vc[i] && i >= MIN_NR_CONSOLES) tty_port_put(&vc[i]->port); } } static int vt_resizex(struct vc_data *vc, struct vt_consize __user *cs) { struct vt_consize v; int i; if (copy_from_user(&v, cs, sizeof(struct vt_consize))) return -EFAULT; /* FIXME: Should check the copies properly */ if (!v.v_vlin) v.v_vlin = vc->vc_scan_lines; if (v.v_clin) { int rows = v.v_vlin / v.v_clin; if (v.v_rows != rows) { if (v.v_rows) /* Parameters don't add up */ return -EINVAL; v.v_rows = rows; } } if (v.v_vcol && v.v_ccol) { int cols = v.v_vcol / v.v_ccol; if (v.v_cols != cols) { if (v.v_cols) return -EINVAL; v.v_cols = cols; } } if (v.v_clin > 32) return -EINVAL; for (i = 0; i < MAX_NR_CONSOLES; i++) { struct vc_data *vcp; if (!vc_cons[i].d) continue; console_lock(); vcp = vc_cons[i].d; if (vcp) { int ret; int save_scan_lines = vcp->vc_scan_lines; int save_cell_height = vcp->vc_cell_height; if (v.v_vlin) vcp->vc_scan_lines = v.v_vlin; if (v.v_clin) vcp->vc_cell_height = v.v_clin; ret = __vc_resize(vcp, v.v_cols, v.v_rows, true); if (ret) { vcp->vc_scan_lines = save_scan_lines; vcp->vc_cell_height = save_cell_height; console_unlock(); return ret; } } console_unlock(); } return 0; } /* * We handle the console-specific ioctl's here. We allow the * capability to modify any console, not just the fg_console. */ int vt_ioctl(struct tty_struct *tty, unsigned int cmd, unsigned long arg) { struct vc_data *vc = tty->driver_data; void __user *up = (void __user *)arg; int i, perm; int ret; /* * To have permissions to do most of the vt ioctls, we either have * to be the owner of the tty, or have CAP_SYS_TTY_CONFIG. */ perm = 0; if (current->signal->tty == tty || capable(CAP_SYS_TTY_CONFIG)) perm = 1; ret = vt_k_ioctl(tty, cmd, arg, perm); if (ret != -ENOIOCTLCMD) return ret; ret = vt_io_ioctl(vc, cmd, up, perm); if (ret != -ENOIOCTLCMD) return ret; switch (cmd) { case TIOCLINUX: return tioclinux(tty, arg); case VT_SETMODE: { struct vt_mode tmp; if (!perm) return -EPERM; if (copy_from_user(&tmp, up, sizeof(struct vt_mode))) return -EFAULT; if (tmp.mode != VT_AUTO && tmp.mode != VT_PROCESS) return -EINVAL; console_lock(); vc->vt_mode = tmp; /* the frsig is ignored, so we set it to 0 */ vc->vt_mode.frsig = 0; put_pid(vc->vt_pid); vc->vt_pid = get_pid(task_pid(current)); /* no switch is required -- saw@shade.msu.ru */ vc->vt_newvt = -1; console_unlock(); break; } case VT_GETMODE: { struct vt_mode tmp; int rc; console_lock(); memcpy(&tmp, &vc->vt_mode, sizeof(struct vt_mode)); console_unlock(); rc = copy_to_user(up, &tmp, sizeof(struct vt_mode)); if (rc) return -EFAULT; break; } /* * Returns global vt state. Note that VT 0 is always open, since * it's an alias for the current VT, and people can't use it here. * We cannot return state for more than 16 VTs, since v_state is short. */ case VT_GETSTATE: { struct vt_stat __user *vtstat = up; unsigned short state, mask; if (put_user(fg_console + 1, &vtstat->v_active)) return -EFAULT; state = 1; /* /dev/tty0 is always open */ console_lock(); /* required by vt_in_use() */ for (i = 0, mask = 2; i < MAX_NR_CONSOLES && mask; ++i, mask <<= 1) if (vt_in_use(i)) state |= mask; console_unlock(); return put_user(state, &vtstat->v_state); } /* * Returns the first available (non-opened) console. */ case VT_OPENQRY: console_lock(); /* required by vt_in_use() */ for (i = 0; i < MAX_NR_CONSOLES; ++i) if (!vt_in_use(i)) break; console_unlock(); i = i < MAX_NR_CONSOLES ? (i+1) : -1; return put_user(i, (int __user *)arg); /* * ioctl(fd, VT_ACTIVATE, num) will cause us to switch to vt # num, * with num >= 1 (switches to vt 0, our console, are not allowed, just * to preserve sanity). */ case VT_ACTIVATE: if (!perm) return -EPERM; if (arg == 0 || arg > MAX_NR_CONSOLES) return -ENXIO; arg--; arg = array_index_nospec(arg, MAX_NR_CONSOLES); console_lock(); ret = vc_allocate(arg); console_unlock(); if (ret) return ret; set_console(arg); break; case VT_SETACTIVATE: if (!perm) return -EPERM; return vt_setactivate(up); /* * wait until the specified VT has been activated */ case VT_WAITACTIVE: if (!perm) return -EPERM; if (arg == 0 || arg > MAX_NR_CONSOLES) return -ENXIO; return vt_waitactive(arg); /* * If a vt is under process control, the kernel will not switch to it * immediately, but postpone the operation until the process calls this * ioctl, allowing the switch to complete. * * According to the X sources this is the behavior: * 0: pending switch-from not OK * 1: pending switch-from OK * 2: completed switch-to OK */ case VT_RELDISP: if (!perm) return -EPERM; console_lock(); ret = vt_reldisp(vc, arg); console_unlock(); return ret; /* * Disallocate memory associated to VT (but leave VT1) */ case VT_DISALLOCATE: if (arg > MAX_NR_CONSOLES) return -ENXIO; if (arg == 0) { vt_disallocate_all(); break; } arg = array_index_nospec(arg - 1, MAX_NR_CONSOLES); return vt_disallocate(arg); case VT_RESIZE: { struct vt_sizes __user *vtsizes = up; struct vc_data *vc; ushort ll,cc; if (!perm) return -EPERM; if (get_user(ll, &vtsizes->v_rows) || get_user(cc, &vtsizes->v_cols)) return -EFAULT; console_lock(); for (i = 0; i < MAX_NR_CONSOLES; i++) { vc = vc_cons[i].d; if (vc) { /* FIXME: review v tty lock */ __vc_resize(vc_cons[i].d, cc, ll, true); } } console_unlock(); break; } case VT_RESIZEX: if (!perm) return -EPERM; return vt_resizex(vc, up); case VT_LOCKSWITCH: if (!capable(CAP_SYS_TTY_CONFIG)) return -EPERM; vt_dont_switch = true; break; case VT_UNLOCKSWITCH: if (!capable(CAP_SYS_TTY_CONFIG)) return -EPERM; vt_dont_switch = false; break; case VT_GETHIFONTMASK: return put_user(vc->vc_hi_font_mask, (unsigned short __user *)arg); case VT_WAITEVENT: return vt_event_wait_ioctl((struct vt_event __user *)arg); default: return -ENOIOCTLCMD; } return 0; } void reset_vc(struct vc_data *vc) { vc->vc_mode = KD_TEXT; vt_reset_unicode(vc->vc_num); vc->vt_mode.mode = VT_AUTO; vc->vt_mode.waitv = 0; vc->vt_mode.relsig = 0; vc->vt_mode.acqsig = 0; vc->vt_mode.frsig = 0; put_pid(vc->vt_pid); vc->vt_pid = NULL; vc->vt_newvt = -1; reset_palette(vc); } void vc_SAK(struct work_struct *work) { struct vc *vc_con = container_of(work, struct vc, SAK_work); struct vc_data *vc; struct tty_struct *tty; console_lock(); vc = vc_con->d; if (vc) { /* FIXME: review tty ref counting */ tty = vc->port.tty; /* * SAK should also work in all raw modes and reset * them properly. */ if (tty) __do_SAK(tty); reset_vc(vc); } console_unlock(); } #ifdef CONFIG_COMPAT struct compat_console_font_op { compat_uint_t op; /* operation code KD_FONT_OP_* */ compat_uint_t flags; /* KD_FONT_FLAG_* */ compat_uint_t width, height; /* font size */ compat_uint_t charcount; compat_caddr_t data; /* font data with height fixed to 32 */ }; static inline int compat_kdfontop_ioctl(struct compat_console_font_op __user *fontop, int perm, struct console_font_op *op, struct vc_data *vc) { int i; if (copy_from_user(op, fontop, sizeof(struct compat_console_font_op))) return -EFAULT; if (!perm && op->op != KD_FONT_OP_GET) return -EPERM; op->data = compat_ptr(((struct compat_console_font_op *)op)->data); i = con_font_op(vc, op); if (i) return i; ((struct compat_console_font_op *)op)->data = (unsigned long)op->data; if (copy_to_user(fontop, op, sizeof(struct compat_console_font_op))) return -EFAULT; return 0; } struct compat_unimapdesc { unsigned short entry_ct; compat_caddr_t entries; }; static inline int compat_unimap_ioctl(unsigned int cmd, struct compat_unimapdesc __user *user_ud, int perm, struct vc_data *vc) { struct compat_unimapdesc tmp; struct unipair __user *tmp_entries; if (copy_from_user(&tmp, user_ud, sizeof tmp)) return -EFAULT; tmp_entries = compat_ptr(tmp.entries); switch (cmd) { case PIO_UNIMAP: if (!perm) return -EPERM; return con_set_unimap(vc, tmp.entry_ct, tmp_entries); case GIO_UNIMAP: if (!perm && fg_console != vc->vc_num) return -EPERM; return con_get_unimap(vc, tmp.entry_ct, &(user_ud->entry_ct), tmp_entries); } return 0; } long vt_compat_ioctl(struct tty_struct *tty, unsigned int cmd, unsigned long arg) { struct vc_data *vc = tty->driver_data; struct console_font_op op; /* used in multiple places here */ void __user *up = compat_ptr(arg); int perm; /* * To have permissions to do most of the vt ioctls, we either have * to be the owner of the tty, or have CAP_SYS_TTY_CONFIG. */ perm = 0; if (current->signal->tty == tty || capable(CAP_SYS_TTY_CONFIG)) perm = 1; switch (cmd) { /* * these need special handlers for incompatible data structures */ case KDFONTOP: return compat_kdfontop_ioctl(up, perm, &op, vc); case PIO_UNIMAP: case GIO_UNIMAP: return compat_unimap_ioctl(cmd, up, perm, vc); /* * all these treat 'arg' as an integer */ case KIOCSOUND: case KDMKTONE: #ifdef CONFIG_X86 case KDADDIO: case KDDELIO: #endif case KDSETMODE: case KDMAPDISP: case KDUNMAPDISP: case KDSKBMODE: case KDSKBMETA: case KDSKBLED: case KDSETLED: case KDSIGACCEPT: case VT_ACTIVATE: case VT_WAITACTIVE: case VT_RELDISP: case VT_DISALLOCATE: case VT_RESIZE: case VT_RESIZEX: return vt_ioctl(tty, cmd, arg); /* * the rest has a compatible data structure behind arg, * but we have to convert it to a proper 64 bit pointer. */ default: return vt_ioctl(tty, cmd, (unsigned long)up); } } #endif /* CONFIG_COMPAT */ /* * Performs the back end of a vt switch. Called under the console * semaphore. */ static void complete_change_console(struct vc_data *vc) { unsigned char old_vc_mode; int old = fg_console; last_console = fg_console; /* * If we're switching, we could be going from KD_GRAPHICS to * KD_TEXT mode or vice versa, which means we need to blank or * unblank the screen later. */ old_vc_mode = vc_cons[fg_console].d->vc_mode; switch_screen(vc); /* * This can't appear below a successful kill_pid(). If it did, * then the *blank_screen operation could occur while X, having * received acqsig, is waking up on another processor. This * condition can lead to overlapping accesses to the VGA range * and the framebuffer (causing system lockups). * * To account for this we duplicate this code below only if the * controlling process is gone and we've called reset_vc. */ if (old_vc_mode != vc->vc_mode) { if (vc->vc_mode == KD_TEXT) do_unblank_screen(1); else do_blank_screen(1); } /* * If this new console is under process control, send it a signal * telling it that it has acquired. Also check if it has died and * clean up (similar to logic employed in change_console()) */ if (vc->vt_mode.mode == VT_PROCESS) { /* * Send the signal as privileged - kill_pid() will * tell us if the process has gone or something else * is awry */ if (kill_pid(vc->vt_pid, vc->vt_mode.acqsig, 1) != 0) { /* * The controlling process has died, so we revert back to * normal operation. In this case, we'll also change back * to KD_TEXT mode. I'm not sure if this is strictly correct * but it saves the agony when the X server dies and the screen * remains blanked due to KD_GRAPHICS! It would be nice to do * this outside of VT_PROCESS but there is no single process * to account for and tracking tty count may be undesirable. */ reset_vc(vc); if (old_vc_mode != vc->vc_mode) { if (vc->vc_mode == KD_TEXT) do_unblank_screen(1); else do_blank_screen(1); } } } /* * Wake anyone waiting for their VT to activate */ vt_event_post(VT_EVENT_SWITCH, old, vc->vc_num); return; } /* * Performs the front-end of a vt switch */ void change_console(struct vc_data *new_vc) { struct vc_data *vc; if (!new_vc || new_vc->vc_num == fg_console || vt_dont_switch) return; /* * If this vt is in process mode, then we need to handshake with * that process before switching. Essentially, we store where that * vt wants to switch to and wait for it to tell us when it's done * (via VT_RELDISP ioctl). * * We also check to see if the controlling process still exists. * If it doesn't, we reset this vt to auto mode and continue. * This is a cheap way to track process control. The worst thing * that can happen is: we send a signal to a process, it dies, and * the switch gets "lost" waiting for a response; hopefully, the * user will try again, we'll detect the process is gone (unless * the user waits just the right amount of time :-) and revert the * vt to auto control. */ vc = vc_cons[fg_console].d; if (vc->vt_mode.mode == VT_PROCESS) { /* * Send the signal as privileged - kill_pid() will * tell us if the process has gone or something else * is awry. * * We need to set vt_newvt *before* sending the signal or we * have a race. */ vc->vt_newvt = new_vc->vc_num; if (kill_pid(vc->vt_pid, vc->vt_mode.relsig, 1) == 0) { /* * It worked. Mark the vt to switch to and * return. The process needs to send us a * VT_RELDISP ioctl to complete the switch. */ return; } /* * The controlling process has died, so we revert back to * normal operation. In this case, we'll also change back * to KD_TEXT mode. I'm not sure if this is strictly correct * but it saves the agony when the X server dies and the screen * remains blanked due to KD_GRAPHICS! It would be nice to do * this outside of VT_PROCESS but there is no single process * to account for and tracking tty count may be undesirable. */ reset_vc(vc); /* * Fall through to normal (VT_AUTO) handling of the switch... */ } /* * Ignore all switches in KD_GRAPHICS+VT_AUTO mode */ if (vc->vc_mode == KD_GRAPHICS) return; complete_change_console(new_vc); } /* Perform a kernel triggered VT switch for suspend/resume */ static int disable_vt_switch; int vt_move_to_console(unsigned int vt, int alloc) { int prev; console_lock(); /* Graphics mode - up to X */ if (disable_vt_switch) { console_unlock(); return 0; } prev = fg_console; if (alloc && vc_allocate(vt)) { /* we can't have a free VC for now. Too bad, * we don't want to mess the screen for now. */ console_unlock(); return -ENOSPC; } if (set_console(vt)) { /* * We're unable to switch to the SUSPEND_CONSOLE. * Let the calling function know so it can decide * what to do. */ console_unlock(); return -EIO; } console_unlock(); if (vt_waitactive(vt + 1)) { pr_debug("Suspend: Can't switch VCs."); return -EINTR; } return prev; } /* * Normally during a suspend, we allocate a new console and switch to it. * When we resume, we switch back to the original console. This switch * can be slow, so on systems where the framebuffer can handle restoration * of video registers anyways, there's little point in doing the console * switch. This function allows you to disable it by passing it '0'. */ void pm_set_vt_switch(int do_switch) { console_lock(); disable_vt_switch = !do_switch; console_unlock(); } EXPORT_SYMBOL(pm_set_vt_switch); |
304 268 151 151 1 1 80 117 117 117 1 1 1 36 35 2 1 155 4 4 1 1 2 1 1 155 1 1 155 4 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 1991, 1992 Linus Torvalds */ #include <linux/types.h> #include <linux/errno.h> #include <linux/signal.h> #include <linux/sched/signal.h> #include <linux/sched/task.h> #include <linux/tty.h> #include <linux/fcntl.h> #include <linux/uaccess.h> #include "tty.h" static int is_ignored(int sig) { return (sigismember(¤t->blocked, sig) || current->sighand->action[sig-1].sa.sa_handler == SIG_IGN); } /** * __tty_check_change - check for POSIX terminal changes * @tty: tty to check * @sig: signal to send * * If we try to write to, or set the state of, a terminal and we're * not in the foreground, send a SIGTTOU. If the signal is blocked or * ignored, go ahead and perform the operation. (POSIX 7.2) * * Locking: ctrl.lock */ int __tty_check_change(struct tty_struct *tty, int sig) { unsigned long flags; struct pid *pgrp, *tty_pgrp; int ret = 0; if (current->signal->tty != tty) return 0; rcu_read_lock(); pgrp = task_pgrp(current); spin_lock_irqsave(&tty->ctrl.lock, flags); tty_pgrp = tty->ctrl.pgrp; spin_unlock_irqrestore(&tty->ctrl.lock, flags); if (tty_pgrp && pgrp != tty_pgrp) { if (is_ignored(sig)) { if (sig == SIGTTIN) ret = -EIO; } else if (is_current_pgrp_orphaned()) ret = -EIO; else { kill_pgrp(pgrp, sig, 1); set_thread_flag(TIF_SIGPENDING); ret = -ERESTARTSYS; } } rcu_read_unlock(); if (!tty_pgrp) tty_warn(tty, "sig=%d, tty->pgrp == NULL!\n", sig); return ret; } int tty_check_change(struct tty_struct *tty) { return __tty_check_change(tty, SIGTTOU); } EXPORT_SYMBOL(tty_check_change); void proc_clear_tty(struct task_struct *p) { unsigned long flags; struct tty_struct *tty; spin_lock_irqsave(&p->sighand->siglock, flags); tty = p->signal->tty; p->signal->tty = NULL; spin_unlock_irqrestore(&p->sighand->siglock, flags); tty_kref_put(tty); } /** * __proc_set_tty - set the controlling terminal * @tty: tty structure * * Only callable by the session leader and only if it does not already have * a controlling terminal. * * Caller must hold: tty_lock() * a readlock on tasklist_lock * sighand lock */ static void __proc_set_tty(struct tty_struct *tty) { unsigned long flags; spin_lock_irqsave(&tty->ctrl.lock, flags); /* * The session and fg pgrp references will be non-NULL if * tiocsctty() is stealing the controlling tty */ put_pid(tty->ctrl.session); put_pid(tty->ctrl.pgrp); tty->ctrl.pgrp = get_pid(task_pgrp(current)); tty->ctrl.session = get_pid(task_session(current)); spin_unlock_irqrestore(&tty->ctrl.lock, flags); if (current->signal->tty) { tty_debug(tty, "current tty %s not NULL!!\n", current->signal->tty->name); tty_kref_put(current->signal->tty); } put_pid(current->signal->tty_old_pgrp); current->signal->tty = tty_kref_get(tty); current->signal->tty_old_pgrp = NULL; } static void proc_set_tty(struct tty_struct *tty) { spin_lock_irq(¤t->sighand->siglock); __proc_set_tty(tty); spin_unlock_irq(¤t->sighand->siglock); } /* * Called by tty_open() to set the controlling tty if applicable. */ void tty_open_proc_set_tty(struct file *filp, struct tty_struct *tty) { read_lock(&tasklist_lock); spin_lock_irq(¤t->sighand->siglock); if (current->signal->leader && !current->signal->tty && tty->ctrl.session == NULL) { /* * Don't let a process that only has write access to the tty * obtain the privileges associated with having a tty as * controlling terminal (being able to reopen it with full * access through /dev/tty, being able to perform pushback). * Many distributions set the group of all ttys to "tty" and * grant write-only access to all terminals for setgid tty * binaries, which should not imply full privileges on all ttys. * * This could theoretically break old code that performs open() * on a write-only file descriptor. In that case, it might be * necessary to also permit this if * inode_permission(inode, MAY_READ) == 0. */ if (filp->f_mode & FMODE_READ) __proc_set_tty(tty); } spin_unlock_irq(¤t->sighand->siglock); read_unlock(&tasklist_lock); } struct tty_struct *get_current_tty(void) { struct tty_struct *tty; unsigned long flags; spin_lock_irqsave(¤t->sighand->siglock, flags); tty = tty_kref_get(current->signal->tty); spin_unlock_irqrestore(¤t->sighand->siglock, flags); return tty; } EXPORT_SYMBOL_GPL(get_current_tty); /* * Called from tty_release(). */ void session_clear_tty(struct pid *session) { struct task_struct *p; do_each_pid_task(session, PIDTYPE_SID, p) { proc_clear_tty(p); } while_each_pid_task(session, PIDTYPE_SID, p); } /** * tty_signal_session_leader - sends SIGHUP to session leader * @tty: controlling tty * @exit_session: if non-zero, signal all foreground group processes * * Send SIGHUP and SIGCONT to the session leader and its process group. * Optionally, signal all processes in the foreground process group. * * Returns the number of processes in the session with this tty * as their controlling terminal. This value is used to drop * tty references for those processes. */ int tty_signal_session_leader(struct tty_struct *tty, int exit_session) { struct task_struct *p; int refs = 0; struct pid *tty_pgrp = NULL; read_lock(&tasklist_lock); if (tty->ctrl.session) { do_each_pid_task(tty->ctrl.session, PIDTYPE_SID, p) { spin_lock_irq(&p->sighand->siglock); if (p->signal->tty == tty) { p->signal->tty = NULL; /* * We defer the dereferences outside of * the tasklist lock. */ refs++; } if (!p->signal->leader) { spin_unlock_irq(&p->sighand->siglock); continue; } send_signal_locked(SIGHUP, SEND_SIG_PRIV, p, PIDTYPE_TGID); send_signal_locked(SIGCONT, SEND_SIG_PRIV, p, PIDTYPE_TGID); put_pid(p->signal->tty_old_pgrp); /* A noop */ spin_lock(&tty->ctrl.lock); tty_pgrp = get_pid(tty->ctrl.pgrp); if (tty->ctrl.pgrp) p->signal->tty_old_pgrp = get_pid(tty->ctrl.pgrp); spin_unlock(&tty->ctrl.lock); spin_unlock_irq(&p->sighand->siglock); } while_each_pid_task(tty->ctrl.session, PIDTYPE_SID, p); } read_unlock(&tasklist_lock); if (tty_pgrp) { if (exit_session) kill_pgrp(tty_pgrp, SIGHUP, exit_session); put_pid(tty_pgrp); } return refs; } /** * disassociate_ctty - disconnect controlling tty * @on_exit: true if exiting so need to "hang up" the session * * This function is typically called only by the session leader, when * it wants to disassociate itself from its controlling tty. * * It performs the following functions: * (1) Sends a SIGHUP and SIGCONT to the foreground process group * (2) Clears the tty from being controlling the session * (3) Clears the controlling tty for all processes in the * session group. * * The argument on_exit is set to 1 if called when a process is * exiting; it is 0 if called by the ioctl TIOCNOTTY. * * Locking: * BTM is taken for hysterical raisons, and held when * called from no_tty(). * tty_mutex is taken to protect tty * ->siglock is taken to protect ->signal/->sighand * tasklist_lock is taken to walk process list for sessions * ->siglock is taken to protect ->signal/->sighand */ void disassociate_ctty(int on_exit) { struct tty_struct *tty; if (!current->signal->leader) return; tty = get_current_tty(); if (tty) { if (on_exit && tty->driver->type != TTY_DRIVER_TYPE_PTY) { tty_vhangup_session(tty); } else { struct pid *tty_pgrp = tty_get_pgrp(tty); if (tty_pgrp) { kill_pgrp(tty_pgrp, SIGHUP, on_exit); if (!on_exit) kill_pgrp(tty_pgrp, SIGCONT, on_exit); put_pid(tty_pgrp); } } tty_kref_put(tty); } else if (on_exit) { struct pid *old_pgrp; spin_lock_irq(¤t->sighand->siglock); old_pgrp = current->signal->tty_old_pgrp; current->signal->tty_old_pgrp = NULL; spin_unlock_irq(¤t->sighand->siglock); if (old_pgrp) { kill_pgrp(old_pgrp, SIGHUP, on_exit); kill_pgrp(old_pgrp, SIGCONT, on_exit); put_pid(old_pgrp); } return; } tty = get_current_tty(); if (tty) { unsigned long flags; tty_lock(tty); spin_lock_irqsave(&tty->ctrl.lock, flags); put_pid(tty->ctrl.session); put_pid(tty->ctrl.pgrp); tty->ctrl.session = NULL; tty->ctrl.pgrp = NULL; spin_unlock_irqrestore(&tty->ctrl.lock, flags); tty_unlock(tty); tty_kref_put(tty); } /* If tty->ctrl.pgrp is not NULL, it may be assigned to * current->signal->tty_old_pgrp in a race condition, and * cause pid memleak. Release current->signal->tty_old_pgrp * after tty->ctrl.pgrp set to NULL. */ spin_lock_irq(¤t->sighand->siglock); put_pid(current->signal->tty_old_pgrp); current->signal->tty_old_pgrp = NULL; spin_unlock_irq(¤t->sighand->siglock); /* Now clear signal->tty under the lock */ read_lock(&tasklist_lock); session_clear_tty(task_session(current)); read_unlock(&tasklist_lock); } /* * * no_tty - Ensure the current process does not have a controlling tty */ void no_tty(void) { /* * FIXME: Review locking here. The tty_lock never covered any race * between a new association and proc_clear_tty but possibly we need * to protect against this anyway. */ struct task_struct *tsk = current; disassociate_ctty(0); proc_clear_tty(tsk); } /** * tiocsctty - set controlling tty * @tty: tty structure * @file: file structure used to check permissions * @arg: user argument * * This ioctl is used to manage job control. It permits a session * leader to set this tty as the controlling tty for the session. * * Locking: * Takes tty_lock() to serialize proc_set_tty() for this tty * Takes tasklist_lock internally to walk sessions * Takes ->siglock() when updating signal->tty */ static int tiocsctty(struct tty_struct *tty, struct file *file, int arg) { int ret = 0; tty_lock(tty); read_lock(&tasklist_lock); if (current->signal->leader && task_session(current) == tty->ctrl.session) goto unlock; /* * The process must be a session leader and * not have a controlling tty already. */ if (!current->signal->leader || current->signal->tty) { ret = -EPERM; goto unlock; } if (tty->ctrl.session) { /* * This tty is already the controlling * tty for another session group! */ if (arg == 1 && capable(CAP_SYS_ADMIN)) { /* * Steal it away */ session_clear_tty(tty->ctrl.session); } else { ret = -EPERM; goto unlock; } } /* See the comment in tty_open_proc_set_tty(). */ if ((file->f_mode & FMODE_READ) == 0 && !capable(CAP_SYS_ADMIN)) { ret = -EPERM; goto unlock; } proc_set_tty(tty); unlock: read_unlock(&tasklist_lock); tty_unlock(tty); return ret; } /** * tty_get_pgrp - return a ref counted pgrp pid * @tty: tty to read * * Returns a refcounted instance of the pid struct for the process * group controlling the tty. */ struct pid *tty_get_pgrp(struct tty_struct *tty) { unsigned long flags; struct pid *pgrp; spin_lock_irqsave(&tty->ctrl.lock, flags); pgrp = get_pid(tty->ctrl.pgrp); spin_unlock_irqrestore(&tty->ctrl.lock, flags); return pgrp; } EXPORT_SYMBOL_GPL(tty_get_pgrp); /* * This checks not only the pgrp, but falls back on the pid if no * satisfactory pgrp is found. I dunno - gdb doesn't work correctly * without this... * * The caller must hold rcu lock or the tasklist lock. */ static struct pid *session_of_pgrp(struct pid *pgrp) { struct task_struct *p; struct pid *sid = NULL; p = pid_task(pgrp, PIDTYPE_PGID); if (p == NULL) p = pid_task(pgrp, PIDTYPE_PID); if (p != NULL) sid = task_session(p); return sid; } /** * tiocgpgrp - get process group * @tty: tty passed by user * @real_tty: tty side of the tty passed by the user if a pty else the tty * @p: returned pid * * Obtain the process group of the tty. If there is no process group * return an error. * * Locking: none. Reference to current->signal->tty is safe. */ static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p) { struct pid *pid; int ret; /* * (tty == real_tty) is a cheap way of * testing if the tty is NOT a master pty. */ if (tty == real_tty && current->signal->tty != real_tty) return -ENOTTY; pid = tty_get_pgrp(real_tty); ret = put_user(pid_vnr(pid), p); put_pid(pid); return ret; } /** * tiocspgrp - attempt to set process group * @tty: tty passed by user * @real_tty: tty side device matching tty passed by user * @p: pid pointer * * Set the process group of the tty to the session passed. Only * permitted where the tty session is our session. * * Locking: RCU, ctrl lock */ static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p) { struct pid *pgrp; pid_t pgrp_nr; int retval = tty_check_change(real_tty); if (retval == -EIO) return -ENOTTY; if (retval) return retval; if (get_user(pgrp_nr, p)) return -EFAULT; if (pgrp_nr < 0) return -EINVAL; spin_lock_irq(&real_tty->ctrl.lock); if (!current->signal->tty || (current->signal->tty != real_tty) || (real_tty->ctrl.session != task_session(current))) { retval = -ENOTTY; goto out_unlock_ctrl; } rcu_read_lock(); pgrp = find_vpid(pgrp_nr); retval = -ESRCH; if (!pgrp) goto out_unlock; retval = -EPERM; if (session_of_pgrp(pgrp) != task_session(current)) goto out_unlock; retval = 0; put_pid(real_tty->ctrl.pgrp); real_tty->ctrl.pgrp = get_pid(pgrp); out_unlock: rcu_read_unlock(); out_unlock_ctrl: spin_unlock_irq(&real_tty->ctrl.lock); return retval; } /** * tiocgsid - get session id * @tty: tty passed by user * @real_tty: tty side of the tty passed by the user if a pty else the tty * @p: pointer to returned session id * * Obtain the session id of the tty. If there is no session * return an error. */ static int tiocgsid(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p) { unsigned long flags; pid_t sid; /* * (tty == real_tty) is a cheap way of * testing if the tty is NOT a master pty. */ if (tty == real_tty && current->signal->tty != real_tty) return -ENOTTY; spin_lock_irqsave(&real_tty->ctrl.lock, flags); if (!real_tty->ctrl.session) goto err; sid = pid_vnr(real_tty->ctrl.session); spin_unlock_irqrestore(&real_tty->ctrl.lock, flags); return put_user(sid, p); err: spin_unlock_irqrestore(&real_tty->ctrl.lock, flags); return -ENOTTY; } /* * Called from tty_ioctl(). If tty is a pty then real_tty is the slave side, * if not then tty == real_tty. */ long tty_jobctrl_ioctl(struct tty_struct *tty, struct tty_struct *real_tty, struct file *file, unsigned int cmd, unsigned long arg) { void __user *p = (void __user *)arg; switch (cmd) { case TIOCNOTTY: if (current->signal->tty != tty) return -ENOTTY; no_tty(); return 0; case TIOCSCTTY: return tiocsctty(real_tty, file, arg); case TIOCGPGRP: return tiocgpgrp(tty, real_tty, p); case TIOCSPGRP: return tiocspgrp(tty, real_tty, p); case TIOCGSID: return tiocgsid(tty, real_tty, p); } return -ENOIOCTLCMD; } |
22 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Generic RTC interface. * This version contains the part of the user interface to the Real Time Clock * service. It is used with both the legacy mc146818 and also EFI * Struct rtc_time and first 12 ioctl by Paul Gortmaker, 1996 - separated out * from <linux/mc146818rtc.h> to this file for 2.4 kernels. * * Copyright (C) 1999 Hewlett-Packard Co. * Copyright (C) 1999 Stephane Eranian <eranian@hpl.hp.com> */ #ifndef _LINUX_RTC_H_ #define _LINUX_RTC_H_ #include <linux/types.h> #include <linux/interrupt.h> #include <linux/nvmem-provider.h> #include <uapi/linux/rtc.h> extern int rtc_month_days(unsigned int month, unsigned int year); extern int rtc_year_days(unsigned int day, unsigned int month, unsigned int year); extern int rtc_valid_tm(struct rtc_time *tm); extern time64_t rtc_tm_to_time64(struct rtc_time *tm); extern void rtc_time64_to_tm(time64_t time, struct rtc_time *tm); ktime_t rtc_tm_to_ktime(struct rtc_time tm); struct rtc_time rtc_ktime_to_tm(ktime_t kt); /* * rtc_tm_sub - Return the difference in seconds. */ static inline time64_t rtc_tm_sub(struct rtc_time *lhs, struct rtc_time *rhs) { return rtc_tm_to_time64(lhs) - rtc_tm_to_time64(rhs); } #include <linux/device.h> #include <linux/seq_file.h> #include <linux/cdev.h> #include <linux/poll.h> #include <linux/mutex.h> #include <linux/timerqueue.h> #include <linux/workqueue.h> extern const struct class rtc_class; /* * For these RTC methods the device parameter is the physical device * on whatever bus holds the hardware (I2C, Platform, SPI, etc), which * was passed to rtc_device_register(). Its driver_data normally holds * device state, including the rtc_device pointer for the RTC. * * Most of these methods are called with rtc_device.ops_lock held, * through the rtc_*(struct rtc_device *, ...) calls. * * The (current) exceptions are mostly filesystem hooks: * - the proc() hook for procfs */ struct rtc_class_ops { int (*ioctl)(struct device *, unsigned int, unsigned long); int (*read_time)(struct device *, struct rtc_time *); int (*set_time)(struct device *, struct rtc_time *); int (*read_alarm)(struct device *, struct rtc_wkalrm *); int (*set_alarm)(struct device *, struct rtc_wkalrm *); int (*proc)(struct device *, struct seq_file *); int (*alarm_irq_enable)(struct device *, unsigned int enabled); int (*read_offset)(struct device *, long *offset); int (*set_offset)(struct device *, long offset); int (*param_get)(struct device *, struct rtc_param *param); int (*param_set)(struct device *, struct rtc_param *param); }; struct rtc_device; struct rtc_timer { struct timerqueue_node node; ktime_t period; void (*func)(struct rtc_device *rtc); struct rtc_device *rtc; int enabled; }; /* flags */ #define RTC_DEV_BUSY 0 #define RTC_NO_CDEV 1 struct rtc_device { struct device dev; struct module *owner; int id; const struct rtc_class_ops *ops; struct mutex ops_lock; struct cdev char_dev; unsigned long flags; unsigned long irq_data; spinlock_t irq_lock; wait_queue_head_t irq_queue; struct fasync_struct *async_queue; int irq_freq; int max_user_freq; struct timerqueue_head timerqueue; struct rtc_timer aie_timer; struct rtc_timer uie_rtctimer; struct hrtimer pie_timer; /* sub second exp, so needs hrtimer */ int pie_enabled; struct work_struct irqwork; /* * This offset specifies the update timing of the RTC. * * tsched t1 write(t2.tv_sec - 1sec)) t2 RTC increments seconds * * The offset defines how tsched is computed so that the write to * the RTC (t2.tv_sec - 1sec) is correct versus the time required * for the transport of the write and the time which the RTC needs * to increment seconds the first time after the write (t2). * * For direct accessible RTCs tsched ~= t1 because the write time * is negligible. For RTCs behind slow busses the transport time is * significant and has to be taken into account. * * The time between the write (t1) and the first increment after * the write (t2) is RTC specific. For a MC146818 RTC it's 500ms, * for many others it's exactly 1 second. Consult the datasheet. * * The value of this offset is also used to calculate the to be * written value (t2.tv_sec - 1sec) at tsched. * * The default value for this is NSEC_PER_SEC + 10 msec default * transport time. The offset can be adjusted by drivers so the * calculation for the to be written value at tsched becomes * correct: * * newval = tsched + set_offset_nsec - NSEC_PER_SEC * and (tsched + set_offset_nsec) % NSEC_PER_SEC == 0 */ unsigned long set_offset_nsec; unsigned long features[BITS_TO_LONGS(RTC_FEATURE_CNT)]; time64_t range_min; timeu64_t range_max; timeu64_t alarm_offset_max; time64_t start_secs; time64_t offset_secs; bool set_start_time; #ifdef CONFIG_RTC_INTF_DEV_UIE_EMUL struct work_struct uie_task; struct timer_list uie_timer; /* Those fields are protected by rtc->irq_lock */ unsigned int oldsecs; unsigned int uie_irq_active:1; unsigned int stop_uie_polling:1; unsigned int uie_task_active:1; unsigned int uie_timer_active:1; #endif }; #define to_rtc_device(d) container_of(d, struct rtc_device, dev) #define rtc_lock(d) mutex_lock(&d->ops_lock) #define rtc_unlock(d) mutex_unlock(&d->ops_lock) /* useful timestamps */ #define RTC_TIMESTAMP_BEGIN_0000 -62167219200ULL /* 0000-01-01 00:00:00 */ #define RTC_TIMESTAMP_BEGIN_1900 -2208988800LL /* 1900-01-01 00:00:00 */ #define RTC_TIMESTAMP_EPOCH_GPS 315964800LL /* 1980-01-06 00:00:00 */ #define RTC_TIMESTAMP_BEGIN_2000 946684800LL /* 2000-01-01 00:00:00 */ #define RTC_TIMESTAMP_END_2063 2966371199LL /* 2063-12-31 23:59:59 */ #define RTC_TIMESTAMP_END_2079 3471292799LL /* 2079-12-31 23:59:59 */ #define RTC_TIMESTAMP_END_2099 4102444799LL /* 2099-12-31 23:59:59 */ #define RTC_TIMESTAMP_END_2199 7258118399LL /* 2199-12-31 23:59:59 */ #define RTC_TIMESTAMP_END_9999 253402300799LL /* 9999-12-31 23:59:59 */ extern struct rtc_device *devm_rtc_device_register(struct device *dev, const char *name, const struct rtc_class_ops *ops, struct module *owner); struct rtc_device *devm_rtc_allocate_device(struct device *dev); int __devm_rtc_register_device(struct module *owner, struct rtc_device *rtc); extern int rtc_read_time(struct rtc_device *rtc, struct rtc_time *tm); extern int rtc_set_time(struct rtc_device *rtc, struct rtc_time *tm); int __rtc_read_alarm(struct rtc_device *rtc, struct rtc_wkalrm *alarm); extern int rtc_read_alarm(struct rtc_device *rtc, struct rtc_wkalrm *alrm); extern int rtc_set_alarm(struct rtc_device *rtc, struct rtc_wkalrm *alrm); extern int rtc_initialize_alarm(struct rtc_device *rtc, struct rtc_wkalrm *alrm); extern void rtc_update_irq(struct rtc_device *rtc, unsigned long num, unsigned long events); extern struct rtc_device *rtc_class_open(const char *name); extern void rtc_class_close(struct rtc_device *rtc); extern int rtc_irq_set_state(struct rtc_device *rtc, int enabled); extern int rtc_irq_set_freq(struct rtc_device *rtc, int freq); extern int rtc_update_irq_enable(struct rtc_device *rtc, unsigned int enabled); extern int rtc_alarm_irq_enable(struct rtc_device *rtc, unsigned int enabled); extern int rtc_dev_update_irq_enable_emul(struct rtc_device *rtc, unsigned int enabled); void rtc_handle_legacy_irq(struct rtc_device *rtc, int num, int mode); void rtc_aie_update_irq(struct rtc_device *rtc); void rtc_uie_update_irq(struct rtc_device *rtc); enum hrtimer_restart rtc_pie_update_irq(struct hrtimer *timer); void rtc_timer_init(struct rtc_timer *timer, void (*f)(struct rtc_device *r), struct rtc_device *rtc); int rtc_timer_start(struct rtc_device *rtc, struct rtc_timer *timer, ktime_t expires, ktime_t period); void rtc_timer_cancel(struct rtc_device *rtc, struct rtc_timer *timer); int rtc_read_offset(struct rtc_device *rtc, long *offset); int rtc_set_offset(struct rtc_device *rtc, long offset); void rtc_timer_do_work(struct work_struct *work); static inline bool is_leap_year(unsigned int year) { return (!(year % 4) && (year % 100)) || !(year % 400); } /** * rtc_bound_alarmtime() - Return alarm time bound by rtc limit * @rtc: Pointer to rtc device structure * @requested: Requested alarm timeout * * Return: Alarm timeout bound by maximum alarm time supported by rtc. */ static inline ktime_t rtc_bound_alarmtime(struct rtc_device *rtc, ktime_t requested) { if (rtc->alarm_offset_max && rtc->alarm_offset_max * MSEC_PER_SEC < ktime_to_ms(requested)) return ms_to_ktime(rtc->alarm_offset_max * MSEC_PER_SEC); return requested; } #define devm_rtc_register_device(device) \ __devm_rtc_register_device(THIS_MODULE, device) #ifdef CONFIG_RTC_HCTOSYS_DEVICE extern int rtc_hctosys_ret; #else #define rtc_hctosys_ret -ENODEV #endif #ifdef CONFIG_RTC_NVMEM int devm_rtc_nvmem_register(struct rtc_device *rtc, struct nvmem_config *nvmem_config); #else static inline int devm_rtc_nvmem_register(struct rtc_device *rtc, struct nvmem_config *nvmem_config) { return 0; } #endif #ifdef CONFIG_RTC_INTF_SYSFS int rtc_add_group(struct rtc_device *rtc, const struct attribute_group *grp); int rtc_add_groups(struct rtc_device *rtc, const struct attribute_group **grps); #else static inline int rtc_add_group(struct rtc_device *rtc, const struct attribute_group *grp) { return 0; } static inline int rtc_add_groups(struct rtc_device *rtc, const struct attribute_group **grps) { return 0; } #endif #endif /* _LINUX_RTC_H_ */ |
1 1 1 1 1 1 4 8 3 3 8 8 3 3 1 2 2 2 4 4 4 1 4 3 3 3 3 3 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 1 3 3 3 1 1 1 1 1 3 3 3 3 3 2 2 1 1 2 2 8 8 3 8 4 4 4 4 4 1 1 4 1 4 20 8 6 8 2 2 20 20 20 20 1 1 2 1 1 1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright(c) 1999 - 2004 Intel Corporation. All rights reserved. */ #include <linux/skbuff.h> #include <linux/if_ether.h> #include <linux/netdevice.h> #include <linux/spinlock.h> #include <linux/ethtool.h> #include <linux/etherdevice.h> #include <linux/if_bonding.h> #include <linux/pkt_sched.h> #include <net/net_namespace.h> #include <net/bonding.h> #include <net/bond_3ad.h> #include <net/netlink.h> /* General definitions */ #define AD_SHORT_TIMEOUT 1 #define AD_LONG_TIMEOUT 0 #define AD_STANDBY 0x2 #define AD_MAX_TX_IN_SECOND 3 #define AD_COLLECTOR_MAX_DELAY 0 /* Timer definitions (43.4.4 in the 802.3ad standard) */ #define AD_FAST_PERIODIC_TIME 1 #define AD_SLOW_PERIODIC_TIME 30 #define AD_SHORT_TIMEOUT_TIME (3*AD_FAST_PERIODIC_TIME) #define AD_LONG_TIMEOUT_TIME (3*AD_SLOW_PERIODIC_TIME) #define AD_CHURN_DETECTION_TIME 60 #define AD_AGGREGATE_WAIT_TIME 2 /* Port Variables definitions used by the State Machines (43.4.7 in the * 802.3ad standard) */ #define AD_PORT_BEGIN 0x1 #define AD_PORT_LACP_ENABLED 0x2 #define AD_PORT_ACTOR_CHURN 0x4 #define AD_PORT_PARTNER_CHURN 0x8 #define AD_PORT_READY 0x10 #define AD_PORT_READY_N 0x20 #define AD_PORT_MATCHED 0x40 #define AD_PORT_STANDBY 0x80 #define AD_PORT_SELECTED 0x100 #define AD_PORT_MOVED 0x200 #define AD_PORT_CHURNED (AD_PORT_ACTOR_CHURN | AD_PORT_PARTNER_CHURN) /* Port Key definitions * key is determined according to the link speed, duplex and * user key (which is yet not supported) * -------------------------------------------------------------- * Port key | User key (10 bits) | Speed (5 bits) | Duplex| * -------------------------------------------------------------- * |15 6|5 1|0 */ #define AD_DUPLEX_KEY_MASKS 0x1 #define AD_SPEED_KEY_MASKS 0x3E #define AD_USER_KEY_MASKS 0xFFC0 enum ad_link_speed_type { AD_LINK_SPEED_1MBPS = 1, AD_LINK_SPEED_10MBPS, AD_LINK_SPEED_100MBPS, AD_LINK_SPEED_1000MBPS, AD_LINK_SPEED_2500MBPS, AD_LINK_SPEED_5000MBPS, AD_LINK_SPEED_10000MBPS, AD_LINK_SPEED_14000MBPS, AD_LINK_SPEED_20000MBPS, AD_LINK_SPEED_25000MBPS, AD_LINK_SPEED_40000MBPS, AD_LINK_SPEED_50000MBPS, AD_LINK_SPEED_56000MBPS, AD_LINK_SPEED_100000MBPS, AD_LINK_SPEED_200000MBPS, AD_LINK_SPEED_400000MBPS, AD_LINK_SPEED_800000MBPS, }; /* compare MAC addresses */ #define MAC_ADDRESS_EQUAL(A, B) \ ether_addr_equal_64bits((const u8 *)A, (const u8 *)B) static const u16 ad_ticks_per_sec = 1000 / AD_TIMER_INTERVAL; static const int ad_delta_in_ticks = (AD_TIMER_INTERVAL * HZ) / 1000; const u8 lacpdu_mcast_addr[ETH_ALEN + 2] __long_aligned = { 0x01, 0x80, 0xC2, 0x00, 0x00, 0x02 }; /* ================= main 802.3ad protocol functions ================== */ static int ad_lacpdu_send(struct port *port); static int ad_marker_send(struct port *port, struct bond_marker *marker); static void ad_mux_machine(struct port *port, bool *update_slave_arr); static void ad_rx_machine(struct lacpdu *lacpdu, struct port *port); static void ad_tx_machine(struct port *port); static void ad_periodic_machine(struct port *port, struct bond_params *bond_params); static void ad_port_selection_logic(struct port *port, bool *update_slave_arr); static void ad_agg_selection_logic(struct aggregator *aggregator, bool *update_slave_arr); static void ad_clear_agg(struct aggregator *aggregator); static void ad_initialize_agg(struct aggregator *aggregator); static void ad_initialize_port(struct port *port, int lacp_fast); static void ad_enable_collecting(struct port *port); static void ad_disable_distributing(struct port *port, bool *update_slave_arr); static void ad_enable_collecting_distributing(struct port *port, bool *update_slave_arr); static void ad_disable_collecting_distributing(struct port *port, bool *update_slave_arr); static void ad_marker_info_received(struct bond_marker *marker_info, struct port *port); static void ad_marker_response_received(struct bond_marker *marker, struct port *port); static void ad_update_actor_keys(struct port *port, bool reset); /* ================= api to bonding and kernel code ================== */ /** * __get_bond_by_port - get the port's bonding struct * @port: the port we're looking at * * Return @port's bonding struct, or %NULL if it can't be found. */ static inline struct bonding *__get_bond_by_port(struct port *port) { if (port->slave == NULL) return NULL; return bond_get_bond_by_slave(port->slave); } /** * __get_first_agg - get the first aggregator in the bond * @port: the port we're looking at * * Return the aggregator of the first slave in @bond, or %NULL if it can't be * found. * The caller must hold RCU or RTNL lock. */ static inline struct aggregator *__get_first_agg(struct port *port) { struct bonding *bond = __get_bond_by_port(port); struct slave *first_slave; struct aggregator *agg; /* If there's no bond for this port, or bond has no slaves */ if (bond == NULL) return NULL; rcu_read_lock(); first_slave = bond_first_slave_rcu(bond); agg = first_slave ? &(SLAVE_AD_INFO(first_slave)->aggregator) : NULL; rcu_read_unlock(); return agg; } /** * __agg_has_partner - see if we have a partner * @agg: the agregator we're looking at * * Return nonzero if aggregator has a partner (denoted by a non-zero ether * address for the partner). Return 0 if not. */ static inline int __agg_has_partner(struct aggregator *agg) { return !is_zero_ether_addr(agg->partner_system.mac_addr_value); } /** * __disable_distributing_port - disable the port's slave for distributing. * Port will still be able to collect. * @port: the port we're looking at * * This will disable only distributing on the port's slave. */ static void __disable_distributing_port(struct port *port) { bond_set_slave_tx_disabled_flags(port->slave, BOND_SLAVE_NOTIFY_LATER); } /** * __enable_collecting_port - enable the port's slave for collecting, * if it's up * @port: the port we're looking at * * This will enable only collecting on the port's slave. */ static void __enable_collecting_port(struct port *port) { struct slave *slave = port->slave; if (slave->link == BOND_LINK_UP && bond_slave_is_up(slave)) bond_set_slave_rx_enabled_flags(slave, BOND_SLAVE_NOTIFY_LATER); } /** * __disable_port - disable the port's slave * @port: the port we're looking at * * This will disable both collecting and distributing on the port's slave. */ static inline void __disable_port(struct port *port) { bond_set_slave_inactive_flags(port->slave, BOND_SLAVE_NOTIFY_LATER); } /** * __enable_port - enable the port's slave, if it's up * @port: the port we're looking at * * This will enable both collecting and distributing on the port's slave. */ static inline void __enable_port(struct port *port) { struct slave *slave = port->slave; if ((slave->link == BOND_LINK_UP) && bond_slave_is_up(slave)) bond_set_slave_active_flags(slave, BOND_SLAVE_NOTIFY_LATER); } /** * __port_move_to_attached_state - check if port should transition back to attached * state. * @port: the port we're looking at */ static bool __port_move_to_attached_state(struct port *port) { if (!(port->sm_vars & AD_PORT_SELECTED) || (port->sm_vars & AD_PORT_STANDBY) || !(port->partner_oper.port_state & LACP_STATE_SYNCHRONIZATION) || !(port->actor_oper_port_state & LACP_STATE_SYNCHRONIZATION)) port->sm_mux_state = AD_MUX_ATTACHED; return port->sm_mux_state == AD_MUX_ATTACHED; } /** * __port_is_collecting_distributing - check if the port's slave is in the * combined collecting/distributing state * @port: the port we're looking at */ static int __port_is_collecting_distributing(struct port *port) { return bond_is_active_slave(port->slave); } /** * __get_agg_selection_mode - get the aggregator selection mode * @port: the port we're looking at * * Get the aggregator selection mode. Can be %STABLE, %BANDWIDTH or %COUNT. */ static inline u32 __get_agg_selection_mode(struct port *port) { struct bonding *bond = __get_bond_by_port(port); if (bond == NULL) return BOND_AD_STABLE; return bond->params.ad_select; } /** * __check_agg_selection_timer - check if the selection timer has expired * @port: the port we're looking at */ static inline int __check_agg_selection_timer(struct port *port) { struct bonding *bond = __get_bond_by_port(port); if (bond == NULL) return 0; return atomic_read(&BOND_AD_INFO(bond).agg_select_timer) ? 1 : 0; } /** * __get_link_speed - get a port's speed * @port: the port we're looking at * * Return @port's speed in 802.3ad enum format. i.e. one of: * 0, * %AD_LINK_SPEED_10MBPS, * %AD_LINK_SPEED_100MBPS, * %AD_LINK_SPEED_1000MBPS, * %AD_LINK_SPEED_2500MBPS, * %AD_LINK_SPEED_5000MBPS, * %AD_LINK_SPEED_10000MBPS * %AD_LINK_SPEED_14000MBPS, * %AD_LINK_SPEED_20000MBPS * %AD_LINK_SPEED_25000MBPS * %AD_LINK_SPEED_40000MBPS * %AD_LINK_SPEED_50000MBPS * %AD_LINK_SPEED_56000MBPS * %AD_LINK_SPEED_100000MBPS * %AD_LINK_SPEED_200000MBPS * %AD_LINK_SPEED_400000MBPS * %AD_LINK_SPEED_800000MBPS */ static u16 __get_link_speed(struct port *port) { struct slave *slave = port->slave; u16 speed; /* this if covers only a special case: when the configuration starts * with link down, it sets the speed to 0. * This is done in spite of the fact that the e100 driver reports 0 * to be compatible with MVT in the future. */ if (slave->link != BOND_LINK_UP) speed = 0; else { switch (slave->speed) { case SPEED_10: speed = AD_LINK_SPEED_10MBPS; break; case SPEED_100: speed = AD_LINK_SPEED_100MBPS; break; case SPEED_1000: speed = AD_LINK_SPEED_1000MBPS; break; case SPEED_2500: speed = AD_LINK_SPEED_2500MBPS; break; case SPEED_5000: speed = AD_LINK_SPEED_5000MBPS; break; case SPEED_10000: speed = AD_LINK_SPEED_10000MBPS; break; case SPEED_14000: speed = AD_LINK_SPEED_14000MBPS; break; case SPEED_20000: speed = AD_LINK_SPEED_20000MBPS; break; case SPEED_25000: speed = AD_LINK_SPEED_25000MBPS; break; case SPEED_40000: speed = AD_LINK_SPEED_40000MBPS; break; case SPEED_50000: speed = AD_LINK_SPEED_50000MBPS; break; case SPEED_56000: speed = AD_LINK_SPEED_56000MBPS; break; case SPEED_100000: speed = AD_LINK_SPEED_100000MBPS; break; case SPEED_200000: speed = AD_LINK_SPEED_200000MBPS; break; case SPEED_400000: speed = AD_LINK_SPEED_400000MBPS; break; case SPEED_800000: speed = AD_LINK_SPEED_800000MBPS; break; default: /* unknown speed value from ethtool. shouldn't happen */ if (slave->speed != SPEED_UNKNOWN) pr_err_once("%s: (slave %s): unknown ethtool speed (%d) for port %d (set it to 0)\n", slave->bond->dev->name, slave->dev->name, slave->speed, port->actor_port_number); speed = 0; break; } } slave_dbg(slave->bond->dev, slave->dev, "Port %d Received link speed %d update from adapter\n", port->actor_port_number, speed); return speed; } /** * __get_duplex - get a port's duplex * @port: the port we're looking at * * Return @port's duplex in 802.3ad bitmask format. i.e.: * 0x01 if in full duplex * 0x00 otherwise */ static u8 __get_duplex(struct port *port) { struct slave *slave = port->slave; u8 retval = 0x0; /* handling a special case: when the configuration starts with * link down, it sets the duplex to 0. */ if (slave->link == BOND_LINK_UP) { switch (slave->duplex) { case DUPLEX_FULL: retval = 0x1; slave_dbg(slave->bond->dev, slave->dev, "Port %d Received status full duplex update from adapter\n", port->actor_port_number); break; case DUPLEX_HALF: default: retval = 0x0; slave_dbg(slave->bond->dev, slave->dev, "Port %d Received status NOT full duplex update from adapter\n", port->actor_port_number); break; } } return retval; } static void __ad_actor_update_port(struct port *port) { const struct bonding *bond = bond_get_bond_by_slave(port->slave); port->actor_system = BOND_AD_INFO(bond).system.sys_mac_addr; port->actor_system_priority = BOND_AD_INFO(bond).system.sys_priority; } /* Conversions */ /** * __ad_timer_to_ticks - convert a given timer type to AD module ticks * @timer_type: which timer to operate * @par: timer parameter. see below * * If @timer_type is %current_while_timer, @par indicates long/short timer. * If @timer_type is %periodic_timer, @par is one of %FAST_PERIODIC_TIME, * %SLOW_PERIODIC_TIME. */ static u16 __ad_timer_to_ticks(u16 timer_type, u16 par) { u16 retval = 0; /* to silence the compiler */ switch (timer_type) { case AD_CURRENT_WHILE_TIMER: /* for rx machine usage */ if (par) retval = (AD_SHORT_TIMEOUT_TIME*ad_ticks_per_sec); else retval = (AD_LONG_TIMEOUT_TIME*ad_ticks_per_sec); break; case AD_ACTOR_CHURN_TIMER: /* for local churn machine */ retval = (AD_CHURN_DETECTION_TIME*ad_ticks_per_sec); break; case AD_PERIODIC_TIMER: /* for periodic machine */ retval = (par*ad_ticks_per_sec); /* long timeout */ break; case AD_PARTNER_CHURN_TIMER: /* for remote churn machine */ retval = (AD_CHURN_DETECTION_TIME*ad_ticks_per_sec); break; case AD_WAIT_WHILE_TIMER: /* for selection machine */ retval = (AD_AGGREGATE_WAIT_TIME*ad_ticks_per_sec); break; } return retval; } /* ================= ad_rx_machine helper functions ================== */ /** * __choose_matched - update a port's matched variable from a received lacpdu * @lacpdu: the lacpdu we've received * @port: the port we're looking at * * Update the value of the matched variable, using parameter values from a * newly received lacpdu. Parameter values for the partner carried in the * received PDU are compared with the corresponding operational parameter * values for the actor. Matched is set to TRUE if all of these parameters * match and the PDU parameter partner_state.aggregation has the same value as * actor_oper_port_state.aggregation and lacp will actively maintain the link * in the aggregation. Matched is also set to TRUE if the value of * actor_state.aggregation in the received PDU is set to FALSE, i.e., indicates * an individual link and lacp will actively maintain the link. Otherwise, * matched is set to FALSE. LACP is considered to be actively maintaining the * link if either the PDU's actor_state.lacp_activity variable is TRUE or both * the actor's actor_oper_port_state.lacp_activity and the PDU's * partner_state.lacp_activity variables are TRUE. * * Note: the AD_PORT_MATCHED "variable" is not specified by 802.3ad; it is * used here to implement the language from 802.3ad 43.4.9 that requires * recordPDU to "match" the LACPDU parameters to the stored values. */ static void __choose_matched(struct lacpdu *lacpdu, struct port *port) { /* check if all parameters are alike * or this is individual link(aggregation == FALSE) * then update the state machine Matched variable. */ if (((ntohs(lacpdu->partner_port) == port->actor_port_number) && (ntohs(lacpdu->partner_port_priority) == port->actor_port_priority) && MAC_ADDRESS_EQUAL(&(lacpdu->partner_system), &(port->actor_system)) && (ntohs(lacpdu->partner_system_priority) == port->actor_system_priority) && (ntohs(lacpdu->partner_key) == port->actor_oper_port_key) && ((lacpdu->partner_state & LACP_STATE_AGGREGATION) == (port->actor_oper_port_state & LACP_STATE_AGGREGATION))) || ((lacpdu->actor_state & LACP_STATE_AGGREGATION) == 0) ) { port->sm_vars |= AD_PORT_MATCHED; } else { port->sm_vars &= ~AD_PORT_MATCHED; } } /** * __record_pdu - record parameters from a received lacpdu * @lacpdu: the lacpdu we've received * @port: the port we're looking at * * Record the parameter values for the Actor carried in a received lacpdu as * the current partner operational parameter values and sets * actor_oper_port_state.defaulted to FALSE. */ static void __record_pdu(struct lacpdu *lacpdu, struct port *port) { if (lacpdu && port) { struct port_params *partner = &port->partner_oper; __choose_matched(lacpdu, port); /* record the new parameter values for the partner * operational */ partner->port_number = ntohs(lacpdu->actor_port); partner->port_priority = ntohs(lacpdu->actor_port_priority); partner->system = lacpdu->actor_system; partner->system_priority = ntohs(lacpdu->actor_system_priority); partner->key = ntohs(lacpdu->actor_key); partner->port_state = lacpdu->actor_state; /* set actor_oper_port_state.defaulted to FALSE */ port->actor_oper_port_state &= ~LACP_STATE_DEFAULTED; /* set the partner sync. to on if the partner is sync, * and the port is matched */ if ((port->sm_vars & AD_PORT_MATCHED) && (lacpdu->actor_state & LACP_STATE_SYNCHRONIZATION)) { partner->port_state |= LACP_STATE_SYNCHRONIZATION; slave_dbg(port->slave->bond->dev, port->slave->dev, "partner sync=1\n"); } else { partner->port_state &= ~LACP_STATE_SYNCHRONIZATION; slave_dbg(port->slave->bond->dev, port->slave->dev, "partner sync=0\n"); } } } /** * __record_default - record default parameters * @port: the port we're looking at * * This function records the default parameter values for the partner carried * in the Partner Admin parameters as the current partner operational parameter * values and sets actor_oper_port_state.defaulted to TRUE. */ static void __record_default(struct port *port) { if (port) { /* record the partner admin parameters */ memcpy(&port->partner_oper, &port->partner_admin, sizeof(struct port_params)); /* set actor_oper_port_state.defaulted to true */ port->actor_oper_port_state |= LACP_STATE_DEFAULTED; } } /** * __update_selected - update a port's Selected variable from a received lacpdu * @lacpdu: the lacpdu we've received * @port: the port we're looking at * * Update the value of the selected variable, using parameter values from a * newly received lacpdu. The parameter values for the Actor carried in the * received PDU are compared with the corresponding operational parameter * values for the ports partner. If one or more of the comparisons shows that * the value(s) received in the PDU differ from the current operational values, * then selected is set to FALSE and actor_oper_port_state.synchronization is * set to out_of_sync. Otherwise, selected remains unchanged. */ static void __update_selected(struct lacpdu *lacpdu, struct port *port) { if (lacpdu && port) { const struct port_params *partner = &port->partner_oper; /* check if any parameter is different then * update the state machine selected variable. */ if (ntohs(lacpdu->actor_port) != partner->port_number || ntohs(lacpdu->actor_port_priority) != partner->port_priority || !MAC_ADDRESS_EQUAL(&lacpdu->actor_system, &partner->system) || ntohs(lacpdu->actor_system_priority) != partner->system_priority || ntohs(lacpdu->actor_key) != partner->key || (lacpdu->actor_state & LACP_STATE_AGGREGATION) != (partner->port_state & LACP_STATE_AGGREGATION)) { port->sm_vars &= ~AD_PORT_SELECTED; } } } /** * __update_default_selected - update a port's Selected variable from Partner * @port: the port we're looking at * * This function updates the value of the selected variable, using the partner * administrative parameter values. The administrative values are compared with * the corresponding operational parameter values for the partner. If one or * more of the comparisons shows that the administrative value(s) differ from * the current operational values, then Selected is set to FALSE and * actor_oper_port_state.synchronization is set to OUT_OF_SYNC. Otherwise, * Selected remains unchanged. */ static void __update_default_selected(struct port *port) { if (port) { const struct port_params *admin = &port->partner_admin; const struct port_params *oper = &port->partner_oper; /* check if any parameter is different then * update the state machine selected variable. */ if (admin->port_number != oper->port_number || admin->port_priority != oper->port_priority || !MAC_ADDRESS_EQUAL(&admin->system, &oper->system) || admin->system_priority != oper->system_priority || admin->key != oper->key || (admin->port_state & LACP_STATE_AGGREGATION) != (oper->port_state & LACP_STATE_AGGREGATION)) { port->sm_vars &= ~AD_PORT_SELECTED; } } } /** * __update_ntt - update a port's ntt variable from a received lacpdu * @lacpdu: the lacpdu we've received * @port: the port we're looking at * * Updates the value of the ntt variable, using parameter values from a newly * received lacpdu. The parameter values for the partner carried in the * received PDU are compared with the corresponding operational parameter * values for the Actor. If one or more of the comparisons shows that the * value(s) received in the PDU differ from the current operational values, * then ntt is set to TRUE. Otherwise, ntt remains unchanged. */ static void __update_ntt(struct lacpdu *lacpdu, struct port *port) { /* validate lacpdu and port */ if (lacpdu && port) { /* check if any parameter is different then * update the port->ntt. */ if ((ntohs(lacpdu->partner_port) != port->actor_port_number) || (ntohs(lacpdu->partner_port_priority) != port->actor_port_priority) || !MAC_ADDRESS_EQUAL(&(lacpdu->partner_system), &(port->actor_system)) || (ntohs(lacpdu->partner_system_priority) != port->actor_system_priority) || (ntohs(lacpdu->partner_key) != port->actor_oper_port_key) || ((lacpdu->partner_state & LACP_STATE_LACP_ACTIVITY) != (port->actor_oper_port_state & LACP_STATE_LACP_ACTIVITY)) || ((lacpdu->partner_state & LACP_STATE_LACP_TIMEOUT) != (port->actor_oper_port_state & LACP_STATE_LACP_TIMEOUT)) || ((lacpdu->partner_state & LACP_STATE_SYNCHRONIZATION) != (port->actor_oper_port_state & LACP_STATE_SYNCHRONIZATION)) || ((lacpdu->partner_state & LACP_STATE_AGGREGATION) != (port->actor_oper_port_state & LACP_STATE_AGGREGATION)) ) { port->ntt = true; } } } /** * __agg_ports_are_ready - check if all ports in an aggregator are ready * @aggregator: the aggregator we're looking at * */ static int __agg_ports_are_ready(struct aggregator *aggregator) { struct port *port; int retval = 1; if (aggregator) { /* scan all ports in this aggregator to verfy if they are * all ready. */ for (port = aggregator->lag_ports; port; port = port->next_port_in_aggregator) { if (!(port->sm_vars & AD_PORT_READY_N)) { retval = 0; break; } } } return retval; } /** * __set_agg_ports_ready - set value of Ready bit in all ports of an aggregator * @aggregator: the aggregator we're looking at * @val: Should the ports' ready bit be set on or off * */ static void __set_agg_ports_ready(struct aggregator *aggregator, int val) { struct port *port; for (port = aggregator->lag_ports; port; port = port->next_port_in_aggregator) { if (val) port->sm_vars |= AD_PORT_READY; else port->sm_vars &= ~AD_PORT_READY; } } static int __agg_active_ports(struct aggregator *agg) { struct port *port; int active = 0; for (port = agg->lag_ports; port; port = port->next_port_in_aggregator) { if (port->is_enabled) active++; } return active; } /** * __get_agg_bandwidth - get the total bandwidth of an aggregator * @aggregator: the aggregator we're looking at * */ static u32 __get_agg_bandwidth(struct aggregator *aggregator) { int nports = __agg_active_ports(aggregator); u32 bandwidth = 0; if (nports) { switch (__get_link_speed(aggregator->lag_ports)) { case AD_LINK_SPEED_1MBPS: bandwidth = nports; break; case AD_LINK_SPEED_10MBPS: bandwidth = nports * 10; break; case AD_LINK_SPEED_100MBPS: bandwidth = nports * 100; break; case AD_LINK_SPEED_1000MBPS: bandwidth = nports * 1000; break; case AD_LINK_SPEED_2500MBPS: bandwidth = nports * 2500; break; case AD_LINK_SPEED_5000MBPS: bandwidth = nports * 5000; break; case AD_LINK_SPEED_10000MBPS: bandwidth = nports * 10000; break; case AD_LINK_SPEED_14000MBPS: bandwidth = nports * 14000; break; case AD_LINK_SPEED_20000MBPS: bandwidth = nports * 20000; break; case AD_LINK_SPEED_25000MBPS: bandwidth = nports * 25000; break; case AD_LINK_SPEED_40000MBPS: bandwidth = nports * 40000; break; case AD_LINK_SPEED_50000MBPS: bandwidth = nports * 50000; break; case AD_LINK_SPEED_56000MBPS: bandwidth = nports * 56000; break; case AD_LINK_SPEED_100000MBPS: bandwidth = nports * 100000; break; case AD_LINK_SPEED_200000MBPS: bandwidth = nports * 200000; break; case AD_LINK_SPEED_400000MBPS: bandwidth = nports * 400000; break; case AD_LINK_SPEED_800000MBPS: bandwidth = nports * 800000; break; default: bandwidth = 0; /* to silence the compiler */ } } return bandwidth; } /** * __get_active_agg - get the current active aggregator * @aggregator: the aggregator we're looking at * * Caller must hold RCU lock. */ static struct aggregator *__get_active_agg(struct aggregator *aggregator) { struct bonding *bond = aggregator->slave->bond; struct list_head *iter; struct slave *slave; bond_for_each_slave_rcu(bond, slave, iter) if (SLAVE_AD_INFO(slave)->aggregator.is_active) return &(SLAVE_AD_INFO(slave)->aggregator); return NULL; } /** * __update_lacpdu_from_port - update a port's lacpdu fields * @port: the port we're looking at */ static inline void __update_lacpdu_from_port(struct port *port) { struct lacpdu *lacpdu = &port->lacpdu; const struct port_params *partner = &port->partner_oper; /* update current actual Actor parameters * lacpdu->subtype initialized * lacpdu->version_number initialized * lacpdu->tlv_type_actor_info initialized * lacpdu->actor_information_length initialized */ lacpdu->actor_system_priority = htons(port->actor_system_priority); lacpdu->actor_system = port->actor_system; lacpdu->actor_key = htons(port->actor_oper_port_key); lacpdu->actor_port_priority = htons(port->actor_port_priority); lacpdu->actor_port = htons(port->actor_port_number); lacpdu->actor_state = port->actor_oper_port_state; slave_dbg(port->slave->bond->dev, port->slave->dev, "update lacpdu: actor port state %x\n", port->actor_oper_port_state); /* lacpdu->reserved_3_1 initialized * lacpdu->tlv_type_partner_info initialized * lacpdu->partner_information_length initialized */ lacpdu->partner_system_priority = htons(partner->system_priority); lacpdu->partner_system = partner->system; lacpdu->partner_key = htons(partner->key); lacpdu->partner_port_priority = htons(partner->port_priority); lacpdu->partner_port = htons(partner->port_number); lacpdu->partner_state = partner->port_state; /* lacpdu->reserved_3_2 initialized * lacpdu->tlv_type_collector_info initialized * lacpdu->collector_information_length initialized * collector_max_delay initialized * reserved_12[12] initialized * tlv_type_terminator initialized * terminator_length initialized * reserved_50[50] initialized */ } /* ================= main 802.3ad protocol code ========================= */ /** * ad_lacpdu_send - send out a lacpdu packet on a given port * @port: the port we're looking at * * Returns: 0 on success * < 0 on error */ static int ad_lacpdu_send(struct port *port) { struct slave *slave = port->slave; struct sk_buff *skb; struct lacpdu_header *lacpdu_header; int length = sizeof(struct lacpdu_header); skb = dev_alloc_skb(length); if (!skb) return -ENOMEM; atomic64_inc(&SLAVE_AD_INFO(slave)->stats.lacpdu_tx); atomic64_inc(&BOND_AD_INFO(slave->bond).stats.lacpdu_tx); skb->dev = slave->dev; skb_reset_mac_header(skb); skb->network_header = skb->mac_header + ETH_HLEN; skb->protocol = PKT_TYPE_LACPDU; skb->priority = TC_PRIO_CONTROL; lacpdu_header = skb_put(skb, length); ether_addr_copy(lacpdu_header->hdr.h_dest, lacpdu_mcast_addr); /* Note: source address is set to be the member's PERMANENT address, * because we use it to identify loopback lacpdus in receive. */ ether_addr_copy(lacpdu_header->hdr.h_source, slave->perm_hwaddr); lacpdu_header->hdr.h_proto = PKT_TYPE_LACPDU; lacpdu_header->lacpdu = port->lacpdu; dev_queue_xmit(skb); return 0; } /** * ad_marker_send - send marker information/response on a given port * @port: the port we're looking at * @marker: marker data to send * * Returns: 0 on success * < 0 on error */ static int ad_marker_send(struct port *port, struct bond_marker *marker) { struct slave *slave = port->slave; struct sk_buff *skb; struct bond_marker_header *marker_header; int length = sizeof(struct bond_marker_header); skb = dev_alloc_skb(length + 16); if (!skb) return -ENOMEM; switch (marker->tlv_type) { case AD_MARKER_INFORMATION_SUBTYPE: atomic64_inc(&SLAVE_AD_INFO(slave)->stats.marker_tx); atomic64_inc(&BOND_AD_INFO(slave->bond).stats.marker_tx); break; case AD_MARKER_RESPONSE_SUBTYPE: atomic64_inc(&SLAVE_AD_INFO(slave)->stats.marker_resp_tx); atomic64_inc(&BOND_AD_INFO(slave->bond).stats.marker_resp_tx); break; } skb_reserve(skb, 16); skb->dev = slave->dev; skb_reset_mac_header(skb); skb->network_header = skb->mac_header + ETH_HLEN; skb->protocol = PKT_TYPE_LACPDU; marker_header = skb_put(skb, length); ether_addr_copy(marker_header->hdr.h_dest, lacpdu_mcast_addr); /* Note: source address is set to be the member's PERMANENT address, * because we use it to identify loopback MARKERs in receive. */ ether_addr_copy(marker_header->hdr.h_source, slave->perm_hwaddr); marker_header->hdr.h_proto = PKT_TYPE_LACPDU; marker_header->marker = *marker; dev_queue_xmit(skb); return 0; } /** * ad_mux_machine - handle a port's mux state machine * @port: the port we're looking at * @update_slave_arr: Does slave array need update? */ static void ad_mux_machine(struct port *port, bool *update_slave_arr) { struct bonding *bond = __get_bond_by_port(port); mux_states_t last_state; /* keep current State Machine state to compare later if it was * changed */ last_state = port->sm_mux_state; if (port->sm_vars & AD_PORT_BEGIN) { port->sm_mux_state = AD_MUX_DETACHED; } else { switch (port->sm_mux_state) { case AD_MUX_DETACHED: if ((port->sm_vars & AD_PORT_SELECTED) || (port->sm_vars & AD_PORT_STANDBY)) /* if SELECTED or STANDBY */ port->sm_mux_state = AD_MUX_WAITING; break; case AD_MUX_WAITING: /* if SELECTED == FALSE return to DETACH state */ if (!(port->sm_vars & AD_PORT_SELECTED)) { port->sm_vars &= ~AD_PORT_READY_N; /* in order to withhold the Selection Logic to * check all ports READY_N value every callback * cycle to update ready variable, we check * READY_N and update READY here */ __set_agg_ports_ready(port->aggregator, __agg_ports_are_ready(port->aggregator)); port->sm_mux_state = AD_MUX_DETACHED; break; } /* check if the wait_while_timer expired */ if (port->sm_mux_timer_counter && !(--port->sm_mux_timer_counter)) port->sm_vars |= AD_PORT_READY_N; /* in order to withhold the selection logic to check * all ports READY_N value every callback cycle to * update ready variable, we check READY_N and update * READY here */ __set_agg_ports_ready(port->aggregator, __agg_ports_are_ready(port->aggregator)); /* if the wait_while_timer expired, and the port is * in READY state, move to ATTACHED state */ if ((port->sm_vars & AD_PORT_READY) && !port->sm_mux_timer_counter) port->sm_mux_state = AD_MUX_ATTACHED; break; case AD_MUX_ATTACHED: /* check also if agg_select_timer expired (so the * edable port will take place only after this timer) */ if ((port->sm_vars & AD_PORT_SELECTED) && (port->partner_oper.port_state & LACP_STATE_SYNCHRONIZATION) && !__check_agg_selection_timer(port)) { if (port->aggregator->is_active) { int state = AD_MUX_COLLECTING_DISTRIBUTING; if (!bond->params.coupled_control) state = AD_MUX_COLLECTING; port->sm_mux_state = state; } } else if (!(port->sm_vars & AD_PORT_SELECTED) || (port->sm_vars & AD_PORT_STANDBY)) { /* if UNSELECTED or STANDBY */ port->sm_vars &= ~AD_PORT_READY_N; /* in order to withhold the selection logic to * check all ports READY_N value every callback * cycle to update ready variable, we check * READY_N and update READY here */ __set_agg_ports_ready(port->aggregator, __agg_ports_are_ready(port->aggregator)); port->sm_mux_state = AD_MUX_DETACHED; } else if (port->aggregator->is_active) { port->actor_oper_port_state |= LACP_STATE_SYNCHRONIZATION; } break; case AD_MUX_COLLECTING_DISTRIBUTING: if (!__port_move_to_attached_state(port)) { /* if port state hasn't changed make * sure that a collecting distributing * port in an active aggregator is enabled */ if (port->aggregator->is_active && !__port_is_collecting_distributing(port)) { __enable_port(port); *update_slave_arr = true; } } break; case AD_MUX_COLLECTING: if (!__port_move_to_attached_state(port)) { if ((port->sm_vars & AD_PORT_SELECTED) && (port->partner_oper.port_state & LACP_STATE_SYNCHRONIZATION) && (port->partner_oper.port_state & LACP_STATE_COLLECTING)) { port->sm_mux_state = AD_MUX_DISTRIBUTING; } else { /* If port state hasn't changed, make sure that a collecting * port is enabled for an active aggregator. */ struct slave *slave = port->slave; if (port->aggregator->is_active && bond_is_slave_rx_disabled(slave)) { ad_enable_collecting(port); *update_slave_arr = true; } } } break; case AD_MUX_DISTRIBUTING: if (!(port->sm_vars & AD_PORT_SELECTED) || (port->sm_vars & AD_PORT_STANDBY) || !(port->partner_oper.port_state & LACP_STATE_COLLECTING) || !(port->partner_oper.port_state & LACP_STATE_SYNCHRONIZATION) || !(port->actor_oper_port_state & LACP_STATE_SYNCHRONIZATION)) { port->sm_mux_state = AD_MUX_COLLECTING; } else { /* if port state hasn't changed make * sure that a collecting distributing * port in an active aggregator is enabled */ if (port->aggregator && port->aggregator->is_active && !__port_is_collecting_distributing(port)) { __enable_port(port); *update_slave_arr = true; } } break; default: break; } } /* check if the state machine was changed */ if (port->sm_mux_state != last_state) { slave_dbg(port->slave->bond->dev, port->slave->dev, "Mux Machine: Port=%d, Last State=%d, Curr State=%d\n", port->actor_port_number, last_state, port->sm_mux_state); switch (port->sm_mux_state) { case AD_MUX_DETACHED: port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION; ad_disable_collecting_distributing(port, update_slave_arr); port->actor_oper_port_state &= ~LACP_STATE_COLLECTING; port->actor_oper_port_state &= ~LACP_STATE_DISTRIBUTING; port->ntt = true; break; case AD_MUX_WAITING: port->sm_mux_timer_counter = __ad_timer_to_ticks(AD_WAIT_WHILE_TIMER, 0); break; case AD_MUX_ATTACHED: if (port->aggregator->is_active) port->actor_oper_port_state |= LACP_STATE_SYNCHRONIZATION; else port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION; port->actor_oper_port_state &= ~LACP_STATE_COLLECTING; port->actor_oper_port_state &= ~LACP_STATE_DISTRIBUTING; ad_disable_collecting_distributing(port, update_slave_arr); port->ntt = true; break; case AD_MUX_COLLECTING_DISTRIBUTING: port->actor_oper_port_state |= LACP_STATE_COLLECTING; port->actor_oper_port_state |= LACP_STATE_DISTRIBUTING; port->actor_oper_port_state |= LACP_STATE_SYNCHRONIZATION; ad_enable_collecting_distributing(port, update_slave_arr); port->ntt = true; break; case AD_MUX_COLLECTING: port->actor_oper_port_state |= LACP_STATE_COLLECTING; port->actor_oper_port_state &= ~LACP_STATE_DISTRIBUTING; port->actor_oper_port_state |= LACP_STATE_SYNCHRONIZATION; ad_enable_collecting(port); ad_disable_distributing(port, update_slave_arr); port->ntt = true; break; case AD_MUX_DISTRIBUTING: port->actor_oper_port_state |= LACP_STATE_DISTRIBUTING; port->actor_oper_port_state |= LACP_STATE_SYNCHRONIZATION; ad_enable_collecting_distributing(port, update_slave_arr); break; default: break; } } } /** * ad_rx_machine - handle a port's rx State Machine * @lacpdu: the lacpdu we've received * @port: the port we're looking at * * If lacpdu arrived, stop previous timer (if exists) and set the next state as * CURRENT. If timer expired set the state machine in the proper state. * In other cases, this function checks if we need to switch to other state. */ static void ad_rx_machine(struct lacpdu *lacpdu, struct port *port) { rx_states_t last_state; /* keep current State Machine state to compare later if it was * changed */ last_state = port->sm_rx_state; if (lacpdu) { atomic64_inc(&SLAVE_AD_INFO(port->slave)->stats.lacpdu_rx); atomic64_inc(&BOND_AD_INFO(port->slave->bond).stats.lacpdu_rx); } /* check if state machine should change state */ /* first, check if port was reinitialized */ if (port->sm_vars & AD_PORT_BEGIN) { port->sm_rx_state = AD_RX_INITIALIZE; port->sm_vars |= AD_PORT_CHURNED; /* check if port is not enabled */ } else if (!(port->sm_vars & AD_PORT_BEGIN) && !port->is_enabled) port->sm_rx_state = AD_RX_PORT_DISABLED; /* check if new lacpdu arrived */ else if (lacpdu && ((port->sm_rx_state == AD_RX_EXPIRED) || (port->sm_rx_state == AD_RX_DEFAULTED) || (port->sm_rx_state == AD_RX_CURRENT))) { if (port->sm_rx_state != AD_RX_CURRENT) port->sm_vars |= AD_PORT_CHURNED; port->sm_rx_timer_counter = 0; port->sm_rx_state = AD_RX_CURRENT; } else { /* if timer is on, and if it is expired */ if (port->sm_rx_timer_counter && !(--port->sm_rx_timer_counter)) { switch (port->sm_rx_state) { case AD_RX_EXPIRED: port->sm_rx_state = AD_RX_DEFAULTED; break; case AD_RX_CURRENT: port->sm_rx_state = AD_RX_EXPIRED; break; default: break; } } else { /* if no lacpdu arrived and no timer is on */ switch (port->sm_rx_state) { case AD_RX_PORT_DISABLED: if (port->is_enabled && (port->sm_vars & AD_PORT_LACP_ENABLED)) port->sm_rx_state = AD_RX_EXPIRED; else if (port->is_enabled && ((port->sm_vars & AD_PORT_LACP_ENABLED) == 0)) port->sm_rx_state = AD_RX_LACP_DISABLED; break; default: break; } } } /* check if the State machine was changed or new lacpdu arrived */ if ((port->sm_rx_state != last_state) || (lacpdu)) { slave_dbg(port->slave->bond->dev, port->slave->dev, "Rx Machine: Port=%d, Last State=%d, Curr State=%d\n", port->actor_port_number, last_state, port->sm_rx_state); switch (port->sm_rx_state) { case AD_RX_INITIALIZE: if (!(port->actor_oper_port_key & AD_DUPLEX_KEY_MASKS)) port->sm_vars &= ~AD_PORT_LACP_ENABLED; else port->sm_vars |= AD_PORT_LACP_ENABLED; port->sm_vars &= ~AD_PORT_SELECTED; __record_default(port); port->actor_oper_port_state &= ~LACP_STATE_EXPIRED; port->sm_rx_state = AD_RX_PORT_DISABLED; fallthrough; case AD_RX_PORT_DISABLED: port->sm_vars &= ~AD_PORT_MATCHED; break; case AD_RX_LACP_DISABLED: port->sm_vars &= ~AD_PORT_SELECTED; __record_default(port); port->partner_oper.port_state &= ~LACP_STATE_AGGREGATION; port->sm_vars |= AD_PORT_MATCHED; port->actor_oper_port_state &= ~LACP_STATE_EXPIRED; break; case AD_RX_EXPIRED: /* Reset of the Synchronization flag (Standard 43.4.12) * This reset cause to disable this port in the * COLLECTING_DISTRIBUTING state of the mux machine in * case of EXPIRED even if LINK_DOWN didn't arrive for * the port. */ port->partner_oper.port_state &= ~LACP_STATE_SYNCHRONIZATION; port->sm_vars &= ~AD_PORT_MATCHED; port->partner_oper.port_state |= LACP_STATE_LACP_TIMEOUT; port->partner_oper.port_state |= LACP_STATE_LACP_ACTIVITY; port->sm_rx_timer_counter = __ad_timer_to_ticks(AD_CURRENT_WHILE_TIMER, (u16)(AD_SHORT_TIMEOUT)); port->actor_oper_port_state |= LACP_STATE_EXPIRED; port->sm_vars |= AD_PORT_CHURNED; break; case AD_RX_DEFAULTED: __update_default_selected(port); __record_default(port); port->sm_vars |= AD_PORT_MATCHED; port->actor_oper_port_state &= ~LACP_STATE_EXPIRED; break; case AD_RX_CURRENT: /* detect loopback situation */ if (MAC_ADDRESS_EQUAL(&(lacpdu->actor_system), &(port->actor_system))) { slave_err(port->slave->bond->dev, port->slave->dev, "An illegal loopback occurred on slave\n" "Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports\n"); return; } __update_selected(lacpdu, port); __update_ntt(lacpdu, port); __record_pdu(lacpdu, port); port->sm_rx_timer_counter = __ad_timer_to_ticks(AD_CURRENT_WHILE_TIMER, (u16)(port->actor_oper_port_state & LACP_STATE_LACP_TIMEOUT)); port->actor_oper_port_state &= ~LACP_STATE_EXPIRED; break; default: break; } } } /** * ad_churn_machine - handle port churn's state machine * @port: the port we're looking at * */ static void ad_churn_machine(struct port *port) { if (port->sm_vars & AD_PORT_CHURNED) { port->sm_vars &= ~AD_PORT_CHURNED; port->sm_churn_actor_state = AD_CHURN_MONITOR; port->sm_churn_partner_state = AD_CHURN_MONITOR; port->sm_churn_actor_timer_counter = __ad_timer_to_ticks(AD_ACTOR_CHURN_TIMER, 0); port->sm_churn_partner_timer_counter = __ad_timer_to_ticks(AD_PARTNER_CHURN_TIMER, 0); return; } if (port->sm_churn_actor_timer_counter && !(--port->sm_churn_actor_timer_counter) && port->sm_churn_actor_state == AD_CHURN_MONITOR) { if (port->actor_oper_port_state & LACP_STATE_SYNCHRONIZATION) { port->sm_churn_actor_state = AD_NO_CHURN; } else { port->churn_actor_count++; port->sm_churn_actor_state = AD_CHURN; } } if (port->sm_churn_partner_timer_counter && !(--port->sm_churn_partner_timer_counter) && port->sm_churn_partner_state == AD_CHURN_MONITOR) { if (port->partner_oper.port_state & LACP_STATE_SYNCHRONIZATION) { port->sm_churn_partner_state = AD_NO_CHURN; } else { port->churn_partner_count++; port->sm_churn_partner_state = AD_CHURN; } } } /** * ad_tx_machine - handle a port's tx state machine * @port: the port we're looking at */ static void ad_tx_machine(struct port *port) { /* check if tx timer expired, to verify that we do not send more than * 3 packets per second */ if (port->sm_tx_timer_counter && !(--port->sm_tx_timer_counter)) { /* check if there is something to send */ if (port->ntt && (port->sm_vars & AD_PORT_LACP_ENABLED)) { __update_lacpdu_from_port(port); if (ad_lacpdu_send(port) >= 0) { slave_dbg(port->slave->bond->dev, port->slave->dev, "Sent LACPDU on port %d\n", port->actor_port_number); /* mark ntt as false, so it will not be sent * again until demanded */ port->ntt = false; } } /* restart tx timer(to verify that we will not exceed * AD_MAX_TX_IN_SECOND */ port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND; } } /** * ad_periodic_machine - handle a port's periodic state machine * @port: the port we're looking at * @bond_params: bond parameters we will use * * Turn ntt flag on priodically to perform periodic transmission of lacpdu's. */ static void ad_periodic_machine(struct port *port, struct bond_params *bond_params) { periodic_states_t last_state; /* keep current state machine state to compare later if it was changed */ last_state = port->sm_periodic_state; /* check if port was reinitialized */ if (((port->sm_vars & AD_PORT_BEGIN) || !(port->sm_vars & AD_PORT_LACP_ENABLED) || !port->is_enabled) || (!(port->actor_oper_port_state & LACP_STATE_LACP_ACTIVITY) && !(port->partner_oper.port_state & LACP_STATE_LACP_ACTIVITY)) || !bond_params->lacp_active) { port->sm_periodic_state = AD_NO_PERIODIC; } /* check if state machine should change state */ else if (port->sm_periodic_timer_counter) { /* check if periodic state machine expired */ if (!(--port->sm_periodic_timer_counter)) { /* if expired then do tx */ port->sm_periodic_state = AD_PERIODIC_TX; } else { /* If not expired, check if there is some new timeout * parameter from the partner state */ switch (port->sm_periodic_state) { case AD_FAST_PERIODIC: if (!(port->partner_oper.port_state & LACP_STATE_LACP_TIMEOUT)) port->sm_periodic_state = AD_SLOW_PERIODIC; break; case AD_SLOW_PERIODIC: if ((port->partner_oper.port_state & LACP_STATE_LACP_TIMEOUT)) { port->sm_periodic_timer_counter = 0; port->sm_periodic_state = AD_PERIODIC_TX; } break; default: break; } } } else { switch (port->sm_periodic_state) { case AD_NO_PERIODIC: port->sm_periodic_state = AD_FAST_PERIODIC; break; case AD_PERIODIC_TX: if (!(port->partner_oper.port_state & LACP_STATE_LACP_TIMEOUT)) port->sm_periodic_state = AD_SLOW_PERIODIC; else port->sm_periodic_state = AD_FAST_PERIODIC; break; default: break; } } /* check if the state machine was changed */ if (port->sm_periodic_state != last_state) { slave_dbg(port->slave->bond->dev, port->slave->dev, "Periodic Machine: Port=%d, Last State=%d, Curr State=%d\n", port->actor_port_number, last_state, port->sm_periodic_state); switch (port->sm_periodic_state) { case AD_NO_PERIODIC: port->sm_periodic_timer_counter = 0; break; case AD_FAST_PERIODIC: /* decrement 1 tick we lost in the PERIODIC_TX cycle */ port->sm_periodic_timer_counter = __ad_timer_to_ticks(AD_PERIODIC_TIMER, (u16)(AD_FAST_PERIODIC_TIME))-1; break; case AD_SLOW_PERIODIC: /* decrement 1 tick we lost in the PERIODIC_TX cycle */ port->sm_periodic_timer_counter = __ad_timer_to_ticks(AD_PERIODIC_TIMER, (u16)(AD_SLOW_PERIODIC_TIME))-1; break; case AD_PERIODIC_TX: port->ntt = true; break; default: break; } } } /** * ad_port_selection_logic - select aggregation groups * @port: the port we're looking at * @update_slave_arr: Does slave array need update? * * Select aggregation groups, and assign each port for it's aggregetor. The * selection logic is called in the inititalization (after all the handshkes), * and after every lacpdu receive (if selected is off). */ static void ad_port_selection_logic(struct port *port, bool *update_slave_arr) { struct aggregator *aggregator, *free_aggregator = NULL, *temp_aggregator; struct port *last_port = NULL, *curr_port; struct list_head *iter; struct bonding *bond; struct slave *slave; int found = 0; /* if the port is already Selected, do nothing */ if (port->sm_vars & AD_PORT_SELECTED) return; bond = __get_bond_by_port(port); /* if the port is connected to other aggregator, detach it */ if (port->aggregator) { /* detach the port from its former aggregator */ temp_aggregator = port->aggregator; for (curr_port = temp_aggregator->lag_ports; curr_port; last_port = curr_port, curr_port = curr_port->next_port_in_aggregator) { if (curr_port == port) { temp_aggregator->num_of_ports--; /* if it is the first port attached to the * aggregator */ if (!last_port) { temp_aggregator->lag_ports = port->next_port_in_aggregator; } else { /* not the first port attached to the * aggregator */ last_port->next_port_in_aggregator = port->next_port_in_aggregator; } /* clear the port's relations to this * aggregator */ port->aggregator = NULL; port->next_port_in_aggregator = NULL; port->actor_port_aggregator_identifier = 0; slave_dbg(bond->dev, port->slave->dev, "Port %d left LAG %d\n", port->actor_port_number, temp_aggregator->aggregator_identifier); /* if the aggregator is empty, clear its * parameters, and set it ready to be attached */ if (!temp_aggregator->lag_ports) ad_clear_agg(temp_aggregator); break; } } if (!curr_port) { /* meaning: the port was related to an aggregator * but was not on the aggregator port list */ net_warn_ratelimited("%s: (slave %s): Warning: Port %d was related to aggregator %d but was not on its port list\n", port->slave->bond->dev->name, port->slave->dev->name, port->actor_port_number, port->aggregator->aggregator_identifier); } } /* search on all aggregators for a suitable aggregator for this port */ bond_for_each_slave(bond, slave, iter) { aggregator = &(SLAVE_AD_INFO(slave)->aggregator); /* keep a free aggregator for later use(if needed) */ if (!aggregator->lag_ports) { if (!free_aggregator) free_aggregator = aggregator; continue; } /* check if current aggregator suits us */ if (((aggregator->actor_oper_aggregator_key == port->actor_oper_port_key) && /* if all parameters match AND */ MAC_ADDRESS_EQUAL(&(aggregator->partner_system), &(port->partner_oper.system)) && (aggregator->partner_system_priority == port->partner_oper.system_priority) && (aggregator->partner_oper_aggregator_key == port->partner_oper.key) ) && ((__agg_has_partner(aggregator) && /* partner answers */ !aggregator->is_individual) /* but is not individual OR */ ) ) { /* attach to the founded aggregator */ port->aggregator = aggregator; port->actor_port_aggregator_identifier = port->aggregator->aggregator_identifier; port->next_port_in_aggregator = aggregator->lag_ports; port->aggregator->num_of_ports++; aggregator->lag_ports = port; slave_dbg(bond->dev, slave->dev, "Port %d joined LAG %d (existing LAG)\n", port->actor_port_number, port->aggregator->aggregator_identifier); /* mark this port as selected */ port->sm_vars |= AD_PORT_SELECTED; found = 1; break; } } /* the port couldn't find an aggregator - attach it to a new * aggregator */ if (!found) { if (free_aggregator) { /* assign port a new aggregator */ port->aggregator = free_aggregator; port->actor_port_aggregator_identifier = port->aggregator->aggregator_identifier; /* update the new aggregator's parameters * if port was responsed from the end-user */ if (port->actor_oper_port_key & AD_DUPLEX_KEY_MASKS) /* if port is full duplex */ port->aggregator->is_individual = false; else port->aggregator->is_individual = true; port->aggregator->actor_admin_aggregator_key = port->actor_admin_port_key; port->aggregator->actor_oper_aggregator_key = port->actor_oper_port_key; port->aggregator->partner_system = port->partner_oper.system; port->aggregator->partner_system_priority = port->partner_oper.system_priority; port->aggregator->partner_oper_aggregator_key = port->partner_oper.key; port->aggregator->receive_state = 1; port->aggregator->transmit_state = 1; port->aggregator->lag_ports = port; port->aggregator->num_of_ports++; /* mark this port as selected */ port->sm_vars |= AD_PORT_SELECTED; slave_dbg(bond->dev, port->slave->dev, "Port %d joined LAG %d (new LAG)\n", port->actor_port_number, port->aggregator->aggregator_identifier); } else { slave_err(bond->dev, port->slave->dev, "Port %d did not find a suitable aggregator\n", port->actor_port_number); return; } } /* if all aggregator's ports are READY_N == TRUE, set ready=TRUE * in all aggregator's ports, else set ready=FALSE in all * aggregator's ports */ __set_agg_ports_ready(port->aggregator, __agg_ports_are_ready(port->aggregator)); aggregator = __get_first_agg(port); ad_agg_selection_logic(aggregator, update_slave_arr); if (!port->aggregator->is_active) port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION; } /* Decide if "agg" is a better choice for the new active aggregator that * the current best, according to the ad_select policy. */ static struct aggregator *ad_agg_selection_test(struct aggregator *best, struct aggregator *curr) { /* 0. If no best, select current. * * 1. If the current agg is not individual, and the best is * individual, select current. * * 2. If current agg is individual and the best is not, keep best. * * 3. Therefore, current and best are both individual or both not * individual, so: * * 3a. If current agg partner replied, and best agg partner did not, * select current. * * 3b. If current agg partner did not reply and best agg partner * did reply, keep best. * * 4. Therefore, current and best both have partner replies or * both do not, so perform selection policy: * * BOND_AD_COUNT: Select by count of ports. If count is equal, * select by bandwidth. * * BOND_AD_STABLE, BOND_AD_BANDWIDTH: Select by bandwidth. */ if (!best) return curr; if (!curr->is_individual && best->is_individual) return curr; if (curr->is_individual && !best->is_individual) return best; if (__agg_has_partner(curr) && !__agg_has_partner(best)) return curr; if (!__agg_has_partner(curr) && __agg_has_partner(best)) return best; switch (__get_agg_selection_mode(curr->lag_ports)) { case BOND_AD_COUNT: if (__agg_active_ports(curr) > __agg_active_ports(best)) return curr; if (__agg_active_ports(curr) < __agg_active_ports(best)) return best; fallthrough; case BOND_AD_STABLE: case BOND_AD_BANDWIDTH: if (__get_agg_bandwidth(curr) > __get_agg_bandwidth(best)) return curr; break; default: net_warn_ratelimited("%s: (slave %s): Impossible agg select mode %d\n", curr->slave->bond->dev->name, curr->slave->dev->name, __get_agg_selection_mode(curr->lag_ports)); break; } return best; } static int agg_device_up(const struct aggregator *agg) { struct port *port = agg->lag_ports; if (!port) return 0; for (port = agg->lag_ports; port; port = port->next_port_in_aggregator) { if (netif_running(port->slave->dev) && netif_carrier_ok(port->slave->dev)) return 1; } return 0; } /** * ad_agg_selection_logic - select an aggregation group for a team * @agg: the aggregator we're looking at * @update_slave_arr: Does slave array need update? * * It is assumed that only one aggregator may be selected for a team. * * The logic of this function is to select the aggregator according to * the ad_select policy: * * BOND_AD_STABLE: select the aggregator with the most ports attached to * it, and to reselect the active aggregator only if the previous * aggregator has no more ports related to it. * * BOND_AD_BANDWIDTH: select the aggregator with the highest total * bandwidth, and reselect whenever a link state change takes place or the * set of slaves in the bond changes. * * BOND_AD_COUNT: select the aggregator with largest number of ports * (slaves), and reselect whenever a link state change takes place or the * set of slaves in the bond changes. * * FIXME: this function MUST be called with the first agg in the bond, or * __get_active_agg() won't work correctly. This function should be better * called with the bond itself, and retrieve the first agg from it. */ static void ad_agg_selection_logic(struct aggregator *agg, bool *update_slave_arr) { struct aggregator *best, *active, *origin; struct bonding *bond = agg->slave->bond; struct list_head *iter; struct slave *slave; struct port *port; rcu_read_lock(); origin = agg; active = __get_active_agg(agg); best = (active && agg_device_up(active)) ? active : NULL; bond_for_each_slave_rcu(bond, slave, iter) { agg = &(SLAVE_AD_INFO(slave)->aggregator); agg->is_active = 0; if (__agg_active_ports(agg) && agg_device_up(agg)) best = ad_agg_selection_test(best, agg); } if (best && __get_agg_selection_mode(best->lag_ports) == BOND_AD_STABLE) { /* For the STABLE policy, don't replace the old active * aggregator if it's still active (it has an answering * partner) or if both the best and active don't have an * answering partner. */ if (active && active->lag_ports && __agg_active_ports(active) && (__agg_has_partner(active) || (!__agg_has_partner(active) && !__agg_has_partner(best)))) { if (!(!active->actor_oper_aggregator_key && best->actor_oper_aggregator_key)) { best = NULL; active->is_active = 1; } } } if (best && (best == active)) { best = NULL; active->is_active = 1; } /* if there is new best aggregator, activate it */ if (best) { netdev_dbg(bond->dev, "(slave %s): best Agg=%d; P=%d; a k=%d; p k=%d; Ind=%d; Act=%d\n", best->slave ? best->slave->dev->name : "NULL", best->aggregator_identifier, best->num_of_ports, best->actor_oper_aggregator_key, best->partner_oper_aggregator_key, best->is_individual, best->is_active); netdev_dbg(bond->dev, "(slave %s): best ports %p slave %p\n", best->slave ? best->slave->dev->name : "NULL", best->lag_ports, best->slave); bond_for_each_slave_rcu(bond, slave, iter) { agg = &(SLAVE_AD_INFO(slave)->aggregator); slave_dbg(bond->dev, slave->dev, "Agg=%d; P=%d; a k=%d; p k=%d; Ind=%d; Act=%d\n", agg->aggregator_identifier, agg->num_of_ports, agg->actor_oper_aggregator_key, agg->partner_oper_aggregator_key, agg->is_individual, agg->is_active); } /* check if any partner replies */ if (best->is_individual) net_warn_ratelimited("%s: Warning: No 802.3ad response from the link partner for any adapters in the bond\n", bond->dev->name); best->is_active = 1; netdev_dbg(bond->dev, "(slave %s): LAG %d chosen as the active LAG\n", best->slave ? best->slave->dev->name : "NULL", best->aggregator_identifier); netdev_dbg(bond->dev, "(slave %s): Agg=%d; P=%d; a k=%d; p k=%d; Ind=%d; Act=%d\n", best->slave ? best->slave->dev->name : "NULL", best->aggregator_identifier, best->num_of_ports, best->actor_oper_aggregator_key, best->partner_oper_aggregator_key, best->is_individual, best->is_active); /* disable the ports that were related to the former * active_aggregator */ if (active) { for (port = active->lag_ports; port; port = port->next_port_in_aggregator) { __disable_port(port); } } /* Slave array needs update. */ *update_slave_arr = true; } /* if the selected aggregator is of join individuals * (partner_system is NULL), enable their ports */ active = __get_active_agg(origin); if (active) { if (!__agg_has_partner(active)) { for (port = active->lag_ports; port; port = port->next_port_in_aggregator) { __enable_port(port); } *update_slave_arr = true; } } rcu_read_unlock(); bond_3ad_set_carrier(bond); } /** * ad_clear_agg - clear a given aggregator's parameters * @aggregator: the aggregator we're looking at */ static void ad_clear_agg(struct aggregator *aggregator) { if (aggregator) { aggregator->is_individual = false; aggregator->actor_admin_aggregator_key = 0; aggregator->actor_oper_aggregator_key = 0; eth_zero_addr(aggregator->partner_system.mac_addr_value); aggregator->partner_system_priority = 0; aggregator->partner_oper_aggregator_key = 0; aggregator->receive_state = 0; aggregator->transmit_state = 0; aggregator->lag_ports = NULL; aggregator->is_active = 0; aggregator->num_of_ports = 0; pr_debug("%s: LAG %d was cleared\n", aggregator->slave ? aggregator->slave->dev->name : "NULL", aggregator->aggregator_identifier); } } /** * ad_initialize_agg - initialize a given aggregator's parameters * @aggregator: the aggregator we're looking at */ static void ad_initialize_agg(struct aggregator *aggregator) { if (aggregator) { ad_clear_agg(aggregator); eth_zero_addr(aggregator->aggregator_mac_address.mac_addr_value); aggregator->aggregator_identifier = 0; aggregator->slave = NULL; } } /** * ad_initialize_port - initialize a given port's parameters * @port: the port we're looking at * @lacp_fast: boolean. whether fast periodic should be used */ static void ad_initialize_port(struct port *port, int lacp_fast) { static const struct port_params tmpl = { .system_priority = 0xffff, .key = 1, .port_number = 1, .port_priority = 0xff, .port_state = 1, }; static const struct lacpdu lacpdu = { .subtype = 0x01, .version_number = 0x01, .tlv_type_actor_info = 0x01, .actor_information_length = 0x14, .tlv_type_partner_info = 0x02, .partner_information_length = 0x14, .tlv_type_collector_info = 0x03, .collector_information_length = 0x10, .collector_max_delay = htons(AD_COLLECTOR_MAX_DELAY), }; if (port) { port->actor_port_priority = 0xff; port->actor_port_aggregator_identifier = 0; port->ntt = false; port->actor_admin_port_state = LACP_STATE_AGGREGATION | LACP_STATE_LACP_ACTIVITY; port->actor_oper_port_state = LACP_STATE_AGGREGATION | LACP_STATE_LACP_ACTIVITY; if (lacp_fast) port->actor_oper_port_state |= LACP_STATE_LACP_TIMEOUT; memcpy(&port->partner_admin, &tmpl, sizeof(tmpl)); memcpy(&port->partner_oper, &tmpl, sizeof(tmpl)); port->is_enabled = true; /* private parameters */ port->sm_vars = AD_PORT_BEGIN | AD_PORT_LACP_ENABLED; port->sm_rx_state = 0; port->sm_rx_timer_counter = 0; port->sm_periodic_state = 0; port->sm_periodic_timer_counter = 0; port->sm_mux_state = 0; port->sm_mux_timer_counter = 0; port->sm_tx_state = 0; port->aggregator = NULL; port->next_port_in_aggregator = NULL; port->transaction_id = 0; port->sm_churn_actor_timer_counter = 0; port->sm_churn_actor_state = 0; port->churn_actor_count = 0; port->sm_churn_partner_timer_counter = 0; port->sm_churn_partner_state = 0; port->churn_partner_count = 0; memcpy(&port->lacpdu, &lacpdu, sizeof(lacpdu)); } } /** * ad_enable_collecting - enable a port's receive * @port: the port we're looking at * * Enable @port if it's in an active aggregator */ static void ad_enable_collecting(struct port *port) { if (port->aggregator->is_active) { struct slave *slave = port->slave; slave_dbg(slave->bond->dev, slave->dev, "Enabling collecting on port %d (LAG %d)\n", port->actor_port_number, port->aggregator->aggregator_identifier); __enable_collecting_port(port); } } /** * ad_disable_distributing - disable a port's transmit * @port: the port we're looking at * @update_slave_arr: Does slave array need update? */ static void ad_disable_distributing(struct port *port, bool *update_slave_arr) { if (port->aggregator && __agg_has_partner(port->aggregator)) { slave_dbg(port->slave->bond->dev, port->slave->dev, "Disabling distributing on port %d (LAG %d)\n", port->actor_port_number, port->aggregator->aggregator_identifier); __disable_distributing_port(port); /* Slave array needs an update */ *update_slave_arr = true; } } /** * ad_enable_collecting_distributing - enable a port's transmit/receive * @port: the port we're looking at * @update_slave_arr: Does slave array need update? * * Enable @port if it's in an active aggregator */ static void ad_enable_collecting_distributing(struct port *port, bool *update_slave_arr) { if (port->aggregator->is_active) { slave_dbg(port->slave->bond->dev, port->slave->dev, "Enabling port %d (LAG %d)\n", port->actor_port_number, port->aggregator->aggregator_identifier); __enable_port(port); /* Slave array needs update */ *update_slave_arr = true; } } /** * ad_disable_collecting_distributing - disable a port's transmit/receive * @port: the port we're looking at * @update_slave_arr: Does slave array need update? */ static void ad_disable_collecting_distributing(struct port *port, bool *update_slave_arr) { if (port->aggregator && __agg_has_partner(port->aggregator)) { slave_dbg(port->slave->bond->dev, port->slave->dev, "Disabling port %d (LAG %d)\n", port->actor_port_number, port->aggregator->aggregator_identifier); __disable_port(port); /* Slave array needs an update */ *update_slave_arr = true; } } /** * ad_marker_info_received - handle receive of a Marker information frame * @marker_info: Marker info received * @port: the port we're looking at */ static void ad_marker_info_received(struct bond_marker *marker_info, struct port *port) { struct bond_marker marker; atomic64_inc(&SLAVE_AD_INFO(port->slave)->stats.marker_rx); atomic64_inc(&BOND_AD_INFO(port->slave->bond).stats.marker_rx); /* copy the received marker data to the response marker */ memcpy(&marker, marker_info, sizeof(struct bond_marker)); /* change the marker subtype to marker response */ marker.tlv_type = AD_MARKER_RESPONSE_SUBTYPE; /* send the marker response */ if (ad_marker_send(port, &marker) >= 0) slave_dbg(port->slave->bond->dev, port->slave->dev, "Sent Marker Response on port %d\n", port->actor_port_number); } /** * ad_marker_response_received - handle receive of a marker response frame * @marker: marker PDU received * @port: the port we're looking at * * This function does nothing since we decided not to implement send and handle * response for marker PDU's, in this stage, but only to respond to marker * information. */ static void ad_marker_response_received(struct bond_marker *marker, struct port *port) { atomic64_inc(&SLAVE_AD_INFO(port->slave)->stats.marker_resp_rx); atomic64_inc(&BOND_AD_INFO(port->slave->bond).stats.marker_resp_rx); /* DO NOTHING, SINCE WE DECIDED NOT TO IMPLEMENT THIS FEATURE FOR NOW */ } /* ========= AD exported functions to the main bonding code ========= */ /* Check aggregators status in team every T seconds */ #define AD_AGGREGATOR_SELECTION_TIMER 8 /** * bond_3ad_initiate_agg_selection - initate aggregator selection * @bond: bonding struct * @timeout: timeout value to set * * Set the aggregation selection timer, to initiate an agg selection in * the very near future. Called during first initialization, and during * any down to up transitions of the bond. */ void bond_3ad_initiate_agg_selection(struct bonding *bond, int timeout) { atomic_set(&BOND_AD_INFO(bond).agg_select_timer, timeout); } /** * bond_3ad_initialize - initialize a bond's 802.3ad parameters and structures * @bond: bonding struct to work on * * Can be called only after the mac address of the bond is set. */ void bond_3ad_initialize(struct bonding *bond) { BOND_AD_INFO(bond).aggregator_identifier = 0; BOND_AD_INFO(bond).system.sys_priority = bond->params.ad_actor_sys_prio; if (is_zero_ether_addr(bond->params.ad_actor_system)) BOND_AD_INFO(bond).system.sys_mac_addr = *((struct mac_addr *)bond->dev->dev_addr); else BOND_AD_INFO(bond).system.sys_mac_addr = *((struct mac_addr *)bond->params.ad_actor_system); bond_3ad_initiate_agg_selection(bond, AD_AGGREGATOR_SELECTION_TIMER * ad_ticks_per_sec); } /** * bond_3ad_bind_slave - initialize a slave's port * @slave: slave struct to work on * * Returns: 0 on success * < 0 on error */ void bond_3ad_bind_slave(struct slave *slave) { struct bonding *bond = bond_get_bond_by_slave(slave); struct port *port; struct aggregator *aggregator; /* check that the slave has not been initialized yet. */ if (SLAVE_AD_INFO(slave)->port.slave != slave) { /* port initialization */ port = &(SLAVE_AD_INFO(slave)->port); ad_initialize_port(port, bond->params.lacp_fast); port->slave = slave; port->actor_port_number = SLAVE_AD_INFO(slave)->id; /* key is determined according to the link speed, duplex and * user key */ port->actor_admin_port_key = bond->params.ad_user_port_key << 6; ad_update_actor_keys(port, false); /* actor system is the bond's system */ __ad_actor_update_port(port); /* tx timer(to verify that no more than MAX_TX_IN_SECOND * lacpdu's are sent in one second) */ port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND; __disable_port(port); /* aggregator initialization */ aggregator = &(SLAVE_AD_INFO(slave)->aggregator); ad_initialize_agg(aggregator); aggregator->aggregator_mac_address = *((struct mac_addr *)bond->dev->dev_addr); aggregator->aggregator_identifier = ++BOND_AD_INFO(bond).aggregator_identifier; aggregator->slave = slave; aggregator->is_active = 0; aggregator->num_of_ports = 0; } } /** * bond_3ad_unbind_slave - deinitialize a slave's port * @slave: slave struct to work on * * Search for the aggregator that is related to this port, remove the * aggregator and assign another aggregator for other port related to it * (if any), and remove the port. */ void bond_3ad_unbind_slave(struct slave *slave) { struct port *port, *prev_port, *temp_port; struct aggregator *aggregator, *new_aggregator, *temp_aggregator; int select_new_active_agg = 0; struct bonding *bond = slave->bond; struct slave *slave_iter; struct list_head *iter; bool dummy_slave_update; /* Ignore this value as caller updates array */ /* Sync against bond_3ad_state_machine_handler() */ spin_lock_bh(&bond->mode_lock); aggregator = &(SLAVE_AD_INFO(slave)->aggregator); port = &(SLAVE_AD_INFO(slave)->port); /* if slave is null, the whole port is not initialized */ if (!port->slave) { slave_warn(bond->dev, slave->dev, "Trying to unbind an uninitialized port\n"); goto out; } slave_dbg(bond->dev, slave->dev, "Unbinding Link Aggregation Group %d\n", aggregator->aggregator_identifier); /* Tell the partner that this port is not suitable for aggregation */ port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION; port->actor_oper_port_state &= ~LACP_STATE_COLLECTING; port->actor_oper_port_state &= ~LACP_STATE_DISTRIBUTING; port->actor_oper_port_state &= ~LACP_STATE_AGGREGATION; __update_lacpdu_from_port(port); ad_lacpdu_send(port); /* check if this aggregator is occupied */ if (aggregator->lag_ports) { /* check if there are other ports related to this aggregator * except the port related to this slave(thats ensure us that * there is a reason to search for new aggregator, and that we * will find one */ if ((aggregator->lag_ports != port) || (aggregator->lag_ports->next_port_in_aggregator)) { /* find new aggregator for the related port(s) */ bond_for_each_slave(bond, slave_iter, iter) { new_aggregator = &(SLAVE_AD_INFO(slave_iter)->aggregator); /* if the new aggregator is empty, or it is * connected to our port only */ if (!new_aggregator->lag_ports || ((new_aggregator->lag_ports == port) && !new_aggregator->lag_ports->next_port_in_aggregator)) break; } if (!slave_iter) new_aggregator = NULL; /* if new aggregator found, copy the aggregator's * parameters and connect the related lag_ports to the * new aggregator */ if ((new_aggregator) && ((!new_aggregator->lag_ports) || ((new_aggregator->lag_ports == port) && !new_aggregator->lag_ports->next_port_in_aggregator))) { slave_dbg(bond->dev, slave->dev, "Some port(s) related to LAG %d - replacing with LAG %d\n", aggregator->aggregator_identifier, new_aggregator->aggregator_identifier); if ((new_aggregator->lag_ports == port) && new_aggregator->is_active) { slave_info(bond->dev, slave->dev, "Removing an active aggregator\n"); select_new_active_agg = 1; } new_aggregator->is_individual = aggregator->is_individual; new_aggregator->actor_admin_aggregator_key = aggregator->actor_admin_aggregator_key; new_aggregator->actor_oper_aggregator_key = aggregator->actor_oper_aggregator_key; new_aggregator->partner_system = aggregator->partner_system; new_aggregator->partner_system_priority = aggregator->partner_system_priority; new_aggregator->partner_oper_aggregator_key = aggregator->partner_oper_aggregator_key; new_aggregator->receive_state = aggregator->receive_state; new_aggregator->transmit_state = aggregator->transmit_state; new_aggregator->lag_ports = aggregator->lag_ports; new_aggregator->is_active = aggregator->is_active; new_aggregator->num_of_ports = aggregator->num_of_ports; /* update the information that is written on * the ports about the aggregator */ for (temp_port = aggregator->lag_ports; temp_port; temp_port = temp_port->next_port_in_aggregator) { temp_port->aggregator = new_aggregator; temp_port->actor_port_aggregator_identifier = new_aggregator->aggregator_identifier; } ad_clear_agg(aggregator); if (select_new_active_agg) ad_agg_selection_logic(__get_first_agg(port), &dummy_slave_update); } else { slave_warn(bond->dev, slave->dev, "unbinding aggregator, and could not find a new aggregator for its ports\n"); } } else { /* in case that the only port related to this * aggregator is the one we want to remove */ select_new_active_agg = aggregator->is_active; ad_clear_agg(aggregator); if (select_new_active_agg) { slave_info(bond->dev, slave->dev, "Removing an active aggregator\n"); /* select new active aggregator */ temp_aggregator = __get_first_agg(port); if (temp_aggregator) ad_agg_selection_logic(temp_aggregator, &dummy_slave_update); } } } slave_dbg(bond->dev, slave->dev, "Unbinding port %d\n", port->actor_port_number); /* find the aggregator that this port is connected to */ bond_for_each_slave(bond, slave_iter, iter) { temp_aggregator = &(SLAVE_AD_INFO(slave_iter)->aggregator); prev_port = NULL; /* search the port in the aggregator's related ports */ for (temp_port = temp_aggregator->lag_ports; temp_port; prev_port = temp_port, temp_port = temp_port->next_port_in_aggregator) { if (temp_port == port) { /* the aggregator found - detach the port from * this aggregator */ if (prev_port) prev_port->next_port_in_aggregator = temp_port->next_port_in_aggregator; else temp_aggregator->lag_ports = temp_port->next_port_in_aggregator; temp_aggregator->num_of_ports--; if (__agg_active_ports(temp_aggregator) == 0) { select_new_active_agg = temp_aggregator->is_active; if (temp_aggregator->num_of_ports == 0) ad_clear_agg(temp_aggregator); if (select_new_active_agg) { slave_info(bond->dev, slave->dev, "Removing an active aggregator\n"); /* select new active aggregator */ ad_agg_selection_logic(__get_first_agg(port), &dummy_slave_update); } } break; } } } port->slave = NULL; out: spin_unlock_bh(&bond->mode_lock); } /** * bond_3ad_update_ad_actor_settings - reflect change of actor settings to ports * @bond: bonding struct to work on * * If an ad_actor setting gets changed we need to update the individual port * settings so the bond device will use the new values when it gets upped. */ void bond_3ad_update_ad_actor_settings(struct bonding *bond) { struct list_head *iter; struct slave *slave; ASSERT_RTNL(); BOND_AD_INFO(bond).system.sys_priority = bond->params.ad_actor_sys_prio; if (is_zero_ether_addr(bond->params.ad_actor_system)) BOND_AD_INFO(bond).system.sys_mac_addr = *((struct mac_addr *)bond->dev->dev_addr); else BOND_AD_INFO(bond).system.sys_mac_addr = *((struct mac_addr *)bond->params.ad_actor_system); spin_lock_bh(&bond->mode_lock); bond_for_each_slave(bond, slave, iter) { struct port *port = &(SLAVE_AD_INFO(slave))->port; __ad_actor_update_port(port); port->ntt = true; } spin_unlock_bh(&bond->mode_lock); } /** * bond_agg_timer_advance - advance agg_select_timer * @bond: bonding structure * * Return true when agg_select_timer reaches 0. */ static bool bond_agg_timer_advance(struct bonding *bond) { int val, nval; while (1) { val = atomic_read(&BOND_AD_INFO(bond).agg_select_timer); if (!val) return false; nval = val - 1; if (atomic_cmpxchg(&BOND_AD_INFO(bond).agg_select_timer, val, nval) == val) break; } return nval == 0; } /** * bond_3ad_state_machine_handler - handle state machines timeout * @work: work context to fetch bonding struct to work on from * * The state machine handling concept in this module is to check every tick * which state machine should operate any function. The execution order is * round robin, so when we have an interaction between state machines, the * reply of one to each other might be delayed until next tick. * * This function also complete the initialization when the agg_select_timer * times out, and it selects an aggregator for the ports that are yet not * related to any aggregator, and selects the active aggregator for a bond. */ void bond_3ad_state_machine_handler(struct work_struct *work) { struct bonding *bond = container_of(work, struct bonding, ad_work.work); struct aggregator *aggregator; struct list_head *iter; struct slave *slave; struct port *port; bool should_notify_rtnl = BOND_SLAVE_NOTIFY_LATER; bool update_slave_arr = false; /* Lock to protect data accessed by all (e.g., port->sm_vars) and * against running with bond_3ad_unbind_slave. ad_rx_machine may run * concurrently due to incoming LACPDU as well. */ spin_lock_bh(&bond->mode_lock); rcu_read_lock(); /* check if there are any slaves */ if (!bond_has_slaves(bond)) goto re_arm; if (bond_agg_timer_advance(bond)) { slave = bond_first_slave_rcu(bond); port = slave ? &(SLAVE_AD_INFO(slave)->port) : NULL; /* select the active aggregator for the bond */ if (port) { if (!port->slave) { net_warn_ratelimited("%s: Warning: bond's first port is uninitialized\n", bond->dev->name); goto re_arm; } aggregator = __get_first_agg(port); ad_agg_selection_logic(aggregator, &update_slave_arr); } bond_3ad_set_carrier(bond); } /* for each port run the state machines */ bond_for_each_slave_rcu(bond, slave, iter) { port = &(SLAVE_AD_INFO(slave)->port); if (!port->slave) { net_warn_ratelimited("%s: Warning: Found an uninitialized port\n", bond->dev->name); goto re_arm; } ad_rx_machine(NULL, port); ad_periodic_machine(port, &bond->params); ad_port_selection_logic(port, &update_slave_arr); ad_mux_machine(port, &update_slave_arr); ad_tx_machine(port); ad_churn_machine(port); /* turn off the BEGIN bit, since we already handled it */ if (port->sm_vars & AD_PORT_BEGIN) port->sm_vars &= ~AD_PORT_BEGIN; } re_arm: bond_for_each_slave_rcu(bond, slave, iter) { if (slave->should_notify) { should_notify_rtnl = BOND_SLAVE_NOTIFY_NOW; break; } } rcu_read_unlock(); spin_unlock_bh(&bond->mode_lock); if (update_slave_arr) bond_slave_arr_work_rearm(bond, 0); if (should_notify_rtnl && rtnl_trylock()) { bond_slave_state_notify(bond); rtnl_unlock(); } queue_delayed_work(bond->wq, &bond->ad_work, ad_delta_in_ticks); } /** * bond_3ad_rx_indication - handle a received frame * @lacpdu: received lacpdu * @slave: slave struct to work on * * It is assumed that frames that were sent on this NIC don't returned as new * received frames (loopback). Since only the payload is given to this * function, it check for loopback. */ static int bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave) { struct bonding *bond = slave->bond; int ret = RX_HANDLER_ANOTHER; struct bond_marker *marker; struct port *port; atomic64_t *stat; port = &(SLAVE_AD_INFO(slave)->port); if (!port->slave) { net_warn_ratelimited("%s: Warning: port of slave %s is uninitialized\n", slave->dev->name, slave->bond->dev->name); return ret; } switch (lacpdu->subtype) { case AD_TYPE_LACPDU: ret = RX_HANDLER_CONSUMED; slave_dbg(slave->bond->dev, slave->dev, "Received LACPDU on port %d\n", port->actor_port_number); /* Protect against concurrent state machines */ spin_lock(&slave->bond->mode_lock); ad_rx_machine(lacpdu, port); spin_unlock(&slave->bond->mode_lock); break; case AD_TYPE_MARKER: ret = RX_HANDLER_CONSUMED; /* No need to convert fields to Little Endian since we * don't use the marker's fields. */ marker = (struct bond_marker *)lacpdu; switch (marker->tlv_type) { case AD_MARKER_INFORMATION_SUBTYPE: slave_dbg(slave->bond->dev, slave->dev, "Received Marker Information on port %d\n", port->actor_port_number); ad_marker_info_received(marker, port); break; case AD_MARKER_RESPONSE_SUBTYPE: slave_dbg(slave->bond->dev, slave->dev, "Received Marker Response on port %d\n", port->actor_port_number); ad_marker_response_received(marker, port); break; default: slave_dbg(slave->bond->dev, slave->dev, "Received an unknown Marker subtype on port %d\n", port->actor_port_number); stat = &SLAVE_AD_INFO(slave)->stats.marker_unknown_rx; atomic64_inc(stat); stat = &BOND_AD_INFO(bond).stats.marker_unknown_rx; atomic64_inc(stat); } break; default: atomic64_inc(&SLAVE_AD_INFO(slave)->stats.lacpdu_unknown_rx); atomic64_inc(&BOND_AD_INFO(bond).stats.lacpdu_unknown_rx); } return ret; } /** * ad_update_actor_keys - Update the oper / admin keys for a port based on * its current speed and duplex settings. * * @port: the port we'are looking at * @reset: Boolean to just reset the speed and the duplex part of the key * * The logic to change the oper / admin keys is: * (a) A full duplex port can participate in LACP with partner. * (b) When the speed is changed, LACP need to be reinitiated. */ static void ad_update_actor_keys(struct port *port, bool reset) { u8 duplex = 0; u16 ospeed = 0, speed = 0; u16 old_oper_key = port->actor_oper_port_key; port->actor_admin_port_key &= ~(AD_SPEED_KEY_MASKS|AD_DUPLEX_KEY_MASKS); if (!reset) { speed = __get_link_speed(port); ospeed = (old_oper_key & AD_SPEED_KEY_MASKS) >> 1; duplex = __get_duplex(port); port->actor_admin_port_key |= (speed << 1) | duplex; } port->actor_oper_port_key = port->actor_admin_port_key; if (old_oper_key != port->actor_oper_port_key) { /* Only 'duplex' port participates in LACP */ if (duplex) port->sm_vars |= AD_PORT_LACP_ENABLED; else port->sm_vars &= ~AD_PORT_LACP_ENABLED; if (!reset) { if (!speed) { slave_err(port->slave->bond->dev, port->slave->dev, "speed changed to 0 on port %d\n", port->actor_port_number); } else if (duplex && ospeed != speed) { /* Speed change restarts LACP state-machine */ port->sm_vars |= AD_PORT_BEGIN; } } } } /** * bond_3ad_adapter_speed_duplex_changed - handle a slave's speed / duplex * change indication * * @slave: slave struct to work on * * Handle reselection of aggregator (if needed) for this port. */ void bond_3ad_adapter_speed_duplex_changed(struct slave *slave) { struct port *port; port = &(SLAVE_AD_INFO(slave)->port); /* if slave is null, the whole port is not initialized */ if (!port->slave) { slave_warn(slave->bond->dev, slave->dev, "speed/duplex changed for uninitialized port\n"); return; } spin_lock_bh(&slave->bond->mode_lock); ad_update_actor_keys(port, false); spin_unlock_bh(&slave->bond->mode_lock); slave_dbg(slave->bond->dev, slave->dev, "Port %d changed speed/duplex\n", port->actor_port_number); } /** * bond_3ad_handle_link_change - handle a slave's link status change indication * @slave: slave struct to work on * @link: whether the link is now up or down * * Handle reselection of aggregator (if needed) for this port. */ void bond_3ad_handle_link_change(struct slave *slave, char link) { struct aggregator *agg; struct port *port; bool dummy; port = &(SLAVE_AD_INFO(slave)->port); /* if slave is null, the whole port is not initialized */ if (!port->slave) { slave_warn(slave->bond->dev, slave->dev, "link status changed for uninitialized port\n"); return; } spin_lock_bh(&slave->bond->mode_lock); /* on link down we are zeroing duplex and speed since * some of the adaptors(ce1000.lan) report full duplex/speed * instead of N/A(duplex) / 0(speed). * * on link up we are forcing recheck on the duplex and speed since * some of he adaptors(ce1000.lan) report. */ if (link == BOND_LINK_UP) { port->is_enabled = true; ad_update_actor_keys(port, false); } else { /* link has failed */ port->is_enabled = false; ad_update_actor_keys(port, true); } agg = __get_first_agg(port); ad_agg_selection_logic(agg, &dummy); spin_unlock_bh(&slave->bond->mode_lock); slave_dbg(slave->bond->dev, slave->dev, "Port %d changed link status to %s\n", port->actor_port_number, link == BOND_LINK_UP ? "UP" : "DOWN"); /* RTNL is held and mode_lock is released so it's safe * to update slave_array here. */ bond_update_slave_arr(slave->bond, NULL); } /** * bond_3ad_set_carrier - set link state for bonding master * @bond: bonding structure * * if we have an active aggregator, we're up, if not, we're down. * Presumes that we cannot have an active aggregator if there are * no slaves with link up. * * This behavior complies with IEEE 802.3 section 43.3.9. * * Called by bond_set_carrier(). Return zero if carrier state does not * change, nonzero if it does. */ int bond_3ad_set_carrier(struct bonding *bond) { struct aggregator *active; struct slave *first_slave; int ret = 1; rcu_read_lock(); first_slave = bond_first_slave_rcu(bond); if (!first_slave) { ret = 0; goto out; } active = __get_active_agg(&(SLAVE_AD_INFO(first_slave)->aggregator)); if (active) { /* are enough slaves available to consider link up? */ if (__agg_active_ports(active) < bond->params.min_links) { if (netif_carrier_ok(bond->dev)) { netif_carrier_off(bond->dev); goto out; } } else if (!netif_carrier_ok(bond->dev)) { netif_carrier_on(bond->dev); goto out; } } else if (netif_carrier_ok(bond->dev)) { netif_carrier_off(bond->dev); } out: rcu_read_unlock(); return ret; } /** * __bond_3ad_get_active_agg_info - get information of the active aggregator * @bond: bonding struct to work on * @ad_info: ad_info struct to fill with the bond's info * * Returns: 0 on success * < 0 on error */ int __bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info) { struct aggregator *aggregator = NULL; struct list_head *iter; struct slave *slave; struct port *port; bond_for_each_slave_rcu(bond, slave, iter) { port = &(SLAVE_AD_INFO(slave)->port); if (port->aggregator && port->aggregator->is_active) { aggregator = port->aggregator; break; } } if (!aggregator) return -1; ad_info->aggregator_id = aggregator->aggregator_identifier; ad_info->ports = __agg_active_ports(aggregator); ad_info->actor_key = aggregator->actor_oper_aggregator_key; ad_info->partner_key = aggregator->partner_oper_aggregator_key; ether_addr_copy(ad_info->partner_system, aggregator->partner_system.mac_addr_value); return 0; } int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info) { int ret; rcu_read_lock(); ret = __bond_3ad_get_active_agg_info(bond, ad_info); rcu_read_unlock(); return ret; } int bond_3ad_lacpdu_recv(const struct sk_buff *skb, struct bonding *bond, struct slave *slave) { struct lacpdu *lacpdu, _lacpdu; if (skb->protocol != PKT_TYPE_LACPDU) return RX_HANDLER_ANOTHER; if (!MAC_ADDRESS_EQUAL(eth_hdr(skb)->h_dest, lacpdu_mcast_addr)) return RX_HANDLER_ANOTHER; lacpdu = skb_header_pointer(skb, 0, sizeof(_lacpdu), &_lacpdu); if (!lacpdu) { atomic64_inc(&SLAVE_AD_INFO(slave)->stats.lacpdu_illegal_rx); atomic64_inc(&BOND_AD_INFO(bond).stats.lacpdu_illegal_rx); return RX_HANDLER_ANOTHER; } return bond_3ad_rx_indication(lacpdu, slave); } /** * bond_3ad_update_lacp_rate - change the lacp rate * @bond: bonding struct * * When modify lacp_rate parameter via sysfs, * update actor_oper_port_state of each port. * * Hold bond->mode_lock, * so we can modify port->actor_oper_port_state, * no matter bond is up or down. */ void bond_3ad_update_lacp_rate(struct bonding *bond) { struct port *port = NULL; struct list_head *iter; struct slave *slave; int lacp_fast; lacp_fast = bond->params.lacp_fast; spin_lock_bh(&bond->mode_lock); bond_for_each_slave(bond, slave, iter) { port = &(SLAVE_AD_INFO(slave)->port); if (lacp_fast) port->actor_oper_port_state |= LACP_STATE_LACP_TIMEOUT; else port->actor_oper_port_state &= ~LACP_STATE_LACP_TIMEOUT; } spin_unlock_bh(&bond->mode_lock); } size_t bond_3ad_stats_size(void) { return nla_total_size_64bit(sizeof(u64)) + /* BOND_3AD_STAT_LACPDU_RX */ nla_total_size_64bit(sizeof(u64)) + /* BOND_3AD_STAT_LACPDU_TX */ nla_total_size_64bit(sizeof(u64)) + /* BOND_3AD_STAT_LACPDU_UNKNOWN_RX */ nla_total_size_64bit(sizeof(u64)) + /* BOND_3AD_STAT_LACPDU_ILLEGAL_RX */ nla_total_size_64bit(sizeof(u64)) + /* BOND_3AD_STAT_MARKER_RX */ nla_total_size_64bit(sizeof(u64)) + /* BOND_3AD_STAT_MARKER_TX */ nla_total_size_64bit(sizeof(u64)) + /* BOND_3AD_STAT_MARKER_RESP_RX */ nla_total_size_64bit(sizeof(u64)) + /* BOND_3AD_STAT_MARKER_RESP_TX */ nla_total_size_64bit(sizeof(u64)); /* BOND_3AD_STAT_MARKER_UNKNOWN_RX */ } int bond_3ad_stats_fill(struct sk_buff *skb, struct bond_3ad_stats *stats) { u64 val; val = atomic64_read(&stats->lacpdu_rx); if (nla_put_u64_64bit(skb, BOND_3AD_STAT_LACPDU_RX, val, BOND_3AD_STAT_PAD)) return -EMSGSIZE; val = atomic64_read(&stats->lacpdu_tx); if (nla_put_u64_64bit(skb, BOND_3AD_STAT_LACPDU_TX, val, BOND_3AD_STAT_PAD)) return -EMSGSIZE; val = atomic64_read(&stats->lacpdu_unknown_rx); if (nla_put_u64_64bit(skb, BOND_3AD_STAT_LACPDU_UNKNOWN_RX, val, BOND_3AD_STAT_PAD)) return -EMSGSIZE; val = atomic64_read(&stats->lacpdu_illegal_rx); if (nla_put_u64_64bit(skb, BOND_3AD_STAT_LACPDU_ILLEGAL_RX, val, BOND_3AD_STAT_PAD)) return -EMSGSIZE; val = atomic64_read(&stats->marker_rx); if (nla_put_u64_64bit(skb, BOND_3AD_STAT_MARKER_RX, val, BOND_3AD_STAT_PAD)) return -EMSGSIZE; val = atomic64_read(&stats->marker_tx); if (nla_put_u64_64bit(skb, BOND_3AD_STAT_MARKER_TX, val, BOND_3AD_STAT_PAD)) return -EMSGSIZE; val = atomic64_read(&stats->marker_resp_rx); if (nla_put_u64_64bit(skb, BOND_3AD_STAT_MARKER_RESP_RX, val, BOND_3AD_STAT_PAD)) return -EMSGSIZE; val = atomic64_read(&stats->marker_resp_tx); if (nla_put_u64_64bit(skb, BOND_3AD_STAT_MARKER_RESP_TX, val, BOND_3AD_STAT_PAD)) return -EMSGSIZE; val = atomic64_read(&stats->marker_unknown_rx); if (nla_put_u64_64bit(skb, BOND_3AD_STAT_MARKER_UNKNOWN_RX, val, BOND_3AD_STAT_PAD)) return -EMSGSIZE; return 0; } |
31 31 31 31 22 22 6 6 5 5 3 5 2 4 2 3 3 2 4 5 41 31 14 14 1 12 13 40 40 25 23 24 40 41 39 41 41 55 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 | // SPDX-License-Identifier: GPL-2.0-only /* * In memory quota format relies on quota infrastructure to store dquot * information for us. While conventional quota formats for file systems * with persistent storage can load quota information into dquot from the * storage on-demand and hence quota dquot shrinker can free any dquot * that is not currently being used, it must be avoided here. Otherwise we * can lose valuable information, user provided limits, because there is * no persistent storage to load the information from afterwards. * * One information that in-memory quota format needs to keep track of is * a sorted list of ids for each quota type. This is done by utilizing * an rb tree which root is stored in mem_dqinfo->dqi_priv for each quota * type. * * This format can be used to support quota on file system without persistent * storage such as tmpfs. * * Author: Lukas Czerner <lczerner@redhat.com> * Carlos Maiolino <cmaiolino@redhat.com> * * Copyright (C) 2023 Red Hat, Inc. */ #include <linux/errno.h> #include <linux/fs.h> #include <linux/mount.h> #include <linux/kernel.h> #include <linux/init.h> #include <linux/module.h> #include <linux/slab.h> #include <linux/rbtree.h> #include <linux/shmem_fs.h> #include <linux/quotaops.h> #include <linux/quota.h> /* * The following constants define the amount of time given a user * before the soft limits are treated as hard limits (usually resulting * in an allocation failure). The timer is started when the user crosses * their soft limit, it is reset when they go below their soft limit. */ #define SHMEM_MAX_IQ_TIME 604800 /* (7*24*60*60) 1 week */ #define SHMEM_MAX_DQ_TIME 604800 /* (7*24*60*60) 1 week */ struct quota_id { struct rb_node node; qid_t id; qsize_t bhardlimit; qsize_t bsoftlimit; qsize_t ihardlimit; qsize_t isoftlimit; }; static int shmem_check_quota_file(struct super_block *sb, int type) { /* There is no real quota file, nothing to do */ return 1; } /* * There is no real quota file. Just allocate rb_root for quota ids and * set limits */ static int shmem_read_file_info(struct super_block *sb, int type) { struct quota_info *dqopt = sb_dqopt(sb); struct mem_dqinfo *info = &dqopt->info[type]; info->dqi_priv = kzalloc(sizeof(struct rb_root), GFP_NOFS); if (!info->dqi_priv) return -ENOMEM; info->dqi_max_spc_limit = SHMEM_QUOTA_MAX_SPC_LIMIT; info->dqi_max_ino_limit = SHMEM_QUOTA_MAX_INO_LIMIT; info->dqi_bgrace = SHMEM_MAX_DQ_TIME; info->dqi_igrace = SHMEM_MAX_IQ_TIME; info->dqi_flags = 0; return 0; } static int shmem_write_file_info(struct super_block *sb, int type) { /* There is no real quota file, nothing to do */ return 0; } /* * Free all the quota_id entries in the rb tree and rb_root. */ static int shmem_free_file_info(struct super_block *sb, int type) { struct mem_dqinfo *info = &sb_dqopt(sb)->info[type]; struct rb_root *root = info->dqi_priv; struct quota_id *entry; struct rb_node *node; info->dqi_priv = NULL; node = rb_first(root); while (node) { entry = rb_entry(node, struct quota_id, node); node = rb_next(&entry->node); rb_erase(&entry->node, root); kfree(entry); } kfree(root); return 0; } static int shmem_get_next_id(struct super_block *sb, struct kqid *qid) { struct mem_dqinfo *info = sb_dqinfo(sb, qid->type); struct rb_node *node; qid_t id = from_kqid(&init_user_ns, *qid); struct quota_info *dqopt = sb_dqopt(sb); struct quota_id *entry = NULL; int ret = 0; if (!sb_has_quota_active(sb, qid->type)) return -ESRCH; down_read(&dqopt->dqio_sem); node = ((struct rb_root *)info->dqi_priv)->rb_node; while (node) { entry = rb_entry(node, struct quota_id, node); if (id < entry->id) node = node->rb_left; else if (id > entry->id) node = node->rb_right; else goto got_next_id; } if (!entry) { ret = -ENOENT; goto out_unlock; } if (id > entry->id) { node = rb_next(&entry->node); if (!node) { ret = -ENOENT; goto out_unlock; } entry = rb_entry(node, struct quota_id, node); } got_next_id: *qid = make_kqid(&init_user_ns, qid->type, entry->id); out_unlock: up_read(&dqopt->dqio_sem); return ret; } /* * Load dquot with limits from existing entry, or create the new entry if * it does not exist. */ static int shmem_acquire_dquot(struct dquot *dquot) { struct mem_dqinfo *info = sb_dqinfo(dquot->dq_sb, dquot->dq_id.type); struct rb_node **n; struct shmem_sb_info *sbinfo = dquot->dq_sb->s_fs_info; struct rb_node *parent = NULL, *new_node = NULL; struct quota_id *new_entry, *entry; qid_t id = from_kqid(&init_user_ns, dquot->dq_id); struct quota_info *dqopt = sb_dqopt(dquot->dq_sb); int ret = 0; mutex_lock(&dquot->dq_lock); down_write(&dqopt->dqio_sem); n = &((struct rb_root *)info->dqi_priv)->rb_node; while (*n) { parent = *n; entry = rb_entry(parent, struct quota_id, node); if (id < entry->id) n = &(*n)->rb_left; else if (id > entry->id) n = &(*n)->rb_right; else goto found; } /* We don't have entry for this id yet, create it */ new_entry = kzalloc(sizeof(struct quota_id), GFP_NOFS); if (!new_entry) { ret = -ENOMEM; goto out_unlock; } new_entry->id = id; if (dquot->dq_id.type == USRQUOTA) { new_entry->bhardlimit = sbinfo->qlimits.usrquota_bhardlimit; new_entry->ihardlimit = sbinfo->qlimits.usrquota_ihardlimit; } else if (dquot->dq_id.type == GRPQUOTA) { new_entry->bhardlimit = sbinfo->qlimits.grpquota_bhardlimit; new_entry->ihardlimit = sbinfo->qlimits.grpquota_ihardlimit; } new_node = &new_entry->node; rb_link_node(new_node, parent, n); rb_insert_color(new_node, (struct rb_root *)info->dqi_priv); entry = new_entry; found: /* Load the stored limits from the tree */ spin_lock(&dquot->dq_dqb_lock); dquot->dq_dqb.dqb_bhardlimit = entry->bhardlimit; dquot->dq_dqb.dqb_bsoftlimit = entry->bsoftlimit; dquot->dq_dqb.dqb_ihardlimit = entry->ihardlimit; dquot->dq_dqb.dqb_isoftlimit = entry->isoftlimit; if (!dquot->dq_dqb.dqb_bhardlimit && !dquot->dq_dqb.dqb_bsoftlimit && !dquot->dq_dqb.dqb_ihardlimit && !dquot->dq_dqb.dqb_isoftlimit) set_bit(DQ_FAKE_B, &dquot->dq_flags); spin_unlock(&dquot->dq_dqb_lock); /* Make sure flags update is visible after dquot has been filled */ smp_mb__before_atomic(); set_bit(DQ_ACTIVE_B, &dquot->dq_flags); out_unlock: up_write(&dqopt->dqio_sem); mutex_unlock(&dquot->dq_lock); return ret; } static bool shmem_is_empty_dquot(struct dquot *dquot) { struct shmem_sb_info *sbinfo = dquot->dq_sb->s_fs_info; qsize_t bhardlimit; qsize_t ihardlimit; if (dquot->dq_id.type == USRQUOTA) { bhardlimit = sbinfo->qlimits.usrquota_bhardlimit; ihardlimit = sbinfo->qlimits.usrquota_ihardlimit; } else if (dquot->dq_id.type == GRPQUOTA) { bhardlimit = sbinfo->qlimits.grpquota_bhardlimit; ihardlimit = sbinfo->qlimits.grpquota_ihardlimit; } if (test_bit(DQ_FAKE_B, &dquot->dq_flags) || (dquot->dq_dqb.dqb_curspace == 0 && dquot->dq_dqb.dqb_curinodes == 0 && dquot->dq_dqb.dqb_bhardlimit == bhardlimit && dquot->dq_dqb.dqb_ihardlimit == ihardlimit)) return true; return false; } /* * Store limits from dquot in the tree unless it's fake. If it is fake * remove the id from the tree since there is no useful information in * there. */ static int shmem_release_dquot(struct dquot *dquot) { struct mem_dqinfo *info = sb_dqinfo(dquot->dq_sb, dquot->dq_id.type); struct rb_node *node; qid_t id = from_kqid(&init_user_ns, dquot->dq_id); struct quota_info *dqopt = sb_dqopt(dquot->dq_sb); struct quota_id *entry = NULL; mutex_lock(&dquot->dq_lock); /* Check whether we are not racing with some other dqget() */ if (dquot_is_busy(dquot)) goto out_dqlock; down_write(&dqopt->dqio_sem); node = ((struct rb_root *)info->dqi_priv)->rb_node; while (node) { entry = rb_entry(node, struct quota_id, node); if (id < entry->id) node = node->rb_left; else if (id > entry->id) node = node->rb_right; else goto found; } /* We should always find the entry in the rb tree */ WARN_ONCE(1, "quota id %u from dquot %p, not in rb tree!\n", id, dquot); up_write(&dqopt->dqio_sem); mutex_unlock(&dquot->dq_lock); return -ENOENT; found: if (shmem_is_empty_dquot(dquot)) { /* Remove entry from the tree */ rb_erase(&entry->node, info->dqi_priv); kfree(entry); } else { /* Store the limits in the tree */ spin_lock(&dquot->dq_dqb_lock); entry->bhardlimit = dquot->dq_dqb.dqb_bhardlimit; entry->bsoftlimit = dquot->dq_dqb.dqb_bsoftlimit; entry->ihardlimit = dquot->dq_dqb.dqb_ihardlimit; entry->isoftlimit = dquot->dq_dqb.dqb_isoftlimit; spin_unlock(&dquot->dq_dqb_lock); } clear_bit(DQ_ACTIVE_B, &dquot->dq_flags); up_write(&dqopt->dqio_sem); out_dqlock: mutex_unlock(&dquot->dq_lock); return 0; } static int shmem_mark_dquot_dirty(struct dquot *dquot) { return 0; } static int shmem_dquot_write_info(struct super_block *sb, int type) { return 0; } static const struct quota_format_ops shmem_format_ops = { .check_quota_file = shmem_check_quota_file, .read_file_info = shmem_read_file_info, .write_file_info = shmem_write_file_info, .free_file_info = shmem_free_file_info, }; struct quota_format_type shmem_quota_format = { .qf_fmt_id = QFMT_SHMEM, .qf_ops = &shmem_format_ops, .qf_owner = THIS_MODULE }; const struct dquot_operations shmem_quota_operations = { .acquire_dquot = shmem_acquire_dquot, .release_dquot = shmem_release_dquot, .alloc_dquot = dquot_alloc, .destroy_dquot = dquot_destroy, .write_info = shmem_dquot_write_info, .mark_dirty = shmem_mark_dquot_dirty, .get_next_id = shmem_get_next_id, }; |
2 2 2 2 2 2 2 2 2 5 5 4 4 4 3 3 4 4 5 4 5 1 1 1 4 4 5 5 3 3 2 5 5 1 5 5 4 5 1 4 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 | // SPDX-License-Identifier: GPL-2.0 /* * Adiantum length-preserving encryption mode * * Copyright 2018 Google LLC */ /* * Adiantum is a tweakable, length-preserving encryption mode designed for fast * and secure disk encryption, especially on CPUs without dedicated crypto * instructions. Adiantum encrypts each sector using the XChaCha12 stream * cipher, two passes of an ε-almost-∆-universal (ε-∆U) hash function based on * NH and Poly1305, and an invocation of the AES-256 block cipher on a single * 16-byte block. See the paper for details: * * Adiantum: length-preserving encryption for entry-level processors * (https://eprint.iacr.org/2018/720.pdf) * * For flexibility, this implementation also allows other ciphers: * * - Stream cipher: XChaCha12 or XChaCha20 * - Block cipher: any with a 128-bit block size and 256-bit key * * This implementation doesn't currently allow other ε-∆U hash functions, i.e. * HPolyC is not supported. This is because Adiantum is ~20% faster than HPolyC * but still provably as secure, and also the ε-∆U hash function of HBSH is * formally defined to take two inputs (tweak, message) which makes it difficult * to wrap with the crypto_shash API. Rather, some details need to be handled * here. Nevertheless, if needed in the future, support for other ε-∆U hash * functions could be added here. */ #include <crypto/b128ops.h> #include <crypto/chacha.h> #include <crypto/internal/cipher.h> #include <crypto/internal/hash.h> #include <crypto/internal/poly1305.h> #include <crypto/internal/skcipher.h> #include <crypto/nhpoly1305.h> #include <crypto/scatterwalk.h> #include <linux/module.h> /* * Size of right-hand part of input data, in bytes; also the size of the block * cipher's block size and the hash function's output. */ #define BLOCKCIPHER_BLOCK_SIZE 16 /* Size of the block cipher key (K_E) in bytes */ #define BLOCKCIPHER_KEY_SIZE 32 /* Size of the hash key (K_H) in bytes */ #define HASH_KEY_SIZE (POLY1305_BLOCK_SIZE + NHPOLY1305_KEY_SIZE) /* * The specification allows variable-length tweaks, but Linux's crypto API * currently only allows algorithms to support a single length. The "natural" * tweak length for Adiantum is 16, since that fits into one Poly1305 block for * the best performance. But longer tweaks are useful for fscrypt, to avoid * needing to derive per-file keys. So instead we use two blocks, or 32 bytes. */ #define TWEAK_SIZE 32 struct adiantum_instance_ctx { struct crypto_skcipher_spawn streamcipher_spawn; struct crypto_cipher_spawn blockcipher_spawn; struct crypto_shash_spawn hash_spawn; }; struct adiantum_tfm_ctx { struct crypto_skcipher *streamcipher; struct crypto_cipher *blockcipher; struct crypto_shash *hash; struct poly1305_core_key header_hash_key; }; struct adiantum_request_ctx { /* * Buffer for right-hand part of data, i.e. * * P_L => P_M => C_M => C_R when encrypting, or * C_R => C_M => P_M => P_L when decrypting. * * Also used to build the IV for the stream cipher. */ union { u8 bytes[XCHACHA_IV_SIZE]; __le32 words[XCHACHA_IV_SIZE / sizeof(__le32)]; le128 bignum; /* interpret as element of Z/(2^{128}Z) */ } rbuf; bool enc; /* true if encrypting, false if decrypting */ /* * The result of the Poly1305 ε-∆U hash function applied to * (bulk length, tweak) */ le128 header_hash; /* Sub-requests, must be last */ union { struct shash_desc hash_desc; struct skcipher_request streamcipher_req; } u; }; /* * Given the XChaCha stream key K_S, derive the block cipher key K_E and the * hash key K_H as follows: * * K_E || K_H || ... = XChaCha(key=K_S, nonce=1||0^191) * * Note that this denotes using bits from the XChaCha keystream, which here we * get indirectly by encrypting a buffer containing all 0's. */ static int adiantum_setkey(struct crypto_skcipher *tfm, const u8 *key, unsigned int keylen) { struct adiantum_tfm_ctx *tctx = crypto_skcipher_ctx(tfm); struct { u8 iv[XCHACHA_IV_SIZE]; u8 derived_keys[BLOCKCIPHER_KEY_SIZE + HASH_KEY_SIZE]; struct scatterlist sg; struct crypto_wait wait; struct skcipher_request req; /* must be last */ } *data; u8 *keyp; int err; /* Set the stream cipher key (K_S) */ crypto_skcipher_clear_flags(tctx->streamcipher, CRYPTO_TFM_REQ_MASK); crypto_skcipher_set_flags(tctx->streamcipher, crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_REQ_MASK); err = crypto_skcipher_setkey(tctx->streamcipher, key, keylen); if (err) return err; /* Derive the subkeys */ data = kzalloc(sizeof(*data) + crypto_skcipher_reqsize(tctx->streamcipher), GFP_KERNEL); if (!data) return -ENOMEM; data->iv[0] = 1; sg_init_one(&data->sg, data->derived_keys, sizeof(data->derived_keys)); crypto_init_wait(&data->wait); skcipher_request_set_tfm(&data->req, tctx->streamcipher); skcipher_request_set_callback(&data->req, CRYPTO_TFM_REQ_MAY_SLEEP | CRYPTO_TFM_REQ_MAY_BACKLOG, crypto_req_done, &data->wait); skcipher_request_set_crypt(&data->req, &data->sg, &data->sg, sizeof(data->derived_keys), data->iv); err = crypto_wait_req(crypto_skcipher_encrypt(&data->req), &data->wait); if (err) goto out; keyp = data->derived_keys; /* Set the block cipher key (K_E) */ crypto_cipher_clear_flags(tctx->blockcipher, CRYPTO_TFM_REQ_MASK); crypto_cipher_set_flags(tctx->blockcipher, crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_REQ_MASK); err = crypto_cipher_setkey(tctx->blockcipher, keyp, BLOCKCIPHER_KEY_SIZE); if (err) goto out; keyp += BLOCKCIPHER_KEY_SIZE; /* Set the hash key (K_H) */ poly1305_core_setkey(&tctx->header_hash_key, keyp); keyp += POLY1305_BLOCK_SIZE; crypto_shash_clear_flags(tctx->hash, CRYPTO_TFM_REQ_MASK); crypto_shash_set_flags(tctx->hash, crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_REQ_MASK); err = crypto_shash_setkey(tctx->hash, keyp, NHPOLY1305_KEY_SIZE); keyp += NHPOLY1305_KEY_SIZE; WARN_ON(keyp != &data->derived_keys[ARRAY_SIZE(data->derived_keys)]); out: kfree_sensitive(data); return err; } /* Addition in Z/(2^{128}Z) */ static inline void le128_add(le128 *r, const le128 *v1, const le128 *v2) { u64 x = le64_to_cpu(v1->b); u64 y = le64_to_cpu(v2->b); r->b = cpu_to_le64(x + y); r->a = cpu_to_le64(le64_to_cpu(v1->a) + le64_to_cpu(v2->a) + (x + y < x)); } /* Subtraction in Z/(2^{128}Z) */ static inline void le128_sub(le128 *r, const le128 *v1, const le128 *v2) { u64 x = le64_to_cpu(v1->b); u64 y = le64_to_cpu(v2->b); r->b = cpu_to_le64(x - y); r->a = cpu_to_le64(le64_to_cpu(v1->a) - le64_to_cpu(v2->a) - (x - y > x)); } /* * Apply the Poly1305 ε-∆U hash function to (bulk length, tweak) and save the * result to rctx->header_hash. This is the calculation * * H_T ← Poly1305_{K_T}(bin_{128}(|L|) || T) * * from the procedure in section 6.4 of the Adiantum paper. The resulting value * is reused in both the first and second hash steps. Specifically, it's added * to the result of an independently keyed ε-∆U hash function (for equal length * inputs only) taken over the left-hand part (the "bulk") of the message, to * give the overall Adiantum hash of the (tweak, left-hand part) pair. */ static void adiantum_hash_header(struct skcipher_request *req) { struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); const struct adiantum_tfm_ctx *tctx = crypto_skcipher_ctx(tfm); struct adiantum_request_ctx *rctx = skcipher_request_ctx(req); const unsigned int bulk_len = req->cryptlen - BLOCKCIPHER_BLOCK_SIZE; struct { __le64 message_bits; __le64 padding; } header = { .message_bits = cpu_to_le64((u64)bulk_len * 8) }; struct poly1305_state state; poly1305_core_init(&state); BUILD_BUG_ON(sizeof(header) % POLY1305_BLOCK_SIZE != 0); poly1305_core_blocks(&state, &tctx->header_hash_key, &header, sizeof(header) / POLY1305_BLOCK_SIZE, 1); BUILD_BUG_ON(TWEAK_SIZE % POLY1305_BLOCK_SIZE != 0); poly1305_core_blocks(&state, &tctx->header_hash_key, req->iv, TWEAK_SIZE / POLY1305_BLOCK_SIZE, 1); poly1305_core_emit(&state, NULL, &rctx->header_hash); } /* Hash the left-hand part (the "bulk") of the message using NHPoly1305 */ static int adiantum_hash_message(struct skcipher_request *req, struct scatterlist *sgl, unsigned int nents, le128 *digest) { struct adiantum_request_ctx *rctx = skcipher_request_ctx(req); const unsigned int bulk_len = req->cryptlen - BLOCKCIPHER_BLOCK_SIZE; struct shash_desc *hash_desc = &rctx->u.hash_desc; struct sg_mapping_iter miter; unsigned int i, n; int err; err = crypto_shash_init(hash_desc); if (err) return err; sg_miter_start(&miter, sgl, nents, SG_MITER_FROM_SG | SG_MITER_ATOMIC); for (i = 0; i < bulk_len; i += n) { sg_miter_next(&miter); n = min_t(unsigned int, miter.length, bulk_len - i); err = crypto_shash_update(hash_desc, miter.addr, n); if (err) break; } sg_miter_stop(&miter); if (err) return err; return crypto_shash_final(hash_desc, (u8 *)digest); } /* Continue Adiantum encryption/decryption after the stream cipher step */ static int adiantum_finish(struct skcipher_request *req) { struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); const struct adiantum_tfm_ctx *tctx = crypto_skcipher_ctx(tfm); struct adiantum_request_ctx *rctx = skcipher_request_ctx(req); const unsigned int bulk_len = req->cryptlen - BLOCKCIPHER_BLOCK_SIZE; struct scatterlist *dst = req->dst; const unsigned int dst_nents = sg_nents(dst); le128 digest; int err; /* If decrypting, decrypt C_M with the block cipher to get P_M */ if (!rctx->enc) crypto_cipher_decrypt_one(tctx->blockcipher, rctx->rbuf.bytes, rctx->rbuf.bytes); /* * Second hash step * enc: C_R = C_M - H_{K_H}(T, C_L) * dec: P_R = P_M - H_{K_H}(T, P_L) */ rctx->u.hash_desc.tfm = tctx->hash; le128_sub(&rctx->rbuf.bignum, &rctx->rbuf.bignum, &rctx->header_hash); if (dst_nents == 1 && dst->offset + req->cryptlen <= PAGE_SIZE) { /* Fast path for single-page destination */ struct page *page = sg_page(dst); void *virt = kmap_local_page(page) + dst->offset; err = crypto_shash_digest(&rctx->u.hash_desc, virt, bulk_len, (u8 *)&digest); if (err) { kunmap_local(virt); return err; } le128_sub(&rctx->rbuf.bignum, &rctx->rbuf.bignum, &digest); memcpy(virt + bulk_len, &rctx->rbuf.bignum, sizeof(le128)); flush_dcache_page(page); kunmap_local(virt); } else { /* Slow path that works for any destination scatterlist */ err = adiantum_hash_message(req, dst, dst_nents, &digest); if (err) return err; le128_sub(&rctx->rbuf.bignum, &rctx->rbuf.bignum, &digest); scatterwalk_map_and_copy(&rctx->rbuf.bignum, dst, bulk_len, sizeof(le128), 1); } return 0; } static void adiantum_streamcipher_done(void *data, int err) { struct skcipher_request *req = data; if (!err) err = adiantum_finish(req); skcipher_request_complete(req, err); } static int adiantum_crypt(struct skcipher_request *req, bool enc) { struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); const struct adiantum_tfm_ctx *tctx = crypto_skcipher_ctx(tfm); struct adiantum_request_ctx *rctx = skcipher_request_ctx(req); const unsigned int bulk_len = req->cryptlen - BLOCKCIPHER_BLOCK_SIZE; struct scatterlist *src = req->src; const unsigned int src_nents = sg_nents(src); unsigned int stream_len; le128 digest; int err; if (req->cryptlen < BLOCKCIPHER_BLOCK_SIZE) return -EINVAL; rctx->enc = enc; /* * First hash step * enc: P_M = P_R + H_{K_H}(T, P_L) * dec: C_M = C_R + H_{K_H}(T, C_L) */ adiantum_hash_header(req); rctx->u.hash_desc.tfm = tctx->hash; if (src_nents == 1 && src->offset + req->cryptlen <= PAGE_SIZE) { /* Fast path for single-page source */ void *virt = kmap_local_page(sg_page(src)) + src->offset; err = crypto_shash_digest(&rctx->u.hash_desc, virt, bulk_len, (u8 *)&digest); memcpy(&rctx->rbuf.bignum, virt + bulk_len, sizeof(le128)); kunmap_local(virt); } else { /* Slow path that works for any source scatterlist */ err = adiantum_hash_message(req, src, src_nents, &digest); scatterwalk_map_and_copy(&rctx->rbuf.bignum, src, bulk_len, sizeof(le128), 0); } if (err) return err; le128_add(&rctx->rbuf.bignum, &rctx->rbuf.bignum, &rctx->header_hash); le128_add(&rctx->rbuf.bignum, &rctx->rbuf.bignum, &digest); /* If encrypting, encrypt P_M with the block cipher to get C_M */ if (enc) crypto_cipher_encrypt_one(tctx->blockcipher, rctx->rbuf.bytes, rctx->rbuf.bytes); /* Initialize the rest of the XChaCha IV (first part is C_M) */ BUILD_BUG_ON(BLOCKCIPHER_BLOCK_SIZE != 16); BUILD_BUG_ON(XCHACHA_IV_SIZE != 32); /* nonce || stream position */ rctx->rbuf.words[4] = cpu_to_le32(1); rctx->rbuf.words[5] = 0; rctx->rbuf.words[6] = 0; rctx->rbuf.words[7] = 0; /* * XChaCha needs to be done on all the data except the last 16 bytes; * for disk encryption that usually means 4080 or 496 bytes. But ChaCha * implementations tend to be most efficient when passed a whole number * of 64-byte ChaCha blocks, or sometimes even a multiple of 256 bytes. * And here it doesn't matter whether the last 16 bytes are written to, * as the second hash step will overwrite them. Thus, round the XChaCha * length up to the next 64-byte boundary if possible. */ stream_len = bulk_len; if (round_up(stream_len, CHACHA_BLOCK_SIZE) <= req->cryptlen) stream_len = round_up(stream_len, CHACHA_BLOCK_SIZE); skcipher_request_set_tfm(&rctx->u.streamcipher_req, tctx->streamcipher); skcipher_request_set_crypt(&rctx->u.streamcipher_req, req->src, req->dst, stream_len, &rctx->rbuf); skcipher_request_set_callback(&rctx->u.streamcipher_req, req->base.flags, adiantum_streamcipher_done, req); return crypto_skcipher_encrypt(&rctx->u.streamcipher_req) ?: adiantum_finish(req); } static int adiantum_encrypt(struct skcipher_request *req) { return adiantum_crypt(req, true); } static int adiantum_decrypt(struct skcipher_request *req) { return adiantum_crypt(req, false); } static int adiantum_init_tfm(struct crypto_skcipher *tfm) { struct skcipher_instance *inst = skcipher_alg_instance(tfm); struct adiantum_instance_ctx *ictx = skcipher_instance_ctx(inst); struct adiantum_tfm_ctx *tctx = crypto_skcipher_ctx(tfm); struct crypto_skcipher *streamcipher; struct crypto_cipher *blockcipher; struct crypto_shash *hash; unsigned int subreq_size; int err; streamcipher = crypto_spawn_skcipher(&ictx->streamcipher_spawn); if (IS_ERR(streamcipher)) return PTR_ERR(streamcipher); blockcipher = crypto_spawn_cipher(&ictx->blockcipher_spawn); if (IS_ERR(blockcipher)) { err = PTR_ERR(blockcipher); goto err_free_streamcipher; } hash = crypto_spawn_shash(&ictx->hash_spawn); if (IS_ERR(hash)) { err = PTR_ERR(hash); goto err_free_blockcipher; } tctx->streamcipher = streamcipher; tctx->blockcipher = blockcipher; tctx->hash = hash; BUILD_BUG_ON(offsetofend(struct adiantum_request_ctx, u) != sizeof(struct adiantum_request_ctx)); subreq_size = max(sizeof_field(struct adiantum_request_ctx, u.hash_desc) + crypto_shash_descsize(hash), sizeof_field(struct adiantum_request_ctx, u.streamcipher_req) + crypto_skcipher_reqsize(streamcipher)); crypto_skcipher_set_reqsize(tfm, offsetof(struct adiantum_request_ctx, u) + subreq_size); return 0; err_free_blockcipher: crypto_free_cipher(blockcipher); err_free_streamcipher: crypto_free_skcipher(streamcipher); return err; } static void adiantum_exit_tfm(struct crypto_skcipher *tfm) { struct adiantum_tfm_ctx *tctx = crypto_skcipher_ctx(tfm); crypto_free_skcipher(tctx->streamcipher); crypto_free_cipher(tctx->blockcipher); crypto_free_shash(tctx->hash); } static void adiantum_free_instance(struct skcipher_instance *inst) { struct adiantum_instance_ctx *ictx = skcipher_instance_ctx(inst); crypto_drop_skcipher(&ictx->streamcipher_spawn); crypto_drop_cipher(&ictx->blockcipher_spawn); crypto_drop_shash(&ictx->hash_spawn); kfree(inst); } /* * Check for a supported set of inner algorithms. * See the comment at the beginning of this file. */ static bool adiantum_supported_algorithms(struct skcipher_alg_common *streamcipher_alg, struct crypto_alg *blockcipher_alg, struct shash_alg *hash_alg) { if (strcmp(streamcipher_alg->base.cra_name, "xchacha12") != 0 && strcmp(streamcipher_alg->base.cra_name, "xchacha20") != 0) return false; if (blockcipher_alg->cra_cipher.cia_min_keysize > BLOCKCIPHER_KEY_SIZE || blockcipher_alg->cra_cipher.cia_max_keysize < BLOCKCIPHER_KEY_SIZE) return false; if (blockcipher_alg->cra_blocksize != BLOCKCIPHER_BLOCK_SIZE) return false; if (strcmp(hash_alg->base.cra_name, "nhpoly1305") != 0) return false; return true; } static int adiantum_create(struct crypto_template *tmpl, struct rtattr **tb) { u32 mask; const char *nhpoly1305_name; struct skcipher_instance *inst; struct adiantum_instance_ctx *ictx; struct skcipher_alg_common *streamcipher_alg; struct crypto_alg *blockcipher_alg; struct shash_alg *hash_alg; int err; err = crypto_check_attr_type(tb, CRYPTO_ALG_TYPE_SKCIPHER, &mask); if (err) return err; inst = kzalloc(sizeof(*inst) + sizeof(*ictx), GFP_KERNEL); if (!inst) return -ENOMEM; ictx = skcipher_instance_ctx(inst); /* Stream cipher, e.g. "xchacha12" */ err = crypto_grab_skcipher(&ictx->streamcipher_spawn, skcipher_crypto_instance(inst), crypto_attr_alg_name(tb[1]), 0, mask); if (err) goto err_free_inst; streamcipher_alg = crypto_spawn_skcipher_alg_common(&ictx->streamcipher_spawn); /* Block cipher, e.g. "aes" */ err = crypto_grab_cipher(&ictx->blockcipher_spawn, skcipher_crypto_instance(inst), crypto_attr_alg_name(tb[2]), 0, mask); if (err) goto err_free_inst; blockcipher_alg = crypto_spawn_cipher_alg(&ictx->blockcipher_spawn); /* NHPoly1305 ε-∆U hash function */ nhpoly1305_name = crypto_attr_alg_name(tb[3]); if (nhpoly1305_name == ERR_PTR(-ENOENT)) nhpoly1305_name = "nhpoly1305"; err = crypto_grab_shash(&ictx->hash_spawn, skcipher_crypto_instance(inst), nhpoly1305_name, 0, mask); if (err) goto err_free_inst; hash_alg = crypto_spawn_shash_alg(&ictx->hash_spawn); /* Check the set of algorithms */ if (!adiantum_supported_algorithms(streamcipher_alg, blockcipher_alg, hash_alg)) { pr_warn("Unsupported Adiantum instantiation: (%s,%s,%s)\n", streamcipher_alg->base.cra_name, blockcipher_alg->cra_name, hash_alg->base.cra_name); err = -EINVAL; goto err_free_inst; } /* Instance fields */ err = -ENAMETOOLONG; if (snprintf(inst->alg.base.cra_name, CRYPTO_MAX_ALG_NAME, "adiantum(%s,%s)", streamcipher_alg->base.cra_name, blockcipher_alg->cra_name) >= CRYPTO_MAX_ALG_NAME) goto err_free_inst; if (snprintf(inst->alg.base.cra_driver_name, CRYPTO_MAX_ALG_NAME, "adiantum(%s,%s,%s)", streamcipher_alg->base.cra_driver_name, blockcipher_alg->cra_driver_name, hash_alg->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME) goto err_free_inst; inst->alg.base.cra_blocksize = BLOCKCIPHER_BLOCK_SIZE; inst->alg.base.cra_ctxsize = sizeof(struct adiantum_tfm_ctx); inst->alg.base.cra_alignmask = streamcipher_alg->base.cra_alignmask; /* * The block cipher is only invoked once per message, so for long * messages (e.g. sectors for disk encryption) its performance doesn't * matter as much as that of the stream cipher and hash function. Thus, * weigh the block cipher's ->cra_priority less. */ inst->alg.base.cra_priority = (4 * streamcipher_alg->base.cra_priority + 2 * hash_alg->base.cra_priority + blockcipher_alg->cra_priority) / 7; inst->alg.setkey = adiantum_setkey; inst->alg.encrypt = adiantum_encrypt; inst->alg.decrypt = adiantum_decrypt; inst->alg.init = adiantum_init_tfm; inst->alg.exit = adiantum_exit_tfm; inst->alg.min_keysize = streamcipher_alg->min_keysize; inst->alg.max_keysize = streamcipher_alg->max_keysize; inst->alg.ivsize = TWEAK_SIZE; inst->free = adiantum_free_instance; err = skcipher_register_instance(tmpl, inst); if (err) { err_free_inst: adiantum_free_instance(inst); } return err; } /* adiantum(streamcipher_name, blockcipher_name [, nhpoly1305_name]) */ static struct crypto_template adiantum_tmpl = { .name = "adiantum", .create = adiantum_create, .module = THIS_MODULE, }; static int __init adiantum_module_init(void) { return crypto_register_template(&adiantum_tmpl); } static void __exit adiantum_module_exit(void) { crypto_unregister_template(&adiantum_tmpl); } subsys_initcall(adiantum_module_init); module_exit(adiantum_module_exit); MODULE_DESCRIPTION("Adiantum length-preserving encryption mode"); MODULE_LICENSE("GPL v2"); MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>"); MODULE_ALIAS_CRYPTO("adiantum"); MODULE_IMPORT_NS("CRYPTO_INTERNAL"); |
4906 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 | // SPDX-License-Identifier: GPL-2.0+ /* * Universal/legacy driver for 8250/16550-type serial ports * * Based on drivers/char/serial.c, by Linus Torvalds, Theodore Ts'o. * * Copyright (C) 2001 Russell King. * * Supports: * early_serial_setup() ports * userspace-configurable "phantom" ports * serial8250_register_8250_port() ports */ #include <linux/acpi.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/ioport.h> #include <linux/init.h> #include <linux/console.h> #include <linux/sysrq.h> #include <linux/delay.h> #include <linux/platform_device.h> #include <linux/pm_runtime.h> #include <linux/tty.h> #include <linux/ratelimit.h> #include <linux/tty_flip.h> #include <linux/serial.h> #include <linux/serial_8250.h> #include <linux/nmi.h> #include <linux/mutex.h> #include <linux/slab.h> #include <linux/string_helpers.h> #include <linux/uaccess.h> #include <linux/io.h> #include <asm/irq.h> #include "8250.h" #define PASS_LIMIT 512 struct irq_info { struct hlist_node node; int irq; spinlock_t lock; /* Protects list not the hash */ struct list_head *head; }; #define NR_IRQ_HASH 32 /* Can be adjusted later */ static struct hlist_head irq_lists[NR_IRQ_HASH]; static DEFINE_MUTEX(hash_mutex); /* Used to walk the hash */ /* * This is the serial driver's interrupt routine. * * Arjan thinks the old way was overly complex, so it got simplified. * Alan disagrees, saying that need the complexity to handle the weird * nature of ISA shared interrupts. (This is a special exception.) * * In order to handle ISA shared interrupts properly, we need to check * that all ports have been serviced, and therefore the ISA interrupt * line has been de-asserted. * * This means we need to loop through all ports. checking that they * don't have an interrupt pending. */ static irqreturn_t serial8250_interrupt(int irq, void *dev_id) { struct irq_info *i = dev_id; struct list_head *l, *end = NULL; int pass_counter = 0, handled = 0; pr_debug("%s(%d): start\n", __func__, irq); spin_lock(&i->lock); l = i->head; do { struct uart_8250_port *up; struct uart_port *port; up = list_entry(l, struct uart_8250_port, list); port = &up->port; if (port->handle_irq(port)) { handled = 1; end = NULL; } else if (end == NULL) end = l; l = l->next; if (l == i->head && pass_counter++ > PASS_LIMIT) break; } while (l != end); spin_unlock(&i->lock); pr_debug("%s(%d): end\n", __func__, irq); return IRQ_RETVAL(handled); } /* * To support ISA shared interrupts, we need to have one interrupt * handler that ensures that the IRQ line has been deasserted * before returning. Failing to do this will result in the IRQ * line being stuck active, and, since ISA irqs are edge triggered, * no more IRQs will be seen. */ static void serial_do_unlink(struct irq_info *i, struct uart_8250_port *up) { spin_lock_irq(&i->lock); if (!list_empty(i->head)) { if (i->head == &up->list) i->head = i->head->next; list_del(&up->list); } else { BUG_ON(i->head != &up->list); i->head = NULL; } spin_unlock_irq(&i->lock); /* List empty so throw away the hash node */ if (i->head == NULL) { hlist_del(&i->node); kfree(i); } } static int serial_link_irq_chain(struct uart_8250_port *up) { struct hlist_head *h; struct irq_info *i; int ret; mutex_lock(&hash_mutex); h = &irq_lists[up->port.irq % NR_IRQ_HASH]; hlist_for_each_entry(i, h, node) if (i->irq == up->port.irq) break; if (i == NULL) { i = kzalloc(sizeof(struct irq_info), GFP_KERNEL); if (i == NULL) { mutex_unlock(&hash_mutex); return -ENOMEM; } spin_lock_init(&i->lock); i->irq = up->port.irq; hlist_add_head(&i->node, h); } mutex_unlock(&hash_mutex); spin_lock_irq(&i->lock); if (i->head) { list_add(&up->list, i->head); spin_unlock_irq(&i->lock); ret = 0; } else { INIT_LIST_HEAD(&up->list); i->head = &up->list; spin_unlock_irq(&i->lock); ret = request_irq(up->port.irq, serial8250_interrupt, up->port.irqflags, up->port.name, i); if (ret < 0) serial_do_unlink(i, up); } return ret; } static void serial_unlink_irq_chain(struct uart_8250_port *up) { struct irq_info *i; struct hlist_head *h; mutex_lock(&hash_mutex); h = &irq_lists[up->port.irq % NR_IRQ_HASH]; hlist_for_each_entry(i, h, node) if (i->irq == up->port.irq) break; BUG_ON(i == NULL); BUG_ON(i->head == NULL); if (list_empty(i->head)) free_irq(up->port.irq, i); serial_do_unlink(i, up); mutex_unlock(&hash_mutex); } /* * This function is used to handle ports that do not have an * interrupt. This doesn't work very well for 16450's, but gives * barely passable results for a 16550A. (Although at the expense * of much CPU overhead). */ static void serial8250_timeout(struct timer_list *t) { struct uart_8250_port *up = from_timer(up, t, timer); up->port.handle_irq(&up->port); mod_timer(&up->timer, jiffies + uart_poll_timeout(&up->port)); } static void serial8250_backup_timeout(struct timer_list *t) { struct uart_8250_port *up = from_timer(up, t, timer); unsigned int iir, ier = 0, lsr; unsigned long flags; uart_port_lock_irqsave(&up->port, &flags); /* * Must disable interrupts or else we risk racing with the interrupt * based handler. */ if (up->port.irq) { ier = serial_in(up, UART_IER); serial_out(up, UART_IER, 0); } iir = serial_in(up, UART_IIR); /* * This should be a safe test for anyone who doesn't trust the * IIR bits on their UART, but it's specifically designed for * the "Diva" UART used on the management processor on many HP * ia64 and parisc boxes. */ lsr = serial_lsr_in(up); if ((iir & UART_IIR_NO_INT) && (up->ier & UART_IER_THRI) && (!kfifo_is_empty(&up->port.state->port.xmit_fifo) || up->port.x_char) && (lsr & UART_LSR_THRE)) { iir &= ~(UART_IIR_ID | UART_IIR_NO_INT); iir |= UART_IIR_THRI; } if (!(iir & UART_IIR_NO_INT)) serial8250_tx_chars(up); if (up->port.irq) serial_out(up, UART_IER, ier); uart_port_unlock_irqrestore(&up->port, flags); /* Standard timer interval plus 0.2s to keep the port running */ mod_timer(&up->timer, jiffies + uart_poll_timeout(&up->port) + HZ / 5); } static void univ8250_setup_timer(struct uart_8250_port *up) { struct uart_port *port = &up->port; /* * The above check will only give an accurate result the first time * the port is opened so this value needs to be preserved. */ if (up->bugs & UART_BUG_THRE) { pr_debug("%s - using backup timer\n", port->name); up->timer.function = serial8250_backup_timeout; mod_timer(&up->timer, jiffies + uart_poll_timeout(port) + HZ / 5); } /* * If the "interrupt" for this port doesn't correspond with any * hardware interrupt, we use a timer-based system. The original * driver used to do this with IRQ0. */ if (!port->irq) mod_timer(&up->timer, jiffies + uart_poll_timeout(port)); } static int univ8250_setup_irq(struct uart_8250_port *up) { struct uart_port *port = &up->port; if (port->irq) return serial_link_irq_chain(up); return 0; } static void univ8250_release_irq(struct uart_8250_port *up) { struct uart_port *port = &up->port; timer_delete_sync(&up->timer); up->timer.function = serial8250_timeout; if (port->irq) serial_unlink_irq_chain(up); } const struct uart_ops *univ8250_port_base_ops = NULL; struct uart_ops univ8250_port_ops; static const struct uart_8250_ops univ8250_driver_ops = { .setup_irq = univ8250_setup_irq, .release_irq = univ8250_release_irq, .setup_timer = univ8250_setup_timer, }; static struct uart_8250_port serial8250_ports[UART_NR]; /** * serial8250_get_port - retrieve struct uart_8250_port * @line: serial line number * * This function retrieves struct uart_8250_port for the specific line. * This struct *must* *not* be used to perform a 8250 or serial core operation * which is not accessible otherwise. Its only purpose is to make the struct * accessible to the runtime-pm callbacks for context suspend/restore. * The lock assumption made here is none because runtime-pm suspend/resume * callbacks should not be invoked if there is any operation performed on the * port. */ struct uart_8250_port *serial8250_get_port(int line) { return &serial8250_ports[line]; } EXPORT_SYMBOL_GPL(serial8250_get_port); static inline void serial8250_apply_quirks(struct uart_8250_port *up) { up->port.quirks |= skip_txen_test ? UPQ_NO_TXEN_TEST : 0; } struct uart_8250_port *serial8250_setup_port(int index) { struct uart_8250_port *up; if (index >= UART_NR) return NULL; up = &serial8250_ports[index]; up->port.line = index; up->port.port_id = index; serial8250_init_port(up); if (!univ8250_port_base_ops) univ8250_port_base_ops = up->port.ops; up->port.ops = &univ8250_port_ops; timer_setup(&up->timer, serial8250_timeout, 0); up->ops = &univ8250_driver_ops; serial8250_set_defaults(up); return up; } void __init serial8250_register_ports(struct uart_driver *drv, struct device *dev) { int i; for (i = 0; i < nr_uarts; i++) { struct uart_8250_port *up = &serial8250_ports[i]; if (up->port.type == PORT_8250_CIR) continue; if (up->port.dev) continue; up->port.dev = dev; if (uart_console_registered(&up->port)) pm_runtime_get_sync(up->port.dev); serial8250_apply_quirks(up); uart_add_one_port(drv, &up->port); } } #ifdef CONFIG_SERIAL_8250_CONSOLE static void univ8250_console_write(struct console *co, const char *s, unsigned int count) { struct uart_8250_port *up = &serial8250_ports[co->index]; serial8250_console_write(up, s, count); } static int univ8250_console_setup(struct console *co, char *options) { struct uart_8250_port *up; struct uart_port *port; int retval, i; /* * Check whether an invalid uart number has been specified, and * if so, search for the first available port that does have * console support. */ if (co->index < 0 || co->index >= UART_NR) co->index = 0; /* * If the console is past the initial isa ports, init more ports up to * co->index as needed and increment nr_uarts accordingly. */ for (i = nr_uarts; i <= co->index; i++) { up = serial8250_setup_port(i); if (!up) return -ENODEV; nr_uarts++; } port = &serial8250_ports[co->index].port; /* link port to console */ uart_port_set_cons(port, co); retval = serial8250_console_setup(port, options, false); if (retval != 0) uart_port_set_cons(port, NULL); return retval; } static int univ8250_console_exit(struct console *co) { struct uart_port *port; port = &serial8250_ports[co->index].port; return serial8250_console_exit(port); } /** * univ8250_console_match - non-standard console matching * @co: registering console * @name: name from console command line * @idx: index from console command line * @options: ptr to option string from console command line * * Only attempts to match console command lines of the form: * console=uart[8250],io|mmio|mmio16|mmio32,<addr>[,<options>] * console=uart[8250],0x<addr>[,<options>] * This form is used to register an initial earlycon boot console and * replace it with the serial8250_console at 8250 driver init. * * Performs console setup for a match (as required by interface) * If no <options> are specified, then assume the h/w is already setup. * * Returns 0 if console matches; otherwise non-zero to use default matching */ static int univ8250_console_match(struct console *co, char *name, int idx, char *options) { char match[] = "uart"; /* 8250-specific earlycon name */ unsigned char iotype; resource_size_t addr; int i; if (strncmp(name, match, 4) != 0) return -ENODEV; if (uart_parse_earlycon(options, &iotype, &addr, &options)) return -ENODEV; /* try to match the port specified on the command line */ for (i = 0; i < nr_uarts; i++) { struct uart_port *port = &serial8250_ports[i].port; if (port->iotype != iotype) continue; if ((iotype == UPIO_MEM || iotype == UPIO_MEM16 || iotype == UPIO_MEM32 || iotype == UPIO_MEM32BE) && (port->mapbase != addr)) continue; if (iotype == UPIO_PORT && port->iobase != addr) continue; co->index = i; uart_port_set_cons(port, co); return serial8250_console_setup(port, options, true); } return -ENODEV; } static struct console univ8250_console = { .name = "ttyS", .write = univ8250_console_write, .device = uart_console_device, .setup = univ8250_console_setup, .exit = univ8250_console_exit, .match = univ8250_console_match, .flags = CON_PRINTBUFFER | CON_ANYTIME, .index = -1, .data = &serial8250_reg, }; static int __init univ8250_console_init(void) { if (nr_uarts == 0) return -ENODEV; serial8250_isa_init_ports(); register_console(&univ8250_console); return 0; } console_initcall(univ8250_console_init); #define SERIAL8250_CONSOLE (&univ8250_console) #else #define SERIAL8250_CONSOLE NULL #endif struct uart_driver serial8250_reg = { .owner = THIS_MODULE, .driver_name = "serial", .dev_name = "ttyS", .major = TTY_MAJOR, .minor = 64, .cons = SERIAL8250_CONSOLE, }; /* * early_serial_setup - early registration for 8250 ports * * Setup an 8250 port structure prior to console initialisation. Use * after console initialisation will cause undefined behaviour. */ int __init early_serial_setup(struct uart_port *port) { struct uart_port *p; if (port->line >= ARRAY_SIZE(serial8250_ports) || nr_uarts == 0) return -ENODEV; serial8250_isa_init_ports(); p = &serial8250_ports[port->line].port; p->iobase = port->iobase; p->membase = port->membase; p->irq = port->irq; p->irqflags = port->irqflags; p->uartclk = port->uartclk; p->fifosize = port->fifosize; p->regshift = port->regshift; p->iotype = port->iotype; p->flags = port->flags; p->mapbase = port->mapbase; p->mapsize = port->mapsize; p->private_data = port->private_data; p->type = port->type; p->line = port->line; serial8250_set_defaults(up_to_u8250p(p)); if (port->serial_in) p->serial_in = port->serial_in; if (port->serial_out) p->serial_out = port->serial_out; if (port->handle_irq) p->handle_irq = port->handle_irq; return 0; } /** * serial8250_suspend_port - suspend one serial port * @line: serial line number * * Suspend one serial port. */ void serial8250_suspend_port(int line) { struct uart_8250_port *up = &serial8250_ports[line]; struct uart_port *port = &up->port; if (!console_suspend_enabled && uart_console(port) && port->type != PORT_8250) { unsigned char canary = 0xa5; serial_out(up, UART_SCR, canary); if (serial_in(up, UART_SCR) == canary) up->canary = canary; } uart_suspend_port(&serial8250_reg, port); } EXPORT_SYMBOL(serial8250_suspend_port); /** * serial8250_resume_port - resume one serial port * @line: serial line number * * Resume one serial port. */ void serial8250_resume_port(int line) { struct uart_8250_port *up = &serial8250_ports[line]; struct uart_port *port = &up->port; up->canary = 0; if (up->capabilities & UART_NATSEMI) { /* Ensure it's still in high speed mode */ serial_port_out(port, UART_LCR, 0xE0); ns16550a_goto_highspeed(up); serial_port_out(port, UART_LCR, 0); port->uartclk = 921600*16; } uart_resume_port(&serial8250_reg, port); } EXPORT_SYMBOL(serial8250_resume_port); /* * serial8250_register_8250_port and serial8250_unregister_port allows for * 16x50 serial ports to be configured at run-time, to support PCMCIA * modems and PCI multiport cards. */ static DEFINE_MUTEX(serial_mutex); static struct uart_8250_port *serial8250_find_match_or_unused(const struct uart_port *port) { int i; /* * First, find a port entry which matches. */ for (i = 0; i < nr_uarts; i++) if (uart_match_port(&serial8250_ports[i].port, port)) return &serial8250_ports[i]; /* try line number first if still available */ i = port->line; if (i < nr_uarts && serial8250_ports[i].port.type == PORT_UNKNOWN && serial8250_ports[i].port.iobase == 0) return &serial8250_ports[i]; /* * We didn't find a matching entry, so look for the first * free entry. We look for one which hasn't been previously * used (indicated by zero iobase). */ for (i = 0; i < nr_uarts; i++) if (serial8250_ports[i].port.type == PORT_UNKNOWN && serial8250_ports[i].port.iobase == 0) return &serial8250_ports[i]; /* * That also failed. Last resort is to find any entry which * doesn't have a real port associated with it. */ for (i = 0; i < nr_uarts; i++) if (serial8250_ports[i].port.type == PORT_UNKNOWN) return &serial8250_ports[i]; return NULL; } static void serial_8250_overrun_backoff_work(struct work_struct *work) { struct uart_8250_port *up = container_of(to_delayed_work(work), struct uart_8250_port, overrun_backoff); struct uart_port *port = &up->port; unsigned long flags; uart_port_lock_irqsave(port, &flags); up->ier |= UART_IER_RLSI | UART_IER_RDI; serial_out(up, UART_IER, up->ier); uart_port_unlock_irqrestore(port, flags); } /** * serial8250_register_8250_port - register a serial port * @up: serial port template * * Configure the serial port specified by the request. If the * port exists and is in use, it is hung up and unregistered * first. * * The port is then probed and if necessary the IRQ is autodetected * If this fails an error is returned. * * On success the port is ready to use and the line number is returned. */ int serial8250_register_8250_port(const struct uart_8250_port *up) { struct uart_8250_port *uart; int ret = -ENOSPC; if (up->port.uartclk == 0) return -EINVAL; mutex_lock(&serial_mutex); uart = serial8250_find_match_or_unused(&up->port); if (!uart) { /* * If the port is past the initial isa ports, initialize a new * port and increment nr_uarts accordingly. */ uart = serial8250_setup_port(nr_uarts); if (!uart) goto unlock; nr_uarts++; } if (uart->port.type != PORT_8250_CIR) { struct mctrl_gpios *gpios; if (uart->port.dev) uart_remove_one_port(&serial8250_reg, &uart->port); uart->port.ctrl_id = up->port.ctrl_id; uart->port.port_id = up->port.port_id; uart->port.iobase = up->port.iobase; uart->port.membase = up->port.membase; uart->port.irq = up->port.irq; uart->port.irqflags = up->port.irqflags; uart->port.uartclk = up->port.uartclk; uart->port.fifosize = up->port.fifosize; uart->port.regshift = up->port.regshift; uart->port.iotype = up->port.iotype; uart->port.flags = up->port.flags | UPF_BOOT_AUTOCONF; uart->bugs = up->bugs; uart->port.mapbase = up->port.mapbase; uart->port.mapsize = up->port.mapsize; uart->port.private_data = up->port.private_data; uart->tx_loadsz = up->tx_loadsz; uart->capabilities = up->capabilities; uart->port.throttle = up->port.throttle; uart->port.unthrottle = up->port.unthrottle; uart->port.rs485_config = up->port.rs485_config; uart->port.rs485_supported = up->port.rs485_supported; uart->port.rs485 = up->port.rs485; uart->rs485_start_tx = up->rs485_start_tx; uart->rs485_stop_tx = up->rs485_stop_tx; uart->lsr_save_mask = up->lsr_save_mask; uart->dma = up->dma; /* Take tx_loadsz from fifosize if it wasn't set separately */ if (uart->port.fifosize && !uart->tx_loadsz) uart->tx_loadsz = uart->port.fifosize; if (up->port.dev) { uart->port.dev = up->port.dev; ret = uart_get_rs485_mode(&uart->port); if (ret) goto err; } if (up->port.flags & UPF_FIXED_TYPE) uart->port.type = up->port.type; /* * Only call mctrl_gpio_init(), if the device has no ACPI * companion device */ if (!has_acpi_companion(uart->port.dev)) { gpios = mctrl_gpio_init(&uart->port, 0); if (IS_ERR(gpios)) { ret = PTR_ERR(gpios); goto err; } else { uart->gpios = gpios; } } serial8250_set_defaults(uart); /* Possibly override default I/O functions. */ if (up->port.serial_in) uart->port.serial_in = up->port.serial_in; if (up->port.serial_out) uart->port.serial_out = up->port.serial_out; if (up->port.handle_irq) uart->port.handle_irq = up->port.handle_irq; /* Possibly override set_termios call */ if (up->port.set_termios) uart->port.set_termios = up->port.set_termios; if (up->port.set_ldisc) uart->port.set_ldisc = up->port.set_ldisc; if (up->port.get_mctrl) uart->port.get_mctrl = up->port.get_mctrl; if (up->port.set_mctrl) uart->port.set_mctrl = up->port.set_mctrl; if (up->port.get_divisor) uart->port.get_divisor = up->port.get_divisor; if (up->port.set_divisor) uart->port.set_divisor = up->port.set_divisor; if (up->port.startup) uart->port.startup = up->port.startup; if (up->port.shutdown) uart->port.shutdown = up->port.shutdown; if (up->port.pm) uart->port.pm = up->port.pm; if (up->port.handle_break) uart->port.handle_break = up->port.handle_break; if (up->dl_read) uart->dl_read = up->dl_read; if (up->dl_write) uart->dl_write = up->dl_write; if (uart->port.type != PORT_8250_CIR) { if (uart_console_registered(&uart->port)) pm_runtime_get_sync(uart->port.dev); if (serial8250_isa_config != NULL) serial8250_isa_config(0, &uart->port, &uart->capabilities); serial8250_apply_quirks(uart); ret = uart_add_one_port(&serial8250_reg, &uart->port); if (ret) goto err; ret = uart->port.line; } else { dev_info(uart->port.dev, "skipping CIR port at 0x%lx / 0x%llx, IRQ %d\n", uart->port.iobase, (unsigned long long)uart->port.mapbase, uart->port.irq); ret = 0; } if (!uart->lsr_save_mask) uart->lsr_save_mask = LSR_SAVE_FLAGS; /* Use default LSR mask */ /* Initialise interrupt backoff work if required */ if (up->overrun_backoff_time_ms > 0) { uart->overrun_backoff_time_ms = up->overrun_backoff_time_ms; INIT_DELAYED_WORK(&uart->overrun_backoff, serial_8250_overrun_backoff_work); } else { uart->overrun_backoff_time_ms = 0; } } unlock: mutex_unlock(&serial_mutex); return ret; err: uart->port.dev = NULL; mutex_unlock(&serial_mutex); return ret; } EXPORT_SYMBOL(serial8250_register_8250_port); /** * serial8250_unregister_port - remove a 16x50 serial port at runtime * @line: serial line number * * Remove one serial port. This may not be called from interrupt * context. We hand the port back to the our control. */ void serial8250_unregister_port(int line) { struct uart_8250_port *uart = &serial8250_ports[line]; mutex_lock(&serial_mutex); if (uart->em485) { unsigned long flags; uart_port_lock_irqsave(&uart->port, &flags); serial8250_em485_destroy(uart); uart_port_unlock_irqrestore(&uart->port, flags); } uart_remove_one_port(&serial8250_reg, &uart->port); if (serial8250_isa_devs) { uart->port.flags &= ~UPF_BOOT_AUTOCONF; uart->port.type = PORT_UNKNOWN; uart->port.dev = &serial8250_isa_devs->dev; uart->port.port_id = line; uart->capabilities = 0; serial8250_init_port(uart); serial8250_apply_quirks(uart); uart_add_one_port(&serial8250_reg, &uart->port); } else { uart->port.dev = NULL; } mutex_unlock(&serial_mutex); } EXPORT_SYMBOL(serial8250_unregister_port); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Generic 8250/16x50 serial driver"); |
18 18 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 | // SPDX-License-Identifier: GPL-2.0 /* * quota.c - CephFS quota * * Copyright (C) 2017-2018 SUSE */ #include <linux/statfs.h> #include "super.h" #include "mds_client.h" void ceph_adjust_quota_realms_count(struct inode *inode, bool inc) { struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb); if (inc) atomic64_inc(&mdsc->quotarealms_count); else atomic64_dec(&mdsc->quotarealms_count); } static inline bool ceph_has_realms_with_quotas(struct inode *inode) { struct super_block *sb = inode->i_sb; struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(sb); struct inode *root = d_inode(sb->s_root); if (atomic64_read(&mdsc->quotarealms_count) > 0) return true; /* if root is the real CephFS root, we don't have quota realms */ if (root && ceph_ino(root) == CEPH_INO_ROOT) return false; /* MDS stray dirs have no quota realms */ if (ceph_vino_is_reserved(ceph_inode(inode)->i_vino)) return false; /* otherwise, we can't know for sure */ return true; } void ceph_handle_quota(struct ceph_mds_client *mdsc, struct ceph_mds_session *session, struct ceph_msg *msg) { struct super_block *sb = mdsc->fsc->sb; struct ceph_mds_quota *h = msg->front.iov_base; struct ceph_client *cl = mdsc->fsc->client; struct ceph_vino vino; struct inode *inode; struct ceph_inode_info *ci; if (!ceph_inc_mds_stopping_blocker(mdsc, session)) return; if (msg->front.iov_len < sizeof(*h)) { pr_err_client(cl, "corrupt message mds%d len %d\n", session->s_mds, (int)msg->front.iov_len); ceph_msg_dump(msg); goto out; } /* lookup inode */ vino.ino = le64_to_cpu(h->ino); vino.snap = CEPH_NOSNAP; inode = ceph_find_inode(sb, vino); if (!inode) { pr_warn_client(cl, "failed to find inode %llx\n", vino.ino); goto out; } ci = ceph_inode(inode); spin_lock(&ci->i_ceph_lock); ci->i_rbytes = le64_to_cpu(h->rbytes); ci->i_rfiles = le64_to_cpu(h->rfiles); ci->i_rsubdirs = le64_to_cpu(h->rsubdirs); __ceph_update_quota(ci, le64_to_cpu(h->max_bytes), le64_to_cpu(h->max_files)); spin_unlock(&ci->i_ceph_lock); iput(inode); out: ceph_dec_mds_stopping_blocker(mdsc); } static struct ceph_quotarealm_inode * find_quotarealm_inode(struct ceph_mds_client *mdsc, u64 ino) { struct ceph_quotarealm_inode *qri = NULL; struct rb_node **node, *parent = NULL; struct ceph_client *cl = mdsc->fsc->client; mutex_lock(&mdsc->quotarealms_inodes_mutex); node = &(mdsc->quotarealms_inodes.rb_node); while (*node) { parent = *node; qri = container_of(*node, struct ceph_quotarealm_inode, node); if (ino < qri->ino) node = &((*node)->rb_left); else if (ino > qri->ino) node = &((*node)->rb_right); else break; } if (!qri || (qri->ino != ino)) { /* Not found, create a new one and insert it */ qri = kmalloc(sizeof(*qri), GFP_KERNEL); if (qri) { qri->ino = ino; qri->inode = NULL; qri->timeout = 0; mutex_init(&qri->mutex); rb_link_node(&qri->node, parent, node); rb_insert_color(&qri->node, &mdsc->quotarealms_inodes); } else pr_warn_client(cl, "Failed to alloc quotarealms_inode\n"); } mutex_unlock(&mdsc->quotarealms_inodes_mutex); return qri; } /* * This function will try to lookup a realm inode which isn't visible in the * filesystem mountpoint. A list of these kind of inodes (not visible) is * maintained in the mdsc and freed only when the filesystem is umounted. * * Note that these inodes are kept in this list even if the lookup fails, which * allows to prevent useless lookup requests. */ static struct inode *lookup_quotarealm_inode(struct ceph_mds_client *mdsc, struct super_block *sb, struct ceph_snap_realm *realm) { struct ceph_client *cl = mdsc->fsc->client; struct ceph_quotarealm_inode *qri; struct inode *in; qri = find_quotarealm_inode(mdsc, realm->ino); if (!qri) return NULL; mutex_lock(&qri->mutex); if (qri->inode && ceph_is_any_caps(qri->inode)) { /* A request has already returned the inode */ mutex_unlock(&qri->mutex); return qri->inode; } /* Check if this inode lookup has failed recently */ if (qri->timeout && time_before_eq(jiffies, qri->timeout)) { mutex_unlock(&qri->mutex); return NULL; } if (qri->inode) { /* get caps */ int ret = __ceph_do_getattr(qri->inode, NULL, CEPH_STAT_CAP_INODE, true); if (ret >= 0) in = qri->inode; else in = ERR_PTR(ret); } else { in = ceph_lookup_inode(sb, realm->ino); } if (IS_ERR(in)) { doutc(cl, "Can't lookup inode %llx (err: %ld)\n", realm->ino, PTR_ERR(in)); qri->timeout = jiffies + secs_to_jiffies(60); /* XXX */ } else { qri->timeout = 0; qri->inode = in; } mutex_unlock(&qri->mutex); return in; } void ceph_cleanup_quotarealms_inodes(struct ceph_mds_client *mdsc) { struct ceph_quotarealm_inode *qri; struct rb_node *node; /* * It should now be safe to clean quotarealms_inode tree without holding * mdsc->quotarealms_inodes_mutex... */ mutex_lock(&mdsc->quotarealms_inodes_mutex); while (!RB_EMPTY_ROOT(&mdsc->quotarealms_inodes)) { node = rb_first(&mdsc->quotarealms_inodes); qri = rb_entry(node, struct ceph_quotarealm_inode, node); rb_erase(node, &mdsc->quotarealms_inodes); iput(qri->inode); kfree(qri); } mutex_unlock(&mdsc->quotarealms_inodes_mutex); } /* * This function walks through the snaprealm for an inode and set the * realmp with the first snaprealm that has quotas set (max_files, * max_bytes, or any, depending on the 'which_quota' argument). If the root is * reached, set the realmp with the root ceph_snap_realm instead. * * Note that the caller is responsible for calling ceph_put_snap_realm() on the * returned realm. * * Callers of this function need to hold mdsc->snap_rwsem. However, if there's * a need to do an inode lookup, this rwsem will be temporarily dropped. Hence * the 'retry' argument: if rwsem needs to be dropped and 'retry' is 'false' * this function will return -EAGAIN; otherwise, the snaprealms walk-through * will be restarted. */ static int get_quota_realm(struct ceph_mds_client *mdsc, struct inode *inode, enum quota_get_realm which_quota, struct ceph_snap_realm **realmp, bool retry) { struct ceph_client *cl = mdsc->fsc->client; struct ceph_inode_info *ci = NULL; struct ceph_snap_realm *realm, *next; struct inode *in; bool has_quota; if (realmp) *realmp = NULL; if (ceph_snap(inode) != CEPH_NOSNAP) return 0; restart: realm = ceph_inode(inode)->i_snap_realm; if (realm) ceph_get_snap_realm(mdsc, realm); else pr_err_ratelimited_client(cl, "%p %llx.%llx null i_snap_realm\n", inode, ceph_vinop(inode)); while (realm) { bool has_inode; spin_lock(&realm->inodes_with_caps_lock); has_inode = realm->inode; in = has_inode ? igrab(realm->inode) : NULL; spin_unlock(&realm->inodes_with_caps_lock); if (has_inode && !in) break; if (!in) { up_read(&mdsc->snap_rwsem); in = lookup_quotarealm_inode(mdsc, inode->i_sb, realm); down_read(&mdsc->snap_rwsem); if (IS_ERR_OR_NULL(in)) break; ceph_put_snap_realm(mdsc, realm); if (!retry) return -EAGAIN; goto restart; } ci = ceph_inode(in); has_quota = __ceph_has_quota(ci, which_quota); iput(in); next = realm->parent; if (has_quota || !next) { if (realmp) *realmp = realm; return 0; } ceph_get_snap_realm(mdsc, next); ceph_put_snap_realm(mdsc, realm); realm = next; } if (realm) ceph_put_snap_realm(mdsc, realm); return 0; } bool ceph_quota_is_same_realm(struct inode *old, struct inode *new) { struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(old->i_sb); struct ceph_snap_realm *old_realm, *new_realm; bool is_same; int ret; restart: /* * We need to lookup 2 quota realms atomically, i.e. with snap_rwsem. * However, get_quota_realm may drop it temporarily. By setting the * 'retry' parameter to 'false', we'll get -EAGAIN if the rwsem was * dropped and we can then restart the whole operation. */ down_read(&mdsc->snap_rwsem); get_quota_realm(mdsc, old, QUOTA_GET_ANY, &old_realm, true); ret = get_quota_realm(mdsc, new, QUOTA_GET_ANY, &new_realm, false); if (ret == -EAGAIN) { up_read(&mdsc->snap_rwsem); if (old_realm) ceph_put_snap_realm(mdsc, old_realm); goto restart; } is_same = (old_realm == new_realm); up_read(&mdsc->snap_rwsem); if (old_realm) ceph_put_snap_realm(mdsc, old_realm); if (new_realm) ceph_put_snap_realm(mdsc, new_realm); return is_same; } enum quota_check_op { QUOTA_CHECK_MAX_FILES_OP, /* check quota max_files limit */ QUOTA_CHECK_MAX_BYTES_OP, /* check quota max_files limit */ QUOTA_CHECK_MAX_BYTES_APPROACHING_OP /* check if quota max_files limit is approaching */ }; /* * check_quota_exceeded() will walk up the snaprealm hierarchy and, for each * realm, it will execute quota check operation defined by the 'op' parameter. * The snaprealm walk is interrupted if the quota check detects that the quota * is exceeded or if the root inode is reached. */ static bool check_quota_exceeded(struct inode *inode, enum quota_check_op op, loff_t delta) { struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb); struct ceph_client *cl = mdsc->fsc->client; struct ceph_inode_info *ci; struct ceph_snap_realm *realm, *next; struct inode *in; u64 max, rvalue; bool exceeded = false; if (ceph_snap(inode) != CEPH_NOSNAP) return false; down_read(&mdsc->snap_rwsem); restart: realm = ceph_inode(inode)->i_snap_realm; if (realm) ceph_get_snap_realm(mdsc, realm); else pr_err_ratelimited_client(cl, "%p %llx.%llx null i_snap_realm\n", inode, ceph_vinop(inode)); while (realm) { bool has_inode; spin_lock(&realm->inodes_with_caps_lock); has_inode = realm->inode; in = has_inode ? igrab(realm->inode) : NULL; spin_unlock(&realm->inodes_with_caps_lock); if (has_inode && !in) break; if (!in) { up_read(&mdsc->snap_rwsem); in = lookup_quotarealm_inode(mdsc, inode->i_sb, realm); down_read(&mdsc->snap_rwsem); if (IS_ERR_OR_NULL(in)) break; ceph_put_snap_realm(mdsc, realm); goto restart; } ci = ceph_inode(in); spin_lock(&ci->i_ceph_lock); if (op == QUOTA_CHECK_MAX_FILES_OP) { max = ci->i_max_files; rvalue = ci->i_rfiles + ci->i_rsubdirs; } else { max = ci->i_max_bytes; rvalue = ci->i_rbytes; } spin_unlock(&ci->i_ceph_lock); switch (op) { case QUOTA_CHECK_MAX_FILES_OP: case QUOTA_CHECK_MAX_BYTES_OP: exceeded = (max && (rvalue + delta > max)); break; case QUOTA_CHECK_MAX_BYTES_APPROACHING_OP: if (max) { if (rvalue >= max) exceeded = true; else { /* * when we're writing more that 1/16th * of the available space */ exceeded = (((max - rvalue) >> 4) < delta); } } break; default: /* Shouldn't happen */ pr_warn_client(cl, "Invalid quota check op (%d)\n", op); exceeded = true; /* Just break the loop */ } iput(in); next = realm->parent; if (exceeded || !next) break; ceph_get_snap_realm(mdsc, next); ceph_put_snap_realm(mdsc, realm); realm = next; } if (realm) ceph_put_snap_realm(mdsc, realm); up_read(&mdsc->snap_rwsem); return exceeded; } /* * ceph_quota_is_max_files_exceeded - check if we can create a new file * @inode: directory where a new file is being created * * This functions returns true is max_files quota allows a new file to be * created. It is necessary to walk through the snaprealm hierarchy (until the * FS root) to check all realms with quotas set. */ bool ceph_quota_is_max_files_exceeded(struct inode *inode) { if (!ceph_has_realms_with_quotas(inode)) return false; WARN_ON(!S_ISDIR(inode->i_mode)); return check_quota_exceeded(inode, QUOTA_CHECK_MAX_FILES_OP, 1); } /* * ceph_quota_is_max_bytes_exceeded - check if we can write to a file * @inode: inode being written * @newsize: new size if write succeeds * * This functions returns true is max_bytes quota allows a file size to reach * @newsize; it returns false otherwise. */ bool ceph_quota_is_max_bytes_exceeded(struct inode *inode, loff_t newsize) { loff_t size = i_size_read(inode); if (!ceph_has_realms_with_quotas(inode)) return false; /* return immediately if we're decreasing file size */ if (newsize <= size) return false; return check_quota_exceeded(inode, QUOTA_CHECK_MAX_BYTES_OP, (newsize - size)); } /* * ceph_quota_is_max_bytes_approaching - check if we're reaching max_bytes * @inode: inode being written * @newsize: new size if write succeeds * * This function returns true if the new file size @newsize will be consuming * more than 1/16th of the available quota space; it returns false otherwise. */ bool ceph_quota_is_max_bytes_approaching(struct inode *inode, loff_t newsize) { loff_t size = ceph_inode(inode)->i_reported_size; if (!ceph_has_realms_with_quotas(inode)) return false; /* return immediately if we're decreasing file size */ if (newsize <= size) return false; return check_quota_exceeded(inode, QUOTA_CHECK_MAX_BYTES_APPROACHING_OP, (newsize - size)); } /* * ceph_quota_update_statfs - if root has quota update statfs with quota status * @fsc: filesystem client instance * @buf: statfs to update * * If the mounted filesystem root has max_bytes quota set, update the filesystem * statistics with the quota status. * * This function returns true if the stats have been updated, false otherwise. */ bool ceph_quota_update_statfs(struct ceph_fs_client *fsc, struct kstatfs *buf) { struct ceph_mds_client *mdsc = fsc->mdsc; struct ceph_inode_info *ci; struct ceph_snap_realm *realm; struct inode *in; u64 total = 0, used, free; bool is_updated = false; down_read(&mdsc->snap_rwsem); get_quota_realm(mdsc, d_inode(fsc->sb->s_root), QUOTA_GET_MAX_BYTES, &realm, true); up_read(&mdsc->snap_rwsem); if (!realm) return false; spin_lock(&realm->inodes_with_caps_lock); in = realm->inode ? igrab(realm->inode) : NULL; spin_unlock(&realm->inodes_with_caps_lock); if (in) { ci = ceph_inode(in); spin_lock(&ci->i_ceph_lock); if (ci->i_max_bytes) { total = ci->i_max_bytes >> CEPH_BLOCK_SHIFT; used = ci->i_rbytes >> CEPH_BLOCK_SHIFT; /* For quota size less than 4MB, use 4KB block size */ if (!total) { total = ci->i_max_bytes >> CEPH_4K_BLOCK_SHIFT; used = ci->i_rbytes >> CEPH_4K_BLOCK_SHIFT; buf->f_frsize = 1 << CEPH_4K_BLOCK_SHIFT; } /* It is possible for a quota to be exceeded. * Report 'zero' in that case */ free = total > used ? total - used : 0; /* For quota size less than 4KB, report the * total=used=4KB,free=0 when quota is full * and total=free=4KB, used=0 otherwise */ if (!total) { total = 1; free = ci->i_max_bytes > ci->i_rbytes ? 1 : 0; buf->f_frsize = 1 << CEPH_4K_BLOCK_SHIFT; } } spin_unlock(&ci->i_ceph_lock); if (total) { buf->f_blocks = total; buf->f_bfree = free; buf->f_bavail = free; is_updated = true; } iput(in); } ceph_put_snap_realm(mdsc, realm); return is_updated; } |
2 2 2 2 2 2 2 2 2 1 6 4 4 4 4 4 4 3 3 3 3 3 3 2 2 2 2 4 3 3 3 2 1 1 1 3 7 7 7 6 1 5 7 7 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 | // SPDX-License-Identifier: GPL-2.0 /* * Native support for the I/O-Warrior USB devices * * Copyright (c) 2003-2005, 2020 Code Mercenaries GmbH * written by Christian Lucht <lucht@codemercs.com> and * Christoph Jung <jung@codemercs.com> * * based on * usb-skeleton.c by Greg Kroah-Hartman <greg@kroah.com> * brlvger.c by Stephane Dalton <sdalton@videotron.ca> * and Stephane Doyon <s.doyon@videotron.ca> * * Released under the GPLv2. */ #include <linux/module.h> #include <linux/usb.h> #include <linux/slab.h> #include <linux/sched.h> #include <linux/mutex.h> #include <linux/poll.h> #include <linux/usb/iowarrior.h> #define DRIVER_AUTHOR "Christian Lucht <lucht@codemercs.com>" #define DRIVER_DESC "USB IO-Warrior driver" #define USB_VENDOR_ID_CODEMERCS 1984 /* low speed iowarrior */ #define USB_DEVICE_ID_CODEMERCS_IOW40 0x1500 #define USB_DEVICE_ID_CODEMERCS_IOW24 0x1501 #define USB_DEVICE_ID_CODEMERCS_IOWPV1 0x1511 #define USB_DEVICE_ID_CODEMERCS_IOWPV2 0x1512 /* full speed iowarrior */ #define USB_DEVICE_ID_CODEMERCS_IOW56 0x1503 /* fuller speed iowarrior */ #define USB_DEVICE_ID_CODEMERCS_IOW28 0x1504 #define USB_DEVICE_ID_CODEMERCS_IOW28L 0x1505 #define USB_DEVICE_ID_CODEMERCS_IOW100 0x1506 /* OEMed devices */ #define USB_DEVICE_ID_CODEMERCS_IOW24SAG 0x158a #define USB_DEVICE_ID_CODEMERCS_IOW56AM 0x158b /* Get a minor range for your devices from the usb maintainer */ #ifdef CONFIG_USB_DYNAMIC_MINORS #define IOWARRIOR_MINOR_BASE 0 #else #define IOWARRIOR_MINOR_BASE 208 // SKELETON_MINOR_BASE 192 + 16, not official yet #endif /* interrupt input queue size */ #define MAX_INTERRUPT_BUFFER 16 /* maximum number of urbs that are submitted for writes at the same time, this applies to the IOWarrior56 only! IOWarrior24 and IOWarrior40 use synchronous usb_control_msg calls. */ #define MAX_WRITES_IN_FLIGHT 4 MODULE_AUTHOR(DRIVER_AUTHOR); MODULE_DESCRIPTION(DRIVER_DESC); MODULE_LICENSE("GPL"); static struct usb_driver iowarrior_driver; /*--------------*/ /* data */ /*--------------*/ /* Structure to hold all of our device specific stuff */ struct iowarrior { struct mutex mutex; /* locks this structure */ struct usb_device *udev; /* save off the usb device pointer */ struct usb_interface *interface; /* the interface for this device */ unsigned char minor; /* the starting minor number for this device */ struct usb_endpoint_descriptor *int_out_endpoint; /* endpoint for reading (needed for IOW56 only) */ struct usb_endpoint_descriptor *int_in_endpoint; /* endpoint for reading */ struct urb *int_in_urb; /* the urb for reading data */ unsigned char *int_in_buffer; /* buffer for data to be read */ unsigned char serial_number; /* to detect lost packages */ unsigned char *read_queue; /* size is MAX_INTERRUPT_BUFFER * packet size */ wait_queue_head_t read_wait; wait_queue_head_t write_wait; /* wait-queue for writing to the device */ atomic_t write_busy; /* number of write-urbs submitted */ atomic_t read_idx; atomic_t intr_idx; atomic_t overflow_flag; /* signals an index 'rollover' */ int present; /* this is 1 as long as the device is connected */ int opened; /* this is 1 if the device is currently open */ char chip_serial[9]; /* the serial number string of the chip connected */ int report_size; /* number of bytes in a report */ u16 product_id; struct usb_anchor submitted; }; /*--------------*/ /* globals */ /*--------------*/ #define USB_REQ_GET_REPORT 0x01 //#if 0 static int usb_get_report(struct usb_device *dev, struct usb_host_interface *inter, unsigned char type, unsigned char id, void *buf, int size) { return usb_control_msg(dev, usb_rcvctrlpipe(dev, 0), USB_REQ_GET_REPORT, USB_DIR_IN | USB_TYPE_CLASS | USB_RECIP_INTERFACE, (type << 8) + id, inter->desc.bInterfaceNumber, buf, size, USB_CTRL_GET_TIMEOUT); } //#endif #define USB_REQ_SET_REPORT 0x09 static int usb_set_report(struct usb_interface *intf, unsigned char type, unsigned char id, void *buf, int size) { return usb_control_msg(interface_to_usbdev(intf), usb_sndctrlpipe(interface_to_usbdev(intf), 0), USB_REQ_SET_REPORT, USB_TYPE_CLASS | USB_RECIP_INTERFACE, (type << 8) + id, intf->cur_altsetting->desc.bInterfaceNumber, buf, size, 1000); } /*---------------------*/ /* driver registration */ /*---------------------*/ /* table of devices that work with this driver */ static const struct usb_device_id iowarrior_ids[] = { {USB_DEVICE(USB_VENDOR_ID_CODEMERCS, USB_DEVICE_ID_CODEMERCS_IOW40)}, {USB_DEVICE(USB_VENDOR_ID_CODEMERCS, USB_DEVICE_ID_CODEMERCS_IOW24)}, {USB_DEVICE(USB_VENDOR_ID_CODEMERCS, USB_DEVICE_ID_CODEMERCS_IOWPV1)}, {USB_DEVICE(USB_VENDOR_ID_CODEMERCS, USB_DEVICE_ID_CODEMERCS_IOWPV2)}, {USB_DEVICE(USB_VENDOR_ID_CODEMERCS, USB_DEVICE_ID_CODEMERCS_IOW56)}, {USB_DEVICE(USB_VENDOR_ID_CODEMERCS, USB_DEVICE_ID_CODEMERCS_IOW24SAG)}, {USB_DEVICE(USB_VENDOR_ID_CODEMERCS, USB_DEVICE_ID_CODEMERCS_IOW56AM)}, {USB_DEVICE(USB_VENDOR_ID_CODEMERCS, USB_DEVICE_ID_CODEMERCS_IOW28)}, {USB_DEVICE(USB_VENDOR_ID_CODEMERCS, USB_DEVICE_ID_CODEMERCS_IOW28L)}, {USB_DEVICE(USB_VENDOR_ID_CODEMERCS, USB_DEVICE_ID_CODEMERCS_IOW100)}, {} /* Terminating entry */ }; MODULE_DEVICE_TABLE(usb, iowarrior_ids); /* * USB callback handler for reading data */ static void iowarrior_callback(struct urb *urb) { struct iowarrior *dev = urb->context; int intr_idx; int read_idx; int aux_idx; int offset; int status = urb->status; int retval; switch (status) { case 0: /* success */ break; case -ECONNRESET: case -ENOENT: case -ESHUTDOWN: return; default: goto exit; } intr_idx = atomic_read(&dev->intr_idx); /* aux_idx become previous intr_idx */ aux_idx = (intr_idx == 0) ? (MAX_INTERRUPT_BUFFER - 1) : (intr_idx - 1); read_idx = atomic_read(&dev->read_idx); /* queue is not empty and it's interface 0 */ if ((intr_idx != read_idx) && (dev->interface->cur_altsetting->desc.bInterfaceNumber == 0)) { /* + 1 for serial number */ offset = aux_idx * (dev->report_size + 1); if (!memcmp (dev->read_queue + offset, urb->transfer_buffer, dev->report_size)) { /* equal values on interface 0 will be ignored */ goto exit; } } /* aux_idx become next intr_idx */ aux_idx = (intr_idx == (MAX_INTERRUPT_BUFFER - 1)) ? 0 : (intr_idx + 1); if (read_idx == aux_idx) { /* queue full, dropping oldest input */ read_idx = (++read_idx == MAX_INTERRUPT_BUFFER) ? 0 : read_idx; atomic_set(&dev->read_idx, read_idx); atomic_set(&dev->overflow_flag, 1); } /* +1 for serial number */ offset = intr_idx * (dev->report_size + 1); memcpy(dev->read_queue + offset, urb->transfer_buffer, dev->report_size); *(dev->read_queue + offset + (dev->report_size)) = dev->serial_number++; atomic_set(&dev->intr_idx, aux_idx); /* tell the blocking read about the new data */ wake_up_interruptible(&dev->read_wait); exit: retval = usb_submit_urb(urb, GFP_ATOMIC); if (retval) dev_err(&dev->interface->dev, "%s - usb_submit_urb failed with result %d\n", __func__, retval); } /* * USB Callback handler for write-ops */ static void iowarrior_write_callback(struct urb *urb) { struct iowarrior *dev; int status = urb->status; dev = urb->context; /* sync/async unlink faults aren't errors */ if (status && !(status == -ENOENT || status == -ECONNRESET || status == -ESHUTDOWN)) { dev_dbg(&dev->interface->dev, "nonzero write bulk status received: %d\n", status); } /* free up our allocated buffer */ usb_free_coherent(urb->dev, urb->transfer_buffer_length, urb->transfer_buffer, urb->transfer_dma); /* tell a waiting writer the interrupt-out-pipe is available again */ atomic_dec(&dev->write_busy); wake_up_interruptible(&dev->write_wait); } /* * iowarrior_delete */ static inline void iowarrior_delete(struct iowarrior *dev) { dev_dbg(&dev->interface->dev, "minor %d\n", dev->minor); kfree(dev->int_in_buffer); usb_free_urb(dev->int_in_urb); kfree(dev->read_queue); usb_put_intf(dev->interface); kfree(dev); } /*---------------------*/ /* fops implementation */ /*---------------------*/ static int read_index(struct iowarrior *dev) { int intr_idx, read_idx; read_idx = atomic_read(&dev->read_idx); intr_idx = atomic_read(&dev->intr_idx); return (read_idx == intr_idx ? -1 : read_idx); } /* * iowarrior_read */ static ssize_t iowarrior_read(struct file *file, char __user *buffer, size_t count, loff_t *ppos) { struct iowarrior *dev; int read_idx; int offset; int retval; dev = file->private_data; if (file->f_flags & O_NONBLOCK) { retval = mutex_trylock(&dev->mutex); if (!retval) return -EAGAIN; } else { retval = mutex_lock_interruptible(&dev->mutex); if (retval) return -ERESTARTSYS; } /* verify that the device wasn't unplugged */ if (!dev->present) { retval = -ENODEV; goto exit; } dev_dbg(&dev->interface->dev, "minor %d, count = %zd\n", dev->minor, count); /* read count must be packet size (+ time stamp) */ if ((count != dev->report_size) && (count != (dev->report_size + 1))) { retval = -EINVAL; goto exit; } /* repeat until no buffer overrun in callback handler occur */ do { atomic_set(&dev->overflow_flag, 0); if ((read_idx = read_index(dev)) == -1) { /* queue empty */ if (file->f_flags & O_NONBLOCK) { retval = -EAGAIN; goto exit; } else { //next line will return when there is either new data, or the device is unplugged int r = wait_event_interruptible(dev->read_wait, (!dev->present || (read_idx = read_index (dev)) != -1)); if (r) { //we were interrupted by a signal retval = -ERESTART; goto exit; } if (!dev->present) { //The device was unplugged retval = -ENODEV; goto exit; } if (read_idx == -1) { // Can this happen ??? retval = 0; goto exit; } } } offset = read_idx * (dev->report_size + 1); if (copy_to_user(buffer, dev->read_queue + offset, count)) { retval = -EFAULT; goto exit; } } while (atomic_read(&dev->overflow_flag)); read_idx = ++read_idx == MAX_INTERRUPT_BUFFER ? 0 : read_idx; atomic_set(&dev->read_idx, read_idx); mutex_unlock(&dev->mutex); return count; exit: mutex_unlock(&dev->mutex); return retval; } /* * iowarrior_write */ static ssize_t iowarrior_write(struct file *file, const char __user *user_buffer, size_t count, loff_t *ppos) { struct iowarrior *dev; int retval = 0; char *buf = NULL; /* for IOW24 and IOW56 we need a buffer */ struct urb *int_out_urb = NULL; dev = file->private_data; mutex_lock(&dev->mutex); /* verify that the device wasn't unplugged */ if (!dev->present) { retval = -ENODEV; goto exit; } dev_dbg(&dev->interface->dev, "minor %d, count = %zd\n", dev->minor, count); /* if count is 0 we're already done */ if (count == 0) { retval = 0; goto exit; } /* We only accept full reports */ if (count != dev->report_size) { retval = -EINVAL; goto exit; } switch (dev->product_id) { case USB_DEVICE_ID_CODEMERCS_IOW24: case USB_DEVICE_ID_CODEMERCS_IOW24SAG: case USB_DEVICE_ID_CODEMERCS_IOWPV1: case USB_DEVICE_ID_CODEMERCS_IOWPV2: case USB_DEVICE_ID_CODEMERCS_IOW40: /* IOW24 and IOW40 use a synchronous call */ buf = memdup_user(user_buffer, count); if (IS_ERR(buf)) { retval = PTR_ERR(buf); goto exit; } retval = usb_set_report(dev->interface, 2, 0, buf, count); kfree(buf); goto exit; case USB_DEVICE_ID_CODEMERCS_IOW56: case USB_DEVICE_ID_CODEMERCS_IOW56AM: case USB_DEVICE_ID_CODEMERCS_IOW28: case USB_DEVICE_ID_CODEMERCS_IOW28L: case USB_DEVICE_ID_CODEMERCS_IOW100: /* The IOW56 uses asynchronous IO and more urbs */ if (atomic_read(&dev->write_busy) == MAX_WRITES_IN_FLIGHT) { /* Wait until we are below the limit for submitted urbs */ if (file->f_flags & O_NONBLOCK) { retval = -EAGAIN; goto exit; } else { retval = wait_event_interruptible(dev->write_wait, (!dev->present || (atomic_read (&dev-> write_busy) < MAX_WRITES_IN_FLIGHT))); if (retval) { /* we were interrupted by a signal */ retval = -ERESTART; goto exit; } if (!dev->present) { /* The device was unplugged */ retval = -ENODEV; goto exit; } if (!dev->opened) { /* We were closed while waiting for an URB */ retval = -ENODEV; goto exit; } } } atomic_inc(&dev->write_busy); int_out_urb = usb_alloc_urb(0, GFP_KERNEL); if (!int_out_urb) { retval = -ENOMEM; goto error_no_urb; } buf = usb_alloc_coherent(dev->udev, dev->report_size, GFP_KERNEL, &int_out_urb->transfer_dma); if (!buf) { retval = -ENOMEM; dev_dbg(&dev->interface->dev, "Unable to allocate buffer\n"); goto error_no_buffer; } usb_fill_int_urb(int_out_urb, dev->udev, usb_sndintpipe(dev->udev, dev->int_out_endpoint->bEndpointAddress), buf, dev->report_size, iowarrior_write_callback, dev, dev->int_out_endpoint->bInterval); int_out_urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP; if (copy_from_user(buf, user_buffer, count)) { retval = -EFAULT; goto error; } usb_anchor_urb(int_out_urb, &dev->submitted); retval = usb_submit_urb(int_out_urb, GFP_KERNEL); if (retval) { dev_dbg(&dev->interface->dev, "submit error %d for urb nr.%d\n", retval, atomic_read(&dev->write_busy)); usb_unanchor_urb(int_out_urb); goto error; } /* submit was ok */ retval = count; usb_free_urb(int_out_urb); goto exit; default: /* what do we have here ? An unsupported Product-ID ? */ dev_err(&dev->interface->dev, "%s - not supported for product=0x%x\n", __func__, dev->product_id); retval = -EFAULT; goto exit; } error: usb_free_coherent(dev->udev, dev->report_size, buf, int_out_urb->transfer_dma); error_no_buffer: usb_free_urb(int_out_urb); error_no_urb: atomic_dec(&dev->write_busy); wake_up_interruptible(&dev->write_wait); exit: mutex_unlock(&dev->mutex); return retval; } /* * iowarrior_ioctl */ static long iowarrior_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct iowarrior *dev = NULL; __u8 *buffer; __u8 __user *user_buffer; int retval; int io_res; /* checks for bytes read/written and copy_to/from_user results */ dev = file->private_data; if (!dev) return -ENODEV; buffer = kzalloc(dev->report_size, GFP_KERNEL); if (!buffer) return -ENOMEM; mutex_lock(&dev->mutex); /* verify that the device wasn't unplugged */ if (!dev->present) { retval = -ENODEV; goto error_out; } dev_dbg(&dev->interface->dev, "minor %d, cmd 0x%.4x, arg %ld\n", dev->minor, cmd, arg); retval = 0; switch (cmd) { case IOW_WRITE: if (dev->product_id == USB_DEVICE_ID_CODEMERCS_IOW24 || dev->product_id == USB_DEVICE_ID_CODEMERCS_IOW24SAG || dev->product_id == USB_DEVICE_ID_CODEMERCS_IOWPV1 || dev->product_id == USB_DEVICE_ID_CODEMERCS_IOWPV2 || dev->product_id == USB_DEVICE_ID_CODEMERCS_IOW40) { user_buffer = (__u8 __user *)arg; io_res = copy_from_user(buffer, user_buffer, dev->report_size); if (io_res) { retval = -EFAULT; } else { io_res = usb_set_report(dev->interface, 2, 0, buffer, dev->report_size); if (io_res < 0) retval = io_res; } } else { retval = -EINVAL; dev_err(&dev->interface->dev, "ioctl 'IOW_WRITE' is not supported for product=0x%x.\n", dev->product_id); } break; case IOW_READ: user_buffer = (__u8 __user *)arg; io_res = usb_get_report(dev->udev, dev->interface->cur_altsetting, 1, 0, buffer, dev->report_size); if (io_res < 0) retval = io_res; else { io_res = copy_to_user(user_buffer, buffer, dev->report_size); if (io_res) retval = -EFAULT; } break; case IOW_GETINFO: { /* Report available information for the device */ struct iowarrior_info info; /* needed for power consumption */ struct usb_config_descriptor *cfg_descriptor = &dev->udev->actconfig->desc; memset(&info, 0, sizeof(info)); /* directly from the descriptor */ info.vendor = le16_to_cpu(dev->udev->descriptor.idVendor); info.product = dev->product_id; info.revision = le16_to_cpu(dev->udev->descriptor.bcdDevice); /* 0==UNKNOWN, 1==LOW(usb1.1) ,2=FULL(usb1.1), 3=HIGH(usb2.0) */ info.speed = dev->udev->speed; info.if_num = dev->interface->cur_altsetting->desc.bInterfaceNumber; info.report_size = dev->report_size; /* serial number string has been read earlier 8 chars or empty string */ memcpy(info.serial, dev->chip_serial, sizeof(dev->chip_serial)); if (cfg_descriptor == NULL) { info.power = -1; /* no information available */ } else { /* the MaxPower is stored in units of 2mA to make it fit into a byte-value */ info.power = cfg_descriptor->bMaxPower * 2; } io_res = copy_to_user((struct iowarrior_info __user *)arg, &info, sizeof(struct iowarrior_info)); if (io_res) retval = -EFAULT; break; } default: /* return that we did not understand this ioctl call */ retval = -ENOTTY; break; } error_out: /* unlock the device */ mutex_unlock(&dev->mutex); kfree(buffer); return retval; } /* * iowarrior_open */ static int iowarrior_open(struct inode *inode, struct file *file) { struct iowarrior *dev = NULL; struct usb_interface *interface; int subminor; int retval = 0; subminor = iminor(inode); interface = usb_find_interface(&iowarrior_driver, subminor); if (!interface) { pr_err("%s - error, can't find device for minor %d\n", __func__, subminor); return -ENODEV; } dev = usb_get_intfdata(interface); if (!dev) return -ENODEV; mutex_lock(&dev->mutex); /* Only one process can open each device, no sharing. */ if (dev->opened) { retval = -EBUSY; goto out; } /* setup interrupt handler for receiving values */ if ((retval = usb_submit_urb(dev->int_in_urb, GFP_KERNEL)) < 0) { dev_err(&interface->dev, "Error %d while submitting URB\n", retval); retval = -EFAULT; goto out; } /* increment our usage count for the driver */ ++dev->opened; /* save our object in the file's private structure */ file->private_data = dev; retval = 0; out: mutex_unlock(&dev->mutex); return retval; } /* * iowarrior_release */ static int iowarrior_release(struct inode *inode, struct file *file) { struct iowarrior *dev; int retval = 0; dev = file->private_data; if (!dev) return -ENODEV; dev_dbg(&dev->interface->dev, "minor %d\n", dev->minor); /* lock our device */ mutex_lock(&dev->mutex); if (dev->opened <= 0) { retval = -ENODEV; /* close called more than once */ mutex_unlock(&dev->mutex); } else { dev->opened = 0; /* we're closing now */ retval = 0; if (dev->present) { /* The device is still connected so we only shutdown pending read-/write-ops. */ usb_kill_urb(dev->int_in_urb); wake_up_interruptible(&dev->read_wait); wake_up_interruptible(&dev->write_wait); mutex_unlock(&dev->mutex); } else { /* The device was unplugged, cleanup resources */ mutex_unlock(&dev->mutex); iowarrior_delete(dev); } } return retval; } static __poll_t iowarrior_poll(struct file *file, poll_table * wait) { struct iowarrior *dev = file->private_data; __poll_t mask = 0; if (!dev->present) return EPOLLERR | EPOLLHUP; poll_wait(file, &dev->read_wait, wait); poll_wait(file, &dev->write_wait, wait); if (!dev->present) return EPOLLERR | EPOLLHUP; if (read_index(dev) != -1) mask |= EPOLLIN | EPOLLRDNORM; if (atomic_read(&dev->write_busy) < MAX_WRITES_IN_FLIGHT) mask |= EPOLLOUT | EPOLLWRNORM; return mask; } /* * File operations needed when we register this driver. * This assumes that this driver NEEDS file operations, * of course, which means that the driver is expected * to have a node in the /dev directory. If the USB * device were for a network interface then the driver * would use "struct net_driver" instead, and a serial * device would use "struct tty_driver". */ static const struct file_operations iowarrior_fops = { .owner = THIS_MODULE, .write = iowarrior_write, .read = iowarrior_read, .unlocked_ioctl = iowarrior_ioctl, .open = iowarrior_open, .release = iowarrior_release, .poll = iowarrior_poll, .llseek = noop_llseek, }; static char *iowarrior_devnode(const struct device *dev, umode_t *mode) { return kasprintf(GFP_KERNEL, "usb/%s", dev_name(dev)); } /* * usb class driver info in order to get a minor number from the usb core, * and to have the device registered with devfs and the driver core */ static struct usb_class_driver iowarrior_class = { .name = "iowarrior%d", .devnode = iowarrior_devnode, .fops = &iowarrior_fops, .minor_base = IOWARRIOR_MINOR_BASE, }; /*---------------------------------*/ /* probe and disconnect functions */ /*---------------------------------*/ /* * iowarrior_probe * * Called by the usb core when a new device is connected that it thinks * this driver might be interested in. */ static int iowarrior_probe(struct usb_interface *interface, const struct usb_device_id *id) { struct usb_device *udev = interface_to_usbdev(interface); struct iowarrior *dev = NULL; struct usb_host_interface *iface_desc; int retval = -ENOMEM; int res; /* allocate memory for our device state and initialize it */ dev = kzalloc(sizeof(struct iowarrior), GFP_KERNEL); if (!dev) return retval; mutex_init(&dev->mutex); atomic_set(&dev->intr_idx, 0); atomic_set(&dev->read_idx, 0); atomic_set(&dev->overflow_flag, 0); init_waitqueue_head(&dev->read_wait); atomic_set(&dev->write_busy, 0); init_waitqueue_head(&dev->write_wait); dev->udev = udev; dev->interface = usb_get_intf(interface); iface_desc = interface->cur_altsetting; dev->product_id = le16_to_cpu(udev->descriptor.idProduct); init_usb_anchor(&dev->submitted); res = usb_find_last_int_in_endpoint(iface_desc, &dev->int_in_endpoint); if (res) { dev_err(&interface->dev, "no interrupt-in endpoint found\n"); retval = res; goto error; } if ((dev->product_id == USB_DEVICE_ID_CODEMERCS_IOW56) || (dev->product_id == USB_DEVICE_ID_CODEMERCS_IOW56AM) || (dev->product_id == USB_DEVICE_ID_CODEMERCS_IOW28) || (dev->product_id == USB_DEVICE_ID_CODEMERCS_IOW28L) || (dev->product_id == USB_DEVICE_ID_CODEMERCS_IOW100)) { res = usb_find_last_int_out_endpoint(iface_desc, &dev->int_out_endpoint); if (res) { dev_err(&interface->dev, "no interrupt-out endpoint found\n"); retval = res; goto error; } } /* we have to check the report_size often, so remember it in the endianness suitable for our machine */ dev->report_size = usb_endpoint_maxp(dev->int_in_endpoint); /* * Some devices need the report size to be different than the * endpoint size. */ if (dev->interface->cur_altsetting->desc.bInterfaceNumber == 0) { switch (dev->product_id) { case USB_DEVICE_ID_CODEMERCS_IOW56: case USB_DEVICE_ID_CODEMERCS_IOW56AM: dev->report_size = 7; break; case USB_DEVICE_ID_CODEMERCS_IOW28: case USB_DEVICE_ID_CODEMERCS_IOW28L: dev->report_size = 4; break; case USB_DEVICE_ID_CODEMERCS_IOW100: dev->report_size = 12; break; } } /* create the urb and buffer for reading */ dev->int_in_urb = usb_alloc_urb(0, GFP_KERNEL); if (!dev->int_in_urb) goto error; dev->int_in_buffer = kmalloc(dev->report_size, GFP_KERNEL); if (!dev->int_in_buffer) goto error; usb_fill_int_urb(dev->int_in_urb, dev->udev, usb_rcvintpipe(dev->udev, dev->int_in_endpoint->bEndpointAddress), dev->int_in_buffer, dev->report_size, iowarrior_callback, dev, dev->int_in_endpoint->bInterval); /* create an internal buffer for interrupt data from the device */ dev->read_queue = kmalloc_array(dev->report_size + 1, MAX_INTERRUPT_BUFFER, GFP_KERNEL); if (!dev->read_queue) goto error; /* Get the serial-number of the chip */ memset(dev->chip_serial, 0x00, sizeof(dev->chip_serial)); usb_string(udev, udev->descriptor.iSerialNumber, dev->chip_serial, sizeof(dev->chip_serial)); if (strlen(dev->chip_serial) != 8) memset(dev->chip_serial, 0x00, sizeof(dev->chip_serial)); /* Set the idle timeout to 0, if this is interface 0 */ if (dev->interface->cur_altsetting->desc.bInterfaceNumber == 0) { usb_control_msg(udev, usb_sndctrlpipe(udev, 0), 0x0A, USB_TYPE_CLASS | USB_RECIP_INTERFACE, 0, 0, NULL, 0, USB_CTRL_SET_TIMEOUT); } /* allow device read and ioctl */ dev->present = 1; /* we can register the device now, as it is ready */ usb_set_intfdata(interface, dev); retval = usb_register_dev(interface, &iowarrior_class); if (retval) { /* something prevented us from registering this driver */ dev_err(&interface->dev, "Not able to get a minor for this device.\n"); goto error; } dev->minor = interface->minor; /* let the user know what node this device is now attached to */ dev_info(&interface->dev, "IOWarrior product=0x%x, serial=%s interface=%d " "now attached to iowarrior%d\n", dev->product_id, dev->chip_serial, iface_desc->desc.bInterfaceNumber, dev->minor - IOWARRIOR_MINOR_BASE); return retval; error: iowarrior_delete(dev); return retval; } /* * iowarrior_disconnect * * Called by the usb core when the device is removed from the system. */ static void iowarrior_disconnect(struct usb_interface *interface) { struct iowarrior *dev = usb_get_intfdata(interface); usb_deregister_dev(interface, &iowarrior_class); mutex_lock(&dev->mutex); /* prevent device read, write and ioctl */ dev->present = 0; if (dev->opened) { /* There is a process that holds a filedescriptor to the device , so we only shutdown read-/write-ops going on. Deleting the device is postponed until close() was called. */ usb_kill_urb(dev->int_in_urb); usb_kill_anchored_urbs(&dev->submitted); wake_up_interruptible(&dev->read_wait); wake_up_interruptible(&dev->write_wait); mutex_unlock(&dev->mutex); } else { /* no process is using the device, cleanup now */ mutex_unlock(&dev->mutex); iowarrior_delete(dev); } } /* usb specific object needed to register this driver with the usb subsystem */ static struct usb_driver iowarrior_driver = { .name = "iowarrior", .probe = iowarrior_probe, .disconnect = iowarrior_disconnect, .id_table = iowarrior_ids, }; module_usb_driver(iowarrior_driver); |
144 144 16 156 16 16 16 16 16 16 16 16 16 16 16 16 16 16 155 156 156 156 156 156 128 156 128 144 123 123 123 123 123 123 84 71 71 71 39 16 39 39 71 71 71 71 71 71 46 71 13 13 13 13 36 36 16 16 2 10 1 3 1 2 1 1 1 1 2 2 6 6 4 4 4 4 6 5 4 2 1 1 4 20 20 20 20 20 20 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 | /* * * Copyright IBM Corporation, 2012 * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> * * Cgroup v2 * Copyright (C) 2019 Red Hat, Inc. * Author: Giuseppe Scrivano <gscrivan@redhat.com> * * This program is free software; you can redistribute it and/or modify it * under the terms of version 2.1 of the GNU Lesser General Public License * as published by the Free Software Foundation. * * This program is distributed in the hope that it would be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. * */ #include <linux/cgroup.h> #include <linux/page_counter.h> #include <linux/slab.h> #include <linux/hugetlb.h> #include <linux/hugetlb_cgroup.h> #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) #define MEMFILE_IDX(val) (((val) >> 16) & 0xffff) #define MEMFILE_ATTR(val) ((val) & 0xffff) /* Use t->m[0] to encode the offset */ #define MEMFILE_OFFSET(t, m0) (((offsetof(t, m0) << 16) | sizeof_field(t, m0))) #define MEMFILE_OFFSET0(val) (((val) >> 16) & 0xffff) #define MEMFILE_FIELD_SIZE(val) ((val) & 0xffff) #define DFL_TMPL_SIZE ARRAY_SIZE(hugetlb_dfl_tmpl) #define LEGACY_TMPL_SIZE ARRAY_SIZE(hugetlb_legacy_tmpl) static struct hugetlb_cgroup *root_h_cgroup __read_mostly; static struct cftype *dfl_files; static struct cftype *legacy_files; static inline struct page_counter * __hugetlb_cgroup_counter_from_cgroup(struct hugetlb_cgroup *h_cg, int idx, bool rsvd) { if (rsvd) return &h_cg->rsvd_hugepage[idx]; return &h_cg->hugepage[idx]; } static inline struct page_counter * hugetlb_cgroup_counter_from_cgroup(struct hugetlb_cgroup *h_cg, int idx) { return __hugetlb_cgroup_counter_from_cgroup(h_cg, idx, false); } static inline struct page_counter * hugetlb_cgroup_counter_from_cgroup_rsvd(struct hugetlb_cgroup *h_cg, int idx) { return __hugetlb_cgroup_counter_from_cgroup(h_cg, idx, true); } static inline struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s) { return s ? container_of(s, struct hugetlb_cgroup, css) : NULL; } static inline struct hugetlb_cgroup *hugetlb_cgroup_from_task(struct task_struct *task) { return hugetlb_cgroup_from_css(task_css(task, hugetlb_cgrp_id)); } static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg) { return (h_cg == root_h_cgroup); } static inline struct hugetlb_cgroup * parent_hugetlb_cgroup(struct hugetlb_cgroup *h_cg) { return hugetlb_cgroup_from_css(h_cg->css.parent); } static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) { struct hstate *h; for_each_hstate(h) { if (page_counter_read( hugetlb_cgroup_counter_from_cgroup(h_cg, hstate_index(h)))) return true; } return false; } static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, struct hugetlb_cgroup *parent_h_cgroup) { int idx; for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) { struct page_counter *fault, *fault_parent = NULL; struct page_counter *rsvd, *rsvd_parent = NULL; unsigned long limit; if (parent_h_cgroup) { fault_parent = hugetlb_cgroup_counter_from_cgroup( parent_h_cgroup, idx); rsvd_parent = hugetlb_cgroup_counter_from_cgroup_rsvd( parent_h_cgroup, idx); } fault = hugetlb_cgroup_counter_from_cgroup(h_cgroup, idx); rsvd = hugetlb_cgroup_counter_from_cgroup_rsvd(h_cgroup, idx); page_counter_init(fault, fault_parent, false); page_counter_init(rsvd, rsvd_parent, false); if (!cgroup_subsys_on_dfl(hugetlb_cgrp_subsys)) { fault->track_failcnt = true; rsvd->track_failcnt = true; } limit = round_down(PAGE_COUNTER_MAX, pages_per_huge_page(&hstates[idx])); VM_BUG_ON(page_counter_set_max(fault, limit)); VM_BUG_ON(page_counter_set_max(rsvd, limit)); } } static void hugetlb_cgroup_free(struct hugetlb_cgroup *h_cgroup) { int node; for_each_node(node) kfree(h_cgroup->nodeinfo[node]); kfree(h_cgroup); } static struct cgroup_subsys_state * hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) { struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css); struct hugetlb_cgroup *h_cgroup; int node; h_cgroup = kzalloc(struct_size(h_cgroup, nodeinfo, nr_node_ids), GFP_KERNEL); if (!h_cgroup) return ERR_PTR(-ENOMEM); if (!parent_h_cgroup) root_h_cgroup = h_cgroup; /* * TODO: this routine can waste much memory for nodes which will * never be onlined. It's better to use memory hotplug callback * function. */ for_each_node(node) { /* Set node_to_alloc to NUMA_NO_NODE for offline nodes. */ int node_to_alloc = node_state(node, N_NORMAL_MEMORY) ? node : NUMA_NO_NODE; h_cgroup->nodeinfo[node] = kzalloc_node(sizeof(struct hugetlb_cgroup_per_node), GFP_KERNEL, node_to_alloc); if (!h_cgroup->nodeinfo[node]) goto fail_alloc_nodeinfo; } hugetlb_cgroup_init(h_cgroup, parent_h_cgroup); return &h_cgroup->css; fail_alloc_nodeinfo: hugetlb_cgroup_free(h_cgroup); return ERR_PTR(-ENOMEM); } static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) { hugetlb_cgroup_free(hugetlb_cgroup_from_css(css)); } /* * Should be called with hugetlb_lock held. * Since we are holding hugetlb_lock, pages cannot get moved from * active list or uncharged from the cgroup, So no need to get * page reference and test for page active here. This function * cannot fail. */ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg, struct folio *folio) { unsigned int nr_pages; struct page_counter *counter; struct hugetlb_cgroup *hcg; struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); hcg = hugetlb_cgroup_from_folio(folio); /* * We can have pages in active list without any cgroup * ie, hugepage with less than 3 pages. We can safely * ignore those pages. */ if (!hcg || hcg != h_cg) goto out; nr_pages = folio_nr_pages(folio); if (!parent) { parent = root_h_cgroup; /* root has no limit */ page_counter_charge(&parent->hugepage[idx], nr_pages); } counter = &h_cg->hugepage[idx]; /* Take the pages off the local counter */ page_counter_cancel(counter, nr_pages); set_hugetlb_cgroup(folio, parent); out: return; } /* * Force the hugetlb cgroup to empty the hugetlb resources by moving them to * the parent cgroup. */ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) { struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); struct hstate *h; struct folio *folio; do { for_each_hstate(h) { spin_lock_irq(&hugetlb_lock); list_for_each_entry(folio, &h->hugepage_activelist, lru) hugetlb_cgroup_move_parent(hstate_index(h), h_cg, folio); spin_unlock_irq(&hugetlb_lock); } cond_resched(); } while (hugetlb_cgroup_have_usage(h_cg)); } static inline void hugetlb_event(struct hugetlb_cgroup *hugetlb, int idx, enum hugetlb_memory_event event) { atomic_long_inc(&hugetlb->events_local[idx][event]); cgroup_file_notify(&hugetlb->events_local_file[idx]); do { atomic_long_inc(&hugetlb->events[idx][event]); cgroup_file_notify(&hugetlb->events_file[idx]); } while ((hugetlb = parent_hugetlb_cgroup(hugetlb)) && !hugetlb_cgroup_is_root(hugetlb)); } static int __hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, struct hugetlb_cgroup **ptr, bool rsvd) { int ret = 0; struct page_counter *counter; struct hugetlb_cgroup *h_cg = NULL; if (hugetlb_cgroup_disabled()) goto done; again: rcu_read_lock(); h_cg = hugetlb_cgroup_from_task(current); if (!css_tryget(&h_cg->css)) { rcu_read_unlock(); goto again; } rcu_read_unlock(); if (!page_counter_try_charge( __hugetlb_cgroup_counter_from_cgroup(h_cg, idx, rsvd), nr_pages, &counter)) { ret = -ENOMEM; hugetlb_event(h_cg, idx, HUGETLB_MAX); css_put(&h_cg->css); goto done; } /* Reservations take a reference to the css because they do not get * reparented. */ if (!rsvd) css_put(&h_cg->css); done: *ptr = h_cg; return ret; } int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, struct hugetlb_cgroup **ptr) { return __hugetlb_cgroup_charge_cgroup(idx, nr_pages, ptr, false); } int hugetlb_cgroup_charge_cgroup_rsvd(int idx, unsigned long nr_pages, struct hugetlb_cgroup **ptr) { return __hugetlb_cgroup_charge_cgroup(idx, nr_pages, ptr, true); } /* Should be called with hugetlb_lock held */ static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, struct folio *folio, bool rsvd) { if (hugetlb_cgroup_disabled() || !h_cg) return; lockdep_assert_held(&hugetlb_lock); __set_hugetlb_cgroup(folio, h_cg, rsvd); if (!rsvd) { unsigned long usage = h_cg->nodeinfo[folio_nid(folio)]->usage[idx]; /* * This write is not atomic due to fetching usage and writing * to it, but that's fine because we call this with * hugetlb_lock held anyway. */ WRITE_ONCE(h_cg->nodeinfo[folio_nid(folio)]->usage[idx], usage + nr_pages); } } void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, struct folio *folio) { __hugetlb_cgroup_commit_charge(idx, nr_pages, h_cg, folio, false); } void hugetlb_cgroup_commit_charge_rsvd(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, struct folio *folio) { __hugetlb_cgroup_commit_charge(idx, nr_pages, h_cg, folio, true); } /* * Should be called with hugetlb_lock held */ static void __hugetlb_cgroup_uncharge_folio(int idx, unsigned long nr_pages, struct folio *folio, bool rsvd) { struct hugetlb_cgroup *h_cg; if (hugetlb_cgroup_disabled()) return; lockdep_assert_held(&hugetlb_lock); h_cg = __hugetlb_cgroup_from_folio(folio, rsvd); if (unlikely(!h_cg)) return; __set_hugetlb_cgroup(folio, NULL, rsvd); page_counter_uncharge(__hugetlb_cgroup_counter_from_cgroup(h_cg, idx, rsvd), nr_pages); if (rsvd) css_put(&h_cg->css); else { unsigned long usage = h_cg->nodeinfo[folio_nid(folio)]->usage[idx]; /* * This write is not atomic due to fetching usage and writing * to it, but that's fine because we call this with * hugetlb_lock held anyway. */ WRITE_ONCE(h_cg->nodeinfo[folio_nid(folio)]->usage[idx], usage - nr_pages); } } void hugetlb_cgroup_uncharge_folio(int idx, unsigned long nr_pages, struct folio *folio) { __hugetlb_cgroup_uncharge_folio(idx, nr_pages, folio, false); } void hugetlb_cgroup_uncharge_folio_rsvd(int idx, unsigned long nr_pages, struct folio *folio) { __hugetlb_cgroup_uncharge_folio(idx, nr_pages, folio, true); } static void __hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, bool rsvd) { if (hugetlb_cgroup_disabled() || !h_cg) return; page_counter_uncharge(__hugetlb_cgroup_counter_from_cgroup(h_cg, idx, rsvd), nr_pages); if (rsvd) css_put(&h_cg->css); } void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg) { __hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg, false); } void hugetlb_cgroup_uncharge_cgroup_rsvd(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg) { __hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg, true); } void hugetlb_cgroup_uncharge_counter(struct resv_map *resv, unsigned long start, unsigned long end) { if (hugetlb_cgroup_disabled() || !resv || !resv->reservation_counter || !resv->css) return; page_counter_uncharge(resv->reservation_counter, (end - start) * resv->pages_per_hpage); css_put(resv->css); } void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv, struct file_region *rg, unsigned long nr_pages, bool region_del) { if (hugetlb_cgroup_disabled() || !resv || !rg || !nr_pages) return; if (rg->reservation_counter && resv->pages_per_hpage && !resv->reservation_counter) { page_counter_uncharge(rg->reservation_counter, nr_pages * resv->pages_per_hpage); /* * Only do css_put(rg->css) when we delete the entire region * because one file_region must hold exactly one css reference. */ if (region_del) css_put(rg->css); } } enum { RES_USAGE, RES_RSVD_USAGE, RES_LIMIT, RES_RSVD_LIMIT, RES_MAX_USAGE, RES_RSVD_MAX_USAGE, RES_FAILCNT, RES_RSVD_FAILCNT, }; static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy) { int nid; struct cftype *cft = seq_cft(seq); int idx = MEMFILE_IDX(cft->private); bool legacy = !cgroup_subsys_on_dfl(hugetlb_cgrp_subsys); struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq)); struct cgroup_subsys_state *css; unsigned long usage; if (legacy) { /* Add up usage across all nodes for the non-hierarchical total. */ usage = 0; for_each_node_state(nid, N_MEMORY) usage += READ_ONCE(h_cg->nodeinfo[nid]->usage[idx]); seq_printf(seq, "total=%lu", usage * PAGE_SIZE); /* Simply print the per-node usage for the non-hierarchical total. */ for_each_node_state(nid, N_MEMORY) seq_printf(seq, " N%d=%lu", nid, READ_ONCE(h_cg->nodeinfo[nid]->usage[idx]) * PAGE_SIZE); seq_putc(seq, '\n'); } /* * The hierarchical total is pretty much the value recorded by the * counter, so use that. */ seq_printf(seq, "%stotal=%lu", legacy ? "hierarchical_" : "", page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE); /* * For each node, transverse the css tree to obtain the hierarchical * node usage. */ for_each_node_state(nid, N_MEMORY) { usage = 0; rcu_read_lock(); css_for_each_descendant_pre(css, &h_cg->css) { usage += READ_ONCE(hugetlb_cgroup_from_css(css) ->nodeinfo[nid] ->usage[idx]); } rcu_read_unlock(); seq_printf(seq, " N%d=%lu", nid, usage * PAGE_SIZE); } seq_putc(seq, '\n'); return 0; } static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { struct page_counter *counter; struct page_counter *rsvd_counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)]; rsvd_counter = &h_cg->rsvd_hugepage[MEMFILE_IDX(cft->private)]; switch (MEMFILE_ATTR(cft->private)) { case RES_USAGE: return (u64)page_counter_read(counter) * PAGE_SIZE; case RES_RSVD_USAGE: return (u64)page_counter_read(rsvd_counter) * PAGE_SIZE; case RES_LIMIT: return (u64)counter->max * PAGE_SIZE; case RES_RSVD_LIMIT: return (u64)rsvd_counter->max * PAGE_SIZE; case RES_MAX_USAGE: return (u64)counter->watermark * PAGE_SIZE; case RES_RSVD_MAX_USAGE: return (u64)rsvd_counter->watermark * PAGE_SIZE; case RES_FAILCNT: return counter->failcnt; case RES_RSVD_FAILCNT: return rsvd_counter->failcnt; default: BUG(); } } static int hugetlb_cgroup_read_u64_max(struct seq_file *seq, void *v) { int idx; u64 val; struct cftype *cft = seq_cft(seq); unsigned long limit; struct page_counter *counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq)); idx = MEMFILE_IDX(cft->private); counter = &h_cg->hugepage[idx]; limit = round_down(PAGE_COUNTER_MAX, pages_per_huge_page(&hstates[idx])); switch (MEMFILE_ATTR(cft->private)) { case RES_RSVD_USAGE: counter = &h_cg->rsvd_hugepage[idx]; fallthrough; case RES_USAGE: val = (u64)page_counter_read(counter); seq_printf(seq, "%llu\n", val * PAGE_SIZE); break; case RES_RSVD_LIMIT: counter = &h_cg->rsvd_hugepage[idx]; fallthrough; case RES_LIMIT: val = (u64)counter->max; if (val == limit) seq_puts(seq, "max\n"); else seq_printf(seq, "%llu\n", val * PAGE_SIZE); break; default: BUG(); } return 0; } static DEFINE_MUTEX(hugetlb_limit_mutex); static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off, const char *max) { int ret, idx; unsigned long nr_pages; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); bool rsvd = false; if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */ return -EINVAL; buf = strstrip(buf); ret = page_counter_memparse(buf, max, &nr_pages); if (ret) return ret; idx = MEMFILE_IDX(of_cft(of)->private); nr_pages = round_down(nr_pages, pages_per_huge_page(&hstates[idx])); switch (MEMFILE_ATTR(of_cft(of)->private)) { case RES_RSVD_LIMIT: rsvd = true; fallthrough; case RES_LIMIT: mutex_lock(&hugetlb_limit_mutex); ret = page_counter_set_max( __hugetlb_cgroup_counter_from_cgroup(h_cg, idx, rsvd), nr_pages); mutex_unlock(&hugetlb_limit_mutex); break; default: ret = -EINVAL; break; } return ret ?: nbytes; } static ssize_t hugetlb_cgroup_write_legacy(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { return hugetlb_cgroup_write(of, buf, nbytes, off, "-1"); } static ssize_t hugetlb_cgroup_write_dfl(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { return hugetlb_cgroup_write(of, buf, nbytes, off, "max"); } static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { int ret = 0; struct page_counter *counter, *rsvd_counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)]; rsvd_counter = &h_cg->rsvd_hugepage[MEMFILE_IDX(of_cft(of)->private)]; switch (MEMFILE_ATTR(of_cft(of)->private)) { case RES_MAX_USAGE: page_counter_reset_watermark(counter); break; case RES_RSVD_MAX_USAGE: page_counter_reset_watermark(rsvd_counter); break; case RES_FAILCNT: counter->failcnt = 0; break; case RES_RSVD_FAILCNT: rsvd_counter->failcnt = 0; break; default: ret = -EINVAL; break; } return ret ?: nbytes; } static char *mem_fmt(char *buf, int size, unsigned long hsize) { if (hsize >= SZ_1G) snprintf(buf, size, "%luGB", hsize / SZ_1G); else if (hsize >= SZ_1M) snprintf(buf, size, "%luMB", hsize / SZ_1M); else snprintf(buf, size, "%luKB", hsize / SZ_1K); return buf; } static int __hugetlb_events_show(struct seq_file *seq, bool local) { int idx; long max; struct cftype *cft = seq_cft(seq); struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq)); idx = MEMFILE_IDX(cft->private); if (local) max = atomic_long_read(&h_cg->events_local[idx][HUGETLB_MAX]); else max = atomic_long_read(&h_cg->events[idx][HUGETLB_MAX]); seq_printf(seq, "max %lu\n", max); return 0; } static int hugetlb_events_show(struct seq_file *seq, void *v) { return __hugetlb_events_show(seq, false); } static int hugetlb_events_local_show(struct seq_file *seq, void *v) { return __hugetlb_events_show(seq, true); } static struct cftype hugetlb_dfl_tmpl[] = { { .name = "max", .private = RES_LIMIT, .seq_show = hugetlb_cgroup_read_u64_max, .write = hugetlb_cgroup_write_dfl, .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "rsvd.max", .private = RES_RSVD_LIMIT, .seq_show = hugetlb_cgroup_read_u64_max, .write = hugetlb_cgroup_write_dfl, .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "current", .private = RES_USAGE, .seq_show = hugetlb_cgroup_read_u64_max, .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "rsvd.current", .private = RES_RSVD_USAGE, .seq_show = hugetlb_cgroup_read_u64_max, .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "events", .seq_show = hugetlb_events_show, .file_offset = MEMFILE_OFFSET(struct hugetlb_cgroup, events_file[0]), .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "events.local", .seq_show = hugetlb_events_local_show, .file_offset = MEMFILE_OFFSET(struct hugetlb_cgroup, events_local_file[0]), .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "numa_stat", .seq_show = hugetlb_cgroup_read_numa_stat, .flags = CFTYPE_NOT_ON_ROOT, }, /* don't need terminator here */ }; static struct cftype hugetlb_legacy_tmpl[] = { { .name = "limit_in_bytes", .private = RES_LIMIT, .read_u64 = hugetlb_cgroup_read_u64, .write = hugetlb_cgroup_write_legacy, }, { .name = "rsvd.limit_in_bytes", .private = RES_RSVD_LIMIT, .read_u64 = hugetlb_cgroup_read_u64, .write = hugetlb_cgroup_write_legacy, }, { .name = "usage_in_bytes", .private = RES_USAGE, .read_u64 = hugetlb_cgroup_read_u64, }, { .name = "rsvd.usage_in_bytes", .private = RES_RSVD_USAGE, .read_u64 = hugetlb_cgroup_read_u64, }, { .name = "max_usage_in_bytes", .private = RES_MAX_USAGE, .write = hugetlb_cgroup_reset, .read_u64 = hugetlb_cgroup_read_u64, }, { .name = "rsvd.max_usage_in_bytes", .private = RES_RSVD_MAX_USAGE, .write = hugetlb_cgroup_reset, .read_u64 = hugetlb_cgroup_read_u64, }, { .name = "failcnt", .private = RES_FAILCNT, .write = hugetlb_cgroup_reset, .read_u64 = hugetlb_cgroup_read_u64, }, { .name = "rsvd.failcnt", .private = RES_RSVD_FAILCNT, .write = hugetlb_cgroup_reset, .read_u64 = hugetlb_cgroup_read_u64, }, { .name = "numa_stat", .seq_show = hugetlb_cgroup_read_numa_stat, }, /* don't need terminator here */ }; static void __init hugetlb_cgroup_cfttypes_init(struct hstate *h, struct cftype *cft, struct cftype *tmpl, int tmpl_size) { char buf[32]; int i, idx = hstate_index(h); /* format the size */ mem_fmt(buf, sizeof(buf), huge_page_size(h)); for (i = 0; i < tmpl_size; cft++, tmpl++, i++) { *cft = *tmpl; /* rebuild the name */ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.%s", buf, tmpl->name); /* rebuild the private */ cft->private = MEMFILE_PRIVATE(idx, tmpl->private); /* rebuild the file_offset */ if (tmpl->file_offset) { unsigned int offset = tmpl->file_offset; cft->file_offset = MEMFILE_OFFSET0(offset) + MEMFILE_FIELD_SIZE(offset) * idx; } lockdep_register_key(&cft->lockdep_key); } } static void __init __hugetlb_cgroup_file_dfl_init(struct hstate *h) { int idx = hstate_index(h); hugetlb_cgroup_cfttypes_init(h, dfl_files + idx * DFL_TMPL_SIZE, hugetlb_dfl_tmpl, DFL_TMPL_SIZE); } static void __init __hugetlb_cgroup_file_legacy_init(struct hstate *h) { int idx = hstate_index(h); hugetlb_cgroup_cfttypes_init(h, legacy_files + idx * LEGACY_TMPL_SIZE, hugetlb_legacy_tmpl, LEGACY_TMPL_SIZE); } static void __init __hugetlb_cgroup_file_init(struct hstate *h) { __hugetlb_cgroup_file_dfl_init(h); __hugetlb_cgroup_file_legacy_init(h); } static void __init __hugetlb_cgroup_file_pre_init(void) { int cft_count; cft_count = hugetlb_max_hstate * DFL_TMPL_SIZE + 1; /* add terminator */ dfl_files = kcalloc(cft_count, sizeof(struct cftype), GFP_KERNEL); BUG_ON(!dfl_files); cft_count = hugetlb_max_hstate * LEGACY_TMPL_SIZE + 1; /* add terminator */ legacy_files = kcalloc(cft_count, sizeof(struct cftype), GFP_KERNEL); BUG_ON(!legacy_files); } static void __init __hugetlb_cgroup_file_post_init(void) { WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys, dfl_files)); WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, legacy_files)); } void __init hugetlb_cgroup_file_init(void) { struct hstate *h; __hugetlb_cgroup_file_pre_init(); for_each_hstate(h) __hugetlb_cgroup_file_init(h); __hugetlb_cgroup_file_post_init(); } /* * hugetlb_lock will make sure a parallel cgroup rmdir won't happen * when we migrate hugepages */ void hugetlb_cgroup_migrate(struct folio *old_folio, struct folio *new_folio) { struct hugetlb_cgroup *h_cg; struct hugetlb_cgroup *h_cg_rsvd; struct hstate *h = folio_hstate(old_folio); if (hugetlb_cgroup_disabled()) return; spin_lock_irq(&hugetlb_lock); h_cg = hugetlb_cgroup_from_folio(old_folio); h_cg_rsvd = hugetlb_cgroup_from_folio_rsvd(old_folio); set_hugetlb_cgroup(old_folio, NULL); set_hugetlb_cgroup_rsvd(old_folio, NULL); /* move the h_cg details to new cgroup */ set_hugetlb_cgroup(new_folio, h_cg); set_hugetlb_cgroup_rsvd(new_folio, h_cg_rsvd); list_move(&new_folio->lru, &h->hugepage_activelist); spin_unlock_irq(&hugetlb_lock); } static struct cftype hugetlb_files[] = { {} /* terminate */ }; struct cgroup_subsys hugetlb_cgrp_subsys = { .css_alloc = hugetlb_cgroup_css_alloc, .css_offline = hugetlb_cgroup_css_offline, .css_free = hugetlb_cgroup_css_free, .dfl_cftypes = hugetlb_files, .legacy_cftypes = hugetlb_files, }; |
208 179 28 203 41 41 41 41 44 28 4 24 90 90 90 6 85 90 89 39 39 71 71 16 55 71 71 9 32 7 7 7 5 23 27 5 1 33 33 33 1 33 33 33 33 33 32 8 4 4 3 1 3 4 23 5 24 1 2 1 1 1 1 2 3 3 2 7 8 1 8 5 4 8 8 11 2 2 8 8 8 8 8 10 10 1 10 9 2 8 10 10 10 10 9 10 6 8 8 8 24 10 1 1 1 1 1 1 2 1 1 23 1 1 23 49 18 18 24 4 24 10 2 10 44 10 8 10 44 34 10 2 24 5 9 4 9 2 53 3 2 46 46 46 4 3 3 3 2 3 3 3 10 3 1 1 9 9 6 4 9 6 6 6 4 4 4 1 4 1 3 3 3 1 18 18 18 11 18 45 45 45 44 20 37 10 3 1 1 1 3 18 37 6 37 45 6 14 17 16 2 16 16 16 14 16 10 10 1 2 8 8 1 5 4 2 4 3 2 1 1 10 5 10 5 3 2 2 3 3 3 2 28 24 5 26 21 9 9 5 3 22 9 7 9 7 7 4 15 14 13 14 14 3 14 14 20 18 17 16 15 14 13 13 7 6 2 26 10 10 9 32 3 2 2 1 20 20 20 12 20 20 20 20 20 224 138 135 135 225 141 27 6 226 8 6 5 1 4 5 5 4 4 4 3 3 3 3 3 1 3 2 1 1 1 1 1 1 1 1 5 5 8 1123 1123 1122 36 1123 3 2 19 1099 1101 1100 344 296 137 182 136 1099 1101 1075 375 104 104 28 104 375 1 375 374 6 375 375 44 375 374 375 800 825 34 824 1099 35 35 35 17 7 10 3 35 35 34 6 33 34 9 34 34 14 20 3 20 34 34 24 5 3 3 1 2 2 3 3 3 5 10 5 5 5 3 34 34 30 32 23 23 10 33 1 24 23 1 33 3 3 33 32 33 18 5 1 5 4 2 2 10 33 33 33 13 1 33 23 23 10 10 35 2 35 4 100 68 32 100 101 101 1 100 99 1 8 5 5 8 2 6 5 4 2 2 5 5 5 2 5 2 4 4 5 5 4 5 5 8 240 1 233 233 243 236 38 205 203 203 36 34 204 236 238 3 237 197 102 235 1 1 232 141 233 73 234 231 39 191 143 62 9 228 227 228 7 224 226 3 2 221 225 225 97 81 12 209 27 15 181 183 27 107 238 251 240 245 44 43 43 43 32 43 1 1 43 4 43 3 43 6 41 41 5 9 41 41 15 8 15 15 14 2 12 7 12 7 11 9 9 9 5 11 11 1 11 1 11 7 7 7 6 10 2 2 15 3 2 3 15 14 15 72 71 70 70 70 70 70 59 70 70 70 72 24 20 23 3 20 20 18 16 18 18 16 16 11 18 18 17 2 15 15 17 18 7 5 7 6 3 7 7 3 7 7 7 18 20 23 4 2 2 2 1 4 5 3 3 3 1 1 2 3 5 25 3 6 1 16 5 5 3 3 4 46 46 5 42 20 19 18 18 5 3 2 1 18 1 18 18 2 7 2 5 4 4 3 2 2 2 25 4 4 4 4 4 154 153 27 24 26 26 26 25 20 25 49 42 41 7 6 47 43 1 1 1 3 2 2 1 1 1 4 2 2 2 1 1 4 3 3 3 2 5 4 4 4 3 2 9 7 7 7 3 6 6 5 2 1 1 28 28 28 12 11 3 2 2 1 3 2 2 2 2 14 152 3 2 2 23 23 22 21 1 1 1 1 1 2 1 1 3 2 2 1 1 1 1 1 2 1 1 2 1 1 1 19 24 1673 1679 1676 436 123 46 185 8 7 7 8 1 1 1 8 200 2 1 2 1680 69 23 2 2 2 1 1 2 44 33 2 1 33 3 3 33 5 5 4 4 12 12 12 5 7 10 9 27 8 35 35 52 52 47 46 50 44 44 38 6 44 41 41 6 41 39 39 38 37 36 35 30 6 3 3 3 2 24 15 11 34 28 34 28 34 29 34 34 34 28 34 10 2 1 12 46 12 11 11 11 8 11 8 5 7 7 7 7 7 7 5 9 10 4 4 4 4 4 4 3 3 4 72 72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 | // SPDX-License-Identifier: GPL-2.0-or-later /* * INET An implementation of the TCP/IP protocol suite for the LINUX * operating system. INET is implemented using the BSD Socket * interface as the means of communication with the user level. * * PACKET - implements raw packet sockets. * * Authors: Ross Biro * Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG> * Alan Cox, <gw4pts@gw4pts.ampr.org> * * Fixes: * Alan Cox : verify_area() now used correctly * Alan Cox : new skbuff lists, look ma no backlogs! * Alan Cox : tidied skbuff lists. * Alan Cox : Now uses generic datagram routines I * added. Also fixed the peek/read crash * from all old Linux datagram code. * Alan Cox : Uses the improved datagram code. * Alan Cox : Added NULL's for socket options. * Alan Cox : Re-commented the code. * Alan Cox : Use new kernel side addressing * Rob Janssen : Correct MTU usage. * Dave Platt : Counter leaks caused by incorrect * interrupt locking and some slightly * dubious gcc output. Can you read * compiler: it said _VOLATILE_ * Richard Kooijman : Timestamp fixes. * Alan Cox : New buffers. Use sk->mac.raw. * Alan Cox : sendmsg/recvmsg support. * Alan Cox : Protocol setting support * Alexey Kuznetsov : Untied from IPv4 stack. * Cyrus Durgin : Fixed kerneld for kmod. * Michal Ostrowski : Module initialization cleanup. * Ulises Alonso : Frame number limit removal and * packet_set_ring memory leak. * Eric Biederman : Allow for > 8 byte hardware addresses. * The convention is that longer addresses * will simply extend the hardware address * byte arrays at the end of sockaddr_ll * and packet_mreq. * Johann Baudy : Added TX RING. * Chetan Loke : Implemented TPACKET_V3 block abstraction * layer. * Copyright (C) 2011, <lokec@ccs.neu.edu> */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/ethtool.h> #include <linux/filter.h> #include <linux/types.h> #include <linux/mm.h> #include <linux/capability.h> #include <linux/fcntl.h> #include <linux/socket.h> #include <linux/in.h> #include <linux/inet.h> #include <linux/netdevice.h> #include <linux/if_packet.h> #include <linux/wireless.h> #include <linux/kernel.h> #include <linux/kmod.h> #include <linux/slab.h> #include <linux/vmalloc.h> #include <net/net_namespace.h> #include <net/ip.h> #include <net/protocol.h> #include <linux/skbuff.h> #include <net/sock.h> #include <linux/errno.h> #include <linux/timer.h> #include <linux/uaccess.h> #include <asm/ioctls.h> #include <asm/page.h> #include <asm/cacheflush.h> #include <asm/io.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/poll.h> #include <linux/module.h> #include <linux/init.h> #include <linux/mutex.h> #include <linux/if_vlan.h> #include <linux/virtio_net.h> #include <linux/errqueue.h> #include <linux/net_tstamp.h> #include <linux/percpu.h> #ifdef CONFIG_INET #include <net/inet_common.h> #endif #include <linux/bpf.h> #include <net/compat.h> #include <linux/netfilter_netdev.h> #include "internal.h" /* Assumptions: - If the device has no dev->header_ops->create, there is no LL header visible above the device. In this case, its hard_header_len should be 0. The device may prepend its own header internally. In this case, its needed_headroom should be set to the space needed for it to add its internal header. For example, a WiFi driver pretending to be an Ethernet driver should set its hard_header_len to be the Ethernet header length, and set its needed_headroom to be (the real WiFi header length - the fake Ethernet header length). - packet socket receives packets with pulled ll header, so that SOCK_RAW should push it back. On receive: ----------- Incoming, dev_has_header(dev) == true mac_header -> ll header data -> data Outgoing, dev_has_header(dev) == true mac_header -> ll header data -> ll header Incoming, dev_has_header(dev) == false mac_header -> data However drivers often make it point to the ll header. This is incorrect because the ll header should be invisible to us. data -> data Outgoing, dev_has_header(dev) == false mac_header -> data. ll header is invisible to us. data -> data Resume If dev_has_header(dev) == false we are unable to restore the ll header, because it is invisible to us. On transmit: ------------ dev_has_header(dev) == true mac_header -> ll header data -> ll header dev_has_header(dev) == false (ll header is invisible to us) mac_header -> data data -> data We should set network_header on output to the correct position, packet classifier depends on it. */ /* Private packet socket structures. */ /* identical to struct packet_mreq except it has * a longer address field. */ struct packet_mreq_max { int mr_ifindex; unsigned short mr_type; unsigned short mr_alen; unsigned char mr_address[MAX_ADDR_LEN]; }; union tpacket_uhdr { struct tpacket_hdr *h1; struct tpacket2_hdr *h2; struct tpacket3_hdr *h3; void *raw; }; static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, int closing, int tx_ring); #define V3_ALIGNMENT (8) #define BLK_HDR_LEN (ALIGN(sizeof(struct tpacket_block_desc), V3_ALIGNMENT)) #define BLK_PLUS_PRIV(sz_of_priv) \ (BLK_HDR_LEN + ALIGN((sz_of_priv), V3_ALIGNMENT)) #define BLOCK_STATUS(x) ((x)->hdr.bh1.block_status) #define BLOCK_NUM_PKTS(x) ((x)->hdr.bh1.num_pkts) #define BLOCK_O2FP(x) ((x)->hdr.bh1.offset_to_first_pkt) #define BLOCK_LEN(x) ((x)->hdr.bh1.blk_len) #define BLOCK_SNUM(x) ((x)->hdr.bh1.seq_num) #define BLOCK_O2PRIV(x) ((x)->offset_to_priv) struct packet_sock; static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev); static void *packet_previous_frame(struct packet_sock *po, struct packet_ring_buffer *rb, int status); static void packet_increment_head(struct packet_ring_buffer *buff); static int prb_curr_blk_in_use(struct tpacket_block_desc *); static void *prb_dispatch_next_block(struct tpacket_kbdq_core *, struct packet_sock *); static void prb_retire_current_block(struct tpacket_kbdq_core *, struct packet_sock *, unsigned int status); static int prb_queue_frozen(struct tpacket_kbdq_core *); static void prb_open_block(struct tpacket_kbdq_core *, struct tpacket_block_desc *); static void prb_retire_rx_blk_timer_expired(struct timer_list *); static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *); static void prb_fill_rxhash(struct tpacket_kbdq_core *, struct tpacket3_hdr *); static void prb_clear_rxhash(struct tpacket_kbdq_core *, struct tpacket3_hdr *); static void prb_fill_vlan_info(struct tpacket_kbdq_core *, struct tpacket3_hdr *); static void packet_flush_mclist(struct sock *sk); static u16 packet_pick_tx_queue(struct sk_buff *skb); struct packet_skb_cb { union { struct sockaddr_pkt pkt; union { /* Trick: alias skb original length with * ll.sll_family and ll.protocol in order * to save room. */ unsigned int origlen; struct sockaddr_ll ll; }; } sa; }; #define vio_le() virtio_legacy_is_little_endian() #define PACKET_SKB_CB(__skb) ((struct packet_skb_cb *)((__skb)->cb)) #define GET_PBDQC_FROM_RB(x) ((struct tpacket_kbdq_core *)(&(x)->prb_bdqc)) #define GET_PBLOCK_DESC(x, bid) \ ((struct tpacket_block_desc *)((x)->pkbdq[(bid)].buffer)) #define GET_CURR_PBLOCK_DESC_FROM_CORE(x) \ ((struct tpacket_block_desc *)((x)->pkbdq[(x)->kactive_blk_num].buffer)) #define GET_NEXT_PRB_BLK_NUM(x) \ (((x)->kactive_blk_num < ((x)->knum_blocks-1)) ? \ ((x)->kactive_blk_num+1) : 0) static void __fanout_unlink(struct sock *sk, struct packet_sock *po); static void __fanout_link(struct sock *sk, struct packet_sock *po); #ifdef CONFIG_NETFILTER_EGRESS static noinline struct sk_buff *nf_hook_direct_egress(struct sk_buff *skb) { struct sk_buff *next, *head = NULL, *tail; int rc; rcu_read_lock(); for (; skb != NULL; skb = next) { next = skb->next; skb_mark_not_on_list(skb); if (!nf_hook_egress(skb, &rc, skb->dev)) continue; if (!head) head = skb; else tail->next = skb; tail = skb; } rcu_read_unlock(); return head; } #endif static int packet_xmit(const struct packet_sock *po, struct sk_buff *skb) { if (!packet_sock_flag(po, PACKET_SOCK_QDISC_BYPASS)) return dev_queue_xmit(skb); #ifdef CONFIG_NETFILTER_EGRESS if (nf_hook_egress_active()) { skb = nf_hook_direct_egress(skb); if (!skb) return NET_XMIT_DROP; } #endif return dev_direct_xmit(skb, packet_pick_tx_queue(skb)); } static struct net_device *packet_cached_dev_get(struct packet_sock *po) { struct net_device *dev; rcu_read_lock(); dev = rcu_dereference(po->cached_dev); dev_hold(dev); rcu_read_unlock(); return dev; } static void packet_cached_dev_assign(struct packet_sock *po, struct net_device *dev) { rcu_assign_pointer(po->cached_dev, dev); } static void packet_cached_dev_reset(struct packet_sock *po) { RCU_INIT_POINTER(po->cached_dev, NULL); } static u16 packet_pick_tx_queue(struct sk_buff *skb) { struct net_device *dev = skb->dev; const struct net_device_ops *ops = dev->netdev_ops; int cpu = raw_smp_processor_id(); u16 queue_index; #ifdef CONFIG_XPS skb->sender_cpu = cpu + 1; #endif skb_record_rx_queue(skb, cpu % dev->real_num_tx_queues); if (ops->ndo_select_queue) { queue_index = ops->ndo_select_queue(dev, skb, NULL); queue_index = netdev_cap_txqueue(dev, queue_index); } else { queue_index = netdev_pick_tx(dev, skb, NULL); } return queue_index; } /* __register_prot_hook must be invoked through register_prot_hook * or from a context in which asynchronous accesses to the packet * socket is not possible (packet_create()). */ static void __register_prot_hook(struct sock *sk) { struct packet_sock *po = pkt_sk(sk); if (!packet_sock_flag(po, PACKET_SOCK_RUNNING)) { if (po->fanout) __fanout_link(sk, po); else dev_add_pack(&po->prot_hook); sock_hold(sk); packet_sock_flag_set(po, PACKET_SOCK_RUNNING, 1); } } static void register_prot_hook(struct sock *sk) { lockdep_assert_held_once(&pkt_sk(sk)->bind_lock); __register_prot_hook(sk); } /* If the sync parameter is true, we will temporarily drop * the po->bind_lock and do a synchronize_net to make sure no * asynchronous packet processing paths still refer to the elements * of po->prot_hook. If the sync parameter is false, it is the * callers responsibility to take care of this. */ static void __unregister_prot_hook(struct sock *sk, bool sync) { struct packet_sock *po = pkt_sk(sk); lockdep_assert_held_once(&po->bind_lock); packet_sock_flag_set(po, PACKET_SOCK_RUNNING, 0); if (po->fanout) __fanout_unlink(sk, po); else __dev_remove_pack(&po->prot_hook); __sock_put(sk); if (sync) { spin_unlock(&po->bind_lock); synchronize_net(); spin_lock(&po->bind_lock); } } static void unregister_prot_hook(struct sock *sk, bool sync) { struct packet_sock *po = pkt_sk(sk); if (packet_sock_flag(po, PACKET_SOCK_RUNNING)) __unregister_prot_hook(sk, sync); } static inline struct page * __pure pgv_to_page(void *addr) { if (is_vmalloc_addr(addr)) return vmalloc_to_page(addr); return virt_to_page(addr); } static void __packet_set_status(struct packet_sock *po, void *frame, int status) { union tpacket_uhdr h; /* WRITE_ONCE() are paired with READ_ONCE() in __packet_get_status */ h.raw = frame; switch (po->tp_version) { case TPACKET_V1: WRITE_ONCE(h.h1->tp_status, status); flush_dcache_page(pgv_to_page(&h.h1->tp_status)); break; case TPACKET_V2: WRITE_ONCE(h.h2->tp_status, status); flush_dcache_page(pgv_to_page(&h.h2->tp_status)); break; case TPACKET_V3: WRITE_ONCE(h.h3->tp_status, status); flush_dcache_page(pgv_to_page(&h.h3->tp_status)); break; default: WARN(1, "TPACKET version not supported.\n"); BUG(); } smp_wmb(); } static int __packet_get_status(const struct packet_sock *po, void *frame) { union tpacket_uhdr h; smp_rmb(); /* READ_ONCE() are paired with WRITE_ONCE() in __packet_set_status */ h.raw = frame; switch (po->tp_version) { case TPACKET_V1: flush_dcache_page(pgv_to_page(&h.h1->tp_status)); return READ_ONCE(h.h1->tp_status); case TPACKET_V2: flush_dcache_page(pgv_to_page(&h.h2->tp_status)); return READ_ONCE(h.h2->tp_status); case TPACKET_V3: flush_dcache_page(pgv_to_page(&h.h3->tp_status)); return READ_ONCE(h.h3->tp_status); default: WARN(1, "TPACKET version not supported.\n"); BUG(); return 0; } } static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts, unsigned int flags) { struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb); if (shhwtstamps && (flags & SOF_TIMESTAMPING_RAW_HARDWARE) && ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts)) return TP_STATUS_TS_RAW_HARDWARE; if ((flags & SOF_TIMESTAMPING_SOFTWARE) && ktime_to_timespec64_cond(skb_tstamp(skb), ts)) return TP_STATUS_TS_SOFTWARE; return 0; } static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame, struct sk_buff *skb) { union tpacket_uhdr h; struct timespec64 ts; __u32 ts_status; if (!(ts_status = tpacket_get_timestamp(skb, &ts, READ_ONCE(po->tp_tstamp)))) return 0; h.raw = frame; /* * versions 1 through 3 overflow the timestamps in y2106, since they * all store the seconds in a 32-bit unsigned integer. * If we create a version 4, that should have a 64-bit timestamp, * either 64-bit seconds + 32-bit nanoseconds, or just 64-bit * nanoseconds. */ switch (po->tp_version) { case TPACKET_V1: h.h1->tp_sec = ts.tv_sec; h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC; break; case TPACKET_V2: h.h2->tp_sec = ts.tv_sec; h.h2->tp_nsec = ts.tv_nsec; break; case TPACKET_V3: h.h3->tp_sec = ts.tv_sec; h.h3->tp_nsec = ts.tv_nsec; break; default: WARN(1, "TPACKET version not supported.\n"); BUG(); } /* one flush is safe, as both fields always lie on the same cacheline */ flush_dcache_page(pgv_to_page(&h.h1->tp_sec)); smp_wmb(); return ts_status; } static void *packet_lookup_frame(const struct packet_sock *po, const struct packet_ring_buffer *rb, unsigned int position, int status) { unsigned int pg_vec_pos, frame_offset; union tpacket_uhdr h; pg_vec_pos = position / rb->frames_per_block; frame_offset = position % rb->frames_per_block; h.raw = rb->pg_vec[pg_vec_pos].buffer + (frame_offset * rb->frame_size); if (status != __packet_get_status(po, h.raw)) return NULL; return h.raw; } static void *packet_current_frame(struct packet_sock *po, struct packet_ring_buffer *rb, int status) { return packet_lookup_frame(po, rb, rb->head, status); } static u16 vlan_get_tci(const struct sk_buff *skb, struct net_device *dev) { struct vlan_hdr vhdr, *vh; unsigned int header_len; if (!dev) return 0; /* In the SOCK_DGRAM scenario, skb data starts at the network * protocol, which is after the VLAN headers. The outer VLAN * header is at the hard_header_len offset in non-variable * length link layer headers. If it's a VLAN device, the * min_header_len should be used to exclude the VLAN header * size. */ if (dev->min_header_len == dev->hard_header_len) header_len = dev->hard_header_len; else if (is_vlan_dev(dev)) header_len = dev->min_header_len; else return 0; vh = skb_header_pointer(skb, skb_mac_offset(skb) + header_len, sizeof(vhdr), &vhdr); if (unlikely(!vh)) return 0; return ntohs(vh->h_vlan_TCI); } static __be16 vlan_get_protocol_dgram(const struct sk_buff *skb) { __be16 proto = skb->protocol; if (unlikely(eth_type_vlan(proto))) proto = __vlan_get_protocol_offset(skb, proto, skb_mac_offset(skb), NULL); return proto; } static void prb_del_retire_blk_timer(struct tpacket_kbdq_core *pkc) { timer_delete_sync(&pkc->retire_blk_timer); } static void prb_shutdown_retire_blk_timer(struct packet_sock *po, struct sk_buff_head *rb_queue) { struct tpacket_kbdq_core *pkc; pkc = GET_PBDQC_FROM_RB(&po->rx_ring); spin_lock_bh(&rb_queue->lock); pkc->delete_blk_timer = 1; spin_unlock_bh(&rb_queue->lock); prb_del_retire_blk_timer(pkc); } static void prb_setup_retire_blk_timer(struct packet_sock *po) { struct tpacket_kbdq_core *pkc; pkc = GET_PBDQC_FROM_RB(&po->rx_ring); timer_setup(&pkc->retire_blk_timer, prb_retire_rx_blk_timer_expired, 0); pkc->retire_blk_timer.expires = jiffies; } static int prb_calc_retire_blk_tmo(struct packet_sock *po, int blk_size_in_bytes) { struct net_device *dev; unsigned int mbits, div; struct ethtool_link_ksettings ecmd; int err; rtnl_lock(); dev = __dev_get_by_index(sock_net(&po->sk), po->ifindex); if (unlikely(!dev)) { rtnl_unlock(); return DEFAULT_PRB_RETIRE_TOV; } err = __ethtool_get_link_ksettings(dev, &ecmd); rtnl_unlock(); if (err) return DEFAULT_PRB_RETIRE_TOV; /* If the link speed is so slow you don't really * need to worry about perf anyways */ if (ecmd.base.speed < SPEED_1000 || ecmd.base.speed == SPEED_UNKNOWN) return DEFAULT_PRB_RETIRE_TOV; div = ecmd.base.speed / 1000; mbits = (blk_size_in_bytes * 8) / (1024 * 1024); if (div) mbits /= div; if (div) return mbits + 1; return mbits; } static void prb_init_ft_ops(struct tpacket_kbdq_core *p1, union tpacket_req_u *req_u) { p1->feature_req_word = req_u->req3.tp_feature_req_word; } static void init_prb_bdqc(struct packet_sock *po, struct packet_ring_buffer *rb, struct pgv *pg_vec, union tpacket_req_u *req_u) { struct tpacket_kbdq_core *p1 = GET_PBDQC_FROM_RB(rb); struct tpacket_block_desc *pbd; memset(p1, 0x0, sizeof(*p1)); p1->knxt_seq_num = 1; p1->pkbdq = pg_vec; pbd = (struct tpacket_block_desc *)pg_vec[0].buffer; p1->pkblk_start = pg_vec[0].buffer; p1->kblk_size = req_u->req3.tp_block_size; p1->knum_blocks = req_u->req3.tp_block_nr; p1->hdrlen = po->tp_hdrlen; p1->version = po->tp_version; p1->last_kactive_blk_num = 0; po->stats.stats3.tp_freeze_q_cnt = 0; if (req_u->req3.tp_retire_blk_tov) p1->retire_blk_tov = req_u->req3.tp_retire_blk_tov; else p1->retire_blk_tov = prb_calc_retire_blk_tmo(po, req_u->req3.tp_block_size); p1->tov_in_jiffies = msecs_to_jiffies(p1->retire_blk_tov); p1->blk_sizeof_priv = req_u->req3.tp_sizeof_priv; rwlock_init(&p1->blk_fill_in_prog_lock); p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv); prb_init_ft_ops(p1, req_u); prb_setup_retire_blk_timer(po); prb_open_block(p1, pbd); } /* Do NOT update the last_blk_num first. * Assumes sk_buff_head lock is held. */ static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *pkc) { mod_timer(&pkc->retire_blk_timer, jiffies + pkc->tov_in_jiffies); pkc->last_kactive_blk_num = pkc->kactive_blk_num; } /* * Timer logic: * 1) We refresh the timer only when we open a block. * By doing this we don't waste cycles refreshing the timer * on packet-by-packet basis. * * With a 1MB block-size, on a 1Gbps line, it will take * i) ~8 ms to fill a block + ii) memcpy etc. * In this cut we are not accounting for the memcpy time. * * So, if the user sets the 'tmo' to 10ms then the timer * will never fire while the block is still getting filled * (which is what we want). However, the user could choose * to close a block early and that's fine. * * But when the timer does fire, we check whether or not to refresh it. * Since the tmo granularity is in msecs, it is not too expensive * to refresh the timer, lets say every '8' msecs. * Either the user can set the 'tmo' or we can derive it based on * a) line-speed and b) block-size. * prb_calc_retire_blk_tmo() calculates the tmo. * */ static void prb_retire_rx_blk_timer_expired(struct timer_list *t) { struct packet_sock *po = from_timer(po, t, rx_ring.prb_bdqc.retire_blk_timer); struct tpacket_kbdq_core *pkc = GET_PBDQC_FROM_RB(&po->rx_ring); unsigned int frozen; struct tpacket_block_desc *pbd; spin_lock(&po->sk.sk_receive_queue.lock); frozen = prb_queue_frozen(pkc); pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); if (unlikely(pkc->delete_blk_timer)) goto out; /* We only need to plug the race when the block is partially filled. * tpacket_rcv: * lock(); increment BLOCK_NUM_PKTS; unlock() * copy_bits() is in progress ... * timer fires on other cpu: * we can't retire the current block because copy_bits * is in progress. * */ if (BLOCK_NUM_PKTS(pbd)) { /* Waiting for skb_copy_bits to finish... */ write_lock(&pkc->blk_fill_in_prog_lock); write_unlock(&pkc->blk_fill_in_prog_lock); } if (pkc->last_kactive_blk_num == pkc->kactive_blk_num) { if (!frozen) { if (!BLOCK_NUM_PKTS(pbd)) { /* An empty block. Just refresh the timer. */ goto refresh_timer; } prb_retire_current_block(pkc, po, TP_STATUS_BLK_TMO); if (!prb_dispatch_next_block(pkc, po)) goto refresh_timer; else goto out; } else { /* Case 1. Queue was frozen because user-space was * lagging behind. */ if (prb_curr_blk_in_use(pbd)) { /* * Ok, user-space is still behind. * So just refresh the timer. */ goto refresh_timer; } else { /* Case 2. queue was frozen,user-space caught up, * now the link went idle && the timer fired. * We don't have a block to close.So we open this * block and restart the timer. * opening a block thaws the queue,restarts timer * Thawing/timer-refresh is a side effect. */ prb_open_block(pkc, pbd); goto out; } } } refresh_timer: _prb_refresh_rx_retire_blk_timer(pkc); out: spin_unlock(&po->sk.sk_receive_queue.lock); } static void prb_flush_block(struct tpacket_kbdq_core *pkc1, struct tpacket_block_desc *pbd1, __u32 status) { /* Flush everything minus the block header */ #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1 u8 *start, *end; start = (u8 *)pbd1; /* Skip the block header(we know header WILL fit in 4K) */ start += PAGE_SIZE; end = (u8 *)PAGE_ALIGN((unsigned long)pkc1->pkblk_end); for (; start < end; start += PAGE_SIZE) flush_dcache_page(pgv_to_page(start)); smp_wmb(); #endif /* Now update the block status. */ BLOCK_STATUS(pbd1) = status; /* Flush the block header */ #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1 start = (u8 *)pbd1; flush_dcache_page(pgv_to_page(start)); smp_wmb(); #endif } /* * Side effect: * * 1) flush the block * 2) Increment active_blk_num * * Note:We DONT refresh the timer on purpose. * Because almost always the next block will be opened. */ static void prb_close_block(struct tpacket_kbdq_core *pkc1, struct tpacket_block_desc *pbd1, struct packet_sock *po, unsigned int stat) { __u32 status = TP_STATUS_USER | stat; struct tpacket3_hdr *last_pkt; struct tpacket_hdr_v1 *h1 = &pbd1->hdr.bh1; struct sock *sk = &po->sk; if (atomic_read(&po->tp_drops)) status |= TP_STATUS_LOSING; last_pkt = (struct tpacket3_hdr *)pkc1->prev; last_pkt->tp_next_offset = 0; /* Get the ts of the last pkt */ if (BLOCK_NUM_PKTS(pbd1)) { h1->ts_last_pkt.ts_sec = last_pkt->tp_sec; h1->ts_last_pkt.ts_nsec = last_pkt->tp_nsec; } else { /* Ok, we tmo'd - so get the current time. * * It shouldn't really happen as we don't close empty * blocks. See prb_retire_rx_blk_timer_expired(). */ struct timespec64 ts; ktime_get_real_ts64(&ts); h1->ts_last_pkt.ts_sec = ts.tv_sec; h1->ts_last_pkt.ts_nsec = ts.tv_nsec; } smp_wmb(); /* Flush the block */ prb_flush_block(pkc1, pbd1, status); sk->sk_data_ready(sk); pkc1->kactive_blk_num = GET_NEXT_PRB_BLK_NUM(pkc1); } static void prb_thaw_queue(struct tpacket_kbdq_core *pkc) { pkc->reset_pending_on_curr_blk = 0; } /* * Side effect of opening a block: * * 1) prb_queue is thawed. * 2) retire_blk_timer is refreshed. * */ static void prb_open_block(struct tpacket_kbdq_core *pkc1, struct tpacket_block_desc *pbd1) { struct timespec64 ts; struct tpacket_hdr_v1 *h1 = &pbd1->hdr.bh1; smp_rmb(); /* We could have just memset this but we will lose the * flexibility of making the priv area sticky */ BLOCK_SNUM(pbd1) = pkc1->knxt_seq_num++; BLOCK_NUM_PKTS(pbd1) = 0; BLOCK_LEN(pbd1) = BLK_PLUS_PRIV(pkc1->blk_sizeof_priv); ktime_get_real_ts64(&ts); h1->ts_first_pkt.ts_sec = ts.tv_sec; h1->ts_first_pkt.ts_nsec = ts.tv_nsec; pkc1->pkblk_start = (char *)pbd1; pkc1->nxt_offset = pkc1->pkblk_start + BLK_PLUS_PRIV(pkc1->blk_sizeof_priv); BLOCK_O2FP(pbd1) = (__u32)BLK_PLUS_PRIV(pkc1->blk_sizeof_priv); BLOCK_O2PRIV(pbd1) = BLK_HDR_LEN; pbd1->version = pkc1->version; pkc1->prev = pkc1->nxt_offset; pkc1->pkblk_end = pkc1->pkblk_start + pkc1->kblk_size; prb_thaw_queue(pkc1); _prb_refresh_rx_retire_blk_timer(pkc1); smp_wmb(); } /* * Queue freeze logic: * 1) Assume tp_block_nr = 8 blocks. * 2) At time 't0', user opens Rx ring. * 3) Some time past 't0', kernel starts filling blocks starting from 0 .. 7 * 4) user-space is either sleeping or processing block '0'. * 5) tpacket_rcv is currently filling block '7', since there is no space left, * it will close block-7,loop around and try to fill block '0'. * call-flow: * __packet_lookup_frame_in_block * prb_retire_current_block() * prb_dispatch_next_block() * |->(BLOCK_STATUS == USER) evaluates to true * 5.1) Since block-0 is currently in-use, we just freeze the queue. * 6) Now there are two cases: * 6.1) Link goes idle right after the queue is frozen. * But remember, the last open_block() refreshed the timer. * When this timer expires,it will refresh itself so that we can * re-open block-0 in near future. * 6.2) Link is busy and keeps on receiving packets. This is a simple * case and __packet_lookup_frame_in_block will check if block-0 * is free and can now be re-used. */ static void prb_freeze_queue(struct tpacket_kbdq_core *pkc, struct packet_sock *po) { pkc->reset_pending_on_curr_blk = 1; po->stats.stats3.tp_freeze_q_cnt++; } #define TOTAL_PKT_LEN_INCL_ALIGN(length) (ALIGN((length), V3_ALIGNMENT)) /* * If the next block is free then we will dispatch it * and return a good offset. * Else, we will freeze the queue. * So, caller must check the return value. */ static void *prb_dispatch_next_block(struct tpacket_kbdq_core *pkc, struct packet_sock *po) { struct tpacket_block_desc *pbd; smp_rmb(); /* 1. Get current block num */ pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); /* 2. If this block is currently in_use then freeze the queue */ if (TP_STATUS_USER & BLOCK_STATUS(pbd)) { prb_freeze_queue(pkc, po); return NULL; } /* * 3. * open this block and return the offset where the first packet * needs to get stored. */ prb_open_block(pkc, pbd); return (void *)pkc->nxt_offset; } static void prb_retire_current_block(struct tpacket_kbdq_core *pkc, struct packet_sock *po, unsigned int status) { struct tpacket_block_desc *pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); /* retire/close the current block */ if (likely(TP_STATUS_KERNEL == BLOCK_STATUS(pbd))) { /* * Plug the case where copy_bits() is in progress on * cpu-0 and tpacket_rcv() got invoked on cpu-1, didn't * have space to copy the pkt in the current block and * called prb_retire_current_block() * * We don't need to worry about the TMO case because * the timer-handler already handled this case. */ if (!(status & TP_STATUS_BLK_TMO)) { /* Waiting for skb_copy_bits to finish... */ write_lock(&pkc->blk_fill_in_prog_lock); write_unlock(&pkc->blk_fill_in_prog_lock); } prb_close_block(pkc, pbd, po, status); return; } } static int prb_curr_blk_in_use(struct tpacket_block_desc *pbd) { return TP_STATUS_USER & BLOCK_STATUS(pbd); } static int prb_queue_frozen(struct tpacket_kbdq_core *pkc) { return pkc->reset_pending_on_curr_blk; } static void prb_clear_blk_fill_status(struct packet_ring_buffer *rb) __releases(&pkc->blk_fill_in_prog_lock) { struct tpacket_kbdq_core *pkc = GET_PBDQC_FROM_RB(rb); read_unlock(&pkc->blk_fill_in_prog_lock); } static void prb_fill_rxhash(struct tpacket_kbdq_core *pkc, struct tpacket3_hdr *ppd) { ppd->hv1.tp_rxhash = skb_get_hash(pkc->skb); } static void prb_clear_rxhash(struct tpacket_kbdq_core *pkc, struct tpacket3_hdr *ppd) { ppd->hv1.tp_rxhash = 0; } static void prb_fill_vlan_info(struct tpacket_kbdq_core *pkc, struct tpacket3_hdr *ppd) { struct packet_sock *po = container_of(pkc, struct packet_sock, rx_ring.prb_bdqc); if (skb_vlan_tag_present(pkc->skb)) { ppd->hv1.tp_vlan_tci = skb_vlan_tag_get(pkc->skb); ppd->hv1.tp_vlan_tpid = ntohs(pkc->skb->vlan_proto); ppd->tp_status = TP_STATUS_VLAN_VALID | TP_STATUS_VLAN_TPID_VALID; } else if (unlikely(po->sk.sk_type == SOCK_DGRAM && eth_type_vlan(pkc->skb->protocol))) { ppd->hv1.tp_vlan_tci = vlan_get_tci(pkc->skb, pkc->skb->dev); ppd->hv1.tp_vlan_tpid = ntohs(pkc->skb->protocol); ppd->tp_status = TP_STATUS_VLAN_VALID | TP_STATUS_VLAN_TPID_VALID; } else { ppd->hv1.tp_vlan_tci = 0; ppd->hv1.tp_vlan_tpid = 0; ppd->tp_status = TP_STATUS_AVAILABLE; } } static void prb_run_all_ft_ops(struct tpacket_kbdq_core *pkc, struct tpacket3_hdr *ppd) { ppd->hv1.tp_padding = 0; prb_fill_vlan_info(pkc, ppd); if (pkc->feature_req_word & TP_FT_REQ_FILL_RXHASH) prb_fill_rxhash(pkc, ppd); else prb_clear_rxhash(pkc, ppd); } static void prb_fill_curr_block(char *curr, struct tpacket_kbdq_core *pkc, struct tpacket_block_desc *pbd, unsigned int len) __acquires(&pkc->blk_fill_in_prog_lock) { struct tpacket3_hdr *ppd; ppd = (struct tpacket3_hdr *)curr; ppd->tp_next_offset = TOTAL_PKT_LEN_INCL_ALIGN(len); pkc->prev = curr; pkc->nxt_offset += TOTAL_PKT_LEN_INCL_ALIGN(len); BLOCK_LEN(pbd) += TOTAL_PKT_LEN_INCL_ALIGN(len); BLOCK_NUM_PKTS(pbd) += 1; read_lock(&pkc->blk_fill_in_prog_lock); prb_run_all_ft_ops(pkc, ppd); } /* Assumes caller has the sk->rx_queue.lock */ static void *__packet_lookup_frame_in_block(struct packet_sock *po, struct sk_buff *skb, unsigned int len ) { struct tpacket_kbdq_core *pkc; struct tpacket_block_desc *pbd; char *curr, *end; pkc = GET_PBDQC_FROM_RB(&po->rx_ring); pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); /* Queue is frozen when user space is lagging behind */ if (prb_queue_frozen(pkc)) { /* * Check if that last block which caused the queue to freeze, * is still in_use by user-space. */ if (prb_curr_blk_in_use(pbd)) { /* Can't record this packet */ return NULL; } else { /* * Ok, the block was released by user-space. * Now let's open that block. * opening a block also thaws the queue. * Thawing is a side effect. */ prb_open_block(pkc, pbd); } } smp_mb(); curr = pkc->nxt_offset; pkc->skb = skb; end = (char *)pbd + pkc->kblk_size; /* first try the current block */ if (curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end) { prb_fill_curr_block(curr, pkc, pbd, len); return (void *)curr; } /* Ok, close the current block */ prb_retire_current_block(pkc, po, 0); /* Now, try to dispatch the next block */ curr = (char *)prb_dispatch_next_block(pkc, po); if (curr) { pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); prb_fill_curr_block(curr, pkc, pbd, len); return (void *)curr; } /* * No free blocks are available.user_space hasn't caught up yet. * Queue was just frozen and now this packet will get dropped. */ return NULL; } static void *packet_current_rx_frame(struct packet_sock *po, struct sk_buff *skb, int status, unsigned int len) { char *curr = NULL; switch (po->tp_version) { case TPACKET_V1: case TPACKET_V2: curr = packet_lookup_frame(po, &po->rx_ring, po->rx_ring.head, status); return curr; case TPACKET_V3: return __packet_lookup_frame_in_block(po, skb, len); default: WARN(1, "TPACKET version not supported\n"); BUG(); return NULL; } } static void *prb_lookup_block(const struct packet_sock *po, const struct packet_ring_buffer *rb, unsigned int idx, int status) { struct tpacket_kbdq_core *pkc = GET_PBDQC_FROM_RB(rb); struct tpacket_block_desc *pbd = GET_PBLOCK_DESC(pkc, idx); if (status != BLOCK_STATUS(pbd)) return NULL; return pbd; } static int prb_previous_blk_num(struct packet_ring_buffer *rb) { unsigned int prev; if (rb->prb_bdqc.kactive_blk_num) prev = rb->prb_bdqc.kactive_blk_num-1; else prev = rb->prb_bdqc.knum_blocks-1; return prev; } /* Assumes caller has held the rx_queue.lock */ static void *__prb_previous_block(struct packet_sock *po, struct packet_ring_buffer *rb, int status) { unsigned int previous = prb_previous_blk_num(rb); return prb_lookup_block(po, rb, previous, status); } static void *packet_previous_rx_frame(struct packet_sock *po, struct packet_ring_buffer *rb, int status) { if (po->tp_version <= TPACKET_V2) return packet_previous_frame(po, rb, status); return __prb_previous_block(po, rb, status); } static void packet_increment_rx_head(struct packet_sock *po, struct packet_ring_buffer *rb) { switch (po->tp_version) { case TPACKET_V1: case TPACKET_V2: return packet_increment_head(rb); case TPACKET_V3: default: WARN(1, "TPACKET version not supported.\n"); BUG(); return; } } static void *packet_previous_frame(struct packet_sock *po, struct packet_ring_buffer *rb, int status) { unsigned int previous = rb->head ? rb->head - 1 : rb->frame_max; return packet_lookup_frame(po, rb, previous, status); } static void packet_increment_head(struct packet_ring_buffer *buff) { buff->head = buff->head != buff->frame_max ? buff->head+1 : 0; } static void packet_inc_pending(struct packet_ring_buffer *rb) { this_cpu_inc(*rb->pending_refcnt); } static void packet_dec_pending(struct packet_ring_buffer *rb) { this_cpu_dec(*rb->pending_refcnt); } static unsigned int packet_read_pending(const struct packet_ring_buffer *rb) { unsigned int refcnt = 0; int cpu; /* We don't use pending refcount in rx_ring. */ if (rb->pending_refcnt == NULL) return 0; for_each_possible_cpu(cpu) refcnt += *per_cpu_ptr(rb->pending_refcnt, cpu); return refcnt; } static int packet_alloc_pending(struct packet_sock *po) { po->rx_ring.pending_refcnt = NULL; po->tx_ring.pending_refcnt = alloc_percpu(unsigned int); if (unlikely(po->tx_ring.pending_refcnt == NULL)) return -ENOBUFS; return 0; } static void packet_free_pending(struct packet_sock *po) { free_percpu(po->tx_ring.pending_refcnt); } #define ROOM_POW_OFF 2 #define ROOM_NONE 0x0 #define ROOM_LOW 0x1 #define ROOM_NORMAL 0x2 static bool __tpacket_has_room(const struct packet_sock *po, int pow_off) { int idx, len; len = READ_ONCE(po->rx_ring.frame_max) + 1; idx = READ_ONCE(po->rx_ring.head); if (pow_off) idx += len >> pow_off; if (idx >= len) idx -= len; return packet_lookup_frame(po, &po->rx_ring, idx, TP_STATUS_KERNEL); } static bool __tpacket_v3_has_room(const struct packet_sock *po, int pow_off) { int idx, len; len = READ_ONCE(po->rx_ring.prb_bdqc.knum_blocks); idx = READ_ONCE(po->rx_ring.prb_bdqc.kactive_blk_num); if (pow_off) idx += len >> pow_off; if (idx >= len) idx -= len; return prb_lookup_block(po, &po->rx_ring, idx, TP_STATUS_KERNEL); } static int __packet_rcv_has_room(const struct packet_sock *po, const struct sk_buff *skb) { const struct sock *sk = &po->sk; int ret = ROOM_NONE; if (po->prot_hook.func != tpacket_rcv) { int rcvbuf = READ_ONCE(sk->sk_rcvbuf); int avail = rcvbuf - atomic_read(&sk->sk_rmem_alloc) - (skb ? skb->truesize : 0); if (avail > (rcvbuf >> ROOM_POW_OFF)) return ROOM_NORMAL; else if (avail > 0) return ROOM_LOW; else return ROOM_NONE; } if (po->tp_version == TPACKET_V3) { if (__tpacket_v3_has_room(po, ROOM_POW_OFF)) ret = ROOM_NORMAL; else if (__tpacket_v3_has_room(po, 0)) ret = ROOM_LOW; } else { if (__tpacket_has_room(po, ROOM_POW_OFF)) ret = ROOM_NORMAL; else if (__tpacket_has_room(po, 0)) ret = ROOM_LOW; } return ret; } static int packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb) { bool pressure; int ret; ret = __packet_rcv_has_room(po, skb); pressure = ret != ROOM_NORMAL; if (packet_sock_flag(po, PACKET_SOCK_PRESSURE) != pressure) packet_sock_flag_set(po, PACKET_SOCK_PRESSURE, pressure); return ret; } static void packet_rcv_try_clear_pressure(struct packet_sock *po) { if (packet_sock_flag(po, PACKET_SOCK_PRESSURE) && __packet_rcv_has_room(po, NULL) == ROOM_NORMAL) packet_sock_flag_set(po, PACKET_SOCK_PRESSURE, false); } static void packet_sock_destruct(struct sock *sk) { skb_queue_purge(&sk->sk_error_queue); WARN_ON(atomic_read(&sk->sk_rmem_alloc)); WARN_ON(refcount_read(&sk->sk_wmem_alloc)); if (!sock_flag(sk, SOCK_DEAD)) { pr_err("Attempt to release alive packet socket: %p\n", sk); return; } } static bool fanout_flow_is_huge(struct packet_sock *po, struct sk_buff *skb) { u32 *history = po->rollover->history; u32 victim, rxhash; int i, count = 0; rxhash = skb_get_hash(skb); for (i = 0; i < ROLLOVER_HLEN; i++) if (READ_ONCE(history[i]) == rxhash) count++; victim = get_random_u32_below(ROLLOVER_HLEN); /* Avoid dirtying the cache line if possible */ if (READ_ONCE(history[victim]) != rxhash) WRITE_ONCE(history[victim], rxhash); return count > (ROLLOVER_HLEN >> 1); } static unsigned int fanout_demux_hash(struct packet_fanout *f, struct sk_buff *skb, unsigned int num) { return reciprocal_scale(__skb_get_hash_symmetric(skb), num); } static unsigned int fanout_demux_lb(struct packet_fanout *f, struct sk_buff *skb, unsigned int num) { unsigned int val = atomic_inc_return(&f->rr_cur); return val % num; } static unsigned int fanout_demux_cpu(struct packet_fanout *f, struct sk_buff *skb, unsigned int num) { return smp_processor_id() % num; } static unsigned int fanout_demux_rnd(struct packet_fanout *f, struct sk_buff *skb, unsigned int num) { return get_random_u32_below(num); } static unsigned int fanout_demux_rollover(struct packet_fanout *f, struct sk_buff *skb, unsigned int idx, bool try_self, unsigned int num) { struct packet_sock *po, *po_next, *po_skip = NULL; unsigned int i, j, room = ROOM_NONE; po = pkt_sk(rcu_dereference(f->arr[idx])); if (try_self) { room = packet_rcv_has_room(po, skb); if (room == ROOM_NORMAL || (room == ROOM_LOW && !fanout_flow_is_huge(po, skb))) return idx; po_skip = po; } i = j = min_t(int, po->rollover->sock, num - 1); do { po_next = pkt_sk(rcu_dereference(f->arr[i])); if (po_next != po_skip && !packet_sock_flag(po_next, PACKET_SOCK_PRESSURE) && packet_rcv_has_room(po_next, skb) == ROOM_NORMAL) { if (i != j) po->rollover->sock = i; atomic_long_inc(&po->rollover->num); if (room == ROOM_LOW) atomic_long_inc(&po->rollover->num_huge); return i; } if (++i == num) i = 0; } while (i != j); atomic_long_inc(&po->rollover->num_failed); return idx; } static unsigned int fanout_demux_qm(struct packet_fanout *f, struct sk_buff *skb, unsigned int num) { return skb_get_queue_mapping(skb) % num; } static unsigned int fanout_demux_bpf(struct packet_fanout *f, struct sk_buff *skb, unsigned int num) { struct bpf_prog *prog; unsigned int ret = 0; rcu_read_lock(); prog = rcu_dereference(f->bpf_prog); if (prog) ret = bpf_prog_run_clear_cb(prog, skb) % num; rcu_read_unlock(); return ret; } static bool fanout_has_flag(struct packet_fanout *f, u16 flag) { return f->flags & (flag >> 8); } static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { struct packet_fanout *f = pt->af_packet_priv; unsigned int num = READ_ONCE(f->num_members); struct net *net = read_pnet(&f->net); struct packet_sock *po; unsigned int idx; if (!net_eq(dev_net(dev), net) || !num) { kfree_skb(skb); return 0; } if (fanout_has_flag(f, PACKET_FANOUT_FLAG_DEFRAG)) { skb = ip_check_defrag(net, skb, IP_DEFRAG_AF_PACKET); if (!skb) return 0; } switch (f->type) { case PACKET_FANOUT_HASH: default: idx = fanout_demux_hash(f, skb, num); break; case PACKET_FANOUT_LB: idx = fanout_demux_lb(f, skb, num); break; case PACKET_FANOUT_CPU: idx = fanout_demux_cpu(f, skb, num); break; case PACKET_FANOUT_RND: idx = fanout_demux_rnd(f, skb, num); break; case PACKET_FANOUT_QM: idx = fanout_demux_qm(f, skb, num); break; case PACKET_FANOUT_ROLLOVER: idx = fanout_demux_rollover(f, skb, 0, false, num); break; case PACKET_FANOUT_CBPF: case PACKET_FANOUT_EBPF: idx = fanout_demux_bpf(f, skb, num); break; } if (fanout_has_flag(f, PACKET_FANOUT_FLAG_ROLLOVER)) idx = fanout_demux_rollover(f, skb, idx, true, num); po = pkt_sk(rcu_dereference(f->arr[idx])); return po->prot_hook.func(skb, dev, &po->prot_hook, orig_dev); } DEFINE_MUTEX(fanout_mutex); EXPORT_SYMBOL_GPL(fanout_mutex); static LIST_HEAD(fanout_list); static u16 fanout_next_id; static void __fanout_link(struct sock *sk, struct packet_sock *po) { struct packet_fanout *f = po->fanout; spin_lock(&f->lock); rcu_assign_pointer(f->arr[f->num_members], sk); smp_wmb(); f->num_members++; if (f->num_members == 1) dev_add_pack(&f->prot_hook); spin_unlock(&f->lock); } static void __fanout_unlink(struct sock *sk, struct packet_sock *po) { struct packet_fanout *f = po->fanout; int i; spin_lock(&f->lock); for (i = 0; i < f->num_members; i++) { if (rcu_dereference_protected(f->arr[i], lockdep_is_held(&f->lock)) == sk) break; } BUG_ON(i >= f->num_members); rcu_assign_pointer(f->arr[i], rcu_dereference_protected(f->arr[f->num_members - 1], lockdep_is_held(&f->lock))); f->num_members--; if (f->num_members == 0) __dev_remove_pack(&f->prot_hook); spin_unlock(&f->lock); } static bool match_fanout_group(struct packet_type *ptype, struct sock *sk) { if (sk->sk_family != PF_PACKET) return false; return ptype->af_packet_priv == pkt_sk(sk)->fanout; } static void fanout_init_data(struct packet_fanout *f) { switch (f->type) { case PACKET_FANOUT_LB: atomic_set(&f->rr_cur, 0); break; case PACKET_FANOUT_CBPF: case PACKET_FANOUT_EBPF: RCU_INIT_POINTER(f->bpf_prog, NULL); break; } } static void __fanout_set_data_bpf(struct packet_fanout *f, struct bpf_prog *new) { struct bpf_prog *old; spin_lock(&f->lock); old = rcu_dereference_protected(f->bpf_prog, lockdep_is_held(&f->lock)); rcu_assign_pointer(f->bpf_prog, new); spin_unlock(&f->lock); if (old) { synchronize_net(); bpf_prog_destroy(old); } } static int fanout_set_data_cbpf(struct packet_sock *po, sockptr_t data, unsigned int len) { struct bpf_prog *new; struct sock_fprog fprog; int ret; if (sock_flag(&po->sk, SOCK_FILTER_LOCKED)) return -EPERM; ret = copy_bpf_fprog_from_user(&fprog, data, len); if (ret) return ret; ret = bpf_prog_create_from_user(&new, &fprog, NULL, false); if (ret) return ret; __fanout_set_data_bpf(po->fanout, new); return 0; } static int fanout_set_data_ebpf(struct packet_sock *po, sockptr_t data, unsigned int len) { struct bpf_prog *new; u32 fd; if (sock_flag(&po->sk, SOCK_FILTER_LOCKED)) return -EPERM; if (len != sizeof(fd)) return -EINVAL; if (copy_from_sockptr(&fd, data, len)) return -EFAULT; new = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER); if (IS_ERR(new)) return PTR_ERR(new); __fanout_set_data_bpf(po->fanout, new); return 0; } static int fanout_set_data(struct packet_sock *po, sockptr_t data, unsigned int len) { switch (po->fanout->type) { case PACKET_FANOUT_CBPF: return fanout_set_data_cbpf(po, data, len); case PACKET_FANOUT_EBPF: return fanout_set_data_ebpf(po, data, len); default: return -EINVAL; } } static void fanout_release_data(struct packet_fanout *f) { switch (f->type) { case PACKET_FANOUT_CBPF: case PACKET_FANOUT_EBPF: __fanout_set_data_bpf(f, NULL); } } static bool __fanout_id_is_free(struct sock *sk, u16 candidate_id) { struct packet_fanout *f; list_for_each_entry(f, &fanout_list, list) { if (f->id == candidate_id && read_pnet(&f->net) == sock_net(sk)) { return false; } } return true; } static bool fanout_find_new_id(struct sock *sk, u16 *new_id) { u16 id = fanout_next_id; do { if (__fanout_id_is_free(sk, id)) { *new_id = id; fanout_next_id = id + 1; return true; } id++; } while (id != fanout_next_id); return false; } static int fanout_add(struct sock *sk, struct fanout_args *args) { struct packet_rollover *rollover = NULL; struct packet_sock *po = pkt_sk(sk); u16 type_flags = args->type_flags; struct packet_fanout *f, *match; u8 type = type_flags & 0xff; u8 flags = type_flags >> 8; u16 id = args->id; int err; switch (type) { case PACKET_FANOUT_ROLLOVER: if (type_flags & PACKET_FANOUT_FLAG_ROLLOVER) return -EINVAL; break; case PACKET_FANOUT_HASH: case PACKET_FANOUT_LB: case PACKET_FANOUT_CPU: case PACKET_FANOUT_RND: case PACKET_FANOUT_QM: case PACKET_FANOUT_CBPF: case PACKET_FANOUT_EBPF: break; default: return -EINVAL; } mutex_lock(&fanout_mutex); err = -EALREADY; if (po->fanout) goto out; if (type == PACKET_FANOUT_ROLLOVER || (type_flags & PACKET_FANOUT_FLAG_ROLLOVER)) { err = -ENOMEM; rollover = kzalloc(sizeof(*rollover), GFP_KERNEL); if (!rollover) goto out; atomic_long_set(&rollover->num, 0); atomic_long_set(&rollover->num_huge, 0); atomic_long_set(&rollover->num_failed, 0); } if (type_flags & PACKET_FANOUT_FLAG_UNIQUEID) { if (id != 0) { err = -EINVAL; goto out; } if (!fanout_find_new_id(sk, &id)) { err = -ENOMEM; goto out; } /* ephemeral flag for the first socket in the group: drop it */ flags &= ~(PACKET_FANOUT_FLAG_UNIQUEID >> 8); } match = NULL; list_for_each_entry(f, &fanout_list, list) { if (f->id == id && read_pnet(&f->net) == sock_net(sk)) { match = f; break; } } err = -EINVAL; if (match) { if (match->flags != flags) goto out; if (args->max_num_members && args->max_num_members != match->max_num_members) goto out; } else { if (args->max_num_members > PACKET_FANOUT_MAX) goto out; if (!args->max_num_members) /* legacy PACKET_FANOUT_MAX */ args->max_num_members = 256; err = -ENOMEM; match = kvzalloc(struct_size(match, arr, args->max_num_members), GFP_KERNEL); if (!match) goto out; write_pnet(&match->net, sock_net(sk)); match->id = id; match->type = type; match->flags = flags; INIT_LIST_HEAD(&match->list); spin_lock_init(&match->lock); refcount_set(&match->sk_ref, 0); fanout_init_data(match); match->prot_hook.type = po->prot_hook.type; match->prot_hook.dev = po->prot_hook.dev; match->prot_hook.func = packet_rcv_fanout; match->prot_hook.af_packet_priv = match; match->prot_hook.af_packet_net = read_pnet(&match->net); match->prot_hook.id_match = match_fanout_group; match->max_num_members = args->max_num_members; match->prot_hook.ignore_outgoing = type_flags & PACKET_FANOUT_FLAG_IGNORE_OUTGOING; list_add(&match->list, &fanout_list); } err = -EINVAL; spin_lock(&po->bind_lock); if (po->num && match->type == type && match->prot_hook.type == po->prot_hook.type && match->prot_hook.dev == po->prot_hook.dev) { err = -ENOSPC; if (refcount_read(&match->sk_ref) < match->max_num_members) { /* Paired with packet_setsockopt(PACKET_FANOUT_DATA) */ WRITE_ONCE(po->fanout, match); po->rollover = rollover; rollover = NULL; refcount_set(&match->sk_ref, refcount_read(&match->sk_ref) + 1); if (packet_sock_flag(po, PACKET_SOCK_RUNNING)) { __dev_remove_pack(&po->prot_hook); __fanout_link(sk, po); } err = 0; } } spin_unlock(&po->bind_lock); if (err && !refcount_read(&match->sk_ref)) { list_del(&match->list); kvfree(match); } out: kfree(rollover); mutex_unlock(&fanout_mutex); return err; } /* If pkt_sk(sk)->fanout->sk_ref is zero, this function removes * pkt_sk(sk)->fanout from fanout_list and returns pkt_sk(sk)->fanout. * It is the responsibility of the caller to call fanout_release_data() and * free the returned packet_fanout (after synchronize_net()) */ static struct packet_fanout *fanout_release(struct sock *sk) { struct packet_sock *po = pkt_sk(sk); struct packet_fanout *f; mutex_lock(&fanout_mutex); f = po->fanout; if (f) { po->fanout = NULL; if (refcount_dec_and_test(&f->sk_ref)) list_del(&f->list); else f = NULL; } mutex_unlock(&fanout_mutex); return f; } static bool packet_extra_vlan_len_allowed(const struct net_device *dev, struct sk_buff *skb) { /* Earlier code assumed this would be a VLAN pkt, double-check * this now that we have the actual packet in hand. We can only * do this check on Ethernet devices. */ if (unlikely(dev->type != ARPHRD_ETHER)) return false; skb_reset_mac_header(skb); return likely(eth_hdr(skb)->h_proto == htons(ETH_P_8021Q)); } static const struct proto_ops packet_ops; static const struct proto_ops packet_ops_spkt; static int packet_rcv_spkt(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { struct sock *sk; struct sockaddr_pkt *spkt; /* * When we registered the protocol we saved the socket in the data * field for just this event. */ sk = pt->af_packet_priv; /* * Yank back the headers [hope the device set this * right or kerboom...] * * Incoming packets have ll header pulled, * push it back. * * For outgoing ones skb->data == skb_mac_header(skb) * so that this procedure is noop. */ if (skb->pkt_type == PACKET_LOOPBACK) goto out; if (!net_eq(dev_net(dev), sock_net(sk))) goto out; skb = skb_share_check(skb, GFP_ATOMIC); if (skb == NULL) goto oom; /* drop any routing info */ skb_dst_drop(skb); /* drop conntrack reference */ nf_reset_ct(skb); spkt = &PACKET_SKB_CB(skb)->sa.pkt; skb_push(skb, skb->data - skb_mac_header(skb)); /* * The SOCK_PACKET socket receives _all_ frames. */ spkt->spkt_family = dev->type; strscpy(spkt->spkt_device, dev->name, sizeof(spkt->spkt_device)); spkt->spkt_protocol = skb->protocol; /* * Charge the memory to the socket. This is done specifically * to prevent sockets using all the memory up. */ if (sock_queue_rcv_skb(sk, skb) == 0) return 0; out: kfree_skb(skb); oom: return 0; } static void packet_parse_headers(struct sk_buff *skb, struct socket *sock) { int depth; if ((!skb->protocol || skb->protocol == htons(ETH_P_ALL)) && sock->type == SOCK_RAW) { skb_reset_mac_header(skb); skb->protocol = dev_parse_header_protocol(skb); } /* Move network header to the right position for VLAN tagged packets */ if (likely(skb->dev->type == ARPHRD_ETHER) && eth_type_vlan(skb->protocol) && vlan_get_protocol_and_depth(skb, skb->protocol, &depth) != 0) skb_set_network_header(skb, depth); skb_probe_transport_header(skb); } /* * Output a raw packet to a device layer. This bypasses all the other * protocol layers and you must therefore supply it with a complete frame */ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg, size_t len) { struct sock *sk = sock->sk; DECLARE_SOCKADDR(struct sockaddr_pkt *, saddr, msg->msg_name); struct sk_buff *skb = NULL; struct net_device *dev; struct sockcm_cookie sockc; __be16 proto = 0; int err; int extra_len = 0; /* * Get and verify the address. */ if (saddr) { if (msg->msg_namelen < sizeof(struct sockaddr)) return -EINVAL; if (msg->msg_namelen == sizeof(struct sockaddr_pkt)) proto = saddr->spkt_protocol; } else return -ENOTCONN; /* SOCK_PACKET must be sent giving an address */ /* * Find the device first to size check it */ saddr->spkt_device[sizeof(saddr->spkt_device) - 1] = 0; retry: rcu_read_lock(); dev = dev_get_by_name_rcu(sock_net(sk), saddr->spkt_device); err = -ENODEV; if (dev == NULL) goto out_unlock; err = -ENETDOWN; if (!(dev->flags & IFF_UP)) goto out_unlock; /* * You may not queue a frame bigger than the mtu. This is the lowest level * raw protocol and you must do your own fragmentation at this level. */ if (unlikely(sock_flag(sk, SOCK_NOFCS))) { if (!netif_supports_nofcs(dev)) { err = -EPROTONOSUPPORT; goto out_unlock; } extra_len = 4; /* We're doing our own CRC */ } err = -EMSGSIZE; if (len > dev->mtu + dev->hard_header_len + VLAN_HLEN + extra_len) goto out_unlock; if (!skb) { size_t reserved = LL_RESERVED_SPACE(dev); int tlen = dev->needed_tailroom; unsigned int hhlen = dev->header_ops ? dev->hard_header_len : 0; rcu_read_unlock(); skb = sock_wmalloc(sk, len + reserved + tlen, 0, GFP_KERNEL); if (skb == NULL) return -ENOBUFS; /* FIXME: Save some space for broken drivers that write a hard * header at transmission time by themselves. PPP is the notable * one here. This should really be fixed at the driver level. */ skb_reserve(skb, reserved); skb_reset_network_header(skb); /* Try to align data part correctly */ if (hhlen) { skb->data -= hhlen; skb->tail -= hhlen; if (len < hhlen) skb_reset_network_header(skb); } err = memcpy_from_msg(skb_put(skb, len), msg, len); if (err) goto out_free; goto retry; } if (!dev_validate_header(dev, skb->data, len) || !skb->len) { err = -EINVAL; goto out_unlock; } if (len > (dev->mtu + dev->hard_header_len + extra_len) && !packet_extra_vlan_len_allowed(dev, skb)) { err = -EMSGSIZE; goto out_unlock; } sockcm_init(&sockc, sk); if (msg->msg_controllen) { err = sock_cmsg_send(sk, msg, &sockc); if (unlikely(err)) goto out_unlock; } skb->protocol = proto; skb->dev = dev; skb->priority = sockc.priority; skb->mark = sockc.mark; skb_set_delivery_type_by_clockid(skb, sockc.transmit_time, sk->sk_clockid); skb_setup_tx_timestamp(skb, &sockc); if (unlikely(extra_len == 4)) skb->no_fcs = 1; packet_parse_headers(skb, sock); dev_queue_xmit(skb); rcu_read_unlock(); return len; out_unlock: rcu_read_unlock(); out_free: kfree_skb(skb); return err; } static unsigned int run_filter(struct sk_buff *skb, const struct sock *sk, unsigned int res) { struct sk_filter *filter; rcu_read_lock(); filter = rcu_dereference(sk->sk_filter); if (filter != NULL) res = bpf_prog_run_clear_cb(filter->prog, skb); rcu_read_unlock(); return res; } static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb, size_t *len, int vnet_hdr_sz) { struct virtio_net_hdr_mrg_rxbuf vnet_hdr = { .num_buffers = 0 }; if (*len < vnet_hdr_sz) return -EINVAL; *len -= vnet_hdr_sz; if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)&vnet_hdr, vio_le(), true, 0)) return -EINVAL; return memcpy_to_msg(msg, (void *)&vnet_hdr, vnet_hdr_sz); } /* * This function makes lazy skb cloning in hope that most of packets * are discarded by BPF. * * Note tricky part: we DO mangle shared skb! skb->data, skb->len * and skb->cb are mangled. It works because (and until) packets * falling here are owned by current CPU. Output packets are cloned * by dev_queue_xmit_nit(), input packets are processed by net_bh * sequentially, so that if we return skb to original state on exit, * we will not harm anyone. */ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { enum skb_drop_reason drop_reason = SKB_CONSUMED; struct sock *sk = NULL; struct sockaddr_ll *sll; struct packet_sock *po; u8 *skb_head = skb->data; int skb_len = skb->len; unsigned int snaplen, res; if (skb->pkt_type == PACKET_LOOPBACK) goto drop; sk = pt->af_packet_priv; po = pkt_sk(sk); if (!net_eq(dev_net(dev), sock_net(sk))) goto drop; skb->dev = dev; if (dev_has_header(dev)) { /* The device has an explicit notion of ll header, * exported to higher levels. * * Otherwise, the device hides details of its frame * structure, so that corresponding packet head is * never delivered to user. */ if (sk->sk_type != SOCK_DGRAM) skb_push(skb, skb->data - skb_mac_header(skb)); else if (skb->pkt_type == PACKET_OUTGOING) { /* Special case: outgoing packets have ll header at head */ skb_pull(skb, skb_network_offset(skb)); } } snaplen = skb_frags_readable(skb) ? skb->len : skb_headlen(skb); res = run_filter(skb, sk, snaplen); if (!res) goto drop_n_restore; if (snaplen > res) snaplen = res; if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf) goto drop_n_acct; if (skb_shared(skb)) { struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC); if (nskb == NULL) goto drop_n_acct; if (skb_head != skb->data) { skb->data = skb_head; skb->len = skb_len; } consume_skb(skb); skb = nskb; } sock_skb_cb_check_size(sizeof(*PACKET_SKB_CB(skb)) + MAX_ADDR_LEN - 8); sll = &PACKET_SKB_CB(skb)->sa.ll; sll->sll_hatype = dev->type; sll->sll_pkttype = skb->pkt_type; if (unlikely(packet_sock_flag(po, PACKET_SOCK_ORIGDEV))) sll->sll_ifindex = orig_dev->ifindex; else sll->sll_ifindex = dev->ifindex; sll->sll_halen = dev_parse_header(skb, sll->sll_addr); /* sll->sll_family and sll->sll_protocol are set in packet_recvmsg(). * Use their space for storing the original skb length. */ PACKET_SKB_CB(skb)->sa.origlen = skb->len; if (pskb_trim(skb, snaplen)) goto drop_n_acct; skb_set_owner_r(skb, sk); skb->dev = NULL; skb_dst_drop(skb); /* drop conntrack reference */ nf_reset_ct(skb); spin_lock(&sk->sk_receive_queue.lock); po->stats.stats1.tp_packets++; sock_skb_set_dropcount(sk, skb); skb_clear_delivery_time(skb); __skb_queue_tail(&sk->sk_receive_queue, skb); spin_unlock(&sk->sk_receive_queue.lock); sk->sk_data_ready(sk); return 0; drop_n_acct: atomic_inc(&po->tp_drops); atomic_inc(&sk->sk_drops); drop_reason = SKB_DROP_REASON_PACKET_SOCK_ERROR; drop_n_restore: if (skb_head != skb->data && skb_shared(skb)) { skb->data = skb_head; skb->len = skb_len; } drop: sk_skb_reason_drop(sk, skb, drop_reason); return 0; } static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { enum skb_drop_reason drop_reason = SKB_CONSUMED; struct sock *sk = NULL; struct packet_sock *po; struct sockaddr_ll *sll; union tpacket_uhdr h; u8 *skb_head = skb->data; int skb_len = skb->len; unsigned int snaplen, res; unsigned long status = TP_STATUS_USER; unsigned short macoff, hdrlen; unsigned int netoff; struct sk_buff *copy_skb = NULL; struct timespec64 ts; __u32 ts_status; unsigned int slot_id = 0; int vnet_hdr_sz = 0; /* struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT. * We may add members to them until current aligned size without forcing * userspace to call getsockopt(..., PACKET_HDRLEN, ...). */ BUILD_BUG_ON(TPACKET_ALIGN(sizeof(*h.h2)) != 32); BUILD_BUG_ON(TPACKET_ALIGN(sizeof(*h.h3)) != 48); if (skb->pkt_type == PACKET_LOOPBACK) goto drop; sk = pt->af_packet_priv; po = pkt_sk(sk); if (!net_eq(dev_net(dev), sock_net(sk))) goto drop; if (dev_has_header(dev)) { if (sk->sk_type != SOCK_DGRAM) skb_push(skb, skb->data - skb_mac_header(skb)); else if (skb->pkt_type == PACKET_OUTGOING) { /* Special case: outgoing packets have ll header at head */ skb_pull(skb, skb_network_offset(skb)); } } snaplen = skb_frags_readable(skb) ? skb->len : skb_headlen(skb); res = run_filter(skb, sk, snaplen); if (!res) goto drop_n_restore; /* If we are flooded, just give up */ if (__packet_rcv_has_room(po, skb) == ROOM_NONE) { atomic_inc(&po->tp_drops); goto drop_n_restore; } if (skb->ip_summed == CHECKSUM_PARTIAL) status |= TP_STATUS_CSUMNOTREADY; else if (skb->pkt_type != PACKET_OUTGOING && skb_csum_unnecessary(skb)) status |= TP_STATUS_CSUM_VALID; if (skb_is_gso(skb) && skb_is_gso_tcp(skb)) status |= TP_STATUS_GSO_TCP; if (snaplen > res) snaplen = res; if (sk->sk_type == SOCK_DGRAM) { macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 + po->tp_reserve; } else { unsigned int maclen = skb_network_offset(skb); netoff = TPACKET_ALIGN(po->tp_hdrlen + (maclen < 16 ? 16 : maclen)) + po->tp_reserve; vnet_hdr_sz = READ_ONCE(po->vnet_hdr_sz); if (vnet_hdr_sz) netoff += vnet_hdr_sz; macoff = netoff - maclen; } if (netoff > USHRT_MAX) { atomic_inc(&po->tp_drops); goto drop_n_restore; } if (po->tp_version <= TPACKET_V2) { if (macoff + snaplen > po->rx_ring.frame_size) { if (READ_ONCE(po->copy_thresh) && atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf) { if (skb_shared(skb)) { copy_skb = skb_clone(skb, GFP_ATOMIC); } else { copy_skb = skb_get(skb); skb_head = skb->data; } if (copy_skb) { memset(&PACKET_SKB_CB(copy_skb)->sa.ll, 0, sizeof(PACKET_SKB_CB(copy_skb)->sa.ll)); skb_set_owner_r(copy_skb, sk); } } snaplen = po->rx_ring.frame_size - macoff; if ((int)snaplen < 0) { snaplen = 0; vnet_hdr_sz = 0; } } } else if (unlikely(macoff + snaplen > GET_PBDQC_FROM_RB(&po->rx_ring)->max_frame_len)) { u32 nval; nval = GET_PBDQC_FROM_RB(&po->rx_ring)->max_frame_len - macoff; pr_err_once("tpacket_rcv: packet too big, clamped from %u to %u. macoff=%u\n", snaplen, nval, macoff); snaplen = nval; if (unlikely((int)snaplen < 0)) { snaplen = 0; macoff = GET_PBDQC_FROM_RB(&po->rx_ring)->max_frame_len; vnet_hdr_sz = 0; } } spin_lock(&sk->sk_receive_queue.lock); h.raw = packet_current_rx_frame(po, skb, TP_STATUS_KERNEL, (macoff+snaplen)); if (!h.raw) goto drop_n_account; if (po->tp_version <= TPACKET_V2) { slot_id = po->rx_ring.head; if (test_bit(slot_id, po->rx_ring.rx_owner_map)) goto drop_n_account; __set_bit(slot_id, po->rx_ring.rx_owner_map); } if (vnet_hdr_sz && virtio_net_hdr_from_skb(skb, h.raw + macoff - sizeof(struct virtio_net_hdr), vio_le(), true, 0)) { if (po->tp_version == TPACKET_V3) prb_clear_blk_fill_status(&po->rx_ring); goto drop_n_account; } if (po->tp_version <= TPACKET_V2) { packet_increment_rx_head(po, &po->rx_ring); /* * LOSING will be reported till you read the stats, * because it's COR - Clear On Read. * Anyways, moving it for V1/V2 only as V3 doesn't need this * at packet level. */ if (atomic_read(&po->tp_drops)) status |= TP_STATUS_LOSING; } po->stats.stats1.tp_packets++; if (copy_skb) { status |= TP_STATUS_COPY; skb_clear_delivery_time(copy_skb); __skb_queue_tail(&sk->sk_receive_queue, copy_skb); } spin_unlock(&sk->sk_receive_queue.lock); skb_copy_bits(skb, 0, h.raw + macoff, snaplen); /* Always timestamp; prefer an existing software timestamp taken * closer to the time of capture. */ ts_status = tpacket_get_timestamp(skb, &ts, READ_ONCE(po->tp_tstamp) | SOF_TIMESTAMPING_SOFTWARE); if (!ts_status) ktime_get_real_ts64(&ts); status |= ts_status; switch (po->tp_version) { case TPACKET_V1: h.h1->tp_len = skb->len; h.h1->tp_snaplen = snaplen; h.h1->tp_mac = macoff; h.h1->tp_net = netoff; h.h1->tp_sec = ts.tv_sec; h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC; hdrlen = sizeof(*h.h1); break; case TPACKET_V2: h.h2->tp_len = skb->len; h.h2->tp_snaplen = snaplen; h.h2->tp_mac = macoff; h.h2->tp_net = netoff; h.h2->tp_sec = ts.tv_sec; h.h2->tp_nsec = ts.tv_nsec; if (skb_vlan_tag_present(skb)) { h.h2->tp_vlan_tci = skb_vlan_tag_get(skb); h.h2->tp_vlan_tpid = ntohs(skb->vlan_proto); status |= TP_STATUS_VLAN_VALID | TP_STATUS_VLAN_TPID_VALID; } else if (unlikely(sk->sk_type == SOCK_DGRAM && eth_type_vlan(skb->protocol))) { h.h2->tp_vlan_tci = vlan_get_tci(skb, skb->dev); h.h2->tp_vlan_tpid = ntohs(skb->protocol); status |= TP_STATUS_VLAN_VALID | TP_STATUS_VLAN_TPID_VALID; } else { h.h2->tp_vlan_tci = 0; h.h2->tp_vlan_tpid = 0; } memset(h.h2->tp_padding, 0, sizeof(h.h2->tp_padding)); hdrlen = sizeof(*h.h2); break; case TPACKET_V3: /* tp_nxt_offset,vlan are already populated above. * So DONT clear those fields here */ h.h3->tp_status |= status; h.h3->tp_len = skb->len; h.h3->tp_snaplen = snaplen; h.h3->tp_mac = macoff; h.h3->tp_net = netoff; h.h3->tp_sec = ts.tv_sec; h.h3->tp_nsec = ts.tv_nsec; memset(h.h3->tp_padding, 0, sizeof(h.h3->tp_padding)); hdrlen = sizeof(*h.h3); break; default: BUG(); } sll = h.raw + TPACKET_ALIGN(hdrlen); sll->sll_halen = dev_parse_header(skb, sll->sll_addr); sll->sll_family = AF_PACKET; sll->sll_hatype = dev->type; sll->sll_protocol = (sk->sk_type == SOCK_DGRAM) ? vlan_get_protocol_dgram(skb) : skb->protocol; sll->sll_pkttype = skb->pkt_type; if (unlikely(packet_sock_flag(po, PACKET_SOCK_ORIGDEV))) sll->sll_ifindex = orig_dev->ifindex; else sll->sll_ifindex = dev->ifindex; smp_mb(); #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1 if (po->tp_version <= TPACKET_V2) { u8 *start, *end; end = (u8 *) PAGE_ALIGN((unsigned long) h.raw + macoff + snaplen); for (start = h.raw; start < end; start += PAGE_SIZE) flush_dcache_page(pgv_to_page(start)); } smp_wmb(); #endif if (po->tp_version <= TPACKET_V2) { spin_lock(&sk->sk_receive_queue.lock); __packet_set_status(po, h.raw, status); __clear_bit(slot_id, po->rx_ring.rx_owner_map); spin_unlock(&sk->sk_receive_queue.lock); sk->sk_data_ready(sk); } else if (po->tp_version == TPACKET_V3) { prb_clear_blk_fill_status(&po->rx_ring); } drop_n_restore: if (skb_head != skb->data && skb_shared(skb)) { skb->data = skb_head; skb->len = skb_len; } drop: sk_skb_reason_drop(sk, skb, drop_reason); return 0; drop_n_account: spin_unlock(&sk->sk_receive_queue.lock); atomic_inc(&po->tp_drops); drop_reason = SKB_DROP_REASON_PACKET_SOCK_ERROR; sk->sk_data_ready(sk); sk_skb_reason_drop(sk, copy_skb, drop_reason); goto drop_n_restore; } static void tpacket_destruct_skb(struct sk_buff *skb) { struct packet_sock *po = pkt_sk(skb->sk); if (likely(po->tx_ring.pg_vec)) { void *ph; __u32 ts; ph = skb_zcopy_get_nouarg(skb); packet_dec_pending(&po->tx_ring); ts = __packet_set_timestamp(po, ph, skb); __packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts); complete(&po->skb_completion); } sock_wfree(skb); } static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len) { if ((vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && (__virtio16_to_cpu(vio_le(), vnet_hdr->csum_start) + __virtio16_to_cpu(vio_le(), vnet_hdr->csum_offset) + 2 > __virtio16_to_cpu(vio_le(), vnet_hdr->hdr_len))) vnet_hdr->hdr_len = __cpu_to_virtio16(vio_le(), __virtio16_to_cpu(vio_le(), vnet_hdr->csum_start) + __virtio16_to_cpu(vio_le(), vnet_hdr->csum_offset) + 2); if (__virtio16_to_cpu(vio_le(), vnet_hdr->hdr_len) > len) return -EINVAL; return 0; } static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len, struct virtio_net_hdr *vnet_hdr, int vnet_hdr_sz) { int ret; if (*len < vnet_hdr_sz) return -EINVAL; *len -= vnet_hdr_sz; if (!copy_from_iter_full(vnet_hdr, sizeof(*vnet_hdr), &msg->msg_iter)) return -EFAULT; ret = __packet_snd_vnet_parse(vnet_hdr, *len); if (ret) return ret; /* move iter to point to the start of mac header */ if (vnet_hdr_sz != sizeof(struct virtio_net_hdr)) iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(struct virtio_net_hdr)); return 0; } static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb, void *frame, struct net_device *dev, void *data, int tp_len, __be16 proto, unsigned char *addr, int hlen, int copylen, const struct sockcm_cookie *sockc) { union tpacket_uhdr ph; int to_write, offset, len, nr_frags, len_max; struct socket *sock = po->sk.sk_socket; struct page *page; int err; ph.raw = frame; skb->protocol = proto; skb->dev = dev; skb->priority = sockc->priority; skb->mark = sockc->mark; skb_set_delivery_type_by_clockid(skb, sockc->transmit_time, po->sk.sk_clockid); skb_setup_tx_timestamp(skb, sockc); skb_zcopy_set_nouarg(skb, ph.raw); skb_reserve(skb, hlen); skb_reset_network_header(skb); to_write = tp_len; if (sock->type == SOCK_DGRAM) { err = dev_hard_header(skb, dev, ntohs(proto), addr, NULL, tp_len); if (unlikely(err < 0)) return -EINVAL; } else if (copylen) { int hdrlen = min_t(int, copylen, tp_len); skb_push(skb, dev->hard_header_len); skb_put(skb, copylen - dev->hard_header_len); err = skb_store_bits(skb, 0, data, hdrlen); if (unlikely(err)) return err; if (!dev_validate_header(dev, skb->data, hdrlen)) return -EINVAL; data += hdrlen; to_write -= hdrlen; } offset = offset_in_page(data); len_max = PAGE_SIZE - offset; len = ((to_write > len_max) ? len_max : to_write); skb->data_len = to_write; skb->len += to_write; skb->truesize += to_write; refcount_add(to_write, &po->sk.sk_wmem_alloc); while (likely(to_write)) { nr_frags = skb_shinfo(skb)->nr_frags; if (unlikely(nr_frags >= MAX_SKB_FRAGS)) { pr_err("Packet exceed the number of skb frags(%u)\n", (unsigned int)MAX_SKB_FRAGS); return -EFAULT; } page = pgv_to_page(data); data += len; flush_dcache_page(page); get_page(page); skb_fill_page_desc(skb, nr_frags, page, offset, len); to_write -= len; offset = 0; len_max = PAGE_SIZE; len = ((to_write > len_max) ? len_max : to_write); } packet_parse_headers(skb, sock); return tp_len; } static int tpacket_parse_header(struct packet_sock *po, void *frame, int size_max, void **data) { union tpacket_uhdr ph; int tp_len, off; ph.raw = frame; switch (po->tp_version) { case TPACKET_V3: if (ph.h3->tp_next_offset != 0) { pr_warn_once("variable sized slot not supported"); return -EINVAL; } tp_len = ph.h3->tp_len; break; case TPACKET_V2: tp_len = ph.h2->tp_len; break; default: tp_len = ph.h1->tp_len; break; } if (unlikely(tp_len > size_max)) { pr_err("packet size is too long (%d > %d)\n", tp_len, size_max); return -EMSGSIZE; } if (unlikely(packet_sock_flag(po, PACKET_SOCK_TX_HAS_OFF))) { int off_min, off_max; off_min = po->tp_hdrlen - sizeof(struct sockaddr_ll); off_max = po->tx_ring.frame_size - tp_len; if (po->sk.sk_type == SOCK_DGRAM) { switch (po->tp_version) { case TPACKET_V3: off = ph.h3->tp_net; break; case TPACKET_V2: off = ph.h2->tp_net; break; default: off = ph.h1->tp_net; break; } } else { switch (po->tp_version) { case TPACKET_V3: off = ph.h3->tp_mac; break; case TPACKET_V2: off = ph.h2->tp_mac; break; default: off = ph.h1->tp_mac; break; } } if (unlikely((off < off_min) || (off_max < off))) return -EINVAL; } else { off = po->tp_hdrlen - sizeof(struct sockaddr_ll); } *data = frame + off; return tp_len; } static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) { struct sk_buff *skb = NULL; struct net_device *dev; struct virtio_net_hdr *vnet_hdr = NULL; struct sockcm_cookie sockc; __be16 proto; int err, reserve = 0; void *ph; DECLARE_SOCKADDR(struct sockaddr_ll *, saddr, msg->msg_name); bool need_wait = !(msg->msg_flags & MSG_DONTWAIT); int vnet_hdr_sz = READ_ONCE(po->vnet_hdr_sz); unsigned char *addr = NULL; int tp_len, size_max; void *data; int len_sum = 0; int status = TP_STATUS_AVAILABLE; int hlen, tlen, copylen = 0; long timeo = 0; mutex_lock(&po->pg_vec_lock); /* packet_sendmsg() check on tx_ring.pg_vec was lockless, * we need to confirm it under protection of pg_vec_lock. */ if (unlikely(!po->tx_ring.pg_vec)) { err = -EBUSY; goto out; } if (likely(saddr == NULL)) { dev = packet_cached_dev_get(po); proto = READ_ONCE(po->num); } else { err = -EINVAL; if (msg->msg_namelen < sizeof(struct sockaddr_ll)) goto out; if (msg->msg_namelen < (saddr->sll_halen + offsetof(struct sockaddr_ll, sll_addr))) goto out; proto = saddr->sll_protocol; dev = dev_get_by_index(sock_net(&po->sk), saddr->sll_ifindex); if (po->sk.sk_socket->type == SOCK_DGRAM) { if (dev && msg->msg_namelen < dev->addr_len + offsetof(struct sockaddr_ll, sll_addr)) goto out_put; addr = saddr->sll_addr; } } err = -ENXIO; if (unlikely(dev == NULL)) goto out; err = -ENETDOWN; if (unlikely(!(dev->flags & IFF_UP))) goto out_put; sockcm_init(&sockc, &po->sk); if (msg->msg_controllen) { err = sock_cmsg_send(&po->sk, msg, &sockc); if (unlikely(err)) goto out_put; } if (po->sk.sk_socket->type == SOCK_RAW) reserve = dev->hard_header_len; size_max = po->tx_ring.frame_size - (po->tp_hdrlen - sizeof(struct sockaddr_ll)); if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz) size_max = dev->mtu + reserve + VLAN_HLEN; reinit_completion(&po->skb_completion); do { ph = packet_current_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST); if (unlikely(ph == NULL)) { if (need_wait && skb) { timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); if (timeo <= 0) { err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; goto out_put; } } /* check for additional frames */ continue; } skb = NULL; tp_len = tpacket_parse_header(po, ph, size_max, &data); if (tp_len < 0) goto tpacket_error; status = TP_STATUS_SEND_REQUEST; hlen = LL_RESERVED_SPACE(dev); tlen = dev->needed_tailroom; if (vnet_hdr_sz) { vnet_hdr = data; data += vnet_hdr_sz; tp_len -= vnet_hdr_sz; if (tp_len < 0 || __packet_snd_vnet_parse(vnet_hdr, tp_len)) { tp_len = -EINVAL; goto tpacket_error; } copylen = __virtio16_to_cpu(vio_le(), vnet_hdr->hdr_len); } copylen = max_t(int, copylen, dev->hard_header_len); skb = sock_alloc_send_skb(&po->sk, hlen + tlen + sizeof(struct sockaddr_ll) + (copylen - dev->hard_header_len), !need_wait, &err); if (unlikely(skb == NULL)) { /* we assume the socket was initially writeable ... */ if (likely(len_sum > 0)) err = len_sum; goto out_status; } tp_len = tpacket_fill_skb(po, skb, ph, dev, data, tp_len, proto, addr, hlen, copylen, &sockc); if (likely(tp_len >= 0) && tp_len > dev->mtu + reserve && !vnet_hdr_sz && !packet_extra_vlan_len_allowed(dev, skb)) tp_len = -EMSGSIZE; if (unlikely(tp_len < 0)) { tpacket_error: if (packet_sock_flag(po, PACKET_SOCK_TP_LOSS)) { __packet_set_status(po, ph, TP_STATUS_AVAILABLE); packet_increment_head(&po->tx_ring); kfree_skb(skb); continue; } else { status = TP_STATUS_WRONG_FORMAT; err = tp_len; goto out_status; } } if (vnet_hdr_sz) { if (virtio_net_hdr_to_skb(skb, vnet_hdr, vio_le())) { tp_len = -EINVAL; goto tpacket_error; } virtio_net_hdr_set_proto(skb, vnet_hdr); } skb->destructor = tpacket_destruct_skb; __packet_set_status(po, ph, TP_STATUS_SENDING); packet_inc_pending(&po->tx_ring); status = TP_STATUS_SEND_REQUEST; err = packet_xmit(po, skb); if (unlikely(err != 0)) { if (err > 0) err = net_xmit_errno(err); if (err && __packet_get_status(po, ph) == TP_STATUS_AVAILABLE) { /* skb was destructed already */ skb = NULL; goto out_status; } /* * skb was dropped but not destructed yet; * let's treat it like congestion or err < 0 */ err = 0; } packet_increment_head(&po->tx_ring); len_sum += tp_len; } while (likely((ph != NULL) || /* Note: packet_read_pending() might be slow if we have * to call it as it's per_cpu variable, but in fast-path * we already short-circuit the loop with the first * condition, and luckily don't have to go that path * anyway. */ (need_wait && packet_read_pending(&po->tx_ring)))); err = len_sum; goto out_put; out_status: __packet_set_status(po, ph, status); kfree_skb(skb); out_put: dev_put(dev); out: mutex_unlock(&po->pg_vec_lock); return err; } static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad, size_t reserve, size_t len, size_t linear, int noblock, int *err) { struct sk_buff *skb; /* Under a page? Don't bother with paged skb. */ if (prepad + len < PAGE_SIZE || !linear) linear = len; if (len - linear > MAX_SKB_FRAGS * (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) linear = len - MAX_SKB_FRAGS * (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER); skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock, err, PAGE_ALLOC_COSTLY_ORDER); if (!skb) return NULL; skb_reserve(skb, reserve); skb_put(skb, linear); skb->data_len = len - linear; skb->len += len - linear; return skb; } static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) { struct sock *sk = sock->sk; DECLARE_SOCKADDR(struct sockaddr_ll *, saddr, msg->msg_name); struct sk_buff *skb; struct net_device *dev; __be16 proto; unsigned char *addr = NULL; int err, reserve = 0; struct sockcm_cookie sockc; struct virtio_net_hdr vnet_hdr = { 0 }; int offset = 0; struct packet_sock *po = pkt_sk(sk); int vnet_hdr_sz = READ_ONCE(po->vnet_hdr_sz); int hlen, tlen, linear; int extra_len = 0; /* * Get and verify the address. */ if (likely(saddr == NULL)) { dev = packet_cached_dev_get(po); proto = READ_ONCE(po->num); } else { err = -EINVAL; if (msg->msg_namelen < sizeof(struct sockaddr_ll)) goto out; if (msg->msg_namelen < (saddr->sll_halen + offsetof(struct sockaddr_ll, sll_addr))) goto out; proto = saddr->sll_protocol; dev = dev_get_by_index(sock_net(sk), saddr->sll_ifindex); if (sock->type == SOCK_DGRAM) { if (dev && msg->msg_namelen < dev->addr_len + offsetof(struct sockaddr_ll, sll_addr)) goto out_unlock; addr = saddr->sll_addr; } } err = -ENXIO; if (unlikely(dev == NULL)) goto out_unlock; err = -ENETDOWN; if (unlikely(!(dev->flags & IFF_UP))) goto out_unlock; sockcm_init(&sockc, sk); if (msg->msg_controllen) { err = sock_cmsg_send(sk, msg, &sockc); if (unlikely(err)) goto out_unlock; } if (sock->type == SOCK_RAW) reserve = dev->hard_header_len; if (vnet_hdr_sz) { err = packet_snd_vnet_parse(msg, &len, &vnet_hdr, vnet_hdr_sz); if (err) goto out_unlock; } if (unlikely(sock_flag(sk, SOCK_NOFCS))) { if (!netif_supports_nofcs(dev)) { err = -EPROTONOSUPPORT; goto out_unlock; } extra_len = 4; /* We're doing our own CRC */ } err = -EMSGSIZE; if (!vnet_hdr.gso_type && (len > dev->mtu + reserve + VLAN_HLEN + extra_len)) goto out_unlock; err = -ENOBUFS; hlen = LL_RESERVED_SPACE(dev); tlen = dev->needed_tailroom; linear = __virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len); linear = max(linear, min_t(int, len, dev->hard_header_len)); skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, linear, msg->msg_flags & MSG_DONTWAIT, &err); if (skb == NULL) goto out_unlock; skb_reset_network_header(skb); err = -EINVAL; if (sock->type == SOCK_DGRAM) { offset = dev_hard_header(skb, dev, ntohs(proto), addr, NULL, len); if (unlikely(offset < 0)) goto out_free; } else if (reserve) { skb_reserve(skb, -reserve); if (len < reserve + sizeof(struct ipv6hdr) && dev->min_header_len != dev->hard_header_len) skb_reset_network_header(skb); } /* Returns -EFAULT on error */ err = skb_copy_datagram_from_iter(skb, offset, &msg->msg_iter, len); if (err) goto out_free; if ((sock->type == SOCK_RAW && !dev_validate_header(dev, skb->data, len)) || !skb->len) { err = -EINVAL; goto out_free; } skb_setup_tx_timestamp(skb, &sockc); if (!vnet_hdr.gso_type && (len > dev->mtu + reserve + extra_len) && !packet_extra_vlan_len_allowed(dev, skb)) { err = -EMSGSIZE; goto out_free; } skb->protocol = proto; skb->dev = dev; skb->priority = sockc.priority; skb->mark = sockc.mark; skb_set_delivery_type_by_clockid(skb, sockc.transmit_time, sk->sk_clockid); if (unlikely(extra_len == 4)) skb->no_fcs = 1; packet_parse_headers(skb, sock); if (vnet_hdr_sz) { err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le()); if (err) goto out_free; len += vnet_hdr_sz; virtio_net_hdr_set_proto(skb, &vnet_hdr); } err = packet_xmit(po, skb); if (unlikely(err != 0)) { if (err > 0) err = net_xmit_errno(err); if (err) goto out_unlock; } dev_put(dev); return len; out_free: kfree_skb(skb); out_unlock: dev_put(dev); out: return err; } static int packet_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) { struct sock *sk = sock->sk; struct packet_sock *po = pkt_sk(sk); /* Reading tx_ring.pg_vec without holding pg_vec_lock is racy. * tpacket_snd() will redo the check safely. */ if (data_race(po->tx_ring.pg_vec)) return tpacket_snd(po, msg); return packet_snd(sock, msg, len); } /* * Close a PACKET socket. This is fairly simple. We immediately go * to 'closed' state and remove our protocol entry in the device list. */ static int packet_release(struct socket *sock) { struct sock *sk = sock->sk; struct packet_sock *po; struct packet_fanout *f; struct net *net; union tpacket_req_u req_u; if (!sk) return 0; net = sock_net(sk); po = pkt_sk(sk); mutex_lock(&net->packet.sklist_lock); sk_del_node_init_rcu(sk); mutex_unlock(&net->packet.sklist_lock); sock_prot_inuse_add(net, sk->sk_prot, -1); spin_lock(&po->bind_lock); unregister_prot_hook(sk, false); packet_cached_dev_reset(po); if (po->prot_hook.dev) { netdev_put(po->prot_hook.dev, &po->prot_hook.dev_tracker); po->prot_hook.dev = NULL; } spin_unlock(&po->bind_lock); packet_flush_mclist(sk); lock_sock(sk); if (po->rx_ring.pg_vec) { memset(&req_u, 0, sizeof(req_u)); packet_set_ring(sk, &req_u, 1, 0); } if (po->tx_ring.pg_vec) { memset(&req_u, 0, sizeof(req_u)); packet_set_ring(sk, &req_u, 1, 1); } release_sock(sk); f = fanout_release(sk); synchronize_net(); kfree(po->rollover); if (f) { fanout_release_data(f); kvfree(f); } /* * Now the socket is dead. No more input will appear. */ sock_orphan(sk); sock->sk = NULL; /* Purge queues */ skb_queue_purge(&sk->sk_receive_queue); packet_free_pending(po); sock_put(sk); return 0; } /* * Attach a packet hook. */ static int packet_do_bind(struct sock *sk, const char *name, int ifindex, __be16 proto) { struct packet_sock *po = pkt_sk(sk); struct net_device *dev = NULL; bool unlisted = false; bool need_rehook; int ret = 0; lock_sock(sk); spin_lock(&po->bind_lock); if (!proto) proto = po->num; rcu_read_lock(); if (po->fanout) { ret = -EINVAL; goto out_unlock; } if (name) { dev = dev_get_by_name_rcu(sock_net(sk), name); if (!dev) { ret = -ENODEV; goto out_unlock; } } else if (ifindex) { dev = dev_get_by_index_rcu(sock_net(sk), ifindex); if (!dev) { ret = -ENODEV; goto out_unlock; } } need_rehook = po->prot_hook.type != proto || po->prot_hook.dev != dev; if (need_rehook) { dev_hold(dev); if (packet_sock_flag(po, PACKET_SOCK_RUNNING)) { rcu_read_unlock(); /* prevents packet_notifier() from calling * register_prot_hook() */ WRITE_ONCE(po->num, 0); __unregister_prot_hook(sk, true); rcu_read_lock(); if (dev) unlisted = !dev_get_by_index_rcu(sock_net(sk), dev->ifindex); } BUG_ON(packet_sock_flag(po, PACKET_SOCK_RUNNING)); WRITE_ONCE(po->num, proto); po->prot_hook.type = proto; netdev_put(po->prot_hook.dev, &po->prot_hook.dev_tracker); if (unlikely(unlisted)) { po->prot_hook.dev = NULL; WRITE_ONCE(po->ifindex, -1); packet_cached_dev_reset(po); } else { netdev_hold(dev, &po->prot_hook.dev_tracker, GFP_ATOMIC); po->prot_hook.dev = dev; WRITE_ONCE(po->ifindex, dev ? dev->ifindex : 0); packet_cached_dev_assign(po, dev); } dev_put(dev); } if (proto == 0 || !need_rehook) goto out_unlock; if (!unlisted && (!dev || (dev->flags & IFF_UP))) { register_prot_hook(sk); } else { sk->sk_err = ENETDOWN; if (!sock_flag(sk, SOCK_DEAD)) sk_error_report(sk); } out_unlock: rcu_read_unlock(); spin_unlock(&po->bind_lock); release_sock(sk); return ret; } /* * Bind a packet socket to a device */ static int packet_bind_spkt(struct socket *sock, struct sockaddr *uaddr, int addr_len) { struct sock *sk = sock->sk; char name[sizeof(uaddr->sa_data_min) + 1]; /* * Check legality */ if (addr_len != sizeof(struct sockaddr)) return -EINVAL; /* uaddr->sa_data comes from the userspace, it's not guaranteed to be * zero-terminated. */ memcpy(name, uaddr->sa_data, sizeof(uaddr->sa_data_min)); name[sizeof(uaddr->sa_data_min)] = 0; return packet_do_bind(sk, name, 0, 0); } static int packet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) { struct sockaddr_ll *sll = (struct sockaddr_ll *)uaddr; struct sock *sk = sock->sk; /* * Check legality */ if (addr_len < sizeof(struct sockaddr_ll)) return -EINVAL; if (sll->sll_family != AF_PACKET) return -EINVAL; return packet_do_bind(sk, NULL, sll->sll_ifindex, sll->sll_protocol); } static struct proto packet_proto = { .name = "PACKET", .owner = THIS_MODULE, .obj_size = sizeof(struct packet_sock), }; /* * Create a packet of type SOCK_PACKET. */ static int packet_create(struct net *net, struct socket *sock, int protocol, int kern) { struct sock *sk; struct packet_sock *po; __be16 proto = (__force __be16)protocol; /* weird, but documented */ int err; if (!ns_capable(net->user_ns, CAP_NET_RAW)) return -EPERM; if (sock->type != SOCK_DGRAM && sock->type != SOCK_RAW && sock->type != SOCK_PACKET) return -ESOCKTNOSUPPORT; sock->state = SS_UNCONNECTED; err = -ENOBUFS; sk = sk_alloc(net, PF_PACKET, GFP_KERNEL, &packet_proto, kern); if (sk == NULL) goto out; sock->ops = &packet_ops; if (sock->type == SOCK_PACKET) sock->ops = &packet_ops_spkt; po = pkt_sk(sk); err = packet_alloc_pending(po); if (err) goto out_sk_free; sock_init_data(sock, sk); init_completion(&po->skb_completion); sk->sk_family = PF_PACKET; po->num = proto; packet_cached_dev_reset(po); sk->sk_destruct = packet_sock_destruct; /* * Attach a protocol block */ spin_lock_init(&po->bind_lock); mutex_init(&po->pg_vec_lock); po->rollover = NULL; po->prot_hook.func = packet_rcv; if (sock->type == SOCK_PACKET) po->prot_hook.func = packet_rcv_spkt; po->prot_hook.af_packet_priv = sk; po->prot_hook.af_packet_net = sock_net(sk); if (proto) { po->prot_hook.type = proto; __register_prot_hook(sk); } mutex_lock(&net->packet.sklist_lock); sk_add_node_tail_rcu(sk, &net->packet.sklist); mutex_unlock(&net->packet.sklist_lock); sock_prot_inuse_add(net, &packet_proto, 1); return 0; out_sk_free: sk_free(sk); out: return err; } /* * Pull a packet from our receive queue and hand it to the user. * If necessary we block. */ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, int flags) { struct sock *sk = sock->sk; struct sk_buff *skb; int copied, err; int vnet_hdr_len = READ_ONCE(pkt_sk(sk)->vnet_hdr_sz); unsigned int origlen = 0; err = -EINVAL; if (flags & ~(MSG_PEEK|MSG_DONTWAIT|MSG_TRUNC|MSG_CMSG_COMPAT|MSG_ERRQUEUE)) goto out; #if 0 /* What error should we return now? EUNATTACH? */ if (pkt_sk(sk)->ifindex < 0) return -ENODEV; #endif if (flags & MSG_ERRQUEUE) { err = sock_recv_errqueue(sk, msg, len, SOL_PACKET, PACKET_TX_TIMESTAMP); goto out; } /* * Call the generic datagram receiver. This handles all sorts * of horrible races and re-entrancy so we can forget about it * in the protocol layers. * * Now it will return ENETDOWN, if device have just gone down, * but then it will block. */ skb = skb_recv_datagram(sk, flags, &err); /* * An error occurred so return it. Because skb_recv_datagram() * handles the blocking we don't see and worry about blocking * retries. */ if (skb == NULL) goto out; packet_rcv_try_clear_pressure(pkt_sk(sk)); if (vnet_hdr_len) { err = packet_rcv_vnet(msg, skb, &len, vnet_hdr_len); if (err) goto out_free; } /* You lose any data beyond the buffer you gave. If it worries * a user program they can ask the device for its MTU * anyway. */ copied = skb->len; if (copied > len) { copied = len; msg->msg_flags |= MSG_TRUNC; } err = skb_copy_datagram_msg(skb, 0, msg, copied); if (err) goto out_free; if (sock->type != SOCK_PACKET) { struct sockaddr_ll *sll = &PACKET_SKB_CB(skb)->sa.ll; /* Original length was stored in sockaddr_ll fields */ origlen = PACKET_SKB_CB(skb)->sa.origlen; sll->sll_family = AF_PACKET; sll->sll_protocol = (sock->type == SOCK_DGRAM) ? vlan_get_protocol_dgram(skb) : skb->protocol; } sock_recv_cmsgs(msg, sk, skb); if (msg->msg_name) { const size_t max_len = min(sizeof(skb->cb), sizeof(struct sockaddr_storage)); int copy_len; /* If the address length field is there to be filled * in, we fill it in now. */ if (sock->type == SOCK_PACKET) { __sockaddr_check_size(sizeof(struct sockaddr_pkt)); msg->msg_namelen = sizeof(struct sockaddr_pkt); copy_len = msg->msg_namelen; } else { struct sockaddr_ll *sll = &PACKET_SKB_CB(skb)->sa.ll; msg->msg_namelen = sll->sll_halen + offsetof(struct sockaddr_ll, sll_addr); copy_len = msg->msg_namelen; if (msg->msg_namelen < sizeof(struct sockaddr_ll)) { memset(msg->msg_name + offsetof(struct sockaddr_ll, sll_addr), 0, sizeof(sll->sll_addr)); msg->msg_namelen = sizeof(struct sockaddr_ll); } } if (WARN_ON_ONCE(copy_len > max_len)) { copy_len = max_len; msg->msg_namelen = copy_len; } memcpy(msg->msg_name, &PACKET_SKB_CB(skb)->sa, copy_len); } if (packet_sock_flag(pkt_sk(sk), PACKET_SOCK_AUXDATA)) { struct tpacket_auxdata aux; aux.tp_status = TP_STATUS_USER; if (skb->ip_summed == CHECKSUM_PARTIAL) aux.tp_status |= TP_STATUS_CSUMNOTREADY; else if (skb->pkt_type != PACKET_OUTGOING && skb_csum_unnecessary(skb)) aux.tp_status |= TP_STATUS_CSUM_VALID; if (skb_is_gso(skb) && skb_is_gso_tcp(skb)) aux.tp_status |= TP_STATUS_GSO_TCP; aux.tp_len = origlen; aux.tp_snaplen = skb->len; aux.tp_mac = 0; aux.tp_net = skb_network_offset(skb); if (skb_vlan_tag_present(skb)) { aux.tp_vlan_tci = skb_vlan_tag_get(skb); aux.tp_vlan_tpid = ntohs(skb->vlan_proto); aux.tp_status |= TP_STATUS_VLAN_VALID | TP_STATUS_VLAN_TPID_VALID; } else if (unlikely(sock->type == SOCK_DGRAM && eth_type_vlan(skb->protocol))) { struct sockaddr_ll *sll = &PACKET_SKB_CB(skb)->sa.ll; struct net_device *dev; rcu_read_lock(); dev = dev_get_by_index_rcu(sock_net(sk), sll->sll_ifindex); if (dev) { aux.tp_vlan_tci = vlan_get_tci(skb, dev); aux.tp_vlan_tpid = ntohs(skb->protocol); aux.tp_status |= TP_STATUS_VLAN_VALID | TP_STATUS_VLAN_TPID_VALID; } else { aux.tp_vlan_tci = 0; aux.tp_vlan_tpid = 0; } rcu_read_unlock(); } else { aux.tp_vlan_tci = 0; aux.tp_vlan_tpid = 0; } put_cmsg(msg, SOL_PACKET, PACKET_AUXDATA, sizeof(aux), &aux); } /* * Free or return the buffer as appropriate. Again this * hides all the races and re-entrancy issues from us. */ err = vnet_hdr_len + ((flags&MSG_TRUNC) ? skb->len : copied); out_free: skb_free_datagram(sk, skb); out: return err; } static int packet_getname_spkt(struct socket *sock, struct sockaddr *uaddr, int peer) { struct net_device *dev; struct sock *sk = sock->sk; if (peer) return -EOPNOTSUPP; uaddr->sa_family = AF_PACKET; memset(uaddr->sa_data, 0, sizeof(uaddr->sa_data_min)); rcu_read_lock(); dev = dev_get_by_index_rcu(sock_net(sk), READ_ONCE(pkt_sk(sk)->ifindex)); if (dev) strscpy(uaddr->sa_data, dev->name, sizeof(uaddr->sa_data_min)); rcu_read_unlock(); return sizeof(*uaddr); } static int packet_getname(struct socket *sock, struct sockaddr *uaddr, int peer) { struct net_device *dev; struct sock *sk = sock->sk; struct packet_sock *po = pkt_sk(sk); DECLARE_SOCKADDR(struct sockaddr_ll *, sll, uaddr); int ifindex; if (peer) return -EOPNOTSUPP; ifindex = READ_ONCE(po->ifindex); sll->sll_family = AF_PACKET; sll->sll_ifindex = ifindex; sll->sll_protocol = READ_ONCE(po->num); sll->sll_pkttype = 0; rcu_read_lock(); dev = dev_get_by_index_rcu(sock_net(sk), ifindex); if (dev) { sll->sll_hatype = dev->type; sll->sll_halen = dev->addr_len; /* Let __fortify_memcpy_chk() know the actual buffer size. */ memcpy(((struct sockaddr_storage *)sll)->__data + offsetof(struct sockaddr_ll, sll_addr) - offsetofend(struct sockaddr_ll, sll_family), dev->dev_addr, dev->addr_len); } else { sll->sll_hatype = 0; /* Bad: we have no ARPHRD_UNSPEC */ sll->sll_halen = 0; } rcu_read_unlock(); return offsetof(struct sockaddr_ll, sll_addr) + sll->sll_halen; } static int packet_dev_mc(struct net_device *dev, struct packet_mclist *i, int what) { switch (i->type) { case PACKET_MR_MULTICAST: if (i->alen != dev->addr_len) return -EINVAL; if (what > 0) return dev_mc_add(dev, i->addr); else return dev_mc_del(dev, i->addr); break; case PACKET_MR_PROMISC: return dev_set_promiscuity(dev, what); case PACKET_MR_ALLMULTI: return dev_set_allmulti(dev, what); case PACKET_MR_UNICAST: if (i->alen != dev->addr_len) return -EINVAL; if (what > 0) return dev_uc_add(dev, i->addr); else return dev_uc_del(dev, i->addr); break; default: break; } return 0; } static void packet_dev_mclist_delete(struct net_device *dev, struct packet_mclist **mlp) { struct packet_mclist *ml; while ((ml = *mlp) != NULL) { if (ml->ifindex == dev->ifindex) { packet_dev_mc(dev, ml, -1); *mlp = ml->next; kfree(ml); } else mlp = &ml->next; } } static int packet_mc_add(struct sock *sk, struct packet_mreq_max *mreq) { struct packet_sock *po = pkt_sk(sk); struct packet_mclist *ml, *i; struct net_device *dev; int err; rtnl_lock(); err = -ENODEV; dev = __dev_get_by_index(sock_net(sk), mreq->mr_ifindex); if (!dev) goto done; err = -EINVAL; if (mreq->mr_alen > dev->addr_len) goto done; err = -ENOBUFS; i = kmalloc(sizeof(*i), GFP_KERNEL); if (i == NULL) goto done; err = 0; for (ml = po->mclist; ml; ml = ml->next) { if (ml->ifindex == mreq->mr_ifindex && ml->type == mreq->mr_type && ml->alen == mreq->mr_alen && memcmp(ml->addr, mreq->mr_address, ml->alen) == 0) { ml->count++; /* Free the new element ... */ kfree(i); goto done; } } i->type = mreq->mr_type; i->ifindex = mreq->mr_ifindex; i->alen = mreq->mr_alen; memcpy(i->addr, mreq->mr_address, i->alen); memset(i->addr + i->alen, 0, sizeof(i->addr) - i->alen); i->count = 1; i->next = po->mclist; po->mclist = i; err = packet_dev_mc(dev, i, 1); if (err) { po->mclist = i->next; kfree(i); } done: rtnl_unlock(); return err; } static int packet_mc_drop(struct sock *sk, struct packet_mreq_max *mreq) { struct packet_mclist *ml, **mlp; rtnl_lock(); for (mlp = &pkt_sk(sk)->mclist; (ml = *mlp) != NULL; mlp = &ml->next) { if (ml->ifindex == mreq->mr_ifindex && ml->type == mreq->mr_type && ml->alen == mreq->mr_alen && memcmp(ml->addr, mreq->mr_address, ml->alen) == 0) { if (--ml->count == 0) { struct net_device *dev; *mlp = ml->next; dev = __dev_get_by_index(sock_net(sk), ml->ifindex); if (dev) packet_dev_mc(dev, ml, -1); kfree(ml); } break; } } rtnl_unlock(); return 0; } static void packet_flush_mclist(struct sock *sk) { struct packet_sock *po = pkt_sk(sk); struct packet_mclist *ml; if (!po->mclist) return; rtnl_lock(); while ((ml = po->mclist) != NULL) { struct net_device *dev; po->mclist = ml->next; dev = __dev_get_by_index(sock_net(sk), ml->ifindex); if (dev != NULL) packet_dev_mc(dev, ml, -1); kfree(ml); } rtnl_unlock(); } static int packet_setsockopt(struct socket *sock, int level, int optname, sockptr_t optval, unsigned int optlen) { struct sock *sk = sock->sk; struct packet_sock *po = pkt_sk(sk); int ret; if (level != SOL_PACKET) return -ENOPROTOOPT; switch (optname) { case PACKET_ADD_MEMBERSHIP: case PACKET_DROP_MEMBERSHIP: { struct packet_mreq_max mreq; int len = optlen; memset(&mreq, 0, sizeof(mreq)); if (len < sizeof(struct packet_mreq)) return -EINVAL; if (len > sizeof(mreq)) len = sizeof(mreq); if (copy_from_sockptr(&mreq, optval, len)) return -EFAULT; if (len < (mreq.mr_alen + offsetof(struct packet_mreq, mr_address))) return -EINVAL; if (optname == PACKET_ADD_MEMBERSHIP) ret = packet_mc_add(sk, &mreq); else ret = packet_mc_drop(sk, &mreq); return ret; } case PACKET_RX_RING: case PACKET_TX_RING: { union tpacket_req_u req_u; ret = -EINVAL; lock_sock(sk); switch (po->tp_version) { case TPACKET_V1: case TPACKET_V2: if (optlen < sizeof(req_u.req)) break; ret = copy_from_sockptr(&req_u.req, optval, sizeof(req_u.req)) ? -EINVAL : 0; break; case TPACKET_V3: default: if (optlen < sizeof(req_u.req3)) break; ret = copy_from_sockptr(&req_u.req3, optval, sizeof(req_u.req3)) ? -EINVAL : 0; break; } if (!ret) ret = packet_set_ring(sk, &req_u, 0, optname == PACKET_TX_RING); release_sock(sk); return ret; } case PACKET_COPY_THRESH: { int val; if (optlen != sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; WRITE_ONCE(pkt_sk(sk)->copy_thresh, val); return 0; } case PACKET_VERSION: { int val; if (optlen != sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; switch (val) { case TPACKET_V1: case TPACKET_V2: case TPACKET_V3: break; default: return -EINVAL; } lock_sock(sk); if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) { ret = -EBUSY; } else { po->tp_version = val; ret = 0; } release_sock(sk); return ret; } case PACKET_RESERVE: { unsigned int val; if (optlen != sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; if (val > INT_MAX) return -EINVAL; lock_sock(sk); if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) { ret = -EBUSY; } else { po->tp_reserve = val; ret = 0; } release_sock(sk); return ret; } case PACKET_LOSS: { unsigned int val; if (optlen != sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; lock_sock(sk); if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) { ret = -EBUSY; } else { packet_sock_flag_set(po, PACKET_SOCK_TP_LOSS, val); ret = 0; } release_sock(sk); return ret; } case PACKET_AUXDATA: { int val; if (optlen < sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; packet_sock_flag_set(po, PACKET_SOCK_AUXDATA, val); return 0; } case PACKET_ORIGDEV: { int val; if (optlen < sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; packet_sock_flag_set(po, PACKET_SOCK_ORIGDEV, val); return 0; } case PACKET_VNET_HDR: case PACKET_VNET_HDR_SZ: { int val, hdr_len; if (sock->type != SOCK_RAW) return -EINVAL; if (optlen < sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; if (optname == PACKET_VNET_HDR_SZ) { if (val && val != sizeof(struct virtio_net_hdr) && val != sizeof(struct virtio_net_hdr_mrg_rxbuf)) return -EINVAL; hdr_len = val; } else { hdr_len = val ? sizeof(struct virtio_net_hdr) : 0; } lock_sock(sk); if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) { ret = -EBUSY; } else { WRITE_ONCE(po->vnet_hdr_sz, hdr_len); ret = 0; } release_sock(sk); return ret; } case PACKET_TIMESTAMP: { int val; if (optlen != sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; WRITE_ONCE(po->tp_tstamp, val); return 0; } case PACKET_FANOUT: { struct fanout_args args = { 0 }; if (optlen != sizeof(int) && optlen != sizeof(args)) return -EINVAL; if (copy_from_sockptr(&args, optval, optlen)) return -EFAULT; return fanout_add(sk, &args); } case PACKET_FANOUT_DATA: { /* Paired with the WRITE_ONCE() in fanout_add() */ if (!READ_ONCE(po->fanout)) return -EINVAL; return fanout_set_data(po, optval, optlen); } case PACKET_IGNORE_OUTGOING: { int val; if (optlen != sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; if (val < 0 || val > 1) return -EINVAL; WRITE_ONCE(po->prot_hook.ignore_outgoing, !!val); return 0; } case PACKET_TX_HAS_OFF: { unsigned int val; if (optlen != sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; lock_sock(sk); if (!po->rx_ring.pg_vec && !po->tx_ring.pg_vec) packet_sock_flag_set(po, PACKET_SOCK_TX_HAS_OFF, val); release_sock(sk); return 0; } case PACKET_QDISC_BYPASS: { int val; if (optlen != sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; packet_sock_flag_set(po, PACKET_SOCK_QDISC_BYPASS, val); return 0; } default: return -ENOPROTOOPT; } } static int packet_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen) { int len; int val, lv = sizeof(val); struct sock *sk = sock->sk; struct packet_sock *po = pkt_sk(sk); void *data = &val; union tpacket_stats_u st; struct tpacket_rollover_stats rstats; int drops; if (level != SOL_PACKET) return -ENOPROTOOPT; if (get_user(len, optlen)) return -EFAULT; if (len < 0) return -EINVAL; switch (optname) { case PACKET_STATISTICS: spin_lock_bh(&sk->sk_receive_queue.lock); memcpy(&st, &po->stats, sizeof(st)); memset(&po->stats, 0, sizeof(po->stats)); spin_unlock_bh(&sk->sk_receive_queue.lock); drops = atomic_xchg(&po->tp_drops, 0); if (po->tp_version == TPACKET_V3) { lv = sizeof(struct tpacket_stats_v3); st.stats3.tp_drops = drops; st.stats3.tp_packets += drops; data = &st.stats3; } else { lv = sizeof(struct tpacket_stats); st.stats1.tp_drops = drops; st.stats1.tp_packets += drops; data = &st.stats1; } break; case PACKET_AUXDATA: val = packet_sock_flag(po, PACKET_SOCK_AUXDATA); break; case PACKET_ORIGDEV: val = packet_sock_flag(po, PACKET_SOCK_ORIGDEV); break; case PACKET_VNET_HDR: val = !!READ_ONCE(po->vnet_hdr_sz); break; case PACKET_VNET_HDR_SZ: val = READ_ONCE(po->vnet_hdr_sz); break; case PACKET_COPY_THRESH: val = READ_ONCE(pkt_sk(sk)->copy_thresh); break; case PACKET_VERSION: val = po->tp_version; break; case PACKET_HDRLEN: if (len > sizeof(int)) len = sizeof(int); if (len < sizeof(int)) return -EINVAL; if (copy_from_user(&val, optval, len)) return -EFAULT; switch (val) { case TPACKET_V1: val = sizeof(struct tpacket_hdr); break; case TPACKET_V2: val = sizeof(struct tpacket2_hdr); break; case TPACKET_V3: val = sizeof(struct tpacket3_hdr); break; default: return -EINVAL; } break; case PACKET_RESERVE: val = po->tp_reserve; break; case PACKET_LOSS: val = packet_sock_flag(po, PACKET_SOCK_TP_LOSS); break; case PACKET_TIMESTAMP: val = READ_ONCE(po->tp_tstamp); break; case PACKET_FANOUT: val = (po->fanout ? ((u32)po->fanout->id | ((u32)po->fanout->type << 16) | ((u32)po->fanout->flags << 24)) : 0); break; case PACKET_IGNORE_OUTGOING: val = READ_ONCE(po->prot_hook.ignore_outgoing); break; case PACKET_ROLLOVER_STATS: if (!po->rollover) return -EINVAL; rstats.tp_all = atomic_long_read(&po->rollover->num); rstats.tp_huge = atomic_long_read(&po->rollover->num_huge); rstats.tp_failed = atomic_long_read(&po->rollover->num_failed); data = &rstats; lv = sizeof(rstats); break; case PACKET_TX_HAS_OFF: val = packet_sock_flag(po, PACKET_SOCK_TX_HAS_OFF); break; case PACKET_QDISC_BYPASS: val = packet_sock_flag(po, PACKET_SOCK_QDISC_BYPASS); break; default: return -ENOPROTOOPT; } if (len > lv) len = lv; if (put_user(len, optlen)) return -EFAULT; if (copy_to_user(optval, data, len)) return -EFAULT; return 0; } static int packet_notifier(struct notifier_block *this, unsigned long msg, void *ptr) { struct sock *sk; struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct net *net = dev_net(dev); rcu_read_lock(); sk_for_each_rcu(sk, &net->packet.sklist) { struct packet_sock *po = pkt_sk(sk); switch (msg) { case NETDEV_UNREGISTER: if (po->mclist) packet_dev_mclist_delete(dev, &po->mclist); fallthrough; case NETDEV_DOWN: if (dev->ifindex == po->ifindex) { spin_lock(&po->bind_lock); if (packet_sock_flag(po, PACKET_SOCK_RUNNING)) { __unregister_prot_hook(sk, false); sk->sk_err = ENETDOWN; if (!sock_flag(sk, SOCK_DEAD)) sk_error_report(sk); } if (msg == NETDEV_UNREGISTER) { packet_cached_dev_reset(po); WRITE_ONCE(po->ifindex, -1); netdev_put(po->prot_hook.dev, &po->prot_hook.dev_tracker); po->prot_hook.dev = NULL; } spin_unlock(&po->bind_lock); } break; case NETDEV_UP: if (dev->ifindex == po->ifindex) { spin_lock(&po->bind_lock); if (po->num) register_prot_hook(sk); spin_unlock(&po->bind_lock); } break; } } rcu_read_unlock(); return NOTIFY_DONE; } static int packet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { struct sock *sk = sock->sk; switch (cmd) { case SIOCOUTQ: { int amount = sk_wmem_alloc_get(sk); return put_user(amount, (int __user *)arg); } case SIOCINQ: { struct sk_buff *skb; int amount = 0; spin_lock_bh(&sk->sk_receive_queue.lock); skb = skb_peek(&sk->sk_receive_queue); if (skb) amount = skb->len; spin_unlock_bh(&sk->sk_receive_queue.lock); return put_user(amount, (int __user *)arg); } #ifdef CONFIG_INET case SIOCADDRT: case SIOCDELRT: case SIOCDARP: case SIOCGARP: case SIOCSARP: case SIOCGIFADDR: case SIOCSIFADDR: case SIOCGIFBRDADDR: case SIOCSIFBRDADDR: case SIOCGIFNETMASK: case SIOCSIFNETMASK: case SIOCGIFDSTADDR: case SIOCSIFDSTADDR: case SIOCSIFFLAGS: return inet_dgram_ops.ioctl(sock, cmd, arg); #endif default: return -ENOIOCTLCMD; } return 0; } static __poll_t packet_poll(struct file *file, struct socket *sock, poll_table *wait) { struct sock *sk = sock->sk; struct packet_sock *po = pkt_sk(sk); __poll_t mask = datagram_poll(file, sock, wait); spin_lock_bh(&sk->sk_receive_queue.lock); if (po->rx_ring.pg_vec) { if (!packet_previous_rx_frame(po, &po->rx_ring, TP_STATUS_KERNEL)) mask |= EPOLLIN | EPOLLRDNORM; } packet_rcv_try_clear_pressure(po); spin_unlock_bh(&sk->sk_receive_queue.lock); spin_lock_bh(&sk->sk_write_queue.lock); if (po->tx_ring.pg_vec) { if (packet_current_frame(po, &po->tx_ring, TP_STATUS_AVAILABLE)) mask |= EPOLLOUT | EPOLLWRNORM; } spin_unlock_bh(&sk->sk_write_queue.lock); return mask; } /* Dirty? Well, I still did not learn better way to account * for user mmaps. */ static void packet_mm_open(struct vm_area_struct *vma) { struct file *file = vma->vm_file; struct socket *sock = file->private_data; struct sock *sk = sock->sk; if (sk) atomic_long_inc(&pkt_sk(sk)->mapped); } static void packet_mm_close(struct vm_area_struct *vma) { struct file *file = vma->vm_file; struct socket *sock = file->private_data; struct sock *sk = sock->sk; if (sk) atomic_long_dec(&pkt_sk(sk)->mapped); } static const struct vm_operations_struct packet_mmap_ops = { .open = packet_mm_open, .close = packet_mm_close, }; static void free_pg_vec(struct pgv *pg_vec, unsigned int order, unsigned int len) { int i; for (i = 0; i < len; i++) { if (likely(pg_vec[i].buffer)) { if (is_vmalloc_addr(pg_vec[i].buffer)) vfree(pg_vec[i].buffer); else free_pages((unsigned long)pg_vec[i].buffer, order); pg_vec[i].buffer = NULL; } } kfree(pg_vec); } static char *alloc_one_pg_vec_page(unsigned long order) { char *buffer; gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY; buffer = (char *) __get_free_pages(gfp_flags, order); if (buffer) return buffer; /* __get_free_pages failed, fall back to vmalloc */ buffer = vzalloc(array_size((1 << order), PAGE_SIZE)); if (buffer) return buffer; /* vmalloc failed, lets dig into swap here */ gfp_flags &= ~__GFP_NORETRY; buffer = (char *) __get_free_pages(gfp_flags, order); if (buffer) return buffer; /* complete and utter failure */ return NULL; } static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) { unsigned int block_nr = req->tp_block_nr; struct pgv *pg_vec; int i; pg_vec = kcalloc(block_nr, sizeof(struct pgv), GFP_KERNEL | __GFP_NOWARN); if (unlikely(!pg_vec)) goto out; for (i = 0; i < block_nr; i++) { pg_vec[i].buffer = alloc_one_pg_vec_page(order); if (unlikely(!pg_vec[i].buffer)) goto out_free_pgvec; } out: return pg_vec; out_free_pgvec: free_pg_vec(pg_vec, order, block_nr); pg_vec = NULL; goto out; } static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, int closing, int tx_ring) { struct pgv *pg_vec = NULL; struct packet_sock *po = pkt_sk(sk); unsigned long *rx_owner_map = NULL; int was_running, order = 0; struct packet_ring_buffer *rb; struct sk_buff_head *rb_queue; __be16 num; int err; /* Added to avoid minimal code churn */ struct tpacket_req *req = &req_u->req; rb = tx_ring ? &po->tx_ring : &po->rx_ring; rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue; err = -EBUSY; if (!closing) { if (atomic_long_read(&po->mapped)) goto out; if (packet_read_pending(rb)) goto out; } if (req->tp_block_nr) { unsigned int min_frame_size; /* Sanity tests and some calculations */ err = -EBUSY; if (unlikely(rb->pg_vec)) goto out; switch (po->tp_version) { case TPACKET_V1: po->tp_hdrlen = TPACKET_HDRLEN; break; case TPACKET_V2: po->tp_hdrlen = TPACKET2_HDRLEN; break; case TPACKET_V3: po->tp_hdrlen = TPACKET3_HDRLEN; break; } err = -EINVAL; if (unlikely((int)req->tp_block_size <= 0)) goto out; if (unlikely(!PAGE_ALIGNED(req->tp_block_size))) goto out; min_frame_size = po->tp_hdrlen + po->tp_reserve; if (po->tp_version >= TPACKET_V3 && req->tp_block_size < BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv) + min_frame_size) goto out; if (unlikely(req->tp_frame_size < min_frame_size)) goto out; if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1))) goto out; rb->frames_per_block = req->tp_block_size / req->tp_frame_size; if (unlikely(rb->frames_per_block == 0)) goto out; if (unlikely(rb->frames_per_block > UINT_MAX / req->tp_block_nr)) goto out; if (unlikely((rb->frames_per_block * req->tp_block_nr) != req->tp_frame_nr)) goto out; err = -ENOMEM; order = get_order(req->tp_block_size); pg_vec = alloc_pg_vec(req, order); if (unlikely(!pg_vec)) goto out; switch (po->tp_version) { case TPACKET_V3: /* Block transmit is not supported yet */ if (!tx_ring) { init_prb_bdqc(po, rb, pg_vec, req_u); } else { struct tpacket_req3 *req3 = &req_u->req3; if (req3->tp_retire_blk_tov || req3->tp_sizeof_priv || req3->tp_feature_req_word) { err = -EINVAL; goto out_free_pg_vec; } } break; default: if (!tx_ring) { rx_owner_map = bitmap_alloc(req->tp_frame_nr, GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO); if (!rx_owner_map) goto out_free_pg_vec; } break; } } /* Done */ else { err = -EINVAL; if (unlikely(req->tp_frame_nr)) goto out; } /* Detach socket from network */ spin_lock(&po->bind_lock); was_running = packet_sock_flag(po, PACKET_SOCK_RUNNING); num = po->num; if (was_running) { WRITE_ONCE(po->num, 0); __unregister_prot_hook(sk, false); } spin_unlock(&po->bind_lock); synchronize_net(); err = -EBUSY; mutex_lock(&po->pg_vec_lock); if (closing || atomic_long_read(&po->mapped) == 0) { err = 0; spin_lock_bh(&rb_queue->lock); swap(rb->pg_vec, pg_vec); if (po->tp_version <= TPACKET_V2) swap(rb->rx_owner_map, rx_owner_map); rb->frame_max = (req->tp_frame_nr - 1); rb->head = 0; rb->frame_size = req->tp_frame_size; spin_unlock_bh(&rb_queue->lock); swap(rb->pg_vec_order, order); swap(rb->pg_vec_len, req->tp_block_nr); rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE; po->prot_hook.func = (po->rx_ring.pg_vec) ? tpacket_rcv : packet_rcv; skb_queue_purge(rb_queue); if (atomic_long_read(&po->mapped)) pr_err("packet_mmap: vma is busy: %ld\n", atomic_long_read(&po->mapped)); } mutex_unlock(&po->pg_vec_lock); spin_lock(&po->bind_lock); if (was_running) { WRITE_ONCE(po->num, num); register_prot_hook(sk); } spin_unlock(&po->bind_lock); if (pg_vec && (po->tp_version > TPACKET_V2)) { /* Because we don't support block-based V3 on tx-ring */ if (!tx_ring) prb_shutdown_retire_blk_timer(po, rb_queue); } out_free_pg_vec: if (pg_vec) { bitmap_free(rx_owner_map); free_pg_vec(pg_vec, order, req->tp_block_nr); } out: return err; } static int packet_mmap(struct file *file, struct socket *sock, struct vm_area_struct *vma) { struct sock *sk = sock->sk; struct packet_sock *po = pkt_sk(sk); unsigned long size, expected_size; struct packet_ring_buffer *rb; unsigned long start; int err = -EINVAL; int i; if (vma->vm_pgoff) return -EINVAL; mutex_lock(&po->pg_vec_lock); expected_size = 0; for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) { if (rb->pg_vec) { expected_size += rb->pg_vec_len * rb->pg_vec_pages * PAGE_SIZE; } } if (expected_size == 0) goto out; size = vma->vm_end - vma->vm_start; if (size != expected_size) goto out; start = vma->vm_start; for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) { if (rb->pg_vec == NULL) continue; for (i = 0; i < rb->pg_vec_len; i++) { struct page *page; void *kaddr = rb->pg_vec[i].buffer; int pg_num; for (pg_num = 0; pg_num < rb->pg_vec_pages; pg_num++) { page = pgv_to_page(kaddr); err = vm_insert_page(vma, start, page); if (unlikely(err)) goto out; start += PAGE_SIZE; kaddr += PAGE_SIZE; } } } atomic_long_inc(&po->mapped); vma->vm_ops = &packet_mmap_ops; err = 0; out: mutex_unlock(&po->pg_vec_lock); return err; } static const struct proto_ops packet_ops_spkt = { .family = PF_PACKET, .owner = THIS_MODULE, .release = packet_release, .bind = packet_bind_spkt, .connect = sock_no_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = packet_getname_spkt, .poll = datagram_poll, .ioctl = packet_ioctl, .gettstamp = sock_gettstamp, .listen = sock_no_listen, .shutdown = sock_no_shutdown, .sendmsg = packet_sendmsg_spkt, .recvmsg = packet_recvmsg, .mmap = sock_no_mmap, }; static const struct proto_ops packet_ops = { .family = PF_PACKET, .owner = THIS_MODULE, .release = packet_release, .bind = packet_bind, .connect = sock_no_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = packet_getname, .poll = packet_poll, .ioctl = packet_ioctl, .gettstamp = sock_gettstamp, .listen = sock_no_listen, .shutdown = sock_no_shutdown, .setsockopt = packet_setsockopt, .getsockopt = packet_getsockopt, .sendmsg = packet_sendmsg, .recvmsg = packet_recvmsg, .mmap = packet_mmap, }; static const struct net_proto_family packet_family_ops = { .family = PF_PACKET, .create = packet_create, .owner = THIS_MODULE, }; static struct notifier_block packet_netdev_notifier = { .notifier_call = packet_notifier, }; #ifdef CONFIG_PROC_FS static void *packet_seq_start(struct seq_file *seq, loff_t *pos) __acquires(RCU) { struct net *net = seq_file_net(seq); rcu_read_lock(); return seq_hlist_start_head_rcu(&net->packet.sklist, *pos); } static void *packet_seq_next(struct seq_file *seq, void *v, loff_t *pos) { struct net *net = seq_file_net(seq); return seq_hlist_next_rcu(v, &net->packet.sklist, pos); } static void packet_seq_stop(struct seq_file *seq, void *v) __releases(RCU) { rcu_read_unlock(); } static int packet_seq_show(struct seq_file *seq, void *v) { if (v == SEQ_START_TOKEN) seq_printf(seq, "%*sRefCnt Type Proto Iface R Rmem User Inode\n", IS_ENABLED(CONFIG_64BIT) ? -17 : -9, "sk"); else { struct sock *s = sk_entry(v); const struct packet_sock *po = pkt_sk(s); seq_printf(seq, "%pK %-6d %-4d %04x %-5d %1d %-6u %-6u %-6lu\n", s, refcount_read(&s->sk_refcnt), s->sk_type, ntohs(READ_ONCE(po->num)), READ_ONCE(po->ifindex), packet_sock_flag(po, PACKET_SOCK_RUNNING), atomic_read(&s->sk_rmem_alloc), from_kuid_munged(seq_user_ns(seq), sock_i_uid(s)), sock_i_ino(s)); } return 0; } static const struct seq_operations packet_seq_ops = { .start = packet_seq_start, .next = packet_seq_next, .stop = packet_seq_stop, .show = packet_seq_show, }; #endif static int __net_init packet_net_init(struct net *net) { mutex_init(&net->packet.sklist_lock); INIT_HLIST_HEAD(&net->packet.sklist); #ifdef CONFIG_PROC_FS if (!proc_create_net("packet", 0, net->proc_net, &packet_seq_ops, sizeof(struct seq_net_private))) return -ENOMEM; #endif /* CONFIG_PROC_FS */ return 0; } static void __net_exit packet_net_exit(struct net *net) { remove_proc_entry("packet", net->proc_net); WARN_ON_ONCE(!hlist_empty(&net->packet.sklist)); } static struct pernet_operations packet_net_ops = { .init = packet_net_init, .exit = packet_net_exit, }; static void __exit packet_exit(void) { sock_unregister(PF_PACKET); proto_unregister(&packet_proto); unregister_netdevice_notifier(&packet_netdev_notifier); unregister_pernet_subsys(&packet_net_ops); } static int __init packet_init(void) { int rc; rc = register_pernet_subsys(&packet_net_ops); if (rc) goto out; rc = register_netdevice_notifier(&packet_netdev_notifier); if (rc) goto out_pernet; rc = proto_register(&packet_proto, 0); if (rc) goto out_notifier; rc = sock_register(&packet_family_ops); if (rc) goto out_proto; return 0; out_proto: proto_unregister(&packet_proto); out_notifier: unregister_netdevice_notifier(&packet_netdev_notifier); out_pernet: unregister_pernet_subsys(&packet_net_ops); out: return rc; } module_init(packet_init); module_exit(packet_exit); MODULE_DESCRIPTION("Packet socket support (AF_PACKET)"); MODULE_LICENSE("GPL"); MODULE_ALIAS_NETPROTO(PF_PACKET); |
28 3 28 28 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 | // SPDX-License-Identifier: GPL-2.0 /* * buffered writeback throttling. loosely based on CoDel. We can't drop * packets for IO scheduling, so the logic is something like this: * * - Monitor latencies in a defined window of time. * - If the minimum latency in the above window exceeds some target, increment * scaling step and scale down queue depth by a factor of 2x. The monitoring * window is then shrunk to 100 / sqrt(scaling step + 1). * - For any window where we don't have solid data on what the latencies * look like, retain status quo. * - If latencies look good, decrement scaling step. * - If we're only doing writes, allow the scaling step to go negative. This * will temporarily boost write performance, snapping back to a stable * scaling step of 0 if reads show up or the heavy writers finish. Unlike * positive scaling steps where we shrink the monitoring window, a negative * scaling step retains the default step==0 window size. * * Copyright (C) 2016 Jens Axboe * */ #include <linux/kernel.h> #include <linux/blk_types.h> #include <linux/slab.h> #include <linux/backing-dev.h> #include <linux/swap.h> #include "blk-stat.h" #include "blk-wbt.h" #include "blk-rq-qos.h" #include "elevator.h" #include "blk.h" #define CREATE_TRACE_POINTS #include <trace/events/wbt.h> enum wbt_flags { WBT_TRACKED = 1, /* write, tracked for throttling */ WBT_READ = 2, /* read */ WBT_SWAP = 4, /* write, from swap_writepage() */ WBT_DISCARD = 8, /* discard */ WBT_NR_BITS = 4, /* number of bits */ }; enum { WBT_RWQ_BG = 0, WBT_RWQ_SWAP, WBT_RWQ_DISCARD, WBT_NUM_RWQ, }; /* * If current state is WBT_STATE_ON/OFF_DEFAULT, it can be covered to any other * state, if current state is WBT_STATE_ON/OFF_MANUAL, it can only be covered * to WBT_STATE_OFF/ON_MANUAL. */ enum { WBT_STATE_ON_DEFAULT = 1, /* on by default */ WBT_STATE_ON_MANUAL = 2, /* on manually by sysfs */ WBT_STATE_OFF_DEFAULT = 3, /* off by default */ WBT_STATE_OFF_MANUAL = 4, /* off manually by sysfs */ }; struct rq_wb { /* * Settings that govern how we throttle */ unsigned int wb_background; /* background writeback */ unsigned int wb_normal; /* normal writeback */ short enable_state; /* WBT_STATE_* */ /* * Number of consecutive periods where we don't have enough * information to make a firm scale up/down decision. */ unsigned int unknown_cnt; u64 win_nsec; /* default window size */ u64 cur_win_nsec; /* current window size */ struct blk_stat_callback *cb; u64 sync_issue; void *sync_cookie; unsigned long last_issue; /* last non-throttled issue */ unsigned long last_comp; /* last non-throttled comp */ unsigned long min_lat_nsec; struct rq_qos rqos; struct rq_wait rq_wait[WBT_NUM_RWQ]; struct rq_depth rq_depth; }; static inline struct rq_wb *RQWB(struct rq_qos *rqos) { return container_of(rqos, struct rq_wb, rqos); } static inline void wbt_clear_state(struct request *rq) { rq->wbt_flags = 0; } static inline enum wbt_flags wbt_flags(struct request *rq) { return rq->wbt_flags; } static inline bool wbt_is_tracked(struct request *rq) { return rq->wbt_flags & WBT_TRACKED; } static inline bool wbt_is_read(struct request *rq) { return rq->wbt_flags & WBT_READ; } enum { /* * Default setting, we'll scale up (to 75% of QD max) or down (min 1) * from here depending on device stats */ RWB_DEF_DEPTH = 16, /* * 100msec window */ RWB_WINDOW_NSEC = 100 * 1000 * 1000ULL, /* * Disregard stats, if we don't meet this minimum */ RWB_MIN_WRITE_SAMPLES = 3, /* * If we have this number of consecutive windows without enough * information to scale up or down, slowly return to center state * (step == 0). */ RWB_UNKNOWN_BUMP = 5, }; static inline bool rwb_enabled(struct rq_wb *rwb) { return rwb && rwb->enable_state != WBT_STATE_OFF_DEFAULT && rwb->enable_state != WBT_STATE_OFF_MANUAL; } static void wb_timestamp(struct rq_wb *rwb, unsigned long *var) { if (rwb_enabled(rwb)) { const unsigned long cur = jiffies; if (cur != *var) *var = cur; } } /* * If a task was rate throttled in balance_dirty_pages() within the last * second or so, use that to indicate a higher cleaning rate. */ static bool wb_recent_wait(struct rq_wb *rwb) { struct backing_dev_info *bdi = rwb->rqos.disk->bdi; return time_before(jiffies, bdi->last_bdp_sleep + HZ); } static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, enum wbt_flags wb_acct) { if (wb_acct & WBT_SWAP) return &rwb->rq_wait[WBT_RWQ_SWAP]; else if (wb_acct & WBT_DISCARD) return &rwb->rq_wait[WBT_RWQ_DISCARD]; return &rwb->rq_wait[WBT_RWQ_BG]; } static void rwb_wake_all(struct rq_wb *rwb) { int i; for (i = 0; i < WBT_NUM_RWQ; i++) { struct rq_wait *rqw = &rwb->rq_wait[i]; if (wq_has_sleeper(&rqw->wait)) wake_up_all(&rqw->wait); } } static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw, enum wbt_flags wb_acct) { int inflight, limit; inflight = atomic_dec_return(&rqw->inflight); /* * For discards, our limit is always the background. For writes, if * the device does write back caching, drop further down before we * wake people up. */ if (wb_acct & WBT_DISCARD) limit = rwb->wb_background; else if (blk_queue_write_cache(rwb->rqos.disk->queue) && !wb_recent_wait(rwb)) limit = 0; else limit = rwb->wb_normal; /* * Don't wake anyone up if we are above the normal limit. */ if (inflight && inflight >= limit) return; if (wq_has_sleeper(&rqw->wait)) { int diff = limit - inflight; if (!inflight || diff >= rwb->wb_background / 2) wake_up_all(&rqw->wait); } } static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct) { struct rq_wb *rwb = RQWB(rqos); struct rq_wait *rqw; if (!(wb_acct & WBT_TRACKED)) return; rqw = get_rq_wait(rwb, wb_acct); wbt_rqw_done(rwb, rqw, wb_acct); } /* * Called on completion of a request. Note that it's also called when * a request is merged, when the request gets freed. */ static void wbt_done(struct rq_qos *rqos, struct request *rq) { struct rq_wb *rwb = RQWB(rqos); if (!wbt_is_tracked(rq)) { if (rwb->sync_cookie == rq) { rwb->sync_issue = 0; rwb->sync_cookie = NULL; } if (wbt_is_read(rq)) wb_timestamp(rwb, &rwb->last_comp); } else { WARN_ON_ONCE(rq == rwb->sync_cookie); __wbt_done(rqos, wbt_flags(rq)); } wbt_clear_state(rq); } static inline bool stat_sample_valid(struct blk_rq_stat *stat) { /* * We need at least one read sample, and a minimum of * RWB_MIN_WRITE_SAMPLES. We require some write samples to know * that it's writes impacting us, and not just some sole read on * a device that is in a lower power state. */ return (stat[READ].nr_samples >= 1 && stat[WRITE].nr_samples >= RWB_MIN_WRITE_SAMPLES); } static u64 rwb_sync_issue_lat(struct rq_wb *rwb) { u64 issue = READ_ONCE(rwb->sync_issue); if (!issue || !rwb->sync_cookie) return 0; return blk_time_get_ns() - issue; } static inline unsigned int wbt_inflight(struct rq_wb *rwb) { unsigned int i, ret = 0; for (i = 0; i < WBT_NUM_RWQ; i++) ret += atomic_read(&rwb->rq_wait[i].inflight); return ret; } enum { LAT_OK = 1, LAT_UNKNOWN, LAT_UNKNOWN_WRITES, LAT_EXCEEDED, }; static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat) { struct backing_dev_info *bdi = rwb->rqos.disk->bdi; struct rq_depth *rqd = &rwb->rq_depth; u64 thislat; /* * If our stored sync issue exceeds the window size, or it * exceeds our min target AND we haven't logged any entries, * flag the latency as exceeded. wbt works off completion latencies, * but for a flooded device, a single sync IO can take a long time * to complete after being issued. If this time exceeds our * monitoring window AND we didn't see any other completions in that * window, then count that sync IO as a violation of the latency. */ thislat = rwb_sync_issue_lat(rwb); if (thislat > rwb->cur_win_nsec || (thislat > rwb->min_lat_nsec && !stat[READ].nr_samples)) { trace_wbt_lat(bdi, thislat); return LAT_EXCEEDED; } /* * No read/write mix, if stat isn't valid */ if (!stat_sample_valid(stat)) { /* * If we had writes in this stat window and the window is * current, we're only doing writes. If a task recently * waited or still has writes in flights, consider us doing * just writes as well. */ if (stat[WRITE].nr_samples || wb_recent_wait(rwb) || wbt_inflight(rwb)) return LAT_UNKNOWN_WRITES; return LAT_UNKNOWN; } /* * If the 'min' latency exceeds our target, step down. */ if (stat[READ].min > rwb->min_lat_nsec) { trace_wbt_lat(bdi, stat[READ].min); trace_wbt_stat(bdi, stat); return LAT_EXCEEDED; } if (rqd->scale_step) trace_wbt_stat(bdi, stat); return LAT_OK; } static void rwb_trace_step(struct rq_wb *rwb, const char *msg) { struct backing_dev_info *bdi = rwb->rqos.disk->bdi; struct rq_depth *rqd = &rwb->rq_depth; trace_wbt_step(bdi, msg, rqd->scale_step, rwb->cur_win_nsec, rwb->wb_background, rwb->wb_normal, rqd->max_depth); } static void calc_wb_limits(struct rq_wb *rwb) { if (rwb->min_lat_nsec == 0) { rwb->wb_normal = rwb->wb_background = 0; } else if (rwb->rq_depth.max_depth <= 2) { rwb->wb_normal = rwb->rq_depth.max_depth; rwb->wb_background = 1; } else { rwb->wb_normal = (rwb->rq_depth.max_depth + 1) / 2; rwb->wb_background = (rwb->rq_depth.max_depth + 3) / 4; } } static void scale_up(struct rq_wb *rwb) { if (!rq_depth_scale_up(&rwb->rq_depth)) return; calc_wb_limits(rwb); rwb->unknown_cnt = 0; rwb_wake_all(rwb); rwb_trace_step(rwb, tracepoint_string("scale up")); } static void scale_down(struct rq_wb *rwb, bool hard_throttle) { if (!rq_depth_scale_down(&rwb->rq_depth, hard_throttle)) return; calc_wb_limits(rwb); rwb->unknown_cnt = 0; rwb_trace_step(rwb, tracepoint_string("scale down")); } static void rwb_arm_timer(struct rq_wb *rwb) { struct rq_depth *rqd = &rwb->rq_depth; if (rqd->scale_step > 0) { /* * We should speed this up, using some variant of a fast * integer inverse square root calculation. Since we only do * this for every window expiration, it's not a huge deal, * though. */ rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4, int_sqrt((rqd->scale_step + 1) << 8)); } else { /* * For step < 0, we don't want to increase/decrease the * window size. */ rwb->cur_win_nsec = rwb->win_nsec; } blk_stat_activate_nsecs(rwb->cb, rwb->cur_win_nsec); } static void wb_timer_fn(struct blk_stat_callback *cb) { struct rq_wb *rwb = cb->data; struct rq_depth *rqd = &rwb->rq_depth; unsigned int inflight = wbt_inflight(rwb); int status; if (!rwb->rqos.disk) return; status = latency_exceeded(rwb, cb->stat); trace_wbt_timer(rwb->rqos.disk->bdi, status, rqd->scale_step, inflight); /* * If we exceeded the latency target, step down. If we did not, * step one level up. If we don't know enough to say either exceeded * or ok, then don't do anything. */ switch (status) { case LAT_EXCEEDED: scale_down(rwb, true); break; case LAT_OK: scale_up(rwb); break; case LAT_UNKNOWN_WRITES: /* * We don't have a valid read/write sample, but we do have * writes going on. Allow step to go negative, to increase * write performance. */ scale_up(rwb); break; case LAT_UNKNOWN: if (++rwb->unknown_cnt < RWB_UNKNOWN_BUMP) break; /* * We get here when previously scaled reduced depth, and we * currently don't have a valid read/write sample. For that * case, slowly return to center state (step == 0). */ if (rqd->scale_step > 0) scale_up(rwb); else if (rqd->scale_step < 0) scale_down(rwb, false); break; default: break; } /* * Re-arm timer, if we have IO in flight */ if (rqd->scale_step || inflight) rwb_arm_timer(rwb); } static void wbt_update_limits(struct rq_wb *rwb) { struct rq_depth *rqd = &rwb->rq_depth; rqd->scale_step = 0; rqd->scaled_max = false; rq_depth_calc_max_depth(rqd); calc_wb_limits(rwb); rwb_wake_all(rwb); } bool wbt_disabled(struct request_queue *q) { struct rq_qos *rqos = wbt_rq_qos(q); return !rqos || !rwb_enabled(RQWB(rqos)); } u64 wbt_get_min_lat(struct request_queue *q) { struct rq_qos *rqos = wbt_rq_qos(q); if (!rqos) return 0; return RQWB(rqos)->min_lat_nsec; } void wbt_set_min_lat(struct request_queue *q, u64 val) { struct rq_qos *rqos = wbt_rq_qos(q); if (!rqos) return; RQWB(rqos)->min_lat_nsec = val; if (val) RQWB(rqos)->enable_state = WBT_STATE_ON_MANUAL; else RQWB(rqos)->enable_state = WBT_STATE_OFF_MANUAL; wbt_update_limits(RQWB(rqos)); } static bool close_io(struct rq_wb *rwb) { const unsigned long now = jiffies; return time_before(now, rwb->last_issue + HZ / 10) || time_before(now, rwb->last_comp + HZ / 10); } #define REQ_HIPRIO (REQ_SYNC | REQ_META | REQ_PRIO | REQ_SWAP) static inline unsigned int get_limit(struct rq_wb *rwb, blk_opf_t opf) { unsigned int limit; if ((opf & REQ_OP_MASK) == REQ_OP_DISCARD) return rwb->wb_background; /* * At this point we know it's a buffered write. If this is * swap trying to free memory, or REQ_SYNC is set, then * it's WB_SYNC_ALL writeback, and we'll use the max limit for * that. If the write is marked as a background write, then use * the idle limit, or go to normal if we haven't had competing * IO for a bit. */ if ((opf & REQ_HIPRIO) || wb_recent_wait(rwb)) limit = rwb->rq_depth.max_depth; else if ((opf & REQ_BACKGROUND) || close_io(rwb)) { /* * If less than 100ms since we completed unrelated IO, * limit us to half the depth for background writeback. */ limit = rwb->wb_background; } else limit = rwb->wb_normal; return limit; } struct wbt_wait_data { struct rq_wb *rwb; enum wbt_flags wb_acct; blk_opf_t opf; }; static bool wbt_inflight_cb(struct rq_wait *rqw, void *private_data) { struct wbt_wait_data *data = private_data; return rq_wait_inc_below(rqw, get_limit(data->rwb, data->opf)); } static void wbt_cleanup_cb(struct rq_wait *rqw, void *private_data) { struct wbt_wait_data *data = private_data; wbt_rqw_done(data->rwb, rqw, data->wb_acct); } /* * Block if we will exceed our limit, or if we are currently waiting for * the timer to kick off queuing again. */ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct, blk_opf_t opf) { struct rq_wait *rqw = get_rq_wait(rwb, wb_acct); struct wbt_wait_data data = { .rwb = rwb, .wb_acct = wb_acct, .opf = opf, }; rq_qos_wait(rqw, &data, wbt_inflight_cb, wbt_cleanup_cb); } static inline bool wbt_should_throttle(struct bio *bio) { switch (bio_op(bio)) { case REQ_OP_WRITE: /* * Don't throttle WRITE_ODIRECT */ if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE)) return false; fallthrough; case REQ_OP_DISCARD: return true; default: return false; } } static enum wbt_flags bio_to_wbt_flags(struct rq_wb *rwb, struct bio *bio) { enum wbt_flags flags = 0; if (!rwb_enabled(rwb)) return 0; if (bio_op(bio) == REQ_OP_READ) { flags = WBT_READ; } else if (wbt_should_throttle(bio)) { if (bio->bi_opf & REQ_SWAP) flags |= WBT_SWAP; if (bio_op(bio) == REQ_OP_DISCARD) flags |= WBT_DISCARD; flags |= WBT_TRACKED; } return flags; } static void wbt_cleanup(struct rq_qos *rqos, struct bio *bio) { struct rq_wb *rwb = RQWB(rqos); enum wbt_flags flags = bio_to_wbt_flags(rwb, bio); __wbt_done(rqos, flags); } /* May sleep, if we have exceeded the writeback limits. */ static void wbt_wait(struct rq_qos *rqos, struct bio *bio) { struct rq_wb *rwb = RQWB(rqos); enum wbt_flags flags; flags = bio_to_wbt_flags(rwb, bio); if (!(flags & WBT_TRACKED)) { if (flags & WBT_READ) wb_timestamp(rwb, &rwb->last_issue); return; } __wbt_wait(rwb, flags, bio->bi_opf); if (!blk_stat_is_active(rwb->cb)) rwb_arm_timer(rwb); } static void wbt_track(struct rq_qos *rqos, struct request *rq, struct bio *bio) { struct rq_wb *rwb = RQWB(rqos); rq->wbt_flags |= bio_to_wbt_flags(rwb, bio); } static void wbt_issue(struct rq_qos *rqos, struct request *rq) { struct rq_wb *rwb = RQWB(rqos); if (!rwb_enabled(rwb)) return; /* * Track sync issue, in case it takes a long time to complete. Allows us * to react quicker, if a sync IO takes a long time to complete. Note * that this is just a hint. The request can go away when it completes, * so it's important we never dereference it. We only use the address to * compare with, which is why we store the sync_issue time locally. */ if (wbt_is_read(rq) && !rwb->sync_issue) { rwb->sync_cookie = rq; rwb->sync_issue = rq->io_start_time_ns; } } static void wbt_requeue(struct rq_qos *rqos, struct request *rq) { struct rq_wb *rwb = RQWB(rqos); if (!rwb_enabled(rwb)) return; if (rq == rwb->sync_cookie) { rwb->sync_issue = 0; rwb->sync_cookie = NULL; } } /* * Enable wbt if defaults are configured that way */ void wbt_enable_default(struct gendisk *disk) { struct request_queue *q = disk->queue; struct rq_qos *rqos; bool enable = IS_ENABLED(CONFIG_BLK_WBT_MQ); if (q->elevator && test_bit(ELEVATOR_FLAG_DISABLE_WBT, &q->elevator->flags)) enable = false; /* Throttling already enabled? */ rqos = wbt_rq_qos(q); if (rqos) { if (enable && RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT) RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT; return; } /* Queue not registered? Maybe shutting down... */ if (!blk_queue_registered(q)) return; if (queue_is_mq(q) && enable) wbt_init(disk); } EXPORT_SYMBOL_GPL(wbt_enable_default); u64 wbt_default_latency_nsec(struct request_queue *q) { /* * We default to 2msec for non-rotational storage, and 75msec * for rotational storage. */ if (blk_queue_nonrot(q)) return 2000000ULL; else return 75000000ULL; } static int wbt_data_dir(const struct request *rq) { const enum req_op op = req_op(rq); if (op == REQ_OP_READ) return READ; else if (op_is_write(op)) return WRITE; /* don't account */ return -1; } static void wbt_queue_depth_changed(struct rq_qos *rqos) { RQWB(rqos)->rq_depth.queue_depth = blk_queue_depth(rqos->disk->queue); wbt_update_limits(RQWB(rqos)); } static void wbt_exit(struct rq_qos *rqos) { struct rq_wb *rwb = RQWB(rqos); blk_stat_remove_callback(rqos->disk->queue, rwb->cb); blk_stat_free_callback(rwb->cb); kfree(rwb); } /* * Disable wbt, if enabled by default. */ void wbt_disable_default(struct gendisk *disk) { struct rq_qos *rqos = wbt_rq_qos(disk->queue); struct rq_wb *rwb; if (!rqos) return; rwb = RQWB(rqos); if (rwb->enable_state == WBT_STATE_ON_DEFAULT) { blk_stat_deactivate(rwb->cb); rwb->enable_state = WBT_STATE_OFF_DEFAULT; } } EXPORT_SYMBOL_GPL(wbt_disable_default); #ifdef CONFIG_BLK_DEBUG_FS static int wbt_curr_win_nsec_show(void *data, struct seq_file *m) { struct rq_qos *rqos = data; struct rq_wb *rwb = RQWB(rqos); seq_printf(m, "%llu\n", rwb->cur_win_nsec); return 0; } static int wbt_enabled_show(void *data, struct seq_file *m) { struct rq_qos *rqos = data; struct rq_wb *rwb = RQWB(rqos); seq_printf(m, "%d\n", rwb->enable_state); return 0; } static int wbt_id_show(void *data, struct seq_file *m) { struct rq_qos *rqos = data; seq_printf(m, "%u\n", rqos->id); return 0; } static int wbt_inflight_show(void *data, struct seq_file *m) { struct rq_qos *rqos = data; struct rq_wb *rwb = RQWB(rqos); int i; for (i = 0; i < WBT_NUM_RWQ; i++) seq_printf(m, "%d: inflight %d\n", i, atomic_read(&rwb->rq_wait[i].inflight)); return 0; } static int wbt_min_lat_nsec_show(void *data, struct seq_file *m) { struct rq_qos *rqos = data; struct rq_wb *rwb = RQWB(rqos); seq_printf(m, "%lu\n", rwb->min_lat_nsec); return 0; } static int wbt_unknown_cnt_show(void *data, struct seq_file *m) { struct rq_qos *rqos = data; struct rq_wb *rwb = RQWB(rqos); seq_printf(m, "%u\n", rwb->unknown_cnt); return 0; } static int wbt_normal_show(void *data, struct seq_file *m) { struct rq_qos *rqos = data; struct rq_wb *rwb = RQWB(rqos); seq_printf(m, "%u\n", rwb->wb_normal); return 0; } static int wbt_background_show(void *data, struct seq_file *m) { struct rq_qos *rqos = data; struct rq_wb *rwb = RQWB(rqos); seq_printf(m, "%u\n", rwb->wb_background); return 0; } static const struct blk_mq_debugfs_attr wbt_debugfs_attrs[] = { {"curr_win_nsec", 0400, wbt_curr_win_nsec_show}, {"enabled", 0400, wbt_enabled_show}, {"id", 0400, wbt_id_show}, {"inflight", 0400, wbt_inflight_show}, {"min_lat_nsec", 0400, wbt_min_lat_nsec_show}, {"unknown_cnt", 0400, wbt_unknown_cnt_show}, {"wb_normal", 0400, wbt_normal_show}, {"wb_background", 0400, wbt_background_show}, {}, }; #endif static const struct rq_qos_ops wbt_rqos_ops = { .throttle = wbt_wait, .issue = wbt_issue, .track = wbt_track, .requeue = wbt_requeue, .done = wbt_done, .cleanup = wbt_cleanup, .queue_depth_changed = wbt_queue_depth_changed, .exit = wbt_exit, #ifdef CONFIG_BLK_DEBUG_FS .debugfs_attrs = wbt_debugfs_attrs, #endif }; int wbt_init(struct gendisk *disk) { struct request_queue *q = disk->queue; struct rq_wb *rwb; int i; int ret; rwb = kzalloc(sizeof(*rwb), GFP_KERNEL); if (!rwb) return -ENOMEM; rwb->cb = blk_stat_alloc_callback(wb_timer_fn, wbt_data_dir, 2, rwb); if (!rwb->cb) { kfree(rwb); return -ENOMEM; } for (i = 0; i < WBT_NUM_RWQ; i++) rq_wait_init(&rwb->rq_wait[i]); rwb->last_comp = rwb->last_issue = jiffies; rwb->win_nsec = RWB_WINDOW_NSEC; rwb->enable_state = WBT_STATE_ON_DEFAULT; rwb->rq_depth.default_depth = RWB_DEF_DEPTH; rwb->min_lat_nsec = wbt_default_latency_nsec(q); rwb->rq_depth.queue_depth = blk_queue_depth(q); wbt_update_limits(rwb); /* * Assign rwb and add the stats callback. */ mutex_lock(&q->rq_qos_mutex); ret = rq_qos_add(&rwb->rqos, disk, RQ_QOS_WBT, &wbt_rqos_ops); mutex_unlock(&q->rq_qos_mutex); if (ret) goto err_free; blk_stat_add_callback(q, rwb->cb); return 0; err_free: blk_stat_free_callback(rwb->cb); kfree(rwb); return ret; } |
92 92 91 92 2 2 50 50 50 7 7 7 6 1 1 1 1 5 5 5 5 5 5 4 3 3 10 10 8 8 9 9 6 9 18 18 18 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 19 3 1 2 46 48 48 48 48 47 46 45 43 45 43 43 43 43 43 42 43 43 43 43 43 43 43 43 43 43 45 2 2 45 45 45 45 34 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 2 45 45 45 4 34 34 34 21 21 21 2 2 2 2 2 2 2 2 8 8 8 8 8 9 9 9 9 5 9 9 1 9 2 9 9 9 5 48 48 4 21 21 20 17 10 10 40 40 9 32 31 40 37 37 37 37 37 37 36 37 39 39 39 39 39 36 1680 101 101 39 37 9 4 20 20 21 21 15 11 4 1 1 1 15 15 4 4 15 19 19 72 1680 101 102 101 101 101 68 68 68 68 10 68 24 24 24 4 24 2 23 23 2 53 52 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 | // SPDX-License-Identifier: GPL-2.0-only /* * This is the linux wireless configuration interface. * * Copyright 2006-2010 Johannes Berg <johannes@sipsolutions.net> * Copyright 2013-2014 Intel Mobile Communications GmbH * Copyright 2015-2017 Intel Deutschland GmbH * Copyright (C) 2018-2025 Intel Corporation */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/if.h> #include <linux/module.h> #include <linux/err.h> #include <linux/list.h> #include <linux/slab.h> #include <linux/nl80211.h> #include <linux/debugfs.h> #include <linux/notifier.h> #include <linux/device.h> #include <linux/etherdevice.h> #include <linux/rtnetlink.h> #include <linux/sched.h> #include <net/genetlink.h> #include <net/cfg80211.h> #include "nl80211.h" #include "core.h" #include "sysfs.h" #include "debugfs.h" #include "wext-compat.h" #include "rdev-ops.h" /* name for sysfs, %d is appended */ #define PHY_NAME "phy" MODULE_AUTHOR("Johannes Berg"); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("wireless configuration support"); MODULE_ALIAS_GENL_FAMILY(NL80211_GENL_NAME); /* RCU-protected (and RTNL for writers) */ LIST_HEAD(cfg80211_rdev_list); int cfg80211_rdev_list_generation; /* for debugfs */ static struct dentry *ieee80211_debugfs_dir; /* for the cleanup, scan and event works */ struct workqueue_struct *cfg80211_wq; static bool cfg80211_disable_40mhz_24ghz; module_param(cfg80211_disable_40mhz_24ghz, bool, 0644); MODULE_PARM_DESC(cfg80211_disable_40mhz_24ghz, "Disable 40MHz support in the 2.4GHz band"); struct cfg80211_registered_device *cfg80211_rdev_by_wiphy_idx(int wiphy_idx) { struct cfg80211_registered_device *result = NULL, *rdev; ASSERT_RTNL(); for_each_rdev(rdev) { if (rdev->wiphy_idx == wiphy_idx) { result = rdev; break; } } return result; } int get_wiphy_idx(struct wiphy *wiphy) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); return rdev->wiphy_idx; } struct wiphy *wiphy_idx_to_wiphy(int wiphy_idx) { struct cfg80211_registered_device *rdev; ASSERT_RTNL(); rdev = cfg80211_rdev_by_wiphy_idx(wiphy_idx); if (!rdev) return NULL; return &rdev->wiphy; } static int cfg80211_dev_check_name(struct cfg80211_registered_device *rdev, const char *newname) { struct cfg80211_registered_device *rdev2; int wiphy_idx, taken = -1, digits; ASSERT_RTNL(); if (strlen(newname) > NL80211_WIPHY_NAME_MAXLEN) return -EINVAL; /* prohibit calling the thing phy%d when %d is not its number */ sscanf(newname, PHY_NAME "%d%n", &wiphy_idx, &taken); if (taken == strlen(newname) && wiphy_idx != rdev->wiphy_idx) { /* count number of places needed to print wiphy_idx */ digits = 1; while (wiphy_idx /= 10) digits++; /* * deny the name if it is phy<idx> where <idx> is printed * without leading zeroes. taken == strlen(newname) here */ if (taken == strlen(PHY_NAME) + digits) return -EINVAL; } /* Ensure another device does not already have this name. */ for_each_rdev(rdev2) if (strcmp(newname, wiphy_name(&rdev2->wiphy)) == 0) return -EINVAL; return 0; } int cfg80211_dev_rename(struct cfg80211_registered_device *rdev, char *newname) { int result; ASSERT_RTNL(); lockdep_assert_wiphy(&rdev->wiphy); /* Ignore nop renames */ if (strcmp(newname, wiphy_name(&rdev->wiphy)) == 0) return 0; result = cfg80211_dev_check_name(rdev, newname); if (result < 0) return result; result = device_rename(&rdev->wiphy.dev, newname); if (result) return result; debugfs_change_name(rdev->wiphy.debugfsdir, "%s", newname); nl80211_notify_wiphy(rdev, NL80211_CMD_NEW_WIPHY); return 0; } int cfg80211_switch_netns(struct cfg80211_registered_device *rdev, struct net *net) { struct wireless_dev *wdev; int err = 0; if (!(rdev->wiphy.flags & WIPHY_FLAG_NETNS_OK)) return -EOPNOTSUPP; list_for_each_entry(wdev, &rdev->wiphy.wdev_list, list) { if (!wdev->netdev) continue; wdev->netdev->netns_immutable = false; err = dev_change_net_namespace(wdev->netdev, net, "wlan%d"); if (err) break; wdev->netdev->netns_immutable = true; } if (err) { /* failed -- clean up to old netns */ net = wiphy_net(&rdev->wiphy); list_for_each_entry_continue_reverse(wdev, &rdev->wiphy.wdev_list, list) { if (!wdev->netdev) continue; wdev->netdev->netns_immutable = false; err = dev_change_net_namespace(wdev->netdev, net, "wlan%d"); WARN_ON(err); wdev->netdev->netns_immutable = true; } return err; } guard(wiphy)(&rdev->wiphy); list_for_each_entry(wdev, &rdev->wiphy.wdev_list, list) { if (!wdev->netdev) continue; nl80211_notify_iface(rdev, wdev, NL80211_CMD_DEL_INTERFACE); } nl80211_notify_wiphy(rdev, NL80211_CMD_DEL_WIPHY); wiphy_net_set(&rdev->wiphy, net); err = device_rename(&rdev->wiphy.dev, dev_name(&rdev->wiphy.dev)); WARN_ON(err); nl80211_notify_wiphy(rdev, NL80211_CMD_NEW_WIPHY); list_for_each_entry(wdev, &rdev->wiphy.wdev_list, list) { if (!wdev->netdev) continue; nl80211_notify_iface(rdev, wdev, NL80211_CMD_NEW_INTERFACE); } return 0; } static void cfg80211_rfkill_poll(struct rfkill *rfkill, void *data) { struct cfg80211_registered_device *rdev = data; guard(wiphy)(&rdev->wiphy); rdev_rfkill_poll(rdev); } void cfg80211_stop_p2p_device(struct cfg80211_registered_device *rdev, struct wireless_dev *wdev) { lockdep_assert_held(&rdev->wiphy.mtx); if (WARN_ON(wdev->iftype != NL80211_IFTYPE_P2P_DEVICE)) return; if (!wdev_running(wdev)) return; rdev_stop_p2p_device(rdev, wdev); wdev->is_running = false; rdev->opencount--; if (rdev->scan_req && rdev->scan_req->wdev == wdev) { if (WARN_ON(!rdev->scan_req->notified && (!rdev->int_scan_req || !rdev->int_scan_req->notified))) rdev->scan_req->info.aborted = true; ___cfg80211_scan_done(rdev, false); } } void cfg80211_stop_nan(struct cfg80211_registered_device *rdev, struct wireless_dev *wdev) { lockdep_assert_held(&rdev->wiphy.mtx); if (WARN_ON(wdev->iftype != NL80211_IFTYPE_NAN)) return; if (!wdev_running(wdev)) return; rdev_stop_nan(rdev, wdev); wdev->is_running = false; rdev->opencount--; } void cfg80211_shutdown_all_interfaces(struct wiphy *wiphy) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); struct wireless_dev *wdev; ASSERT_RTNL(); list_for_each_entry(wdev, &rdev->wiphy.wdev_list, list) { if (wdev->netdev) { dev_close(wdev->netdev); continue; } /* otherwise, check iftype */ guard(wiphy)(wiphy); switch (wdev->iftype) { case NL80211_IFTYPE_P2P_DEVICE: cfg80211_stop_p2p_device(rdev, wdev); break; case NL80211_IFTYPE_NAN: cfg80211_stop_nan(rdev, wdev); break; default: break; } } } EXPORT_SYMBOL_GPL(cfg80211_shutdown_all_interfaces); static int cfg80211_rfkill_set_block(void *data, bool blocked) { struct cfg80211_registered_device *rdev = data; if (!blocked) return 0; rtnl_lock(); cfg80211_shutdown_all_interfaces(&rdev->wiphy); rtnl_unlock(); return 0; } static void cfg80211_rfkill_block_work(struct work_struct *work) { struct cfg80211_registered_device *rdev; rdev = container_of(work, struct cfg80211_registered_device, rfkill_block); cfg80211_rfkill_set_block(rdev, true); } static void cfg80211_event_work(struct work_struct *work) { struct cfg80211_registered_device *rdev; rdev = container_of(work, struct cfg80211_registered_device, event_work); guard(wiphy)(&rdev->wiphy); cfg80211_process_rdev_events(rdev); } void cfg80211_destroy_ifaces(struct cfg80211_registered_device *rdev) { struct wireless_dev *wdev, *tmp; ASSERT_RTNL(); list_for_each_entry_safe(wdev, tmp, &rdev->wiphy.wdev_list, list) { if (wdev->nl_owner_dead) { if (wdev->netdev) dev_close(wdev->netdev); guard(wiphy)(&rdev->wiphy); cfg80211_leave(rdev, wdev); cfg80211_remove_virtual_intf(rdev, wdev); } } } static void cfg80211_destroy_iface_wk(struct work_struct *work) { struct cfg80211_registered_device *rdev; rdev = container_of(work, struct cfg80211_registered_device, destroy_work); rtnl_lock(); cfg80211_destroy_ifaces(rdev); rtnl_unlock(); } static void cfg80211_sched_scan_stop_wk(struct wiphy *wiphy, struct wiphy_work *work) { struct cfg80211_registered_device *rdev; struct cfg80211_sched_scan_request *req, *tmp; rdev = container_of(work, struct cfg80211_registered_device, sched_scan_stop_wk); list_for_each_entry_safe(req, tmp, &rdev->sched_scan_req_list, list) { if (req->nl_owner_dead) cfg80211_stop_sched_scan_req(rdev, req, false); } } static void cfg80211_propagate_radar_detect_wk(struct work_struct *work) { struct cfg80211_registered_device *rdev; rdev = container_of(work, struct cfg80211_registered_device, propagate_radar_detect_wk); rtnl_lock(); regulatory_propagate_dfs_state(&rdev->wiphy, &rdev->radar_chandef, NL80211_DFS_UNAVAILABLE, NL80211_RADAR_DETECTED); rtnl_unlock(); } static void cfg80211_propagate_cac_done_wk(struct work_struct *work) { struct cfg80211_registered_device *rdev; rdev = container_of(work, struct cfg80211_registered_device, propagate_cac_done_wk); rtnl_lock(); regulatory_propagate_dfs_state(&rdev->wiphy, &rdev->cac_done_chandef, NL80211_DFS_AVAILABLE, NL80211_RADAR_CAC_FINISHED); rtnl_unlock(); } static void cfg80211_wiphy_work(struct work_struct *work) { struct cfg80211_registered_device *rdev; struct wiphy_work *wk; rdev = container_of(work, struct cfg80211_registered_device, wiphy_work); trace_wiphy_work_worker_start(&rdev->wiphy); guard(wiphy)(&rdev->wiphy); if (rdev->suspended) return; spin_lock_irq(&rdev->wiphy_work_lock); wk = list_first_entry_or_null(&rdev->wiphy_work_list, struct wiphy_work, entry); if (wk) { list_del_init(&wk->entry); if (!list_empty(&rdev->wiphy_work_list)) queue_work(system_unbound_wq, work); spin_unlock_irq(&rdev->wiphy_work_lock); trace_wiphy_work_run(&rdev->wiphy, wk); wk->func(&rdev->wiphy, wk); } else { spin_unlock_irq(&rdev->wiphy_work_lock); } } /* exported functions */ struct wiphy *wiphy_new_nm(const struct cfg80211_ops *ops, int sizeof_priv, const char *requested_name) { static atomic_t wiphy_counter = ATOMIC_INIT(0); struct cfg80211_registered_device *rdev; int alloc_size; WARN_ON(ops->add_key && (!ops->del_key || !ops->set_default_key)); WARN_ON(ops->auth && (!ops->assoc || !ops->deauth || !ops->disassoc)); WARN_ON(ops->connect && !ops->disconnect); WARN_ON(ops->join_ibss && !ops->leave_ibss); WARN_ON(ops->add_virtual_intf && !ops->del_virtual_intf); WARN_ON(ops->add_station && !ops->del_station); WARN_ON(ops->add_mpath && !ops->del_mpath); WARN_ON(ops->join_mesh && !ops->leave_mesh); WARN_ON(ops->start_p2p_device && !ops->stop_p2p_device); WARN_ON(ops->start_ap && !ops->stop_ap); WARN_ON(ops->join_ocb && !ops->leave_ocb); WARN_ON(ops->suspend && !ops->resume); WARN_ON(ops->sched_scan_start && !ops->sched_scan_stop); WARN_ON(ops->remain_on_channel && !ops->cancel_remain_on_channel); WARN_ON(ops->tdls_channel_switch && !ops->tdls_cancel_channel_switch); WARN_ON(ops->add_tx_ts && !ops->del_tx_ts); alloc_size = sizeof(*rdev) + sizeof_priv; rdev = kzalloc(alloc_size, GFP_KERNEL); if (!rdev) return NULL; rdev->ops = ops; rdev->wiphy_idx = atomic_inc_return(&wiphy_counter); if (unlikely(rdev->wiphy_idx < 0)) { /* ugh, wrapped! */ atomic_dec(&wiphy_counter); kfree(rdev); return NULL; } /* atomic_inc_return makes it start at 1, make it start at 0 */ rdev->wiphy_idx--; /* give it a proper name */ if (requested_name && requested_name[0]) { int rv; rtnl_lock(); rv = cfg80211_dev_check_name(rdev, requested_name); if (rv < 0) { rtnl_unlock(); goto use_default_name; } rv = dev_set_name(&rdev->wiphy.dev, "%s", requested_name); rtnl_unlock(); if (rv) goto use_default_name; } else { int rv; use_default_name: /* NOTE: This is *probably* safe w/out holding rtnl because of * the restrictions on phy names. Probably this call could * fail if some other part of the kernel (re)named a device * phyX. But, might should add some locking and check return * value, and use a different name if this one exists? */ rv = dev_set_name(&rdev->wiphy.dev, PHY_NAME "%d", rdev->wiphy_idx); if (rv < 0) { kfree(rdev); return NULL; } } mutex_init(&rdev->wiphy.mtx); INIT_LIST_HEAD(&rdev->wiphy.wdev_list); INIT_LIST_HEAD(&rdev->beacon_registrations); spin_lock_init(&rdev->beacon_registrations_lock); spin_lock_init(&rdev->bss_lock); INIT_LIST_HEAD(&rdev->bss_list); INIT_LIST_HEAD(&rdev->sched_scan_req_list); wiphy_work_init(&rdev->scan_done_wk, __cfg80211_scan_done); INIT_DELAYED_WORK(&rdev->dfs_update_channels_wk, cfg80211_dfs_channels_update_work); #ifdef CONFIG_CFG80211_WEXT rdev->wiphy.wext = &cfg80211_wext_handler; #endif device_initialize(&rdev->wiphy.dev); rdev->wiphy.dev.class = &ieee80211_class; rdev->wiphy.dev.platform_data = rdev; device_enable_async_suspend(&rdev->wiphy.dev); INIT_WORK(&rdev->destroy_work, cfg80211_destroy_iface_wk); wiphy_work_init(&rdev->sched_scan_stop_wk, cfg80211_sched_scan_stop_wk); INIT_WORK(&rdev->sched_scan_res_wk, cfg80211_sched_scan_results_wk); INIT_WORK(&rdev->propagate_radar_detect_wk, cfg80211_propagate_radar_detect_wk); INIT_WORK(&rdev->propagate_cac_done_wk, cfg80211_propagate_cac_done_wk); INIT_WORK(&rdev->mgmt_registrations_update_wk, cfg80211_mgmt_registrations_update_wk); spin_lock_init(&rdev->mgmt_registrations_lock); INIT_WORK(&rdev->wiphy_work, cfg80211_wiphy_work); INIT_LIST_HEAD(&rdev->wiphy_work_list); spin_lock_init(&rdev->wiphy_work_lock); #ifdef CONFIG_CFG80211_DEFAULT_PS rdev->wiphy.flags |= WIPHY_FLAG_PS_ON_BY_DEFAULT; #endif wiphy_net_set(&rdev->wiphy, &init_net); rdev->rfkill_ops.set_block = cfg80211_rfkill_set_block; rdev->wiphy.rfkill = rfkill_alloc(dev_name(&rdev->wiphy.dev), &rdev->wiphy.dev, RFKILL_TYPE_WLAN, &rdev->rfkill_ops, rdev); if (!rdev->wiphy.rfkill) { wiphy_free(&rdev->wiphy); return NULL; } INIT_WORK(&rdev->rfkill_block, cfg80211_rfkill_block_work); INIT_WORK(&rdev->conn_work, cfg80211_conn_work); INIT_WORK(&rdev->event_work, cfg80211_event_work); INIT_WORK(&rdev->background_cac_abort_wk, cfg80211_background_cac_abort_wk); INIT_DELAYED_WORK(&rdev->background_cac_done_wk, cfg80211_background_cac_done_wk); init_waitqueue_head(&rdev->dev_wait); /* * Initialize wiphy parameters to IEEE 802.11 MIB default values. * Fragmentation and RTS threshold are disabled by default with the * special -1 value. */ rdev->wiphy.retry_short = 7; rdev->wiphy.retry_long = 4; rdev->wiphy.frag_threshold = (u32) -1; rdev->wiphy.rts_threshold = (u32) -1; rdev->wiphy.coverage_class = 0; rdev->wiphy.max_num_csa_counters = 1; rdev->wiphy.max_sched_scan_plans = 1; rdev->wiphy.max_sched_scan_plan_interval = U32_MAX; return &rdev->wiphy; } EXPORT_SYMBOL(wiphy_new_nm); static int wiphy_verify_iface_combinations(struct wiphy *wiphy, const struct ieee80211_iface_combination *iface_comb, int n_iface_comb, bool combined_radio) { const struct ieee80211_iface_combination *c; int i, j; for (i = 0; i < n_iface_comb; i++) { u32 cnt = 0; u16 all_iftypes = 0; c = &iface_comb[i]; /* * Combinations with just one interface aren't real, * however we make an exception for DFS. */ if (WARN_ON((c->max_interfaces < 2) && !c->radar_detect_widths)) return -EINVAL; /* Need at least one channel */ if (WARN_ON(!c->num_different_channels)) return -EINVAL; /* DFS only works on one channel. Avoid this check * for multi-radio global combination, since it hold * the capabilities of all radio combinations. */ if (!combined_radio && WARN_ON(c->radar_detect_widths && c->num_different_channels > 1)) return -EINVAL; if (WARN_ON(!c->n_limits)) return -EINVAL; for (j = 0; j < c->n_limits; j++) { u16 types = c->limits[j].types; /* interface types shouldn't overlap */ if (WARN_ON(types & all_iftypes)) return -EINVAL; all_iftypes |= types; if (WARN_ON(!c->limits[j].max)) return -EINVAL; /* Shouldn't list software iftypes in combinations! */ if (WARN_ON(wiphy->software_iftypes & types)) return -EINVAL; /* Only a single P2P_DEVICE can be allowed, avoid this * check for multi-radio global combination, since it * hold the capabilities of all radio combinations. */ if (!combined_radio && WARN_ON(types & BIT(NL80211_IFTYPE_P2P_DEVICE) && c->limits[j].max > 1)) return -EINVAL; /* Only a single NAN can be allowed, avoid this * check for multi-radio global combination, since it * hold the capabilities of all radio combinations. */ if (!combined_radio && WARN_ON(types & BIT(NL80211_IFTYPE_NAN) && c->limits[j].max > 1)) return -EINVAL; /* * This isn't well-defined right now. If you have an * IBSS interface, then its beacon interval may change * by joining other networks, and nothing prevents it * from doing that. * So technically we probably shouldn't even allow AP * and IBSS in the same interface, but it seems that * some drivers support that, possibly only with fixed * beacon intervals for IBSS. */ if (WARN_ON(types & BIT(NL80211_IFTYPE_ADHOC) && c->beacon_int_min_gcd)) { return -EINVAL; } cnt += c->limits[j].max; /* * Don't advertise an unsupported type * in a combination. */ if (WARN_ON((wiphy->interface_modes & types) != types)) return -EINVAL; } if (WARN_ON(all_iftypes & BIT(NL80211_IFTYPE_WDS))) return -EINVAL; /* You can't even choose that many! */ if (WARN_ON(cnt < c->max_interfaces)) return -EINVAL; } return 0; } static int wiphy_verify_combinations(struct wiphy *wiphy) { int i, ret; bool combined_radio = false; if (wiphy->n_radio) { for (i = 0; i < wiphy->n_radio; i++) { const struct wiphy_radio *radio = &wiphy->radio[i]; ret = wiphy_verify_iface_combinations(wiphy, radio->iface_combinations, radio->n_iface_combinations, false); if (ret) return ret; } combined_radio = true; } ret = wiphy_verify_iface_combinations(wiphy, wiphy->iface_combinations, wiphy->n_iface_combinations, combined_radio); return ret; } int wiphy_register(struct wiphy *wiphy) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); int res; enum nl80211_band band; struct ieee80211_supported_band *sband; bool have_band = false; int i; u16 ifmodes = wiphy->interface_modes; #ifdef CONFIG_PM if (WARN_ON(wiphy->wowlan && (wiphy->wowlan->flags & WIPHY_WOWLAN_GTK_REKEY_FAILURE) && !(wiphy->wowlan->flags & WIPHY_WOWLAN_SUPPORTS_GTK_REKEY))) return -EINVAL; if (WARN_ON(wiphy->wowlan && !wiphy->wowlan->flags && !wiphy->wowlan->n_patterns && !wiphy->wowlan->tcp)) return -EINVAL; #endif if (WARN_ON((wiphy->features & NL80211_FEATURE_TDLS_CHANNEL_SWITCH) && (!rdev->ops->tdls_channel_switch || !rdev->ops->tdls_cancel_channel_switch))) return -EINVAL; if (WARN_ON((wiphy->interface_modes & BIT(NL80211_IFTYPE_NAN)) && (!rdev->ops->start_nan || !rdev->ops->stop_nan || !rdev->ops->add_nan_func || !rdev->ops->del_nan_func || !(wiphy->nan_supported_bands & BIT(NL80211_BAND_2GHZ))))) return -EINVAL; if (WARN_ON(wiphy->interface_modes & BIT(NL80211_IFTYPE_WDS))) return -EINVAL; if (WARN_ON(wiphy->pmsr_capa && !wiphy->pmsr_capa->ftm.supported)) return -EINVAL; if (wiphy->pmsr_capa && wiphy->pmsr_capa->ftm.supported) { if (WARN_ON(!wiphy->pmsr_capa->ftm.asap && !wiphy->pmsr_capa->ftm.non_asap)) return -EINVAL; if (WARN_ON(!wiphy->pmsr_capa->ftm.preambles || !wiphy->pmsr_capa->ftm.bandwidths)) return -EINVAL; if (WARN_ON(wiphy->pmsr_capa->ftm.preambles & ~(BIT(NL80211_PREAMBLE_LEGACY) | BIT(NL80211_PREAMBLE_HT) | BIT(NL80211_PREAMBLE_VHT) | BIT(NL80211_PREAMBLE_HE) | BIT(NL80211_PREAMBLE_DMG)))) return -EINVAL; if (WARN_ON((wiphy->pmsr_capa->ftm.trigger_based || wiphy->pmsr_capa->ftm.non_trigger_based) && !(wiphy->pmsr_capa->ftm.preambles & BIT(NL80211_PREAMBLE_HE)))) return -EINVAL; if (WARN_ON(wiphy->pmsr_capa->ftm.bandwidths & ~(BIT(NL80211_CHAN_WIDTH_20_NOHT) | BIT(NL80211_CHAN_WIDTH_20) | BIT(NL80211_CHAN_WIDTH_40) | BIT(NL80211_CHAN_WIDTH_80) | BIT(NL80211_CHAN_WIDTH_80P80) | BIT(NL80211_CHAN_WIDTH_160) | BIT(NL80211_CHAN_WIDTH_320) | BIT(NL80211_CHAN_WIDTH_5) | BIT(NL80211_CHAN_WIDTH_10)))) return -EINVAL; } if (WARN_ON((wiphy->regulatory_flags & REGULATORY_WIPHY_SELF_MANAGED) && (wiphy->regulatory_flags & (REGULATORY_CUSTOM_REG | REGULATORY_STRICT_REG | REGULATORY_COUNTRY_IE_FOLLOW_POWER | REGULATORY_COUNTRY_IE_IGNORE)))) return -EINVAL; if (WARN_ON(wiphy->coalesce && (!wiphy->coalesce->n_rules || !wiphy->coalesce->n_patterns) && (!wiphy->coalesce->pattern_min_len || wiphy->coalesce->pattern_min_len > wiphy->coalesce->pattern_max_len))) return -EINVAL; if (WARN_ON(wiphy->ap_sme_capa && !(wiphy->flags & WIPHY_FLAG_HAVE_AP_SME))) return -EINVAL; if (WARN_ON(wiphy->addresses && !wiphy->n_addresses)) return -EINVAL; if (WARN_ON(wiphy->addresses && !is_zero_ether_addr(wiphy->perm_addr) && memcmp(wiphy->perm_addr, wiphy->addresses[0].addr, ETH_ALEN))) return -EINVAL; if (WARN_ON(wiphy->max_acl_mac_addrs && (!(wiphy->flags & WIPHY_FLAG_HAVE_AP_SME) || !rdev->ops->set_mac_acl))) return -EINVAL; /* assure only valid behaviours are flagged by driver * hence subtract 2 as bit 0 is invalid. */ if (WARN_ON(wiphy->bss_select_support && (wiphy->bss_select_support & ~(BIT(__NL80211_BSS_SELECT_ATTR_AFTER_LAST) - 2)))) return -EINVAL; if (WARN_ON(wiphy_ext_feature_isset(&rdev->wiphy, NL80211_EXT_FEATURE_4WAY_HANDSHAKE_STA_1X) && (!rdev->ops->set_pmk || !rdev->ops->del_pmk))) return -EINVAL; if (WARN_ON(!(rdev->wiphy.flags & WIPHY_FLAG_SUPPORTS_FW_ROAM) && rdev->ops->update_connect_params)) return -EINVAL; if (wiphy->addresses) memcpy(wiphy->perm_addr, wiphy->addresses[0].addr, ETH_ALEN); /* sanity check ifmodes */ WARN_ON(!ifmodes); ifmodes &= ((1 << NUM_NL80211_IFTYPES) - 1) & ~1; if (WARN_ON(ifmodes != wiphy->interface_modes)) wiphy->interface_modes = ifmodes; res = wiphy_verify_combinations(wiphy); if (res) return res; /* sanity check supported bands/channels */ for (band = 0; band < NUM_NL80211_BANDS; band++) { const struct ieee80211_sband_iftype_data *iftd; u16 types = 0; bool have_he = false; sband = wiphy->bands[band]; if (!sband) continue; sband->band = band; if (WARN_ON(!sband->n_channels)) return -EINVAL; /* * on 60GHz or sub-1Ghz band, there are no legacy rates, so * n_bitrates is 0 */ if (WARN_ON((band != NL80211_BAND_60GHZ && band != NL80211_BAND_S1GHZ) && !sband->n_bitrates)) return -EINVAL; if (WARN_ON(band == NL80211_BAND_6GHZ && (sband->ht_cap.ht_supported || sband->vht_cap.vht_supported))) return -EINVAL; /* * Since cfg80211_disable_40mhz_24ghz is global, we can * modify the sband's ht data even if the driver uses a * global structure for that. */ if (cfg80211_disable_40mhz_24ghz && band == NL80211_BAND_2GHZ && sband->ht_cap.ht_supported) { sband->ht_cap.cap &= ~IEEE80211_HT_CAP_SUP_WIDTH_20_40; sband->ht_cap.cap &= ~IEEE80211_HT_CAP_SGI_40; } /* * Since we use a u32 for rate bitmaps in * ieee80211_get_response_rate, we cannot * have more than 32 legacy rates. */ if (WARN_ON(sband->n_bitrates > 32)) return -EINVAL; for (i = 0; i < sband->n_channels; i++) { sband->channels[i].orig_flags = sband->channels[i].flags; sband->channels[i].orig_mag = INT_MAX; sband->channels[i].orig_mpwr = sband->channels[i].max_power; sband->channels[i].band = band; if (WARN_ON(sband->channels[i].freq_offset >= 1000)) return -EINVAL; } for_each_sband_iftype_data(sband, i, iftd) { bool has_ap, has_non_ap; u32 ap_bits = BIT(NL80211_IFTYPE_AP) | BIT(NL80211_IFTYPE_P2P_GO); if (WARN_ON(!iftd->types_mask)) return -EINVAL; if (WARN_ON(types & iftd->types_mask)) return -EINVAL; /* at least one piece of information must be present */ if (WARN_ON(!iftd->he_cap.has_he)) return -EINVAL; types |= iftd->types_mask; if (i == 0) have_he = iftd->he_cap.has_he; else have_he = have_he && iftd->he_cap.has_he; has_ap = iftd->types_mask & ap_bits; has_non_ap = iftd->types_mask & ~ap_bits; /* * For EHT 20 MHz STA, the capabilities format differs * but to simplify, don't check 20 MHz but rather check * only if AP and non-AP were mentioned at the same time, * reject if so. */ if (WARN_ON(iftd->eht_cap.has_eht && has_ap && has_non_ap)) return -EINVAL; } if (WARN_ON(!have_he && band == NL80211_BAND_6GHZ)) return -EINVAL; have_band = true; } if (!have_band) { WARN_ON(1); return -EINVAL; } for (i = 0; i < rdev->wiphy.n_vendor_commands; i++) { /* * Validate we have a policy (can be explicitly set to * VENDOR_CMD_RAW_DATA which is non-NULL) and also that * we have at least one of doit/dumpit. */ if (WARN_ON(!rdev->wiphy.vendor_commands[i].policy)) return -EINVAL; if (WARN_ON(!rdev->wiphy.vendor_commands[i].doit && !rdev->wiphy.vendor_commands[i].dumpit)) return -EINVAL; } #ifdef CONFIG_PM if (WARN_ON(rdev->wiphy.wowlan && rdev->wiphy.wowlan->n_patterns && (!rdev->wiphy.wowlan->pattern_min_len || rdev->wiphy.wowlan->pattern_min_len > rdev->wiphy.wowlan->pattern_max_len))) return -EINVAL; #endif if (!wiphy->max_num_akm_suites) wiphy->max_num_akm_suites = NL80211_MAX_NR_AKM_SUITES; else if (wiphy->max_num_akm_suites < NL80211_MAX_NR_AKM_SUITES || wiphy->max_num_akm_suites > CFG80211_MAX_NUM_AKM_SUITES) return -EINVAL; /* check and set up bitrates */ ieee80211_set_bitrate_flags(wiphy); rdev->wiphy.features |= NL80211_FEATURE_SCAN_FLUSH; rtnl_lock(); wiphy_lock(&rdev->wiphy); res = device_add(&rdev->wiphy.dev); if (res) { wiphy_unlock(&rdev->wiphy); rtnl_unlock(); return res; } list_add_rcu(&rdev->list, &cfg80211_rdev_list); cfg80211_rdev_list_generation++; /* add to debugfs */ rdev->wiphy.debugfsdir = debugfs_create_dir(wiphy_name(&rdev->wiphy), ieee80211_debugfs_dir); cfg80211_debugfs_rdev_add(rdev); nl80211_notify_wiphy(rdev, NL80211_CMD_NEW_WIPHY); wiphy_unlock(&rdev->wiphy); /* set up regulatory info */ wiphy_regulatory_register(wiphy); if (wiphy->regulatory_flags & REGULATORY_CUSTOM_REG) { struct regulatory_request request; request.wiphy_idx = get_wiphy_idx(wiphy); request.initiator = NL80211_REGDOM_SET_BY_DRIVER; request.alpha2[0] = '9'; request.alpha2[1] = '9'; nl80211_send_reg_change_event(&request); } /* Check that nobody globally advertises any capabilities they do not * advertise on all possible interface types. */ if (wiphy->extended_capabilities_len && wiphy->num_iftype_ext_capab && wiphy->iftype_ext_capab) { u8 supported_on_all, j; const struct wiphy_iftype_ext_capab *capab; capab = wiphy->iftype_ext_capab; for (j = 0; j < wiphy->extended_capabilities_len; j++) { if (capab[0].extended_capabilities_len > j) supported_on_all = capab[0].extended_capabilities[j]; else supported_on_all = 0x00; for (i = 1; i < wiphy->num_iftype_ext_capab; i++) { if (j >= capab[i].extended_capabilities_len) { supported_on_all = 0x00; break; } supported_on_all &= capab[i].extended_capabilities[j]; } if (WARN_ON(wiphy->extended_capabilities[j] & ~supported_on_all)) break; } } rdev->wiphy.registered = true; rtnl_unlock(); res = rfkill_register(rdev->wiphy.rfkill); if (res) { rfkill_destroy(rdev->wiphy.rfkill); rdev->wiphy.rfkill = NULL; wiphy_unregister(&rdev->wiphy); return res; } return 0; } EXPORT_SYMBOL(wiphy_register); void wiphy_rfkill_start_polling(struct wiphy *wiphy) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); if (!rdev->ops->rfkill_poll) return; rdev->rfkill_ops.poll = cfg80211_rfkill_poll; rfkill_resume_polling(wiphy->rfkill); } EXPORT_SYMBOL(wiphy_rfkill_start_polling); void cfg80211_process_wiphy_works(struct cfg80211_registered_device *rdev, struct wiphy_work *end) { unsigned int runaway_limit = 100; unsigned long flags; lockdep_assert_held(&rdev->wiphy.mtx); spin_lock_irqsave(&rdev->wiphy_work_lock, flags); while (!list_empty(&rdev->wiphy_work_list)) { struct wiphy_work *wk; wk = list_first_entry(&rdev->wiphy_work_list, struct wiphy_work, entry); list_del_init(&wk->entry); spin_unlock_irqrestore(&rdev->wiphy_work_lock, flags); trace_wiphy_work_run(&rdev->wiphy, wk); wk->func(&rdev->wiphy, wk); spin_lock_irqsave(&rdev->wiphy_work_lock, flags); if (wk == end) break; if (WARN_ON(--runaway_limit == 0)) INIT_LIST_HEAD(&rdev->wiphy_work_list); } spin_unlock_irqrestore(&rdev->wiphy_work_lock, flags); } void wiphy_unregister(struct wiphy *wiphy) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); wait_event(rdev->dev_wait, ({ int __count; wiphy_lock(&rdev->wiphy); __count = rdev->opencount; wiphy_unlock(&rdev->wiphy); __count == 0; })); if (rdev->wiphy.rfkill) rfkill_unregister(rdev->wiphy.rfkill); rtnl_lock(); wiphy_lock(&rdev->wiphy); nl80211_notify_wiphy(rdev, NL80211_CMD_DEL_WIPHY); rdev->wiphy.registered = false; WARN_ON(!list_empty(&rdev->wiphy.wdev_list)); /* * First remove the hardware from everywhere, this makes * it impossible to find from userspace. */ debugfs_remove_recursive(rdev->wiphy.debugfsdir); list_del_rcu(&rdev->list); synchronize_rcu(); /* * If this device got a regulatory hint tell core its * free to listen now to a new shiny device regulatory hint */ wiphy_regulatory_deregister(wiphy); cfg80211_rdev_list_generation++; device_del(&rdev->wiphy.dev); #ifdef CONFIG_PM if (rdev->wiphy.wowlan_config && rdev->ops->set_wakeup) rdev_set_wakeup(rdev, false); #endif /* surely nothing is reachable now, clean up work */ cfg80211_process_wiphy_works(rdev, NULL); wiphy_unlock(&rdev->wiphy); rtnl_unlock(); /* this has nothing to do now but make sure it's gone */ cancel_work_sync(&rdev->wiphy_work); cancel_work_sync(&rdev->conn_work); flush_work(&rdev->event_work); cancel_delayed_work_sync(&rdev->dfs_update_channels_wk); cancel_delayed_work_sync(&rdev->background_cac_done_wk); flush_work(&rdev->destroy_work); flush_work(&rdev->propagate_radar_detect_wk); flush_work(&rdev->propagate_cac_done_wk); flush_work(&rdev->mgmt_registrations_update_wk); flush_work(&rdev->background_cac_abort_wk); cfg80211_rdev_free_wowlan(rdev); cfg80211_free_coalesce(rdev->coalesce); rdev->coalesce = NULL; } EXPORT_SYMBOL(wiphy_unregister); void cfg80211_dev_free(struct cfg80211_registered_device *rdev) { struct cfg80211_internal_bss *scan, *tmp; struct cfg80211_beacon_registration *reg, *treg; unsigned long flags; spin_lock_irqsave(&rdev->wiphy_work_lock, flags); WARN_ON(!list_empty(&rdev->wiphy_work_list)); spin_unlock_irqrestore(&rdev->wiphy_work_lock, flags); cancel_work_sync(&rdev->wiphy_work); rfkill_destroy(rdev->wiphy.rfkill); list_for_each_entry_safe(reg, treg, &rdev->beacon_registrations, list) { list_del(®->list); kfree(reg); } list_for_each_entry_safe(scan, tmp, &rdev->bss_list, list) cfg80211_put_bss(&rdev->wiphy, &scan->pub); mutex_destroy(&rdev->wiphy.mtx); /* * The 'regd' can only be non-NULL if we never finished * initializing the wiphy and thus never went through the * unregister path - e.g. in failure scenarios. Thus, it * cannot have been visible to anyone if non-NULL, so we * can just free it here. */ kfree(rcu_dereference_raw(rdev->wiphy.regd)); kfree(rdev); } void wiphy_free(struct wiphy *wiphy) { put_device(&wiphy->dev); } EXPORT_SYMBOL(wiphy_free); void wiphy_rfkill_set_hw_state_reason(struct wiphy *wiphy, bool blocked, enum rfkill_hard_block_reasons reason) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); if (rfkill_set_hw_state_reason(wiphy->rfkill, blocked, reason)) schedule_work(&rdev->rfkill_block); } EXPORT_SYMBOL(wiphy_rfkill_set_hw_state_reason); static void _cfg80211_unregister_wdev(struct wireless_dev *wdev, bool unregister_netdev) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wdev->wiphy); struct cfg80211_cqm_config *cqm_config; unsigned int link_id; ASSERT_RTNL(); lockdep_assert_held(&rdev->wiphy.mtx); nl80211_notify_iface(rdev, wdev, NL80211_CMD_DEL_INTERFACE); wdev->registered = false; if (wdev->netdev) { sysfs_remove_link(&wdev->netdev->dev.kobj, "phy80211"); if (unregister_netdev) unregister_netdevice(wdev->netdev); } list_del_rcu(&wdev->list); synchronize_net(); rdev->devlist_generation++; cfg80211_mlme_purge_registrations(wdev); switch (wdev->iftype) { case NL80211_IFTYPE_P2P_DEVICE: cfg80211_stop_p2p_device(rdev, wdev); break; case NL80211_IFTYPE_NAN: cfg80211_stop_nan(rdev, wdev); break; default: break; } #ifdef CONFIG_CFG80211_WEXT kfree_sensitive(wdev->wext.keys); wdev->wext.keys = NULL; #endif wiphy_work_cancel(wdev->wiphy, &wdev->cqm_rssi_work); /* deleted from the list, so can't be found from nl80211 any more */ cqm_config = rcu_access_pointer(wdev->cqm_config); kfree_rcu(cqm_config, rcu_head); RCU_INIT_POINTER(wdev->cqm_config, NULL); /* * Ensure that all events have been processed and * freed. */ cfg80211_process_wdev_events(wdev); if (wdev->iftype == NL80211_IFTYPE_STATION || wdev->iftype == NL80211_IFTYPE_P2P_CLIENT) { for (link_id = 0; link_id < ARRAY_SIZE(wdev->links); link_id++) { struct cfg80211_internal_bss *curbss; curbss = wdev->links[link_id].client.current_bss; if (WARN_ON(curbss)) { cfg80211_unhold_bss(curbss); cfg80211_put_bss(wdev->wiphy, &curbss->pub); wdev->links[link_id].client.current_bss = NULL; } } } wdev->connected = false; } void cfg80211_unregister_wdev(struct wireless_dev *wdev) { _cfg80211_unregister_wdev(wdev, true); } EXPORT_SYMBOL(cfg80211_unregister_wdev); static const struct device_type wiphy_type = { .name = "wlan", }; void cfg80211_update_iface_num(struct cfg80211_registered_device *rdev, enum nl80211_iftype iftype, int num) { lockdep_assert_held(&rdev->wiphy.mtx); rdev->num_running_ifaces += num; if (iftype == NL80211_IFTYPE_MONITOR) rdev->num_running_monitor_ifaces += num; } void cfg80211_leave(struct cfg80211_registered_device *rdev, struct wireless_dev *wdev) { struct net_device *dev = wdev->netdev; struct cfg80211_sched_scan_request *pos, *tmp; lockdep_assert_held(&rdev->wiphy.mtx); cfg80211_pmsr_wdev_down(wdev); cfg80211_stop_background_radar_detection(wdev); switch (wdev->iftype) { case NL80211_IFTYPE_ADHOC: cfg80211_leave_ibss(rdev, dev, true); break; case NL80211_IFTYPE_P2P_CLIENT: case NL80211_IFTYPE_STATION: list_for_each_entry_safe(pos, tmp, &rdev->sched_scan_req_list, list) { if (dev == pos->dev) cfg80211_stop_sched_scan_req(rdev, pos, false); } #ifdef CONFIG_CFG80211_WEXT kfree(wdev->wext.ie); wdev->wext.ie = NULL; wdev->wext.ie_len = 0; wdev->wext.connect.auth_type = NL80211_AUTHTYPE_AUTOMATIC; #endif cfg80211_disconnect(rdev, dev, WLAN_REASON_DEAUTH_LEAVING, true); break; case NL80211_IFTYPE_MESH_POINT: cfg80211_leave_mesh(rdev, dev); break; case NL80211_IFTYPE_AP: case NL80211_IFTYPE_P2P_GO: cfg80211_stop_ap(rdev, dev, -1, true); break; case NL80211_IFTYPE_OCB: cfg80211_leave_ocb(rdev, dev); break; case NL80211_IFTYPE_P2P_DEVICE: case NL80211_IFTYPE_NAN: /* cannot happen, has no netdev */ break; case NL80211_IFTYPE_AP_VLAN: case NL80211_IFTYPE_MONITOR: /* nothing to do */ break; case NL80211_IFTYPE_UNSPECIFIED: case NL80211_IFTYPE_WDS: case NUM_NL80211_IFTYPES: /* invalid */ break; } } void cfg80211_stop_iface(struct wiphy *wiphy, struct wireless_dev *wdev, gfp_t gfp) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); struct cfg80211_event *ev; unsigned long flags; trace_cfg80211_stop_iface(wiphy, wdev); ev = kzalloc(sizeof(*ev), gfp); if (!ev) return; ev->type = EVENT_STOPPED; spin_lock_irqsave(&wdev->event_lock, flags); list_add_tail(&ev->list, &wdev->event_list); spin_unlock_irqrestore(&wdev->event_lock, flags); queue_work(cfg80211_wq, &rdev->event_work); } EXPORT_SYMBOL(cfg80211_stop_iface); void cfg80211_init_wdev(struct wireless_dev *wdev) { INIT_LIST_HEAD(&wdev->event_list); spin_lock_init(&wdev->event_lock); INIT_LIST_HEAD(&wdev->mgmt_registrations); INIT_LIST_HEAD(&wdev->pmsr_list); spin_lock_init(&wdev->pmsr_lock); INIT_WORK(&wdev->pmsr_free_wk, cfg80211_pmsr_free_wk); #ifdef CONFIG_CFG80211_WEXT wdev->wext.default_key = -1; wdev->wext.default_mgmt_key = -1; wdev->wext.connect.auth_type = NL80211_AUTHTYPE_AUTOMATIC; #endif wiphy_work_init(&wdev->cqm_rssi_work, cfg80211_cqm_rssi_notify_work); if (wdev->wiphy->flags & WIPHY_FLAG_PS_ON_BY_DEFAULT) wdev->ps = true; else wdev->ps = false; /* allow mac80211 to determine the timeout */ wdev->ps_timeout = -1; wdev->radio_mask = BIT(wdev->wiphy->n_radio) - 1; if ((wdev->iftype == NL80211_IFTYPE_STATION || wdev->iftype == NL80211_IFTYPE_P2P_CLIENT || wdev->iftype == NL80211_IFTYPE_ADHOC) && !wdev->use_4addr) wdev->netdev->priv_flags |= IFF_DONT_BRIDGE; INIT_WORK(&wdev->disconnect_wk, cfg80211_autodisconnect_wk); } void cfg80211_register_wdev(struct cfg80211_registered_device *rdev, struct wireless_dev *wdev) { ASSERT_RTNL(); lockdep_assert_held(&rdev->wiphy.mtx); /* * We get here also when the interface changes network namespaces, * as it's registered into the new one, but we don't want it to * change ID in that case. Checking if the ID is already assigned * works, because 0 isn't considered a valid ID and the memory is * 0-initialized. */ if (!wdev->identifier) wdev->identifier = ++rdev->wdev_id; list_add_rcu(&wdev->list, &rdev->wiphy.wdev_list); rdev->devlist_generation++; wdev->registered = true; if (wdev->netdev && sysfs_create_link(&wdev->netdev->dev.kobj, &rdev->wiphy.dev.kobj, "phy80211")) pr_err("failed to add phy80211 symlink to netdev!\n"); nl80211_notify_iface(rdev, wdev, NL80211_CMD_NEW_INTERFACE); } int cfg80211_register_netdevice(struct net_device *dev) { struct wireless_dev *wdev = dev->ieee80211_ptr; struct cfg80211_registered_device *rdev; int ret; ASSERT_RTNL(); if (WARN_ON(!wdev)) return -EINVAL; rdev = wiphy_to_rdev(wdev->wiphy); lockdep_assert_held(&rdev->wiphy.mtx); /* we'll take care of this */ wdev->registered = true; wdev->registering = true; ret = register_netdevice(dev); if (ret) goto out; cfg80211_register_wdev(rdev, wdev); ret = 0; out: wdev->registering = false; if (ret) wdev->registered = false; return ret; } EXPORT_SYMBOL(cfg80211_register_netdevice); static int cfg80211_netdev_notifier_call(struct notifier_block *nb, unsigned long state, void *ptr) { struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct wireless_dev *wdev = dev->ieee80211_ptr; struct cfg80211_registered_device *rdev; struct cfg80211_sched_scan_request *pos, *tmp; if (!wdev) return NOTIFY_DONE; rdev = wiphy_to_rdev(wdev->wiphy); WARN_ON(wdev->iftype == NL80211_IFTYPE_UNSPECIFIED); switch (state) { case NETDEV_POST_INIT: SET_NETDEV_DEVTYPE(dev, &wiphy_type); wdev->netdev = dev; /* can only change netns with wiphy */ dev->netns_immutable = true; cfg80211_init_wdev(wdev); break; case NETDEV_REGISTER: if (!wdev->registered) { guard(wiphy)(&rdev->wiphy); cfg80211_register_wdev(rdev, wdev); } break; case NETDEV_UNREGISTER: /* * It is possible to get NETDEV_UNREGISTER multiple times, * so check wdev->registered. */ if (wdev->registered && !wdev->registering) { guard(wiphy)(&rdev->wiphy); _cfg80211_unregister_wdev(wdev, false); } break; case NETDEV_GOING_DOWN: scoped_guard(wiphy, &rdev->wiphy) { cfg80211_leave(rdev, wdev); cfg80211_remove_links(wdev); } /* since we just did cfg80211_leave() nothing to do there */ cancel_work_sync(&wdev->disconnect_wk); cancel_work_sync(&wdev->pmsr_free_wk); break; case NETDEV_DOWN: wiphy_lock(&rdev->wiphy); cfg80211_update_iface_num(rdev, wdev->iftype, -1); if (rdev->scan_req && rdev->scan_req->wdev == wdev) { if (WARN_ON(!rdev->scan_req->notified && (!rdev->int_scan_req || !rdev->int_scan_req->notified))) rdev->scan_req->info.aborted = true; ___cfg80211_scan_done(rdev, false); } list_for_each_entry_safe(pos, tmp, &rdev->sched_scan_req_list, list) { if (WARN_ON(pos->dev == wdev->netdev)) cfg80211_stop_sched_scan_req(rdev, pos, false); } rdev->opencount--; wiphy_unlock(&rdev->wiphy); wake_up(&rdev->dev_wait); break; case NETDEV_UP: wiphy_lock(&rdev->wiphy); cfg80211_update_iface_num(rdev, wdev->iftype, 1); switch (wdev->iftype) { #ifdef CONFIG_CFG80211_WEXT case NL80211_IFTYPE_ADHOC: cfg80211_ibss_wext_join(rdev, wdev); break; case NL80211_IFTYPE_STATION: cfg80211_mgd_wext_connect(rdev, wdev); break; #endif #ifdef CONFIG_MAC80211_MESH case NL80211_IFTYPE_MESH_POINT: { /* backward compat code... */ struct mesh_setup setup; memcpy(&setup, &default_mesh_setup, sizeof(setup)); /* back compat only needed for mesh_id */ setup.mesh_id = wdev->u.mesh.id; setup.mesh_id_len = wdev->u.mesh.id_up_len; if (wdev->u.mesh.id_up_len) __cfg80211_join_mesh(rdev, dev, &setup, &default_mesh_config); break; } #endif default: break; } rdev->opencount++; /* * Configure power management to the driver here so that its * correctly set also after interface type changes etc. */ if ((wdev->iftype == NL80211_IFTYPE_STATION || wdev->iftype == NL80211_IFTYPE_P2P_CLIENT) && rdev->ops->set_power_mgmt && rdev_set_power_mgmt(rdev, dev, wdev->ps, wdev->ps_timeout)) { /* assume this means it's off */ wdev->ps = false; } wiphy_unlock(&rdev->wiphy); break; case NETDEV_PRE_UP: if (!cfg80211_iftype_allowed(wdev->wiphy, wdev->iftype, wdev->use_4addr, 0)) return notifier_from_errno(-EOPNOTSUPP); if (rfkill_blocked(rdev->wiphy.rfkill)) return notifier_from_errno(-ERFKILL); break; default: return NOTIFY_DONE; } wireless_nlevent_flush(); return NOTIFY_OK; } static struct notifier_block cfg80211_netdev_notifier = { .notifier_call = cfg80211_netdev_notifier_call, }; static void __net_exit cfg80211_pernet_exit(struct net *net) { struct cfg80211_registered_device *rdev; rtnl_lock(); for_each_rdev(rdev) { if (net_eq(wiphy_net(&rdev->wiphy), net)) WARN_ON(cfg80211_switch_netns(rdev, &init_net)); } rtnl_unlock(); } static struct pernet_operations cfg80211_pernet_ops = { .exit = cfg80211_pernet_exit, }; void wiphy_work_queue(struct wiphy *wiphy, struct wiphy_work *work) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); unsigned long flags; trace_wiphy_work_queue(wiphy, work); spin_lock_irqsave(&rdev->wiphy_work_lock, flags); if (list_empty(&work->entry)) list_add_tail(&work->entry, &rdev->wiphy_work_list); spin_unlock_irqrestore(&rdev->wiphy_work_lock, flags); queue_work(system_unbound_wq, &rdev->wiphy_work); } EXPORT_SYMBOL_GPL(wiphy_work_queue); void wiphy_work_cancel(struct wiphy *wiphy, struct wiphy_work *work) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); unsigned long flags; lockdep_assert_held(&wiphy->mtx); trace_wiphy_work_cancel(wiphy, work); spin_lock_irqsave(&rdev->wiphy_work_lock, flags); if (!list_empty(&work->entry)) list_del_init(&work->entry); spin_unlock_irqrestore(&rdev->wiphy_work_lock, flags); } EXPORT_SYMBOL_GPL(wiphy_work_cancel); void wiphy_work_flush(struct wiphy *wiphy, struct wiphy_work *work) { struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); unsigned long flags; bool run; trace_wiphy_work_flush(wiphy, work); spin_lock_irqsave(&rdev->wiphy_work_lock, flags); run = !work || !list_empty(&work->entry); spin_unlock_irqrestore(&rdev->wiphy_work_lock, flags); if (run) cfg80211_process_wiphy_works(rdev, work); } EXPORT_SYMBOL_GPL(wiphy_work_flush); void wiphy_delayed_work_timer(struct timer_list *t) { struct wiphy_delayed_work *dwork = from_timer(dwork, t, timer); wiphy_work_queue(dwork->wiphy, &dwork->work); } EXPORT_SYMBOL(wiphy_delayed_work_timer); void wiphy_delayed_work_queue(struct wiphy *wiphy, struct wiphy_delayed_work *dwork, unsigned long delay) { trace_wiphy_delayed_work_queue(wiphy, &dwork->work, delay); if (!delay) { timer_delete(&dwork->timer); wiphy_work_queue(wiphy, &dwork->work); return; } dwork->wiphy = wiphy; mod_timer(&dwork->timer, jiffies + delay); } EXPORT_SYMBOL_GPL(wiphy_delayed_work_queue); void wiphy_delayed_work_cancel(struct wiphy *wiphy, struct wiphy_delayed_work *dwork) { lockdep_assert_held(&wiphy->mtx); timer_delete_sync(&dwork->timer); wiphy_work_cancel(wiphy, &dwork->work); } EXPORT_SYMBOL_GPL(wiphy_delayed_work_cancel); void wiphy_delayed_work_flush(struct wiphy *wiphy, struct wiphy_delayed_work *dwork) { lockdep_assert_held(&wiphy->mtx); timer_delete_sync(&dwork->timer); wiphy_work_flush(wiphy, &dwork->work); } EXPORT_SYMBOL_GPL(wiphy_delayed_work_flush); bool wiphy_delayed_work_pending(struct wiphy *wiphy, struct wiphy_delayed_work *dwork) { return timer_pending(&dwork->timer); } EXPORT_SYMBOL_GPL(wiphy_delayed_work_pending); static int __init cfg80211_init(void) { int err; err = register_pernet_device(&cfg80211_pernet_ops); if (err) goto out_fail_pernet; err = wiphy_sysfs_init(); if (err) goto out_fail_sysfs; err = register_netdevice_notifier(&cfg80211_netdev_notifier); if (err) goto out_fail_notifier; err = nl80211_init(); if (err) goto out_fail_nl80211; ieee80211_debugfs_dir = debugfs_create_dir("ieee80211", NULL); err = regulatory_init(); if (err) goto out_fail_reg; cfg80211_wq = alloc_ordered_workqueue("cfg80211", WQ_MEM_RECLAIM); if (!cfg80211_wq) { err = -ENOMEM; goto out_fail_wq; } return 0; out_fail_wq: regulatory_exit(); out_fail_reg: debugfs_remove(ieee80211_debugfs_dir); nl80211_exit(); out_fail_nl80211: unregister_netdevice_notifier(&cfg80211_netdev_notifier); out_fail_notifier: wiphy_sysfs_exit(); out_fail_sysfs: unregister_pernet_device(&cfg80211_pernet_ops); out_fail_pernet: return err; } fs_initcall(cfg80211_init); static void __exit cfg80211_exit(void) { debugfs_remove(ieee80211_debugfs_dir); nl80211_exit(); unregister_netdevice_notifier(&cfg80211_netdev_notifier); wiphy_sysfs_exit(); regulatory_exit(); unregister_pernet_device(&cfg80211_pernet_ops); destroy_workqueue(cfg80211_wq); } module_exit(cfg80211_exit); |
2625 2583 10 220 13 2567 221 2567 221 2581 28519 2567 2558 2562 1347 2549 2567 202 203 201 202 202 202 221 220 218 161 149 137 137 137 160 218 203 221 221 219 220 220 221 221 221 218 219 217 218 217 219 220 219 55 55 55 55 55 218 22345 22345 1 1 1 1 4 3 3 1 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 | // SPDX-License-Identifier: GPL-2.0-only #include "cgroup-internal.h" #include <linux/sched/cputime.h> #include <linux/bpf.h> #include <linux/btf.h> #include <linux/btf_ids.h> #include <trace/events/cgroup.h> static DEFINE_SPINLOCK(cgroup_rstat_lock); static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) { return per_cpu_ptr(cgrp->rstat_cpu, cpu); } /* * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock). * * This makes it easier to diagnose locking issues and contention in * production environments. The parameter @fast_path determine the * tracepoints being added, allowing us to diagnose "flush" related * operations without handling high-frequency fast-path "update" events. */ static __always_inline unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu, struct cgroup *cgrp, const bool fast_path) { unsigned long flags; bool contended; /* * The _irqsave() is needed because cgroup_rstat_lock is * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring * this lock with the _irq() suffix only disables interrupts on * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables * interrupts on both configurations. The _irqsave() ensures * that interrupts are always disabled and later restored. */ contended = !raw_spin_trylock_irqsave(cpu_lock, flags); if (contended) { if (fast_path) trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu, contended); else trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, contended); raw_spin_lock_irqsave(cpu_lock, flags); } if (fast_path) trace_cgroup_rstat_cpu_locked_fastpath(cgrp, cpu, contended); else trace_cgroup_rstat_cpu_locked(cgrp, cpu, contended); return flags; } static __always_inline void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu, struct cgroup *cgrp, unsigned long flags, const bool fast_path) { if (fast_path) trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false); else trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false); raw_spin_unlock_irqrestore(cpu_lock, flags); } /** * cgroup_rstat_updated - keep track of updated rstat_cpu * @cgrp: target cgroup * @cpu: cpu on which rstat_cpu was updated * * @cgrp's rstat_cpu on @cpu was updated. Put it on the parent's matching * rstat_cpu->updated_children list. See the comment on top of * cgroup_rstat_cpu definition for details. */ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) { raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); unsigned long flags; /* * Speculative already-on-list test. This may race leading to * temporary inaccuracies, which is fine. * * Because @parent's updated_children is terminated with @parent * instead of NULL, we can tell whether @cgrp is on the list by * testing the next pointer for NULL. */ if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next)) return; flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true); /* put @cgrp and all ancestors on the corresponding updated lists */ while (true) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); struct cgroup *parent = cgroup_parent(cgrp); struct cgroup_rstat_cpu *prstatc; /* * Both additions and removals are bottom-up. If a cgroup * is already in the tree, all ancestors are. */ if (rstatc->updated_next) break; /* Root has no parent to link it to, but mark it busy */ if (!parent) { rstatc->updated_next = cgrp; break; } prstatc = cgroup_rstat_cpu(parent, cpu); rstatc->updated_next = prstatc->updated_children; prstatc->updated_children = cgrp; cgrp = parent; } _cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true); } /** * cgroup_rstat_push_children - push children cgroups into the given list * @head: current head of the list (= subtree root) * @child: first child of the root * @cpu: target cpu * Return: A new singly linked list of cgroups to be flush * * Iteratively traverse down the cgroup_rstat_cpu updated tree level by * level and push all the parents first before their next level children * into a singly linked list built from the tail backward like "pushing" * cgroups into a stack. The root is pushed by the caller. */ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head, struct cgroup *child, int cpu) { struct cgroup *chead = child; /* Head of child cgroup level */ struct cgroup *ghead = NULL; /* Head of grandchild cgroup level */ struct cgroup *parent, *grandchild; struct cgroup_rstat_cpu *crstatc; child->rstat_flush_next = NULL; next_level: while (chead) { child = chead; chead = child->rstat_flush_next; parent = cgroup_parent(child); /* updated_next is parent cgroup terminated */ while (child != parent) { child->rstat_flush_next = head; head = child; crstatc = cgroup_rstat_cpu(child, cpu); grandchild = crstatc->updated_children; if (grandchild != child) { /* Push the grand child to the next level */ crstatc->updated_children = child; grandchild->rstat_flush_next = ghead; ghead = grandchild; } child = crstatc->updated_next; crstatc->updated_next = NULL; } } if (ghead) { chead = ghead; ghead = NULL; goto next_level; } return head; } /** * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed * @root: root of the cgroup subtree to traverse * @cpu: target cpu * Return: A singly linked list of cgroups to be flushed * * Walks the updated rstat_cpu tree on @cpu from @root. During traversal, * each returned cgroup is unlinked from the updated tree. * * The only ordering guarantee is that, for a parent and a child pair * covered by a given traversal, the child is before its parent in * the list. * * Note that updated_children is self terminated and points to a list of * child cgroups if not empty. Whereas updated_next is like a sibling link * within the children list and terminated by the parent cgroup. An exception * here is the cgroup root whose updated_next can be self terminated. */ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu) { raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu); struct cgroup *head = NULL, *parent, *child; unsigned long flags; flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false); /* Return NULL if this subtree is not on-list */ if (!rstatc->updated_next) goto unlock_ret; /* * Unlink @root from its parent. As the updated_children list is * singly linked, we have to walk it to find the removal point. */ parent = cgroup_parent(root); if (parent) { struct cgroup_rstat_cpu *prstatc; struct cgroup **nextp; prstatc = cgroup_rstat_cpu(parent, cpu); nextp = &prstatc->updated_children; while (*nextp != root) { struct cgroup_rstat_cpu *nrstatc; nrstatc = cgroup_rstat_cpu(*nextp, cpu); WARN_ON_ONCE(*nextp == parent); nextp = &nrstatc->updated_next; } *nextp = rstatc->updated_next; } rstatc->updated_next = NULL; /* Push @root to the list first before pushing the children */ head = root; root->rstat_flush_next = NULL; child = rstatc->updated_children; rstatc->updated_children = root; if (child != root) head = cgroup_rstat_push_children(head, child, cpu); unlock_ret: _cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false); return head; } /* * A hook for bpf stat collectors to attach to and flush their stats. * Together with providing bpf kfuncs for cgroup_rstat_updated() and * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that * collect cgroup stats can integrate with rstat for efficient flushing. * * A static noinline declaration here could cause the compiler to optimize away * the function. A global noinline declaration will keep the definition, but may * optimize away the callsite. Therefore, __weak is needed to ensure that the * call is still emitted, by telling the compiler that we don't know what the * function might eventually be. */ __bpf_hook_start(); __weak noinline void bpf_rstat_flush(struct cgroup *cgrp, struct cgroup *parent, int cpu) { } __bpf_hook_end(); /* * Helper functions for locking cgroup_rstat_lock. * * This makes it easier to diagnose locking issues and contention in * production environments. The parameter @cpu_in_loop indicate lock * was released and re-taken when collection data from the CPUs. The * value -1 is used when obtaining the main lock else this is the CPU * number processed last. */ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop) __acquires(&cgroup_rstat_lock) { bool contended; contended = !spin_trylock_irq(&cgroup_rstat_lock); if (contended) { trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended); spin_lock_irq(&cgroup_rstat_lock); } trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended); } static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) __releases(&cgroup_rstat_lock) { trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false); spin_unlock_irq(&cgroup_rstat_lock); } /** * cgroup_rstat_flush - flush stats in @cgrp's subtree * @cgrp: target cgroup * * Collect all per-cpu stats in @cgrp's subtree into the global counters * and propagate them upwards. After this function returns, all cgroups in * the subtree have up-to-date ->stat. * * This also gets all cgroups in the subtree including @cgrp off the * ->updated_children lists. * * This function may block. */ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) { int cpu; might_sleep(); for_each_possible_cpu(cpu) { struct cgroup *pos; /* Reacquire for each CPU to avoid disabling IRQs too long */ __cgroup_rstat_lock(cgrp, cpu); pos = cgroup_rstat_updated_list(cgrp, cpu); for (; pos; pos = pos->rstat_flush_next) { struct cgroup_subsys_state *css; cgroup_base_stat_flush(pos, cpu); bpf_rstat_flush(pos, cgroup_parent(pos), cpu); rcu_read_lock(); list_for_each_entry_rcu(css, &pos->rstat_css_list, rstat_css_node) css->ss->css_rstat_flush(css, cpu); rcu_read_unlock(); } __cgroup_rstat_unlock(cgrp, cpu); if (!cond_resched()) cpu_relax(); } } int cgroup_rstat_init(struct cgroup *cgrp) { int cpu; /* the root cgrp has rstat_cpu preallocated */ if (!cgrp->rstat_cpu) { cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu); if (!cgrp->rstat_cpu) return -ENOMEM; } /* ->updated_children list is self terminated */ for_each_possible_cpu(cpu) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); rstatc->updated_children = cgrp; u64_stats_init(&rstatc->bsync); } return 0; } void cgroup_rstat_exit(struct cgroup *cgrp) { int cpu; cgroup_rstat_flush(cgrp); /* sanity check */ for_each_possible_cpu(cpu) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); if (WARN_ON_ONCE(rstatc->updated_children != cgrp) || WARN_ON_ONCE(rstatc->updated_next)) return; } free_percpu(cgrp->rstat_cpu); cgrp->rstat_cpu = NULL; } void __init cgroup_rstat_boot(void) { int cpu; for_each_possible_cpu(cpu) raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu)); } /* * Functions for cgroup basic resource statistics implemented on top of * rstat. */ static void cgroup_base_stat_add(struct cgroup_base_stat *dst_bstat, struct cgroup_base_stat *src_bstat) { dst_bstat->cputime.utime += src_bstat->cputime.utime; dst_bstat->cputime.stime += src_bstat->cputime.stime; dst_bstat->cputime.sum_exec_runtime += src_bstat->cputime.sum_exec_runtime; #ifdef CONFIG_SCHED_CORE dst_bstat->forceidle_sum += src_bstat->forceidle_sum; #endif dst_bstat->ntime += src_bstat->ntime; } static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat, struct cgroup_base_stat *src_bstat) { dst_bstat->cputime.utime -= src_bstat->cputime.utime; dst_bstat->cputime.stime -= src_bstat->cputime.stime; dst_bstat->cputime.sum_exec_runtime -= src_bstat->cputime.sum_exec_runtime; #ifdef CONFIG_SCHED_CORE dst_bstat->forceidle_sum -= src_bstat->forceidle_sum; #endif dst_bstat->ntime -= src_bstat->ntime; } static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); struct cgroup *parent = cgroup_parent(cgrp); struct cgroup_rstat_cpu *prstatc; struct cgroup_base_stat delta; unsigned seq; /* Root-level stats are sourced from system-wide CPU stats */ if (!parent) return; /* fetch the current per-cpu values */ do { seq = __u64_stats_fetch_begin(&rstatc->bsync); delta = rstatc->bstat; } while (__u64_stats_fetch_retry(&rstatc->bsync, seq)); /* propagate per-cpu delta to cgroup and per-cpu global statistics */ cgroup_base_stat_sub(&delta, &rstatc->last_bstat); cgroup_base_stat_add(&cgrp->bstat, &delta); cgroup_base_stat_add(&rstatc->last_bstat, &delta); cgroup_base_stat_add(&rstatc->subtree_bstat, &delta); /* propagate cgroup and per-cpu global delta to parent (unless that's root) */ if (cgroup_parent(parent)) { delta = cgrp->bstat; cgroup_base_stat_sub(&delta, &cgrp->last_bstat); cgroup_base_stat_add(&parent->bstat, &delta); cgroup_base_stat_add(&cgrp->last_bstat, &delta); delta = rstatc->subtree_bstat; prstatc = cgroup_rstat_cpu(parent, cpu); cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat); cgroup_base_stat_add(&prstatc->subtree_bstat, &delta); cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta); } } static struct cgroup_rstat_cpu * cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags) { struct cgroup_rstat_cpu *rstatc; rstatc = get_cpu_ptr(cgrp->rstat_cpu); *flags = u64_stats_update_begin_irqsave(&rstatc->bsync); return rstatc; } static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp, struct cgroup_rstat_cpu *rstatc, unsigned long flags) { u64_stats_update_end_irqrestore(&rstatc->bsync, flags); cgroup_rstat_updated(cgrp, smp_processor_id()); put_cpu_ptr(rstatc); } void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec) { struct cgroup_rstat_cpu *rstatc; unsigned long flags; rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags); rstatc->bstat.cputime.sum_exec_runtime += delta_exec; cgroup_base_stat_cputime_account_end(cgrp, rstatc, flags); } void __cgroup_account_cputime_field(struct cgroup *cgrp, enum cpu_usage_stat index, u64 delta_exec) { struct cgroup_rstat_cpu *rstatc; unsigned long flags; rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags); switch (index) { case CPUTIME_NICE: rstatc->bstat.ntime += delta_exec; fallthrough; case CPUTIME_USER: rstatc->bstat.cputime.utime += delta_exec; break; case CPUTIME_SYSTEM: case CPUTIME_IRQ: case CPUTIME_SOFTIRQ: rstatc->bstat.cputime.stime += delta_exec; break; #ifdef CONFIG_SCHED_CORE case CPUTIME_FORCEIDLE: rstatc->bstat.forceidle_sum += delta_exec; break; #endif default: break; } cgroup_base_stat_cputime_account_end(cgrp, rstatc, flags); } /* * compute the cputime for the root cgroup by getting the per cpu data * at a global level, then categorizing the fields in a manner consistent * with how it is done by __cgroup_account_cputime_field for each bit of * cpu time attributed to a cgroup. */ static void root_cgroup_cputime(struct cgroup_base_stat *bstat) { struct task_cputime *cputime = &bstat->cputime; int i; memset(bstat, 0, sizeof(*bstat)); for_each_possible_cpu(i) { struct kernel_cpustat kcpustat; u64 *cpustat = kcpustat.cpustat; u64 user = 0; u64 sys = 0; kcpustat_cpu_fetch(&kcpustat, i); user += cpustat[CPUTIME_USER]; user += cpustat[CPUTIME_NICE]; cputime->utime += user; sys += cpustat[CPUTIME_SYSTEM]; sys += cpustat[CPUTIME_IRQ]; sys += cpustat[CPUTIME_SOFTIRQ]; cputime->stime += sys; cputime->sum_exec_runtime += user; cputime->sum_exec_runtime += sys; #ifdef CONFIG_SCHED_CORE bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE]; #endif bstat->ntime += cpustat[CPUTIME_NICE]; } } static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat *bstat) { #ifdef CONFIG_SCHED_CORE u64 forceidle_time = bstat->forceidle_sum; do_div(forceidle_time, NSEC_PER_USEC); seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time); #endif } void cgroup_base_stat_cputime_show(struct seq_file *seq) { struct cgroup *cgrp = seq_css(seq)->cgroup; struct cgroup_base_stat bstat; if (cgroup_parent(cgrp)) { cgroup_rstat_flush(cgrp); __cgroup_rstat_lock(cgrp, -1); bstat = cgrp->bstat; cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime, &bstat.cputime.utime, &bstat.cputime.stime); __cgroup_rstat_unlock(cgrp, -1); } else { root_cgroup_cputime(&bstat); } do_div(bstat.cputime.sum_exec_runtime, NSEC_PER_USEC); do_div(bstat.cputime.utime, NSEC_PER_USEC); do_div(bstat.cputime.stime, NSEC_PER_USEC); do_div(bstat.ntime, NSEC_PER_USEC); seq_printf(seq, "usage_usec %llu\n" "user_usec %llu\n" "system_usec %llu\n" "nice_usec %llu\n", bstat.cputime.sum_exec_runtime, bstat.cputime.utime, bstat.cputime.stime, bstat.ntime); cgroup_force_idle_show(seq, &bstat); } /* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */ BTF_KFUNCS_START(bpf_rstat_kfunc_ids) BTF_ID_FLAGS(func, cgroup_rstat_updated) BTF_ID_FLAGS(func, cgroup_rstat_flush, KF_SLEEPABLE) BTF_KFUNCS_END(bpf_rstat_kfunc_ids) static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = { .owner = THIS_MODULE, .set = &bpf_rstat_kfunc_ids, }; static int __init bpf_rstat_kfunc_init(void) { return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_rstat_kfunc_set); } late_initcall(bpf_rstat_kfunc_init); |
4 2 1 4 12 12 8 7 6 5 1 2 4 1 1 1 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | // SPDX-License-Identifier: GPL-2.0-only /* * * Generic part shared by ipv4 and ipv6 backends. */ #include <linux/kernel.h> #include <linux/init.h> #include <linux/module.h> #include <linux/netlink.h> #include <linux/netfilter.h> #include <linux/netfilter/nf_tables.h> #include <net/netfilter/nf_tables_core.h> #include <net/netfilter/nf_tables.h> #include <net/netfilter/nft_fib.h> #define NFTA_FIB_F_ALL (NFTA_FIB_F_SADDR | NFTA_FIB_F_DADDR | \ NFTA_FIB_F_MARK | NFTA_FIB_F_IIF | NFTA_FIB_F_OIF | \ NFTA_FIB_F_PRESENT) const struct nla_policy nft_fib_policy[NFTA_FIB_MAX + 1] = { [NFTA_FIB_DREG] = { .type = NLA_U32 }, [NFTA_FIB_RESULT] = { .type = NLA_U32 }, [NFTA_FIB_FLAGS] = NLA_POLICY_MASK(NLA_BE32, NFTA_FIB_F_ALL), }; EXPORT_SYMBOL(nft_fib_policy); int nft_fib_validate(const struct nft_ctx *ctx, const struct nft_expr *expr) { const struct nft_fib *priv = nft_expr_priv(expr); unsigned int hooks; switch (priv->result) { case NFT_FIB_RESULT_OIF: case NFT_FIB_RESULT_OIFNAME: hooks = (1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_LOCAL_IN) | (1 << NF_INET_FORWARD); break; case NFT_FIB_RESULT_ADDRTYPE: if (priv->flags & NFTA_FIB_F_IIF) hooks = (1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_LOCAL_IN) | (1 << NF_INET_FORWARD); else if (priv->flags & NFTA_FIB_F_OIF) hooks = (1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_POST_ROUTING) | (1 << NF_INET_FORWARD); else hooks = (1 << NF_INET_LOCAL_IN) | (1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_FORWARD) | (1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_POST_ROUTING); break; default: return -EINVAL; } return nft_chain_validate_hooks(ctx->chain, hooks); } EXPORT_SYMBOL_GPL(nft_fib_validate); int nft_fib_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) { struct nft_fib *priv = nft_expr_priv(expr); unsigned int len; int err; if (!tb[NFTA_FIB_DREG] || !tb[NFTA_FIB_RESULT] || !tb[NFTA_FIB_FLAGS]) return -EINVAL; priv->flags = ntohl(nla_get_be32(tb[NFTA_FIB_FLAGS])); if (priv->flags == 0) return -EINVAL; if ((priv->flags & (NFTA_FIB_F_SADDR | NFTA_FIB_F_DADDR)) == (NFTA_FIB_F_SADDR | NFTA_FIB_F_DADDR)) return -EINVAL; if ((priv->flags & (NFTA_FIB_F_IIF | NFTA_FIB_F_OIF)) == (NFTA_FIB_F_IIF | NFTA_FIB_F_OIF)) return -EINVAL; if ((priv->flags & (NFTA_FIB_F_SADDR | NFTA_FIB_F_DADDR)) == 0) return -EINVAL; priv->result = ntohl(nla_get_be32(tb[NFTA_FIB_RESULT])); switch (priv->result) { case NFT_FIB_RESULT_OIF: if (priv->flags & NFTA_FIB_F_OIF) return -EINVAL; len = sizeof(int); break; case NFT_FIB_RESULT_OIFNAME: if (priv->flags & NFTA_FIB_F_OIF) return -EINVAL; len = IFNAMSIZ; break; case NFT_FIB_RESULT_ADDRTYPE: len = sizeof(u32); break; default: return -EINVAL; } err = nft_parse_register_store(ctx, tb[NFTA_FIB_DREG], &priv->dreg, NULL, NFT_DATA_VALUE, len); if (err < 0) return err; return 0; } EXPORT_SYMBOL_GPL(nft_fib_init); int nft_fib_dump(struct sk_buff *skb, const struct nft_expr *expr, bool reset) { const struct nft_fib *priv = nft_expr_priv(expr); if (nft_dump_register(skb, NFTA_FIB_DREG, priv->dreg)) return -1; if (nla_put_be32(skb, NFTA_FIB_RESULT, htonl(priv->result))) return -1; if (nla_put_be32(skb, NFTA_FIB_FLAGS, htonl(priv->flags))) return -1; return 0; } EXPORT_SYMBOL_GPL(nft_fib_dump); void nft_fib_store_result(void *reg, const struct nft_fib *priv, const struct net_device *dev) { u32 *dreg = reg; int index; switch (priv->result) { case NFT_FIB_RESULT_OIF: index = dev ? dev->ifindex : 0; if (priv->flags & NFTA_FIB_F_PRESENT) nft_reg_store8(dreg, !!index); else *dreg = index; break; case NFT_FIB_RESULT_OIFNAME: if (priv->flags & NFTA_FIB_F_PRESENT) nft_reg_store8(dreg, !!dev); else strscpy_pad(reg, dev ? dev->name : "", IFNAMSIZ); break; default: WARN_ON_ONCE(1); *dreg = 0; break; } } EXPORT_SYMBOL_GPL(nft_fib_store_result); bool nft_fib_reduce(struct nft_regs_track *track, const struct nft_expr *expr) { const struct nft_fib *priv = nft_expr_priv(expr); unsigned int len = NFT_REG32_SIZE; const struct nft_fib *fib; switch (priv->result) { case NFT_FIB_RESULT_OIF: break; case NFT_FIB_RESULT_OIFNAME: if (priv->flags & NFTA_FIB_F_PRESENT) len = NFT_REG32_SIZE; else len = IFNAMSIZ; break; case NFT_FIB_RESULT_ADDRTYPE: break; default: WARN_ON_ONCE(1); break; } if (!nft_reg_track_cmp(track, expr, priv->dreg)) { nft_reg_track_update(track, expr, priv->dreg, len); return false; } fib = nft_expr_priv(track->regs[priv->dreg].selector); if (priv->result != fib->result || priv->flags != fib->flags) { nft_reg_track_update(track, expr, priv->dreg, len); return false; } if (!track->regs[priv->dreg].bitwise) return true; return false; } EXPORT_SYMBOL_GPL(nft_fib_reduce); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Query routing table from nftables"); MODULE_AUTHOR("Florian Westphal <fw@strlen.de>"); |
1095 1079 941 1093 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _ASM_X86_TLB_H #define _ASM_X86_TLB_H #define tlb_flush tlb_flush static inline void tlb_flush(struct mmu_gather *tlb); #include <asm-generic/tlb.h> #include <linux/kernel.h> #include <vdso/bits.h> #include <vdso/page.h> static inline void tlb_flush(struct mmu_gather *tlb) { unsigned long start = 0UL, end = TLB_FLUSH_ALL; unsigned int stride_shift = tlb_get_unmap_shift(tlb); if (!tlb->fullmm && !tlb->need_flush_all) { start = tlb->start; end = tlb->end; } flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables); } static inline void invlpg(unsigned long addr) { asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); } enum addr_stride { PTE_STRIDE = 0, PMD_STRIDE = 1 }; /* * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination * of the three. For example: * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address * - FLAG_PCID: invalidate all TLB entries matching the PCID * * The first is used to invalidate (kernel) mappings at a particular * address across all processes. * * The latter invalidates all TLB entries matching a PCID. */ #define INVLPGB_FLAG_VA BIT(0) #define INVLPGB_FLAG_PCID BIT(1) #define INVLPGB_FLAG_ASID BIT(2) #define INVLPGB_FLAG_INCLUDE_GLOBAL BIT(3) #define INVLPGB_FLAG_FINAL_ONLY BIT(4) #define INVLPGB_FLAG_INCLUDE_NESTED BIT(5) /* The implied mode when all bits are clear: */ #define INVLPGB_MODE_ALL_NONGLOBALS 0UL #ifdef CONFIG_BROADCAST_TLB_FLUSH /* * INVLPGB does broadcast TLB invalidation across all the CPUs in the system. * * The INVLPGB instruction is weakly ordered, and a batch of invalidations can * be done in a parallel fashion. * * The instruction takes the number of extra pages to invalidate, beyond the * first page, while __invlpgb gets the more human readable number of pages to * invalidate. * * The bits in rax[0:2] determine respectively which components of the address * (VA, PCID, ASID) get compared when flushing. If neither bits are set, *any* * address in the specified range matches. * * Since it is desired to only flush TLB entries for the ASID that is executing * the instruction (a host/hypervisor or a guest), the ASID valid bit should * always be set. On a host/hypervisor, the hardware will use the ASID value * specified in EDX[15:0] (which should be 0). On a guest, the hardware will * use the actual ASID value of the guest. * * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from * this CPU have completed. */ static inline void __invlpgb(unsigned long asid, unsigned long pcid, unsigned long addr, u16 nr_pages, enum addr_stride stride, u8 flags) { u64 rax = addr | flags | INVLPGB_FLAG_ASID; u32 ecx = (stride << 31) | (nr_pages - 1); u32 edx = (pcid << 16) | asid; /* The low bits in rax are for flags. Verify addr is clean. */ VM_WARN_ON_ONCE(addr & ~PAGE_MASK); /* INVLPGB; supported in binutils >= 2.36. */ asm volatile(".byte 0x0f, 0x01, 0xfe" :: "a" (rax), "c" (ecx), "d" (edx)); } static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flags) { __invlpgb(asid, pcid, 0, 1, 0, flags); } static inline void __tlbsync(void) { /* * TLBSYNC waits for INVLPGB instructions originating on the same CPU * to have completed. Print a warning if the task has been migrated, * and might not be waiting on all the INVLPGBs issued during this TLB * invalidation sequence. */ cant_migrate(); /* TLBSYNC: supported in binutils >= 0.36. */ asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory"); } #else /* Some compilers (I'm looking at you clang!) simply can't do DCE */ static inline void __invlpgb(unsigned long asid, unsigned long pcid, unsigned long addr, u16 nr_pages, enum addr_stride s, u8 flags) { } static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flags) { } static inline void __tlbsync(void) { } #endif static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid, unsigned long addr, u16 nr, bool stride) { enum addr_stride str = stride ? PMD_STRIDE : PTE_STRIDE; u8 flags = INVLPGB_FLAG_PCID | INVLPGB_FLAG_VA; __invlpgb(0, pcid, addr, nr, str, flags); } /* Flush all mappings for a given PCID, not including globals. */ static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid) { __invlpgb_all(0, pcid, INVLPGB_FLAG_PCID); } /* Flush all mappings, including globals, for all PCIDs. */ static inline void invlpgb_flush_all(void) { /* * TLBSYNC at the end needs to make sure all flushes done on the * current CPU have been executed system-wide. Therefore, make * sure nothing gets migrated in-between but disable preemption * as it is cheaper. */ guard(preempt)(); __invlpgb_all(0, 0, INVLPGB_FLAG_INCLUDE_GLOBAL); __tlbsync(); } /* Flush addr, including globals, for all PCIDs. */ static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr) { __invlpgb(0, 0, addr, nr, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL); } /* Flush all mappings for all PCIDs except globals. */ static inline void invlpgb_flush_all_nonglobals(void) { guard(preempt)(); __invlpgb_all(0, 0, INVLPGB_MODE_ALL_NONGLOBALS); __tlbsync(); } #endif /* _ASM_X86_TLB_H */ |
178 178 147 173 85 85 85 85 85 79 79 150 150 150 179 179 175 178 85 150 150 145 175 15 15 15 15 15 15 15 15 15 15 15 15 14 15 15 15 15 15 347 347 347 101 101 101 101 101 6 6 6 6 347 346 346 347 346 41 41 347 347 101 316 6 6 6 343 41 41 346 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 | // SPDX-License-Identifier: GPL-2.0 /* * linux/mm/page_io.c * * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds * * Swap reorganised 29.12.95, * Asynchronous swapping added 30.12.95. Stephen Tweedie * Removed race in async swapping. 14.4.1996. Bruno Haible * Add swap of shared pages through the page cache. 20.2.1998. Stephen Tweedie * Always use brw_page, life becomes simpler. 12 May 1998 Eric Biederman */ #include <linux/mm.h> #include <linux/kernel_stat.h> #include <linux/gfp.h> #include <linux/pagemap.h> #include <linux/swap.h> #include <linux/bio.h> #include <linux/swapops.h> #include <linux/writeback.h> #include <linux/blkdev.h> #include <linux/psi.h> #include <linux/uio.h> #include <linux/sched/task.h> #include <linux/delayacct.h> #include <linux/zswap.h> #include "swap.h" static void __end_swap_bio_write(struct bio *bio) { struct folio *folio = bio_first_folio_all(bio); if (bio->bi_status) { /* * We failed to write the page out to swap-space. * Re-dirty the page in order to avoid it being reclaimed. * Also print a dire warning that things will go BAD (tm) * very quickly. * * Also clear PG_reclaim to avoid folio_rotate_reclaimable() */ folio_mark_dirty(folio); pr_alert_ratelimited("Write-error on swap-device (%u:%u:%llu)\n", MAJOR(bio_dev(bio)), MINOR(bio_dev(bio)), (unsigned long long)bio->bi_iter.bi_sector); folio_clear_reclaim(folio); } folio_end_writeback(folio); } static void end_swap_bio_write(struct bio *bio) { __end_swap_bio_write(bio); bio_put(bio); } static void __end_swap_bio_read(struct bio *bio) { struct folio *folio = bio_first_folio_all(bio); if (bio->bi_status) { pr_alert_ratelimited("Read-error on swap-device (%u:%u:%llu)\n", MAJOR(bio_dev(bio)), MINOR(bio_dev(bio)), (unsigned long long)bio->bi_iter.bi_sector); } else { folio_mark_uptodate(folio); } folio_unlock(folio); } static void end_swap_bio_read(struct bio *bio) { __end_swap_bio_read(bio); bio_put(bio); } int generic_swapfile_activate(struct swap_info_struct *sis, struct file *swap_file, sector_t *span) { struct address_space *mapping = swap_file->f_mapping; struct inode *inode = mapping->host; unsigned blocks_per_page; unsigned long page_no; unsigned blkbits; sector_t probe_block; sector_t last_block; sector_t lowest_block = -1; sector_t highest_block = 0; int nr_extents = 0; int ret; blkbits = inode->i_blkbits; blocks_per_page = PAGE_SIZE >> blkbits; /* * Map all the blocks into the extent tree. This code doesn't try * to be very smart. */ probe_block = 0; page_no = 0; last_block = i_size_read(inode) >> blkbits; while ((probe_block + blocks_per_page) <= last_block && page_no < sis->max) { unsigned block_in_page; sector_t first_block; cond_resched(); first_block = probe_block; ret = bmap(inode, &first_block); if (ret || !first_block) goto bad_bmap; /* * It must be PAGE_SIZE aligned on-disk */ if (first_block & (blocks_per_page - 1)) { probe_block++; goto reprobe; } for (block_in_page = 1; block_in_page < blocks_per_page; block_in_page++) { sector_t block; block = probe_block + block_in_page; ret = bmap(inode, &block); if (ret || !block) goto bad_bmap; if (block != first_block + block_in_page) { /* Discontiguity */ probe_block++; goto reprobe; } } first_block >>= (PAGE_SHIFT - blkbits); if (page_no) { /* exclude the header page */ if (first_block < lowest_block) lowest_block = first_block; if (first_block > highest_block) highest_block = first_block; } /* * We found a PAGE_SIZE-length, PAGE_SIZE-aligned run of blocks */ ret = add_swap_extent(sis, page_no, 1, first_block); if (ret < 0) goto out; nr_extents += ret; page_no++; probe_block += blocks_per_page; reprobe: continue; } ret = nr_extents; *span = 1 + highest_block - lowest_block; if (page_no == 0) page_no = 1; /* force Empty message */ sis->max = page_no; sis->pages = page_no - 1; out: return ret; bad_bmap: pr_err("swapon: swapfile has holes\n"); ret = -EINVAL; goto out; } static bool is_folio_zero_filled(struct folio *folio) { unsigned int pos, last_pos; unsigned long *data; unsigned int i; last_pos = PAGE_SIZE / sizeof(*data) - 1; for (i = 0; i < folio_nr_pages(folio); i++) { data = kmap_local_folio(folio, i * PAGE_SIZE); /* * Check last word first, incase the page is zero-filled at * the start and has non-zero data at the end, which is common * in real-world workloads. */ if (data[last_pos]) { kunmap_local(data); return false; } for (pos = 0; pos < last_pos; pos++) { if (data[pos]) { kunmap_local(data); return false; } } kunmap_local(data); } return true; } static void swap_zeromap_folio_set(struct folio *folio) { struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio); struct swap_info_struct *sis = swp_swap_info(folio->swap); int nr_pages = folio_nr_pages(folio); swp_entry_t entry; unsigned int i; for (i = 0; i < folio_nr_pages(folio); i++) { entry = page_swap_entry(folio_page(folio, i)); set_bit(swp_offset(entry), sis->zeromap); } count_vm_events(SWPOUT_ZERO, nr_pages); if (objcg) { count_objcg_events(objcg, SWPOUT_ZERO, nr_pages); obj_cgroup_put(objcg); } } static void swap_zeromap_folio_clear(struct folio *folio) { struct swap_info_struct *sis = swp_swap_info(folio->swap); swp_entry_t entry; unsigned int i; for (i = 0; i < folio_nr_pages(folio); i++) { entry = page_swap_entry(folio_page(folio, i)); clear_bit(swp_offset(entry), sis->zeromap); } } /* * We may have stale swap cache pages in memory: notice * them here and get rid of the unnecessary final write. */ int swap_writepage(struct page *page, struct writeback_control *wbc) { struct folio *folio = page_folio(page); int ret; if (folio_free_swap(folio)) { folio_unlock(folio); return 0; } /* * Arch code may have to preserve more data than just the page * contents, e.g. memory tags. */ ret = arch_prepare_to_swap(folio); if (ret) { folio_mark_dirty(folio); folio_unlock(folio); return ret; } /* * Use a bitmap (zeromap) to avoid doing IO for zero-filled pages. * The bits in zeromap are protected by the locked swapcache folio * and atomic updates are used to protect against read-modify-write * corruption due to other zero swap entries seeing concurrent updates. */ if (is_folio_zero_filled(folio)) { swap_zeromap_folio_set(folio); folio_unlock(folio); return 0; } else { /* * Clear bits this folio occupies in the zeromap to prevent * zero data being read in from any previous zero writes that * occupied the same swap entries. */ swap_zeromap_folio_clear(folio); } if (zswap_store(folio)) { count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT); folio_unlock(folio); return 0; } if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { folio_mark_dirty(folio); return AOP_WRITEPAGE_ACTIVATE; } __swap_writepage(folio, wbc); return 0; } static inline void count_swpout_vm_event(struct folio *folio) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (unlikely(folio_test_pmd_mappable(folio))) { count_memcg_folio_events(folio, THP_SWPOUT, 1); count_vm_event(THP_SWPOUT); } #endif count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT); count_memcg_folio_events(folio, PSWPOUT, folio_nr_pages(folio)); count_vm_events(PSWPOUT, folio_nr_pages(folio)); } #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP) static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio) { struct cgroup_subsys_state *css; struct mem_cgroup *memcg; memcg = folio_memcg(folio); if (!memcg) return; rcu_read_lock(); css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys); bio_associate_blkg_from_css(bio, css); rcu_read_unlock(); } #else #define bio_associate_blkg_from_page(bio, folio) do { } while (0) #endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */ struct swap_iocb { struct kiocb iocb; struct bio_vec bvec[SWAP_CLUSTER_MAX]; int pages; int len; }; static mempool_t *sio_pool; int sio_pool_init(void) { if (!sio_pool) { mempool_t *pool = mempool_create_kmalloc_pool( SWAP_CLUSTER_MAX, sizeof(struct swap_iocb)); if (cmpxchg(&sio_pool, NULL, pool)) mempool_destroy(pool); } if (!sio_pool) return -ENOMEM; return 0; } static void sio_write_complete(struct kiocb *iocb, long ret) { struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb); struct page *page = sio->bvec[0].bv_page; int p; if (ret != sio->len) { /* * In the case of swap-over-nfs, this can be a * temporary failure if the system has limited * memory for allocating transmit buffers. * Mark the page dirty and avoid * folio_rotate_reclaimable but rate-limit the * messages. */ pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n", ret, swap_dev_pos(page_swap_entry(page))); for (p = 0; p < sio->pages; p++) { page = sio->bvec[p].bv_page; set_page_dirty(page); ClearPageReclaim(page); } } for (p = 0; p < sio->pages; p++) end_page_writeback(sio->bvec[p].bv_page); mempool_free(sio, sio_pool); } static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc) { struct swap_iocb *sio = NULL; struct swap_info_struct *sis = swp_swap_info(folio->swap); struct file *swap_file = sis->swap_file; loff_t pos = swap_dev_pos(folio->swap); count_swpout_vm_event(folio); folio_start_writeback(folio); folio_unlock(folio); if (wbc->swap_plug) sio = *wbc->swap_plug; if (sio) { if (sio->iocb.ki_filp != swap_file || sio->iocb.ki_pos + sio->len != pos) { swap_write_unplug(sio); sio = NULL; } } if (!sio) { sio = mempool_alloc(sio_pool, GFP_NOIO); init_sync_kiocb(&sio->iocb, swap_file); sio->iocb.ki_complete = sio_write_complete; sio->iocb.ki_pos = pos; sio->pages = 0; sio->len = 0; } bvec_set_folio(&sio->bvec[sio->pages], folio, folio_size(folio), 0); sio->len += folio_size(folio); sio->pages += 1; if (sio->pages == ARRAY_SIZE(sio->bvec) || !wbc->swap_plug) { swap_write_unplug(sio); sio = NULL; } if (wbc->swap_plug) *wbc->swap_plug = sio; } static void swap_writepage_bdev_sync(struct folio *folio, struct writeback_control *wbc, struct swap_info_struct *sis) { struct bio_vec bv; struct bio bio; bio_init(&bio, sis->bdev, &bv, 1, REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc)); bio.bi_iter.bi_sector = swap_folio_sector(folio); bio_add_folio_nofail(&bio, folio, folio_size(folio), 0); bio_associate_blkg_from_page(&bio, folio); count_swpout_vm_event(folio); folio_start_writeback(folio); folio_unlock(folio); submit_bio_wait(&bio); __end_swap_bio_write(&bio); } static void swap_writepage_bdev_async(struct folio *folio, struct writeback_control *wbc, struct swap_info_struct *sis) { struct bio *bio; bio = bio_alloc(sis->bdev, 1, REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc), GFP_NOIO); bio->bi_iter.bi_sector = swap_folio_sector(folio); bio->bi_end_io = end_swap_bio_write; bio_add_folio_nofail(bio, folio, folio_size(folio), 0); bio_associate_blkg_from_page(bio, folio); count_swpout_vm_event(folio); folio_start_writeback(folio); folio_unlock(folio); submit_bio(bio); } void __swap_writepage(struct folio *folio, struct writeback_control *wbc) { struct swap_info_struct *sis = swp_swap_info(folio->swap); VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); /* * ->flags can be updated non-atomicially (scan_swap_map_slots), * but that will never affect SWP_FS_OPS, so the data_race * is safe. */ if (data_race(sis->flags & SWP_FS_OPS)) swap_writepage_fs(folio, wbc); /* * ->flags can be updated non-atomicially (scan_swap_map_slots), * but that will never affect SWP_SYNCHRONOUS_IO, so the data_race * is safe. */ else if (data_race(sis->flags & SWP_SYNCHRONOUS_IO)) swap_writepage_bdev_sync(folio, wbc, sis); else swap_writepage_bdev_async(folio, wbc, sis); } void swap_write_unplug(struct swap_iocb *sio) { struct iov_iter from; struct address_space *mapping = sio->iocb.ki_filp->f_mapping; int ret; iov_iter_bvec(&from, ITER_SOURCE, sio->bvec, sio->pages, sio->len); ret = mapping->a_ops->swap_rw(&sio->iocb, &from); if (ret != -EIOCBQUEUED) sio_write_complete(&sio->iocb, ret); } static void sio_read_complete(struct kiocb *iocb, long ret) { struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb); int p; if (ret == sio->len) { for (p = 0; p < sio->pages; p++) { struct folio *folio = page_folio(sio->bvec[p].bv_page); count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN); count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio)); folio_mark_uptodate(folio); folio_unlock(folio); } count_vm_events(PSWPIN, sio->pages); } else { for (p = 0; p < sio->pages; p++) { struct folio *folio = page_folio(sio->bvec[p].bv_page); folio_unlock(folio); } pr_alert_ratelimited("Read-error on swap-device\n"); } mempool_free(sio, sio_pool); } static bool swap_read_folio_zeromap(struct folio *folio) { int nr_pages = folio_nr_pages(folio); struct obj_cgroup *objcg; bool is_zeromap; /* * Swapping in a large folio that is partially in the zeromap is not * currently handled. Return true without marking the folio uptodate so * that an IO error is emitted (e.g. do_swap_page() will sigbus). */ if (WARN_ON_ONCE(swap_zeromap_batch(folio->swap, nr_pages, &is_zeromap) != nr_pages)) return true; if (!is_zeromap) return false; objcg = get_obj_cgroup_from_folio(folio); count_vm_events(SWPIN_ZERO, nr_pages); if (objcg) { count_objcg_events(objcg, SWPIN_ZERO, nr_pages); obj_cgroup_put(objcg); } folio_zero_range(folio, 0, folio_size(folio)); folio_mark_uptodate(folio); return true; } static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug) { struct swap_info_struct *sis = swp_swap_info(folio->swap); struct swap_iocb *sio = NULL; loff_t pos = swap_dev_pos(folio->swap); if (plug) sio = *plug; if (sio) { if (sio->iocb.ki_filp != sis->swap_file || sio->iocb.ki_pos + sio->len != pos) { swap_read_unplug(sio); sio = NULL; } } if (!sio) { sio = mempool_alloc(sio_pool, GFP_KERNEL); init_sync_kiocb(&sio->iocb, sis->swap_file); sio->iocb.ki_pos = pos; sio->iocb.ki_complete = sio_read_complete; sio->pages = 0; sio->len = 0; } bvec_set_folio(&sio->bvec[sio->pages], folio, folio_size(folio), 0); sio->len += folio_size(folio); sio->pages += 1; if (sio->pages == ARRAY_SIZE(sio->bvec) || !plug) { swap_read_unplug(sio); sio = NULL; } if (plug) *plug = sio; } static void swap_read_folio_bdev_sync(struct folio *folio, struct swap_info_struct *sis) { struct bio_vec bv; struct bio bio; bio_init(&bio, sis->bdev, &bv, 1, REQ_OP_READ); bio.bi_iter.bi_sector = swap_folio_sector(folio); bio_add_folio_nofail(&bio, folio, folio_size(folio), 0); /* * Keep this task valid during swap readpage because the oom killer may * attempt to access it in the page fault retry time check. */ get_task_struct(current); count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN); count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio)); count_vm_events(PSWPIN, folio_nr_pages(folio)); submit_bio_wait(&bio); __end_swap_bio_read(&bio); put_task_struct(current); } static void swap_read_folio_bdev_async(struct folio *folio, struct swap_info_struct *sis) { struct bio *bio; bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL); bio->bi_iter.bi_sector = swap_folio_sector(folio); bio->bi_end_io = end_swap_bio_read; bio_add_folio_nofail(bio, folio, folio_size(folio), 0); count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN); count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio)); count_vm_events(PSWPIN, folio_nr_pages(folio)); submit_bio(bio); } void swap_read_folio(struct folio *folio, struct swap_iocb **plug) { struct swap_info_struct *sis = swp_swap_info(folio->swap); bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO; bool workingset = folio_test_workingset(folio); unsigned long pflags; bool in_thrashing; VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio); /* * Count submission time as memory stall and delay. When the device * is congested, or the submitting cgroup IO-throttled, submission * can be a significant part of overall IO time. */ if (workingset) { delayacct_thrashing_start(&in_thrashing); psi_memstall_enter(&pflags); } delayacct_swapin_start(); if (swap_read_folio_zeromap(folio)) { folio_unlock(folio); goto finish; } if (zswap_load(folio) != -ENOENT) goto finish; /* We have to read from slower devices. Increase zswap protection. */ zswap_folio_swapin(folio); if (data_race(sis->flags & SWP_FS_OPS)) { swap_read_folio_fs(folio, plug); } else if (synchronous) { swap_read_folio_bdev_sync(folio, sis); } else { swap_read_folio_bdev_async(folio, sis); } finish: if (workingset) { delayacct_thrashing_end(&in_thrashing); psi_memstall_leave(&pflags); } delayacct_swapin_end(); } void __swap_read_unplug(struct swap_iocb *sio) { struct iov_iter from; struct address_space *mapping = sio->iocb.ki_filp->f_mapping; int ret; iov_iter_bvec(&from, ITER_DEST, sio->bvec, sio->pages, sio->len); ret = mapping->a_ops->swap_rw(&sio->iocb, &from); if (ret != -EIOCBQUEUED) sio_read_complete(&sio->iocb, ret); } |
55 11 1 32 2 221 221 220 10 221 2567 221 2567 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 | /* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM cgroup #if !defined(_TRACE_CGROUP_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_CGROUP_H #include <linux/cgroup.h> #include <linux/tracepoint.h> DECLARE_EVENT_CLASS(cgroup_root, TP_PROTO(struct cgroup_root *root), TP_ARGS(root), TP_STRUCT__entry( __field( int, root ) __field( u16, ss_mask ) __string( name, root->name ) ), TP_fast_assign( __entry->root = root->hierarchy_id; __entry->ss_mask = root->subsys_mask; __assign_str(name); ), TP_printk("root=%d ss_mask=%#x name=%s", __entry->root, __entry->ss_mask, __get_str(name)) ); DEFINE_EVENT(cgroup_root, cgroup_setup_root, TP_PROTO(struct cgroup_root *root), TP_ARGS(root) ); DEFINE_EVENT(cgroup_root, cgroup_destroy_root, TP_PROTO(struct cgroup_root *root), TP_ARGS(root) ); DEFINE_EVENT(cgroup_root, cgroup_remount, TP_PROTO(struct cgroup_root *root), TP_ARGS(root) ); DECLARE_EVENT_CLASS(cgroup, TP_PROTO(struct cgroup *cgrp, const char *path), TP_ARGS(cgrp, path), TP_STRUCT__entry( __field( int, root ) __field( int, level ) __field( u64, id ) __string( path, path ) ), TP_fast_assign( __entry->root = cgrp->root->hierarchy_id; __entry->id = cgroup_id(cgrp); __entry->level = cgrp->level; __assign_str(path); ), TP_printk("root=%d id=%llu level=%d path=%s", __entry->root, __entry->id, __entry->level, __get_str(path)) ); DEFINE_EVENT(cgroup, cgroup_mkdir, TP_PROTO(struct cgroup *cgrp, const char *path), TP_ARGS(cgrp, path) ); DEFINE_EVENT(cgroup, cgroup_rmdir, TP_PROTO(struct cgroup *cgrp, const char *path), TP_ARGS(cgrp, path) ); DEFINE_EVENT(cgroup, cgroup_release, TP_PROTO(struct cgroup *cgrp, const char *path), TP_ARGS(cgrp, path) ); DEFINE_EVENT(cgroup, cgroup_rename, TP_PROTO(struct cgroup *cgrp, const char *path), TP_ARGS(cgrp, path) ); DEFINE_EVENT(cgroup, cgroup_freeze, TP_PROTO(struct cgroup *cgrp, const char *path), TP_ARGS(cgrp, path) ); DEFINE_EVENT(cgroup, cgroup_unfreeze, TP_PROTO(struct cgroup *cgrp, const char *path), TP_ARGS(cgrp, path) ); DECLARE_EVENT_CLASS(cgroup_migrate, TP_PROTO(struct cgroup *dst_cgrp, const char *path, struct task_struct *task, bool threadgroup), TP_ARGS(dst_cgrp, path, task, threadgroup), TP_STRUCT__entry( __field( int, dst_root ) __field( int, dst_level ) __field( u64, dst_id ) __field( int, pid ) __string( dst_path, path ) __string( comm, task->comm ) ), TP_fast_assign( __entry->dst_root = dst_cgrp->root->hierarchy_id; __entry->dst_id = cgroup_id(dst_cgrp); __entry->dst_level = dst_cgrp->level; __assign_str(dst_path); __entry->pid = task->pid; __assign_str(comm); ), TP_printk("dst_root=%d dst_id=%llu dst_level=%d dst_path=%s pid=%d comm=%s", __entry->dst_root, __entry->dst_id, __entry->dst_level, __get_str(dst_path), __entry->pid, __get_str(comm)) ); DEFINE_EVENT(cgroup_migrate, cgroup_attach_task, TP_PROTO(struct cgroup *dst_cgrp, const char *path, struct task_struct *task, bool threadgroup), TP_ARGS(dst_cgrp, path, task, threadgroup) ); DEFINE_EVENT(cgroup_migrate, cgroup_transfer_tasks, TP_PROTO(struct cgroup *dst_cgrp, const char *path, struct task_struct *task, bool threadgroup), TP_ARGS(dst_cgrp, path, task, threadgroup) ); DECLARE_EVENT_CLASS(cgroup_event, TP_PROTO(struct cgroup *cgrp, const char *path, int val), TP_ARGS(cgrp, path, val), TP_STRUCT__entry( __field( int, root ) __field( int, level ) __field( u64, id ) __string( path, path ) __field( int, val ) ), TP_fast_assign( __entry->root = cgrp->root->hierarchy_id; __entry->id = cgroup_id(cgrp); __entry->level = cgrp->level; __assign_str(path); __entry->val = val; ), TP_printk("root=%d id=%llu level=%d path=%s val=%d", __entry->root, __entry->id, __entry->level, __get_str(path), __entry->val) ); DEFINE_EVENT(cgroup_event, cgroup_notify_populated, TP_PROTO(struct cgroup *cgrp, const char *path, int val), TP_ARGS(cgrp, path, val) ); DEFINE_EVENT(cgroup_event, cgroup_notify_frozen, TP_PROTO(struct cgroup *cgrp, const char *path, int val), TP_ARGS(cgrp, path, val) ); DECLARE_EVENT_CLASS(cgroup_rstat, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), TP_ARGS(cgrp, cpu, contended), TP_STRUCT__entry( __field( int, root ) __field( int, level ) __field( u64, id ) __field( int, cpu ) __field( bool, contended ) ), TP_fast_assign( __entry->root = cgrp->root->hierarchy_id; __entry->id = cgroup_id(cgrp); __entry->level = cgrp->level; __entry->cpu = cpu; __entry->contended = contended; ), TP_printk("root=%d id=%llu level=%d cpu=%d lock contended:%d", __entry->root, __entry->id, __entry->level, __entry->cpu, __entry->contended) ); /* Related to global: cgroup_rstat_lock */ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), TP_ARGS(cgrp, cpu, contended) ); DEFINE_EVENT(cgroup_rstat, cgroup_rstat_locked, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), TP_ARGS(cgrp, cpu, contended) ); DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), TP_ARGS(cgrp, cpu, contended) ); /* Related to per CPU: cgroup_rstat_cpu_lock */ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), TP_ARGS(cgrp, cpu, contended) ); DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended_fastpath, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), TP_ARGS(cgrp, cpu, contended) ); DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), TP_ARGS(cgrp, cpu, contended) ); DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked_fastpath, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), TP_ARGS(cgrp, cpu, contended) ); DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), TP_ARGS(cgrp, cpu, contended) ); DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), TP_ARGS(cgrp, cpu, contended) ); #endif /* _TRACE_CGROUP_H */ /* This part must be outside protection */ #include <trace/define_trace.h> |
2 264 267 267 267 267 267 267 2 2 2 2 2 2 2 2 2 257 256 257 2 2 2 2 2 2 2 2 257 257 257 257 267 266 266 263 33 267 256 257 6 6 6 6 266 266 267 267 267 267 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 72 72 72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 | // SPDX-License-Identifier: GPL-2.0-or-later /* * IPVS An implementation of the IP virtual server support for the * LINUX operating system. IPVS is now implemented as a module * over the Netfilter framework. IPVS can be used to build a * high-performance and highly available server based on a * cluster of servers. * * Authors: Wensong Zhang <wensong@linuxvirtualserver.org> * Peter Kese <peter.kese@ijs.si> * Julian Anastasov <ja@ssi.bg> * * The IPVS code for kernel 2.2 was done by Wensong Zhang and Peter Kese, * with changes/fixes from Julian Anastasov, Lars Marowsky-Bree, Horms * and others. Many code here is taken from IP MASQ code of kernel 2.2. * * Changes: */ #define KMSG_COMPONENT "IPVS" #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt #include <linux/interrupt.h> #include <linux/in.h> #include <linux/inet.h> #include <linux/net.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/proc_fs.h> /* for proc_net_* */ #include <linux/slab.h> #include <linux/seq_file.h> #include <linux/jhash.h> #include <linux/random.h> #include <linux/rcupdate_wait.h> #include <net/net_namespace.h> #include <net/ip_vs.h> #ifndef CONFIG_IP_VS_TAB_BITS #define CONFIG_IP_VS_TAB_BITS 12 #endif /* * Connection hash size. Default is what was selected at compile time. */ static int ip_vs_conn_tab_bits = CONFIG_IP_VS_TAB_BITS; module_param_named(conn_tab_bits, ip_vs_conn_tab_bits, int, 0444); MODULE_PARM_DESC(conn_tab_bits, "Set connections' hash size"); /* size and mask values */ int ip_vs_conn_tab_size __read_mostly; static int ip_vs_conn_tab_mask __read_mostly; /* * Connection hash table: for input and output packets lookups of IPVS */ static struct hlist_head *ip_vs_conn_tab __read_mostly; /* SLAB cache for IPVS connections */ static struct kmem_cache *ip_vs_conn_cachep __read_mostly; /* counter for no client port connections */ static atomic_t ip_vs_conn_no_cport_cnt = ATOMIC_INIT(0); /* random value for IPVS connection hash */ static unsigned int ip_vs_conn_rnd __read_mostly; /* * Fine locking granularity for big connection hash table */ #define CT_LOCKARRAY_BITS 5 #define CT_LOCKARRAY_SIZE (1<<CT_LOCKARRAY_BITS) #define CT_LOCKARRAY_MASK (CT_LOCKARRAY_SIZE-1) /* We need an addrstrlen that works with or without v6 */ #ifdef CONFIG_IP_VS_IPV6 #define IP_VS_ADDRSTRLEN INET6_ADDRSTRLEN #else #define IP_VS_ADDRSTRLEN (8+1) #endif struct ip_vs_aligned_lock { spinlock_t l; } __attribute__((__aligned__(SMP_CACHE_BYTES))); /* lock array for conn table */ static struct ip_vs_aligned_lock __ip_vs_conntbl_lock_array[CT_LOCKARRAY_SIZE] __cacheline_aligned; static inline void ct_write_lock_bh(unsigned int key) { spin_lock_bh(&__ip_vs_conntbl_lock_array[key&CT_LOCKARRAY_MASK].l); } static inline void ct_write_unlock_bh(unsigned int key) { spin_unlock_bh(&__ip_vs_conntbl_lock_array[key&CT_LOCKARRAY_MASK].l); } static void ip_vs_conn_expire(struct timer_list *t); /* * Returns hash value for IPVS connection entry */ static unsigned int ip_vs_conn_hashkey(struct netns_ipvs *ipvs, int af, unsigned int proto, const union nf_inet_addr *addr, __be16 port) { #ifdef CONFIG_IP_VS_IPV6 if (af == AF_INET6) return (jhash_3words(jhash(addr, 16, ip_vs_conn_rnd), (__force u32)port, proto, ip_vs_conn_rnd) ^ ((size_t)ipvs>>8)) & ip_vs_conn_tab_mask; #endif return (jhash_3words((__force u32)addr->ip, (__force u32)port, proto, ip_vs_conn_rnd) ^ ((size_t)ipvs>>8)) & ip_vs_conn_tab_mask; } static unsigned int ip_vs_conn_hashkey_param(const struct ip_vs_conn_param *p, bool inverse) { const union nf_inet_addr *addr; __be16 port; if (p->pe_data && p->pe->hashkey_raw) return p->pe->hashkey_raw(p, ip_vs_conn_rnd, inverse) & ip_vs_conn_tab_mask; if (likely(!inverse)) { addr = p->caddr; port = p->cport; } else { addr = p->vaddr; port = p->vport; } return ip_vs_conn_hashkey(p->ipvs, p->af, p->protocol, addr, port); } static unsigned int ip_vs_conn_hashkey_conn(const struct ip_vs_conn *cp) { struct ip_vs_conn_param p; ip_vs_conn_fill_param(cp->ipvs, cp->af, cp->protocol, &cp->caddr, cp->cport, NULL, 0, &p); if (cp->pe) { p.pe = cp->pe; p.pe_data = cp->pe_data; p.pe_data_len = cp->pe_data_len; } return ip_vs_conn_hashkey_param(&p, false); } /* * Hashes ip_vs_conn in ip_vs_conn_tab by netns,proto,addr,port. * returns bool success. */ static inline int ip_vs_conn_hash(struct ip_vs_conn *cp) { unsigned int hash; int ret; if (cp->flags & IP_VS_CONN_F_ONE_PACKET) return 0; /* Hash by protocol, client address and port */ hash = ip_vs_conn_hashkey_conn(cp); ct_write_lock_bh(hash); spin_lock(&cp->lock); if (!(cp->flags & IP_VS_CONN_F_HASHED)) { cp->flags |= IP_VS_CONN_F_HASHED; refcount_inc(&cp->refcnt); hlist_add_head_rcu(&cp->c_list, &ip_vs_conn_tab[hash]); ret = 1; } else { pr_err("%s(): request for already hashed, called from %pS\n", __func__, __builtin_return_address(0)); ret = 0; } spin_unlock(&cp->lock); ct_write_unlock_bh(hash); return ret; } /* * UNhashes ip_vs_conn from ip_vs_conn_tab. * returns bool success. Caller should hold conn reference. */ static inline int ip_vs_conn_unhash(struct ip_vs_conn *cp) { unsigned int hash; int ret; /* unhash it and decrease its reference counter */ hash = ip_vs_conn_hashkey_conn(cp); ct_write_lock_bh(hash); spin_lock(&cp->lock); if (cp->flags & IP_VS_CONN_F_HASHED) { hlist_del_rcu(&cp->c_list); cp->flags &= ~IP_VS_CONN_F_HASHED; refcount_dec(&cp->refcnt); ret = 1; } else ret = 0; spin_unlock(&cp->lock); ct_write_unlock_bh(hash); return ret; } /* Try to unlink ip_vs_conn from ip_vs_conn_tab. * returns bool success. */ static inline bool ip_vs_conn_unlink(struct ip_vs_conn *cp) { unsigned int hash; bool ret = false; if (cp->flags & IP_VS_CONN_F_ONE_PACKET) return refcount_dec_if_one(&cp->refcnt); hash = ip_vs_conn_hashkey_conn(cp); ct_write_lock_bh(hash); spin_lock(&cp->lock); if (cp->flags & IP_VS_CONN_F_HASHED) { /* Decrease refcnt and unlink conn only if we are last user */ if (refcount_dec_if_one(&cp->refcnt)) { hlist_del_rcu(&cp->c_list); cp->flags &= ~IP_VS_CONN_F_HASHED; ret = true; } } spin_unlock(&cp->lock); ct_write_unlock_bh(hash); return ret; } /* * Gets ip_vs_conn associated with supplied parameters in the ip_vs_conn_tab. * Called for pkts coming from OUTside-to-INside. * p->caddr, p->cport: pkt source address (foreign host) * p->vaddr, p->vport: pkt dest address (load balancer) */ static inline struct ip_vs_conn * __ip_vs_conn_in_get(const struct ip_vs_conn_param *p) { unsigned int hash; struct ip_vs_conn *cp; hash = ip_vs_conn_hashkey_param(p, false); rcu_read_lock(); hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[hash], c_list) { if (p->cport == cp->cport && p->vport == cp->vport && cp->af == p->af && ip_vs_addr_equal(p->af, p->caddr, &cp->caddr) && ip_vs_addr_equal(p->af, p->vaddr, &cp->vaddr) && ((!p->cport) ^ (!(cp->flags & IP_VS_CONN_F_NO_CPORT))) && p->protocol == cp->protocol && cp->ipvs == p->ipvs) { if (!__ip_vs_conn_get(cp)) continue; /* HIT */ rcu_read_unlock(); return cp; } } rcu_read_unlock(); return NULL; } struct ip_vs_conn *ip_vs_conn_in_get(const struct ip_vs_conn_param *p) { struct ip_vs_conn *cp; cp = __ip_vs_conn_in_get(p); if (!cp && atomic_read(&ip_vs_conn_no_cport_cnt)) { struct ip_vs_conn_param cport_zero_p = *p; cport_zero_p.cport = 0; cp = __ip_vs_conn_in_get(&cport_zero_p); } IP_VS_DBG_BUF(9, "lookup/in %s %s:%d->%s:%d %s\n", ip_vs_proto_name(p->protocol), IP_VS_DBG_ADDR(p->af, p->caddr), ntohs(p->cport), IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport), cp ? "hit" : "not hit"); return cp; } static int ip_vs_conn_fill_param_proto(struct netns_ipvs *ipvs, int af, const struct sk_buff *skb, const struct ip_vs_iphdr *iph, struct ip_vs_conn_param *p) { __be16 _ports[2], *pptr; pptr = frag_safe_skb_hp(skb, iph->len, sizeof(_ports), _ports); if (pptr == NULL) return 1; if (likely(!ip_vs_iph_inverse(iph))) ip_vs_conn_fill_param(ipvs, af, iph->protocol, &iph->saddr, pptr[0], &iph->daddr, pptr[1], p); else ip_vs_conn_fill_param(ipvs, af, iph->protocol, &iph->daddr, pptr[1], &iph->saddr, pptr[0], p); return 0; } struct ip_vs_conn * ip_vs_conn_in_get_proto(struct netns_ipvs *ipvs, int af, const struct sk_buff *skb, const struct ip_vs_iphdr *iph) { struct ip_vs_conn_param p; if (ip_vs_conn_fill_param_proto(ipvs, af, skb, iph, &p)) return NULL; return ip_vs_conn_in_get(&p); } EXPORT_SYMBOL_GPL(ip_vs_conn_in_get_proto); /* Get reference to connection template */ struct ip_vs_conn *ip_vs_ct_in_get(const struct ip_vs_conn_param *p) { unsigned int hash; struct ip_vs_conn *cp; hash = ip_vs_conn_hashkey_param(p, false); rcu_read_lock(); hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[hash], c_list) { if (unlikely(p->pe_data && p->pe->ct_match)) { if (cp->ipvs != p->ipvs) continue; if (p->pe == cp->pe && p->pe->ct_match(p, cp)) { if (__ip_vs_conn_get(cp)) goto out; } continue; } if (cp->af == p->af && ip_vs_addr_equal(p->af, p->caddr, &cp->caddr) && /* protocol should only be IPPROTO_IP if * p->vaddr is a fwmark */ ip_vs_addr_equal(p->protocol == IPPROTO_IP ? AF_UNSPEC : p->af, p->vaddr, &cp->vaddr) && p->vport == cp->vport && p->cport == cp->cport && cp->flags & IP_VS_CONN_F_TEMPLATE && p->protocol == cp->protocol && cp->ipvs == p->ipvs) { if (__ip_vs_conn_get(cp)) goto out; } } cp = NULL; out: rcu_read_unlock(); IP_VS_DBG_BUF(9, "template lookup/in %s %s:%d->%s:%d %s\n", ip_vs_proto_name(p->protocol), IP_VS_DBG_ADDR(p->af, p->caddr), ntohs(p->cport), IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport), cp ? "hit" : "not hit"); return cp; } /* Gets ip_vs_conn associated with supplied parameters in the ip_vs_conn_tab. * Called for pkts coming from inside-to-OUTside. * p->caddr, p->cport: pkt source address (inside host) * p->vaddr, p->vport: pkt dest address (foreign host) */ struct ip_vs_conn *ip_vs_conn_out_get(const struct ip_vs_conn_param *p) { unsigned int hash; struct ip_vs_conn *cp, *ret=NULL; const union nf_inet_addr *saddr; __be16 sport; /* * Check for "full" addressed entries */ hash = ip_vs_conn_hashkey_param(p, true); rcu_read_lock(); hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[hash], c_list) { if (p->vport != cp->cport) continue; if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ) { sport = cp->vport; saddr = &cp->vaddr; } else { sport = cp->dport; saddr = &cp->daddr; } if (p->cport == sport && cp->af == p->af && ip_vs_addr_equal(p->af, p->vaddr, &cp->caddr) && ip_vs_addr_equal(p->af, p->caddr, saddr) && p->protocol == cp->protocol && cp->ipvs == p->ipvs) { if (!__ip_vs_conn_get(cp)) continue; /* HIT */ ret = cp; break; } } rcu_read_unlock(); IP_VS_DBG_BUF(9, "lookup/out %s %s:%d->%s:%d %s\n", ip_vs_proto_name(p->protocol), IP_VS_DBG_ADDR(p->af, p->caddr), ntohs(p->cport), IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport), ret ? "hit" : "not hit"); return ret; } struct ip_vs_conn * ip_vs_conn_out_get_proto(struct netns_ipvs *ipvs, int af, const struct sk_buff *skb, const struct ip_vs_iphdr *iph) { struct ip_vs_conn_param p; if (ip_vs_conn_fill_param_proto(ipvs, af, skb, iph, &p)) return NULL; return ip_vs_conn_out_get(&p); } EXPORT_SYMBOL_GPL(ip_vs_conn_out_get_proto); /* * Put back the conn and restart its timer with its timeout */ static void __ip_vs_conn_put_timer(struct ip_vs_conn *cp) { unsigned long t = (cp->flags & IP_VS_CONN_F_ONE_PACKET) ? 0 : cp->timeout; mod_timer(&cp->timer, jiffies+t); __ip_vs_conn_put(cp); } void ip_vs_conn_put(struct ip_vs_conn *cp) { if ((cp->flags & IP_VS_CONN_F_ONE_PACKET) && (refcount_read(&cp->refcnt) == 1) && !timer_pending(&cp->timer)) /* expire connection immediately */ ip_vs_conn_expire(&cp->timer); else __ip_vs_conn_put_timer(cp); } /* * Fill a no_client_port connection with a client port number */ void ip_vs_conn_fill_cport(struct ip_vs_conn *cp, __be16 cport) { if (ip_vs_conn_unhash(cp)) { spin_lock_bh(&cp->lock); if (cp->flags & IP_VS_CONN_F_NO_CPORT) { atomic_dec(&ip_vs_conn_no_cport_cnt); cp->flags &= ~IP_VS_CONN_F_NO_CPORT; cp->cport = cport; } spin_unlock_bh(&cp->lock); /* hash on new dport */ ip_vs_conn_hash(cp); } } /* * Bind a connection entry with the corresponding packet_xmit. * Called by ip_vs_conn_new. */ static inline void ip_vs_bind_xmit(struct ip_vs_conn *cp) { switch (IP_VS_FWD_METHOD(cp)) { case IP_VS_CONN_F_MASQ: cp->packet_xmit = ip_vs_nat_xmit; break; case IP_VS_CONN_F_TUNNEL: #ifdef CONFIG_IP_VS_IPV6 if (cp->daf == AF_INET6) cp->packet_xmit = ip_vs_tunnel_xmit_v6; else #endif cp->packet_xmit = ip_vs_tunnel_xmit; break; case IP_VS_CONN_F_DROUTE: cp->packet_xmit = ip_vs_dr_xmit; break; case IP_VS_CONN_F_LOCALNODE: cp->packet_xmit = ip_vs_null_xmit; break; case IP_VS_CONN_F_BYPASS: cp->packet_xmit = ip_vs_bypass_xmit; break; } } #ifdef CONFIG_IP_VS_IPV6 static inline void ip_vs_bind_xmit_v6(struct ip_vs_conn *cp) { switch (IP_VS_FWD_METHOD(cp)) { case IP_VS_CONN_F_MASQ: cp->packet_xmit = ip_vs_nat_xmit_v6; break; case IP_VS_CONN_F_TUNNEL: if (cp->daf == AF_INET6) cp->packet_xmit = ip_vs_tunnel_xmit_v6; else cp->packet_xmit = ip_vs_tunnel_xmit; break; case IP_VS_CONN_F_DROUTE: cp->packet_xmit = ip_vs_dr_xmit_v6; break; case IP_VS_CONN_F_LOCALNODE: cp->packet_xmit = ip_vs_null_xmit; break; case IP_VS_CONN_F_BYPASS: cp->packet_xmit = ip_vs_bypass_xmit_v6; break; } } #endif static inline int ip_vs_dest_totalconns(struct ip_vs_dest *dest) { return atomic_read(&dest->activeconns) + atomic_read(&dest->inactconns); } /* * Bind a connection entry with a virtual service destination * Called just after a new connection entry is created. */ static inline void ip_vs_bind_dest(struct ip_vs_conn *cp, struct ip_vs_dest *dest) { unsigned int conn_flags; __u32 flags; /* if dest is NULL, then return directly */ if (!dest) return; /* Increase the refcnt counter of the dest */ ip_vs_dest_hold(dest); conn_flags = atomic_read(&dest->conn_flags); if (cp->protocol != IPPROTO_UDP) conn_flags &= ~IP_VS_CONN_F_ONE_PACKET; flags = cp->flags; /* Bind with the destination and its corresponding transmitter */ if (flags & IP_VS_CONN_F_SYNC) { /* if the connection is not template and is created * by sync, preserve the activity flag. */ if (!(flags & IP_VS_CONN_F_TEMPLATE)) conn_flags &= ~IP_VS_CONN_F_INACTIVE; /* connections inherit forwarding method from dest */ flags &= ~(IP_VS_CONN_F_FWD_MASK | IP_VS_CONN_F_NOOUTPUT); } flags |= conn_flags; cp->flags = flags; cp->dest = dest; IP_VS_DBG_BUF(7, "Bind-dest %s c:%s:%d v:%s:%d " "d:%s:%d fwd:%c s:%u conn->flags:%X conn->refcnt:%d " "dest->refcnt:%d\n", ip_vs_proto_name(cp->protocol), IP_VS_DBG_ADDR(cp->af, &cp->caddr), ntohs(cp->cport), IP_VS_DBG_ADDR(cp->af, &cp->vaddr), ntohs(cp->vport), IP_VS_DBG_ADDR(cp->daf, &cp->daddr), ntohs(cp->dport), ip_vs_fwd_tag(cp), cp->state, cp->flags, refcount_read(&cp->refcnt), refcount_read(&dest->refcnt)); /* Update the connection counters */ if (!(flags & IP_VS_CONN_F_TEMPLATE)) { /* It is a normal connection, so modify the counters * according to the flags, later the protocol can * update them on state change */ if (!(flags & IP_VS_CONN_F_INACTIVE)) atomic_inc(&dest->activeconns); else atomic_inc(&dest->inactconns); } else { /* It is a persistent connection/template, so increase the persistent connection counter */ atomic_inc(&dest->persistconns); } if (dest->u_threshold != 0 && ip_vs_dest_totalconns(dest) >= dest->u_threshold) dest->flags |= IP_VS_DEST_F_OVERLOAD; } /* * Check if there is a destination for the connection, if so * bind the connection to the destination. */ void ip_vs_try_bind_dest(struct ip_vs_conn *cp) { struct ip_vs_dest *dest; rcu_read_lock(); /* This function is only invoked by the synchronization code. We do * not currently support heterogeneous pools with synchronization, * so we can make the assumption that the svc_af is the same as the * dest_af */ dest = ip_vs_find_dest(cp->ipvs, cp->af, cp->af, &cp->daddr, cp->dport, &cp->vaddr, cp->vport, cp->protocol, cp->fwmark, cp->flags); if (dest) { struct ip_vs_proto_data *pd; spin_lock_bh(&cp->lock); if (cp->dest) { spin_unlock_bh(&cp->lock); rcu_read_unlock(); return; } /* Applications work depending on the forwarding method * but better to reassign them always when binding dest */ if (cp->app) ip_vs_unbind_app(cp); ip_vs_bind_dest(cp, dest); spin_unlock_bh(&cp->lock); /* Update its packet transmitter */ cp->packet_xmit = NULL; #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6) ip_vs_bind_xmit_v6(cp); else #endif ip_vs_bind_xmit(cp); pd = ip_vs_proto_data_get(cp->ipvs, cp->protocol); if (pd && atomic_read(&pd->appcnt)) ip_vs_bind_app(cp, pd->pp); } rcu_read_unlock(); } /* * Unbind a connection entry with its VS destination * Called by the ip_vs_conn_expire function. */ static inline void ip_vs_unbind_dest(struct ip_vs_conn *cp) { struct ip_vs_dest *dest = cp->dest; if (!dest) return; IP_VS_DBG_BUF(7, "Unbind-dest %s c:%s:%d v:%s:%d " "d:%s:%d fwd:%c s:%u conn->flags:%X conn->refcnt:%d " "dest->refcnt:%d\n", ip_vs_proto_name(cp->protocol), IP_VS_DBG_ADDR(cp->af, &cp->caddr), ntohs(cp->cport), IP_VS_DBG_ADDR(cp->af, &cp->vaddr), ntohs(cp->vport), IP_VS_DBG_ADDR(cp->daf, &cp->daddr), ntohs(cp->dport), ip_vs_fwd_tag(cp), cp->state, cp->flags, refcount_read(&cp->refcnt), refcount_read(&dest->refcnt)); /* Update the connection counters */ if (!(cp->flags & IP_VS_CONN_F_TEMPLATE)) { /* It is a normal connection, so decrease the inactconns or activeconns counter */ if (cp->flags & IP_VS_CONN_F_INACTIVE) { atomic_dec(&dest->inactconns); } else { atomic_dec(&dest->activeconns); } } else { /* It is a persistent connection/template, so decrease the persistent connection counter */ atomic_dec(&dest->persistconns); } if (dest->l_threshold != 0) { if (ip_vs_dest_totalconns(dest) < dest->l_threshold) dest->flags &= ~IP_VS_DEST_F_OVERLOAD; } else if (dest->u_threshold != 0) { if (ip_vs_dest_totalconns(dest) * 4 < dest->u_threshold * 3) dest->flags &= ~IP_VS_DEST_F_OVERLOAD; } else { if (dest->flags & IP_VS_DEST_F_OVERLOAD) dest->flags &= ~IP_VS_DEST_F_OVERLOAD; } ip_vs_dest_put(dest); } static int expire_quiescent_template(struct netns_ipvs *ipvs, struct ip_vs_dest *dest) { #ifdef CONFIG_SYSCTL return ipvs->sysctl_expire_quiescent_template && (atomic_read(&dest->weight) == 0); #else return 0; #endif } /* * Checking if the destination of a connection template is available. * If available, return 1, otherwise invalidate this connection * template and return 0. */ int ip_vs_check_template(struct ip_vs_conn *ct, struct ip_vs_dest *cdest) { struct ip_vs_dest *dest = ct->dest; struct netns_ipvs *ipvs = ct->ipvs; /* * Checking the dest server status. */ if ((dest == NULL) || !(dest->flags & IP_VS_DEST_F_AVAILABLE) || expire_quiescent_template(ipvs, dest) || (cdest && (dest != cdest))) { IP_VS_DBG_BUF(9, "check_template: dest not available for " "protocol %s s:%s:%d v:%s:%d " "-> d:%s:%d\n", ip_vs_proto_name(ct->protocol), IP_VS_DBG_ADDR(ct->af, &ct->caddr), ntohs(ct->cport), IP_VS_DBG_ADDR(ct->af, &ct->vaddr), ntohs(ct->vport), IP_VS_DBG_ADDR(ct->daf, &ct->daddr), ntohs(ct->dport)); /* * Invalidate the connection template */ if (ct->vport != htons(0xffff)) { if (ip_vs_conn_unhash(ct)) { ct->dport = htons(0xffff); ct->vport = htons(0xffff); ct->cport = 0; ip_vs_conn_hash(ct); } } /* * Simply decrease the refcnt of the template, * don't restart its timer. */ __ip_vs_conn_put(ct); return 0; } return 1; } static void ip_vs_conn_rcu_free(struct rcu_head *head) { struct ip_vs_conn *cp = container_of(head, struct ip_vs_conn, rcu_head); ip_vs_pe_put(cp->pe); kfree(cp->pe_data); kmem_cache_free(ip_vs_conn_cachep, cp); } /* Try to delete connection while not holding reference */ static void ip_vs_conn_del(struct ip_vs_conn *cp) { if (timer_delete(&cp->timer)) { /* Drop cp->control chain too */ if (cp->control) cp->timeout = 0; ip_vs_conn_expire(&cp->timer); } } /* Try to delete connection while holding reference */ static void ip_vs_conn_del_put(struct ip_vs_conn *cp) { if (timer_delete(&cp->timer)) { /* Drop cp->control chain too */ if (cp->control) cp->timeout = 0; __ip_vs_conn_put(cp); ip_vs_conn_expire(&cp->timer); } else { __ip_vs_conn_put(cp); } } static void ip_vs_conn_expire(struct timer_list *t) { struct ip_vs_conn *cp = from_timer(cp, t, timer); struct netns_ipvs *ipvs = cp->ipvs; /* * do I control anybody? */ if (atomic_read(&cp->n_control)) goto expire_later; /* Unlink conn if not referenced anymore */ if (likely(ip_vs_conn_unlink(cp))) { struct ip_vs_conn *ct = cp->control; /* delete the timer if it is activated by other users */ timer_delete(&cp->timer); /* does anybody control me? */ if (ct) { bool has_ref = !cp->timeout && __ip_vs_conn_get(ct); ip_vs_control_del(cp); /* Drop CTL or non-assured TPL if not used anymore */ if (has_ref && !atomic_read(&ct->n_control) && (!(ct->flags & IP_VS_CONN_F_TEMPLATE) || !(ct->state & IP_VS_CTPL_S_ASSURED))) { IP_VS_DBG(4, "drop controlling connection\n"); ip_vs_conn_del_put(ct); } else if (has_ref) { __ip_vs_conn_put(ct); } } if ((cp->flags & IP_VS_CONN_F_NFCT) && !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) { /* Do not access conntracks during subsys cleanup * because nf_conntrack_find_get can not be used after * conntrack cleanup for the net. */ smp_rmb(); if (ipvs->enable) ip_vs_conn_drop_conntrack(cp); } if (unlikely(cp->app != NULL)) ip_vs_unbind_app(cp); ip_vs_unbind_dest(cp); if (cp->flags & IP_VS_CONN_F_NO_CPORT) atomic_dec(&ip_vs_conn_no_cport_cnt); if (cp->flags & IP_VS_CONN_F_ONE_PACKET) ip_vs_conn_rcu_free(&cp->rcu_head); else call_rcu(&cp->rcu_head, ip_vs_conn_rcu_free); atomic_dec(&ipvs->conn_count); return; } expire_later: IP_VS_DBG(7, "delayed: conn->refcnt=%d conn->n_control=%d\n", refcount_read(&cp->refcnt), atomic_read(&cp->n_control)); refcount_inc(&cp->refcnt); cp->timeout = 60*HZ; if (ipvs->sync_state & IP_VS_STATE_MASTER) ip_vs_sync_conn(ipvs, cp, sysctl_sync_threshold(ipvs)); __ip_vs_conn_put_timer(cp); } /* Modify timer, so that it expires as soon as possible. * Can be called without reference only if under RCU lock. * We can have such chain of conns linked with ->control: DATA->CTL->TPL * - DATA (eg. FTP) and TPL (persistence) can be present depending on setup * - cp->timeout=0 indicates all conns from chain should be dropped but * TPL is not dropped if in assured state */ void ip_vs_conn_expire_now(struct ip_vs_conn *cp) { /* Using mod_timer_pending will ensure the timer is not * modified after the final del_timer in ip_vs_conn_expire. */ if (timer_pending(&cp->timer) && time_after(cp->timer.expires, jiffies)) mod_timer_pending(&cp->timer, jiffies); } /* * Create a new connection entry and hash it into the ip_vs_conn_tab */ struct ip_vs_conn * ip_vs_conn_new(const struct ip_vs_conn_param *p, int dest_af, const union nf_inet_addr *daddr, __be16 dport, unsigned int flags, struct ip_vs_dest *dest, __u32 fwmark) { struct ip_vs_conn *cp; struct netns_ipvs *ipvs = p->ipvs; struct ip_vs_proto_data *pd = ip_vs_proto_data_get(p->ipvs, p->protocol); cp = kmem_cache_alloc(ip_vs_conn_cachep, GFP_ATOMIC); if (cp == NULL) { IP_VS_ERR_RL("%s(): no memory\n", __func__); return NULL; } INIT_HLIST_NODE(&cp->c_list); timer_setup(&cp->timer, ip_vs_conn_expire, 0); cp->ipvs = ipvs; cp->af = p->af; cp->daf = dest_af; cp->protocol = p->protocol; ip_vs_addr_set(p->af, &cp->caddr, p->caddr); cp->cport = p->cport; /* proto should only be IPPROTO_IP if p->vaddr is a fwmark */ ip_vs_addr_set(p->protocol == IPPROTO_IP ? AF_UNSPEC : p->af, &cp->vaddr, p->vaddr); cp->vport = p->vport; ip_vs_addr_set(cp->daf, &cp->daddr, daddr); cp->dport = dport; cp->flags = flags; cp->fwmark = fwmark; if (flags & IP_VS_CONN_F_TEMPLATE && p->pe) { ip_vs_pe_get(p->pe); cp->pe = p->pe; cp->pe_data = p->pe_data; cp->pe_data_len = p->pe_data_len; } else { cp->pe = NULL; cp->pe_data = NULL; cp->pe_data_len = 0; } spin_lock_init(&cp->lock); /* * Set the entry is referenced by the current thread before hashing * it in the table, so that other thread run ip_vs_random_dropentry * but cannot drop this entry. */ refcount_set(&cp->refcnt, 1); cp->control = NULL; atomic_set(&cp->n_control, 0); atomic_set(&cp->in_pkts, 0); cp->packet_xmit = NULL; cp->app = NULL; cp->app_data = NULL; /* reset struct ip_vs_seq */ cp->in_seq.delta = 0; cp->out_seq.delta = 0; atomic_inc(&ipvs->conn_count); if (flags & IP_VS_CONN_F_NO_CPORT) atomic_inc(&ip_vs_conn_no_cport_cnt); /* Bind the connection with a destination server */ cp->dest = NULL; ip_vs_bind_dest(cp, dest); /* Set its state and timeout */ cp->state = 0; cp->old_state = 0; cp->timeout = 3*HZ; cp->sync_endtime = jiffies & ~3UL; /* Bind its packet transmitter */ #ifdef CONFIG_IP_VS_IPV6 if (p->af == AF_INET6) ip_vs_bind_xmit_v6(cp); else #endif ip_vs_bind_xmit(cp); if (unlikely(pd && atomic_read(&pd->appcnt))) ip_vs_bind_app(cp, pd->pp); /* * Allow conntrack to be preserved. By default, conntrack * is created and destroyed for every packet. * Sometimes keeping conntrack can be useful for * IP_VS_CONN_F_ONE_PACKET too. */ if (ip_vs_conntrack_enabled(ipvs)) cp->flags |= IP_VS_CONN_F_NFCT; /* Hash it in the ip_vs_conn_tab finally */ ip_vs_conn_hash(cp); return cp; } /* * /proc/net/ip_vs_conn entries */ #ifdef CONFIG_PROC_FS struct ip_vs_iter_state { struct seq_net_private p; unsigned int bucket; unsigned int skip_elems; }; static void *ip_vs_conn_array(struct ip_vs_iter_state *iter) { int idx; struct ip_vs_conn *cp; for (idx = iter->bucket; idx < ip_vs_conn_tab_size; idx++) { unsigned int skip = 0; hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) { /* __ip_vs_conn_get() is not needed by * ip_vs_conn_seq_show and ip_vs_conn_sync_seq_show */ if (skip >= iter->skip_elems) { iter->bucket = idx; return cp; } ++skip; } iter->skip_elems = 0; cond_resched_rcu(); } iter->bucket = idx; return NULL; } static void *ip_vs_conn_seq_start(struct seq_file *seq, loff_t *pos) __acquires(RCU) { struct ip_vs_iter_state *iter = seq->private; rcu_read_lock(); if (*pos == 0) { iter->skip_elems = 0; iter->bucket = 0; return SEQ_START_TOKEN; } return ip_vs_conn_array(iter); } static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos) { struct ip_vs_conn *cp = v; struct ip_vs_iter_state *iter = seq->private; struct hlist_node *e; ++*pos; if (v == SEQ_START_TOKEN) return ip_vs_conn_array(iter); /* more on same hash chain? */ e = rcu_dereference(hlist_next_rcu(&cp->c_list)); if (e) { iter->skip_elems++; return hlist_entry(e, struct ip_vs_conn, c_list); } iter->skip_elems = 0; iter->bucket++; return ip_vs_conn_array(iter); } static void ip_vs_conn_seq_stop(struct seq_file *seq, void *v) __releases(RCU) { rcu_read_unlock(); } static int ip_vs_conn_seq_show(struct seq_file *seq, void *v) { if (v == SEQ_START_TOKEN) seq_puts(seq, "Pro FromIP FPrt ToIP TPrt DestIP DPrt State Expires PEName PEData\n"); else { const struct ip_vs_conn *cp = v; struct net *net = seq_file_net(seq); char pe_data[IP_VS_PENAME_MAXLEN + IP_VS_PEDATA_MAXLEN + 3]; size_t len = 0; char dbuf[IP_VS_ADDRSTRLEN]; if (!net_eq(cp->ipvs->net, net)) return 0; if (cp->pe_data) { pe_data[0] = ' '; len = strlen(cp->pe->name); memcpy(pe_data + 1, cp->pe->name, len); pe_data[len + 1] = ' '; len += 2; len += cp->pe->show_pe_data(cp, pe_data + len); } pe_data[len] = '\0'; #ifdef CONFIG_IP_VS_IPV6 if (cp->daf == AF_INET6) snprintf(dbuf, sizeof(dbuf), "%pI6", &cp->daddr.in6); else #endif snprintf(dbuf, sizeof(dbuf), "%08X", ntohl(cp->daddr.ip)); #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6) seq_printf(seq, "%-3s %pI6 %04X %pI6 %04X " "%s %04X %-11s %7u%s\n", ip_vs_proto_name(cp->protocol), &cp->caddr.in6, ntohs(cp->cport), &cp->vaddr.in6, ntohs(cp->vport), dbuf, ntohs(cp->dport), ip_vs_state_name(cp), jiffies_delta_to_msecs(cp->timer.expires - jiffies) / 1000, pe_data); else #endif seq_printf(seq, "%-3s %08X %04X %08X %04X" " %s %04X %-11s %7u%s\n", ip_vs_proto_name(cp->protocol), ntohl(cp->caddr.ip), ntohs(cp->cport), ntohl(cp->vaddr.ip), ntohs(cp->vport), dbuf, ntohs(cp->dport), ip_vs_state_name(cp), jiffies_delta_to_msecs(cp->timer.expires - jiffies) / 1000, pe_data); } return 0; } static const struct seq_operations ip_vs_conn_seq_ops = { .start = ip_vs_conn_seq_start, .next = ip_vs_conn_seq_next, .stop = ip_vs_conn_seq_stop, .show = ip_vs_conn_seq_show, }; static const char *ip_vs_origin_name(unsigned int flags) { if (flags & IP_VS_CONN_F_SYNC) return "SYNC"; else return "LOCAL"; } static int ip_vs_conn_sync_seq_show(struct seq_file *seq, void *v) { char dbuf[IP_VS_ADDRSTRLEN]; if (v == SEQ_START_TOKEN) seq_puts(seq, "Pro FromIP FPrt ToIP TPrt DestIP DPrt State Origin Expires\n"); else { const struct ip_vs_conn *cp = v; struct net *net = seq_file_net(seq); if (!net_eq(cp->ipvs->net, net)) return 0; #ifdef CONFIG_IP_VS_IPV6 if (cp->daf == AF_INET6) snprintf(dbuf, sizeof(dbuf), "%pI6", &cp->daddr.in6); else #endif snprintf(dbuf, sizeof(dbuf), "%08X", ntohl(cp->daddr.ip)); #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6) seq_printf(seq, "%-3s %pI6 %04X %pI6 %04X " "%s %04X %-11s %-6s %7u\n", ip_vs_proto_name(cp->protocol), &cp->caddr.in6, ntohs(cp->cport), &cp->vaddr.in6, ntohs(cp->vport), dbuf, ntohs(cp->dport), ip_vs_state_name(cp), ip_vs_origin_name(cp->flags), jiffies_delta_to_msecs(cp->timer.expires - jiffies) / 1000); else #endif seq_printf(seq, "%-3s %08X %04X %08X %04X " "%s %04X %-11s %-6s %7u\n", ip_vs_proto_name(cp->protocol), ntohl(cp->caddr.ip), ntohs(cp->cport), ntohl(cp->vaddr.ip), ntohs(cp->vport), dbuf, ntohs(cp->dport), ip_vs_state_name(cp), ip_vs_origin_name(cp->flags), jiffies_delta_to_msecs(cp->timer.expires - jiffies) / 1000); } return 0; } static const struct seq_operations ip_vs_conn_sync_seq_ops = { .start = ip_vs_conn_seq_start, .next = ip_vs_conn_seq_next, .stop = ip_vs_conn_seq_stop, .show = ip_vs_conn_sync_seq_show, }; #endif /* Randomly drop connection entries before running out of memory * Can be used for DATA and CTL conns. For TPL conns there are exceptions: * - traffic for services in OPS mode increases ct->in_pkts, so it is supported * - traffic for services not in OPS mode does not increase ct->in_pkts in * all cases, so it is not supported */ static inline int todrop_entry(struct ip_vs_conn *cp) { /* * The drop rate array needs tuning for real environments. * Called from timer bh only => no locking */ static const signed char todrop_rate[9] = {0, 1, 2, 3, 4, 5, 6, 7, 8}; static signed char todrop_counter[9] = {0}; int i; /* if the conn entry hasn't lasted for 60 seconds, don't drop it. This will leave enough time for normal connection to get through. */ if (time_before(cp->timeout + jiffies, cp->timer.expires + 60*HZ)) return 0; /* Don't drop the entry if its number of incoming packets is not located in [0, 8] */ i = atomic_read(&cp->in_pkts); if (i > 8 || i < 0) return 0; if (!todrop_rate[i]) return 0; if (--todrop_counter[i] > 0) return 0; todrop_counter[i] = todrop_rate[i]; return 1; } static inline bool ip_vs_conn_ops_mode(struct ip_vs_conn *cp) { struct ip_vs_service *svc; if (!cp->dest) return false; svc = rcu_dereference(cp->dest->svc); return svc && (svc->flags & IP_VS_SVC_F_ONEPACKET); } /* Called from keventd and must protect itself from softirqs */ void ip_vs_random_dropentry(struct netns_ipvs *ipvs) { int idx; struct ip_vs_conn *cp; rcu_read_lock(); /* * Randomly scan 1/32 of the whole table every second */ for (idx = 0; idx < (ip_vs_conn_tab_size>>5); idx++) { unsigned int hash = get_random_u32() & ip_vs_conn_tab_mask; hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[hash], c_list) { if (cp->ipvs != ipvs) continue; if (atomic_read(&cp->n_control)) continue; if (cp->flags & IP_VS_CONN_F_TEMPLATE) { /* connection template of OPS */ if (ip_vs_conn_ops_mode(cp)) goto try_drop; if (!(cp->state & IP_VS_CTPL_S_ASSURED)) goto drop; continue; } if (cp->protocol == IPPROTO_TCP) { switch(cp->state) { case IP_VS_TCP_S_SYN_RECV: case IP_VS_TCP_S_SYNACK: break; case IP_VS_TCP_S_ESTABLISHED: if (todrop_entry(cp)) break; continue; default: continue; } } else if (cp->protocol == IPPROTO_SCTP) { switch (cp->state) { case IP_VS_SCTP_S_INIT1: case IP_VS_SCTP_S_INIT: break; case IP_VS_SCTP_S_ESTABLISHED: if (todrop_entry(cp)) break; continue; default: continue; } } else { try_drop: if (!todrop_entry(cp)) continue; } drop: IP_VS_DBG(4, "drop connection\n"); ip_vs_conn_del(cp); } cond_resched_rcu(); } rcu_read_unlock(); } /* * Flush all the connection entries in the ip_vs_conn_tab */ static void ip_vs_conn_flush(struct netns_ipvs *ipvs) { int idx; struct ip_vs_conn *cp, *cp_c; flush_again: rcu_read_lock(); for (idx = 0; idx < ip_vs_conn_tab_size; idx++) { hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) { if (cp->ipvs != ipvs) continue; if (atomic_read(&cp->n_control)) continue; cp_c = cp->control; IP_VS_DBG(4, "del connection\n"); ip_vs_conn_del(cp); if (cp_c && !atomic_read(&cp_c->n_control)) { IP_VS_DBG(4, "del controlling connection\n"); ip_vs_conn_del(cp_c); } } cond_resched_rcu(); } rcu_read_unlock(); /* the counter may be not NULL, because maybe some conn entries are run by slow timer handler or unhashed but still referred */ if (atomic_read(&ipvs->conn_count) != 0) { schedule(); goto flush_again; } } #ifdef CONFIG_SYSCTL void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs) { int idx; struct ip_vs_conn *cp, *cp_c; struct ip_vs_dest *dest; rcu_read_lock(); for (idx = 0; idx < ip_vs_conn_tab_size; idx++) { hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) { if (cp->ipvs != ipvs) continue; dest = cp->dest; if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE)) continue; if (atomic_read(&cp->n_control)) continue; cp_c = cp->control; IP_VS_DBG(4, "del connection\n"); ip_vs_conn_del(cp); if (cp_c && !atomic_read(&cp_c->n_control)) { IP_VS_DBG(4, "del controlling connection\n"); ip_vs_conn_del(cp_c); } } cond_resched_rcu(); /* netns clean up started, abort delayed work */ if (!ipvs->enable) break; } rcu_read_unlock(); } #endif /* * per netns init and exit */ int __net_init ip_vs_conn_net_init(struct netns_ipvs *ipvs) { atomic_set(&ipvs->conn_count, 0); #ifdef CONFIG_PROC_FS if (!proc_create_net("ip_vs_conn", 0, ipvs->net->proc_net, &ip_vs_conn_seq_ops, sizeof(struct ip_vs_iter_state))) goto err_conn; if (!proc_create_net("ip_vs_conn_sync", 0, ipvs->net->proc_net, &ip_vs_conn_sync_seq_ops, sizeof(struct ip_vs_iter_state))) goto err_conn_sync; #endif return 0; #ifdef CONFIG_PROC_FS err_conn_sync: remove_proc_entry("ip_vs_conn", ipvs->net->proc_net); err_conn: return -ENOMEM; #endif } void __net_exit ip_vs_conn_net_cleanup(struct netns_ipvs *ipvs) { /* flush all the connection entries first */ ip_vs_conn_flush(ipvs); #ifdef CONFIG_PROC_FS remove_proc_entry("ip_vs_conn", ipvs->net->proc_net); remove_proc_entry("ip_vs_conn_sync", ipvs->net->proc_net); #endif } int __init ip_vs_conn_init(void) { size_t tab_array_size; int max_avail; #if BITS_PER_LONG > 32 int max = 27; #else int max = 20; #endif int min = 8; int idx; max_avail = order_base_2(totalram_pages()) + PAGE_SHIFT; max_avail -= 2; /* ~4 in hash row */ max_avail -= 1; /* IPVS up to 1/2 of mem */ max_avail -= order_base_2(sizeof(struct ip_vs_conn)); max = clamp(max_avail, min, max); ip_vs_conn_tab_bits = clamp(ip_vs_conn_tab_bits, min, max); ip_vs_conn_tab_size = 1 << ip_vs_conn_tab_bits; ip_vs_conn_tab_mask = ip_vs_conn_tab_size - 1; /* * Allocate the connection hash table and initialize its list heads */ tab_array_size = array_size(ip_vs_conn_tab_size, sizeof(*ip_vs_conn_tab)); ip_vs_conn_tab = kvmalloc_array(ip_vs_conn_tab_size, sizeof(*ip_vs_conn_tab), GFP_KERNEL); if (!ip_vs_conn_tab) return -ENOMEM; /* Allocate ip_vs_conn slab cache */ ip_vs_conn_cachep = KMEM_CACHE(ip_vs_conn, SLAB_HWCACHE_ALIGN); if (!ip_vs_conn_cachep) { kvfree(ip_vs_conn_tab); return -ENOMEM; } pr_info("Connection hash table configured (size=%d, memory=%zdKbytes)\n", ip_vs_conn_tab_size, tab_array_size / 1024); IP_VS_DBG(0, "Each connection entry needs %zd bytes at least\n", sizeof(struct ip_vs_conn)); for (idx = 0; idx < ip_vs_conn_tab_size; idx++) INIT_HLIST_HEAD(&ip_vs_conn_tab[idx]); for (idx = 0; idx < CT_LOCKARRAY_SIZE; idx++) { spin_lock_init(&__ip_vs_conntbl_lock_array[idx].l); } /* calculate the random value for connection hash */ get_random_bytes(&ip_vs_conn_rnd, sizeof(ip_vs_conn_rnd)); return 0; } void ip_vs_conn_cleanup(void) { /* Wait all ip_vs_conn_rcu_free() callbacks to complete */ rcu_barrier(); /* Release the empty cache */ kmem_cache_destroy(ip_vs_conn_cachep); kvfree(ip_vs_conn_tab); } |
540 23 23 23 545 545 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | // SPDX-License-Identifier: GPL-2.0 /* * Block rq-qos policy for assigning an I/O priority class to requests. * * Using an rq-qos policy for assigning I/O priority class has two advantages * over using the ioprio_set() system call: * * - This policy is cgroup based so it has all the advantages of cgroups. * - While ioprio_set() does not affect page cache writeback I/O, this rq-qos * controller affects page cache writeback I/O for filesystems that support * assiociating a cgroup with writeback I/O. See also * Documentation/admin-guide/cgroup-v2.rst. */ #include <linux/blk-mq.h> #include <linux/blk_types.h> #include <linux/kernel.h> #include <linux/module.h> #include "blk-cgroup.h" #include "blk-ioprio.h" #include "blk-rq-qos.h" /** * enum prio_policy - I/O priority class policy. * @POLICY_NO_CHANGE: (default) do not modify the I/O priority class. * @POLICY_PROMOTE_TO_RT: modify no-IOPRIO_CLASS_RT to IOPRIO_CLASS_RT. * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into * IOPRIO_CLASS_BE. * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE. * @POLICY_NONE_TO_RT: an alias for POLICY_PROMOTE_TO_RT. * * See also <linux/ioprio.h>. */ enum prio_policy { POLICY_NO_CHANGE = 0, POLICY_PROMOTE_TO_RT = 1, POLICY_RESTRICT_TO_BE = 2, POLICY_ALL_TO_IDLE = 3, POLICY_NONE_TO_RT = 4, }; static const char *policy_name[] = { [POLICY_NO_CHANGE] = "no-change", [POLICY_PROMOTE_TO_RT] = "promote-to-rt", [POLICY_RESTRICT_TO_BE] = "restrict-to-be", [POLICY_ALL_TO_IDLE] = "idle", [POLICY_NONE_TO_RT] = "none-to-rt", }; static struct blkcg_policy ioprio_policy; /** * struct ioprio_blkcg - Per cgroup data. * @cpd: blkcg_policy_data structure. * @prio_policy: One of the IOPRIO_CLASS_* values. See also <linux/ioprio.h>. */ struct ioprio_blkcg { struct blkcg_policy_data cpd; enum prio_policy prio_policy; }; static struct ioprio_blkcg *blkcg_to_ioprio_blkcg(struct blkcg *blkcg) { return container_of(blkcg_to_cpd(blkcg, &ioprio_policy), struct ioprio_blkcg, cpd); } static struct ioprio_blkcg * ioprio_blkcg_from_css(struct cgroup_subsys_state *css) { return blkcg_to_ioprio_blkcg(css_to_blkcg(css)); } static int ioprio_show_prio_policy(struct seq_file *sf, void *v) { struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(seq_css(sf)); seq_printf(sf, "%s\n", policy_name[blkcg->prio_policy]); return 0; } static ssize_t ioprio_set_prio_policy(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(of_css(of)); int ret; if (off != 0) return -EIO; /* kernfs_fop_write_iter() terminates 'buf' with '\0'. */ ret = sysfs_match_string(policy_name, buf); if (ret < 0) return ret; blkcg->prio_policy = ret; return nbytes; } static struct blkcg_policy_data *ioprio_alloc_cpd(gfp_t gfp) { struct ioprio_blkcg *blkcg; blkcg = kzalloc(sizeof(*blkcg), gfp); if (!blkcg) return NULL; blkcg->prio_policy = POLICY_NO_CHANGE; return &blkcg->cpd; } static void ioprio_free_cpd(struct blkcg_policy_data *cpd) { struct ioprio_blkcg *blkcg = container_of(cpd, typeof(*blkcg), cpd); kfree(blkcg); } static struct cftype ioprio_files[] = { { .name = "prio.class", .seq_show = ioprio_show_prio_policy, .write = ioprio_set_prio_policy, }, { } /* sentinel */ }; static struct blkcg_policy ioprio_policy = { .dfl_cftypes = ioprio_files, .legacy_cftypes = ioprio_files, .cpd_alloc_fn = ioprio_alloc_cpd, .cpd_free_fn = ioprio_free_cpd, }; void blkcg_set_ioprio(struct bio *bio) { struct ioprio_blkcg *blkcg = blkcg_to_ioprio_blkcg(bio->bi_blkg->blkcg); u16 prio; if (!blkcg || blkcg->prio_policy == POLICY_NO_CHANGE) return; if (blkcg->prio_policy == POLICY_PROMOTE_TO_RT || blkcg->prio_policy == POLICY_NONE_TO_RT) { /* * For RT threads, the default priority level is 4 because * task_nice is 0. By promoting non-RT io-priority to RT-class * and default level 4, those requests that are already * RT-class but need a higher io-priority can use ioprio_set() * to achieve this. */ if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) != IOPRIO_CLASS_RT) bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4); return; } /* * Except for IOPRIO_CLASS_NONE, higher I/O priority numbers * correspond to a lower priority. Hence, the max_t() below selects * the lower priority of bi_ioprio and the cgroup I/O priority class. * If the bio I/O priority equals IOPRIO_CLASS_NONE, the cgroup I/O * priority is assigned to the bio. */ prio = max_t(u16, bio->bi_ioprio, IOPRIO_PRIO_VALUE(blkcg->prio_policy, 0)); if (prio > bio->bi_ioprio) bio->bi_ioprio = prio; } static int __init ioprio_init(void) { return blkcg_policy_register(&ioprio_policy); } static void __exit ioprio_exit(void) { blkcg_policy_unregister(&ioprio_policy); } module_init(ioprio_init); module_exit(ioprio_exit); |
51 6 19 19 29 29 34 34 52 52 52 51 15 15 15 15 15 15 15 15 6 6 6 37 29 29 29 29 29 29 14 29 15 14 14 14 14 17 17 17 17 6 11 11 11 18 18 18 17 6 14 14 20 6 20 6 14 14 14 13 11 18 18 3 3 15 18 18 9 18 3 1 12 12 11 9 9 18 18 18 15 15 3 3 18 3 9 9 21 13 13 13 13 13 13 10 8 12 11 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 | // SPDX-License-Identifier: GPL-2.0-only /* * Page Attribute Table (PAT) support: handle memory caching attributes in page tables. * * Authors: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> * Suresh B Siddha <suresh.b.siddha@intel.com> * * Loosely based on earlier PAT patchset from Eric Biederman and Andi Kleen. * * Basic principles: * * PAT is a CPU feature supported by all modern x86 CPUs, to allow the firmware and * the kernel to set one of a handful of 'caching type' attributes for physical * memory ranges: uncached, write-combining, write-through, write-protected, * and the most commonly used and default attribute: write-back caching. * * PAT support supersedes and augments MTRR support in a compatible fashion: MTRR is * a hardware interface to enumerate a limited number of physical memory ranges * and set their caching attributes explicitly, programmed into the CPU via MSRs. * Even modern CPUs have MTRRs enabled - but these are typically not touched * by the kernel or by user-space (such as the X server), we rely on PAT for any * additional cache attribute logic. * * PAT doesn't work via explicit memory ranges, but uses page table entries to add * cache attribute information to the mapped memory range: there's 3 bits used, * (_PAGE_PWT, _PAGE_PCD, _PAGE_PAT), with the 8 possible values mapped by the * CPU to actual cache attributes via an MSR loaded into the CPU (MSR_IA32_CR_PAT). * * ( There's a metric ton of finer details, such as compatibility with CPU quirks * that only support 4 types of PAT entries, and interaction with MTRRs, see * below for details. ) */ #include <linux/seq_file.h> #include <linux/memblock.h> #include <linux/debugfs.h> #include <linux/ioport.h> #include <linux/kernel.h> #include <linux/pfn_t.h> #include <linux/slab.h> #include <linux/mm.h> #include <linux/highmem.h> #include <linux/fs.h> #include <linux/rbtree.h> #include <asm/cpu_device_id.h> #include <asm/cacheflush.h> #include <asm/cacheinfo.h> #include <asm/processor.h> #include <asm/tlbflush.h> #include <asm/x86_init.h> #include <asm/fcntl.h> #include <asm/e820/api.h> #include <asm/mtrr.h> #include <asm/page.h> #include <asm/msr.h> #include <asm/memtype.h> #include <asm/io.h> #include "memtype.h" #include "../mm_internal.h" #undef pr_fmt #define pr_fmt(fmt) "" fmt static bool __read_mostly pat_disabled = !IS_ENABLED(CONFIG_X86_PAT); static u64 __ro_after_init pat_msr_val; /* * PAT support is enabled by default, but can be disabled for * various user-requested or hardware-forced reasons: */ static void __init pat_disable(const char *msg_reason) { if (pat_disabled) return; pat_disabled = true; pr_info("x86/PAT: %s\n", msg_reason); memory_caching_control &= ~CACHE_PAT; } static int __init nopat(char *str) { pat_disable("PAT support disabled via boot option."); return 0; } early_param("nopat", nopat); bool pat_enabled(void) { return !pat_disabled; } EXPORT_SYMBOL_GPL(pat_enabled); int pat_debug_enable; static int __init pat_debug_setup(char *str) { pat_debug_enable = 1; return 1; } __setup("debugpat", pat_debug_setup); #ifdef CONFIG_X86_PAT /* * X86 PAT uses page flags arch_1 and arch_2 together to keep track of * memory type of pages that have backing page struct. * * X86 PAT supports 4 different memory types: * - _PAGE_CACHE_MODE_WB * - _PAGE_CACHE_MODE_WC * - _PAGE_CACHE_MODE_UC_MINUS * - _PAGE_CACHE_MODE_WT * * _PAGE_CACHE_MODE_WB is the default type. */ #define _PGMT_WB 0 #define _PGMT_WC (1UL << PG_arch_1) #define _PGMT_UC_MINUS (1UL << PG_arch_2) #define _PGMT_WT (1UL << PG_arch_2 | 1UL << PG_arch_1) #define _PGMT_MASK (1UL << PG_arch_2 | 1UL << PG_arch_1) #define _PGMT_CLEAR_MASK (~_PGMT_MASK) static inline enum page_cache_mode get_page_memtype(struct page *pg) { unsigned long pg_flags = pg->flags & _PGMT_MASK; if (pg_flags == _PGMT_WB) return _PAGE_CACHE_MODE_WB; else if (pg_flags == _PGMT_WC) return _PAGE_CACHE_MODE_WC; else if (pg_flags == _PGMT_UC_MINUS) return _PAGE_CACHE_MODE_UC_MINUS; else return _PAGE_CACHE_MODE_WT; } static inline void set_page_memtype(struct page *pg, enum page_cache_mode memtype) { unsigned long memtype_flags; unsigned long old_flags; unsigned long new_flags; switch (memtype) { case _PAGE_CACHE_MODE_WC: memtype_flags = _PGMT_WC; break; case _PAGE_CACHE_MODE_UC_MINUS: memtype_flags = _PGMT_UC_MINUS; break; case _PAGE_CACHE_MODE_WT: memtype_flags = _PGMT_WT; break; case _PAGE_CACHE_MODE_WB: default: memtype_flags = _PGMT_WB; break; } old_flags = READ_ONCE(pg->flags); do { new_flags = (old_flags & _PGMT_CLEAR_MASK) | memtype_flags; } while (!try_cmpxchg(&pg->flags, &old_flags, new_flags)); } #else static inline enum page_cache_mode get_page_memtype(struct page *pg) { return -1; } static inline void set_page_memtype(struct page *pg, enum page_cache_mode memtype) { } #endif #define CM(c) (_PAGE_CACHE_MODE_ ## c) static enum page_cache_mode __init pat_get_cache_mode(unsigned int pat_val, char *msg) { enum page_cache_mode cache; char *cache_mode; switch (pat_val) { case X86_MEMTYPE_UC: cache = CM(UC); cache_mode = "UC "; break; case X86_MEMTYPE_WC: cache = CM(WC); cache_mode = "WC "; break; case X86_MEMTYPE_WT: cache = CM(WT); cache_mode = "WT "; break; case X86_MEMTYPE_WP: cache = CM(WP); cache_mode = "WP "; break; case X86_MEMTYPE_WB: cache = CM(WB); cache_mode = "WB "; break; case X86_MEMTYPE_UC_MINUS: cache = CM(UC_MINUS); cache_mode = "UC- "; break; default: cache = CM(WB); cache_mode = "WB "; break; } memcpy(msg, cache_mode, 4); return cache; } #undef CM /* * Update the cache mode to pgprot translation tables according to PAT * configuration. * Using lower indices is preferred, so we start with highest index. */ static void __init init_cache_modes(u64 pat) { enum page_cache_mode cache; char pat_msg[33]; int i; pat_msg[32] = 0; for (i = 7; i >= 0; i--) { cache = pat_get_cache_mode((pat >> (i * 8)) & 7, pat_msg + 4 * i); update_cache_mode_entry(i, cache); } pr_info("x86/PAT: Configuration [0-7]: %s\n", pat_msg); } void pat_cpu_init(void) { if (!boot_cpu_has(X86_FEATURE_PAT)) { /* * If this happens we are on a secondary CPU, but switched to * PAT on the boot CPU. We have no way to undo PAT. */ panic("x86/PAT: PAT enabled, but not supported by secondary CPU\n"); } wrmsrl(MSR_IA32_CR_PAT, pat_msr_val); __flush_tlb_all(); } /** * pat_bp_init - Initialize the PAT MSR value and PAT table * * This function initializes PAT MSR value and PAT table with an OS-defined * value to enable additional cache attributes, WC, WT and WP. * * This function prepares the calls of pat_cpu_init() via cache_cpu_init() * on all CPUs. */ void __init pat_bp_init(void) { struct cpuinfo_x86 *c = &boot_cpu_data; if (!IS_ENABLED(CONFIG_X86_PAT)) pr_info_once("x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.\n"); if (!cpu_feature_enabled(X86_FEATURE_PAT)) pat_disable("PAT not supported by the CPU."); else rdmsrl(MSR_IA32_CR_PAT, pat_msr_val); if (!pat_msr_val) { pat_disable("PAT support disabled by the firmware."); /* * No PAT. Emulate the PAT table that corresponds to the two * cache bits, PWT (Write Through) and PCD (Cache Disable). * This setup is also the same as the BIOS default setup. * * PTE encoding: * * PCD * |PWT PAT * || slot * 00 0 WB : _PAGE_CACHE_MODE_WB * 01 1 WT : _PAGE_CACHE_MODE_WT * 10 2 UC-: _PAGE_CACHE_MODE_UC_MINUS * 11 3 UC : _PAGE_CACHE_MODE_UC * * NOTE: When WC or WP is used, it is redirected to UC- per * the default setup in __cachemode2pte_tbl[]. */ pat_msr_val = PAT_VALUE(WB, WT, UC_MINUS, UC, WB, WT, UC_MINUS, UC); } /* * Xen PV doesn't allow to set PAT MSR, but all cache modes are * supported. */ if (pat_disabled || cpu_feature_enabled(X86_FEATURE_XENPV)) { init_cache_modes(pat_msr_val); return; } if ((c->x86_vfm >= INTEL_PENTIUM_PRO && c->x86_vfm <= INTEL_PENTIUM_M_DOTHAN) || (c->x86_vfm >= INTEL_P4_WILLAMETTE && c->x86_vfm <= INTEL_P4_CEDARMILL)) { /* * PAT support with the lower four entries. Intel Pentium 2, * 3, M, and 4 are affected by PAT errata, which makes the * upper four entries unusable. To be on the safe side, we don't * use those. * * PTE encoding: * PAT * |PCD * ||PWT PAT * ||| slot * 000 0 WB : _PAGE_CACHE_MODE_WB * 001 1 WC : _PAGE_CACHE_MODE_WC * 010 2 UC-: _PAGE_CACHE_MODE_UC_MINUS * 011 3 UC : _PAGE_CACHE_MODE_UC * PAT bit unused * * NOTE: When WT or WP is used, it is redirected to UC- per * the default setup in __cachemode2pte_tbl[]. */ pat_msr_val = PAT_VALUE(WB, WC, UC_MINUS, UC, WB, WC, UC_MINUS, UC); } else { /* * Full PAT support. We put WT in slot 7 to improve * robustness in the presence of errata that might cause * the high PAT bit to be ignored. This way, a buggy slot 7 * access will hit slot 3, and slot 3 is UC, so at worst * we lose performance without causing a correctness issue. * Pentium 4 erratum N46 is an example for such an erratum, * although we try not to use PAT at all on affected CPUs. * * PTE encoding: * PAT * |PCD * ||PWT PAT * ||| slot * 000 0 WB : _PAGE_CACHE_MODE_WB * 001 1 WC : _PAGE_CACHE_MODE_WC * 010 2 UC-: _PAGE_CACHE_MODE_UC_MINUS * 011 3 UC : _PAGE_CACHE_MODE_UC * 100 4 WB : Reserved * 101 5 WP : _PAGE_CACHE_MODE_WP * 110 6 UC-: Reserved * 111 7 WT : _PAGE_CACHE_MODE_WT * * The reserved slots are unused, but mapped to their * corresponding types in the presence of PAT errata. */ pat_msr_val = PAT_VALUE(WB, WC, UC_MINUS, UC, WB, WP, UC_MINUS, WT); } memory_caching_control |= CACHE_PAT; init_cache_modes(pat_msr_val); } static DEFINE_SPINLOCK(memtype_lock); /* protects memtype accesses */ /* * Does intersection of PAT memory type and MTRR memory type and returns * the resulting memory type as PAT understands it. * (Type in pat and mtrr will not have same value) * The intersection is based on "Effective Memory Type" tables in IA-32 * SDM vol 3a */ static unsigned long pat_x_mtrr_type(u64 start, u64 end, enum page_cache_mode req_type) { /* * Look for MTRR hint to get the effective type in case where PAT * request is for WB. */ if (req_type == _PAGE_CACHE_MODE_WB) { u8 mtrr_type, uniform; mtrr_type = mtrr_type_lookup(start, end, &uniform); if (mtrr_type != MTRR_TYPE_WRBACK) return _PAGE_CACHE_MODE_UC_MINUS; return _PAGE_CACHE_MODE_WB; } return req_type; } struct pagerange_state { unsigned long cur_pfn; int ram; int not_ram; }; static int pagerange_is_ram_callback(unsigned long initial_pfn, unsigned long total_nr_pages, void *arg) { struct pagerange_state *state = arg; state->not_ram |= initial_pfn > state->cur_pfn; state->ram |= total_nr_pages > 0; state->cur_pfn = initial_pfn + total_nr_pages; return state->ram && state->not_ram; } static int pat_pagerange_is_ram(resource_size_t start, resource_size_t end) { int ret = 0; unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long end_pfn = (end + PAGE_SIZE - 1) >> PAGE_SHIFT; struct pagerange_state state = {start_pfn, 0, 0}; /* * For legacy reasons, physical address range in the legacy ISA * region is tracked as non-RAM. This will allow users of * /dev/mem to map portions of legacy ISA region, even when * some of those portions are listed(or not even listed) with * different e820 types(RAM/reserved/..) */ if (start_pfn < ISA_END_ADDRESS >> PAGE_SHIFT) start_pfn = ISA_END_ADDRESS >> PAGE_SHIFT; if (start_pfn < end_pfn) { ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, &state, pagerange_is_ram_callback); } return (ret > 0) ? -1 : (state.ram ? 1 : 0); } /* * For RAM pages, we use page flags to mark the pages with appropriate type. * The page flags are limited to four types, WB (default), WC, WT and UC-. * WP request fails with -EINVAL, and UC gets redirected to UC-. Setting * a new memory type is only allowed for a page mapped with the default WB * type. * * Here we do two passes: * - Find the memtype of all the pages in the range, look for any conflicts. * - In case of no conflicts, set the new memtype for pages in the range. */ static int reserve_ram_pages_type(u64 start, u64 end, enum page_cache_mode req_type, enum page_cache_mode *new_type) { struct page *page; u64 pfn; if (req_type == _PAGE_CACHE_MODE_WP) { if (new_type) *new_type = _PAGE_CACHE_MODE_UC_MINUS; return -EINVAL; } if (req_type == _PAGE_CACHE_MODE_UC) { /* We do not support strong UC */ WARN_ON_ONCE(1); req_type = _PAGE_CACHE_MODE_UC_MINUS; } for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) { enum page_cache_mode type; page = pfn_to_page(pfn); type = get_page_memtype(page); if (type != _PAGE_CACHE_MODE_WB) { pr_info("x86/PAT: reserve_ram_pages_type failed [mem %#010Lx-%#010Lx], track 0x%x, req 0x%x\n", start, end - 1, type, req_type); if (new_type) *new_type = type; return -EBUSY; } } if (new_type) *new_type = req_type; for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) { page = pfn_to_page(pfn); set_page_memtype(page, req_type); } return 0; } static int free_ram_pages_type(u64 start, u64 end) { struct page *page; u64 pfn; for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) { page = pfn_to_page(pfn); set_page_memtype(page, _PAGE_CACHE_MODE_WB); } return 0; } static u64 sanitize_phys(u64 address) { /* * When changing the memtype for pages containing poison allow * for a "decoy" virtual address (bit 63 clear) passed to * set_memory_X(). __pa() on a "decoy" address results in a * physical address with bit 63 set. * * Decoy addresses are not present for 32-bit builds, see * set_mce_nospec(). */ if (IS_ENABLED(CONFIG_X86_64)) return address & __PHYSICAL_MASK; return address; } /* * req_type typically has one of the: * - _PAGE_CACHE_MODE_WB * - _PAGE_CACHE_MODE_WC * - _PAGE_CACHE_MODE_UC_MINUS * - _PAGE_CACHE_MODE_UC * - _PAGE_CACHE_MODE_WT * * If new_type is NULL, function will return an error if it cannot reserve the * region with req_type. If new_type is non-NULL, function will return * available type in new_type in case of no error. In case of any error * it will return a negative return value. */ int memtype_reserve(u64 start, u64 end, enum page_cache_mode req_type, enum page_cache_mode *new_type) { struct memtype *entry_new; enum page_cache_mode actual_type; int is_range_ram; int err = 0; start = sanitize_phys(start); /* * The end address passed into this function is exclusive, but * sanitize_phys() expects an inclusive address. */ end = sanitize_phys(end - 1) + 1; if (start >= end) { WARN(1, "%s failed: [mem %#010Lx-%#010Lx], req %s\n", __func__, start, end - 1, cattr_name(req_type)); return -EINVAL; } if (!pat_enabled()) { /* This is identical to page table setting without PAT */ if (new_type) *new_type = req_type; return 0; } /* Low ISA region is always mapped WB in page table. No need to track */ if (x86_platform.is_untracked_pat_range(start, end)) { if (new_type) *new_type = _PAGE_CACHE_MODE_WB; return 0; } /* * Call mtrr_lookup to get the type hint. This is an * optimization for /dev/mem mmap'ers into WB memory (BIOS * tools and ACPI tools). Use WB request for WB memory and use * UC_MINUS otherwise. */ actual_type = pat_x_mtrr_type(start, end, req_type); if (new_type) *new_type = actual_type; is_range_ram = pat_pagerange_is_ram(start, end); if (is_range_ram == 1) { err = reserve_ram_pages_type(start, end, req_type, new_type); return err; } else if (is_range_ram < 0) { return -EINVAL; } entry_new = kzalloc(sizeof(struct memtype), GFP_KERNEL); if (!entry_new) return -ENOMEM; entry_new->start = start; entry_new->end = end; entry_new->type = actual_type; spin_lock(&memtype_lock); err = memtype_check_insert(entry_new, new_type); if (err) { pr_info("x86/PAT: memtype_reserve failed [mem %#010Lx-%#010Lx], track %s, req %s\n", start, end - 1, cattr_name(entry_new->type), cattr_name(req_type)); kfree(entry_new); spin_unlock(&memtype_lock); return err; } spin_unlock(&memtype_lock); dprintk("memtype_reserve added [mem %#010Lx-%#010Lx], track %s, req %s, ret %s\n", start, end - 1, cattr_name(entry_new->type), cattr_name(req_type), new_type ? cattr_name(*new_type) : "-"); return err; } int memtype_free(u64 start, u64 end) { int is_range_ram; struct memtype *entry_old; if (!pat_enabled()) return 0; start = sanitize_phys(start); end = sanitize_phys(end); /* Low ISA region is always mapped WB. No need to track */ if (x86_platform.is_untracked_pat_range(start, end)) return 0; is_range_ram = pat_pagerange_is_ram(start, end); if (is_range_ram == 1) return free_ram_pages_type(start, end); if (is_range_ram < 0) return -EINVAL; spin_lock(&memtype_lock); entry_old = memtype_erase(start, end); spin_unlock(&memtype_lock); if (IS_ERR(entry_old)) { pr_info("x86/PAT: %s:%d freeing invalid memtype [mem %#010Lx-%#010Lx]\n", current->comm, current->pid, start, end - 1); return -EINVAL; } kfree(entry_old); dprintk("memtype_free request [mem %#010Lx-%#010Lx]\n", start, end - 1); return 0; } /** * lookup_memtype - Looks up the memory type for a physical address * @paddr: physical address of which memory type needs to be looked up * * Only to be called when PAT is enabled * * Returns _PAGE_CACHE_MODE_WB, _PAGE_CACHE_MODE_WC, _PAGE_CACHE_MODE_UC_MINUS * or _PAGE_CACHE_MODE_WT. */ static enum page_cache_mode lookup_memtype(u64 paddr) { enum page_cache_mode rettype = _PAGE_CACHE_MODE_WB; struct memtype *entry; if (x86_platform.is_untracked_pat_range(paddr, paddr + PAGE_SIZE)) return rettype; if (pat_pagerange_is_ram(paddr, paddr + PAGE_SIZE)) { struct page *page; page = pfn_to_page(paddr >> PAGE_SHIFT); return get_page_memtype(page); } spin_lock(&memtype_lock); entry = memtype_lookup(paddr); if (entry != NULL) rettype = entry->type; else rettype = _PAGE_CACHE_MODE_UC_MINUS; spin_unlock(&memtype_lock); return rettype; } /** * pat_pfn_immune_to_uc_mtrr - Check whether the PAT memory type * of @pfn cannot be overridden by UC MTRR memory type. * * Only to be called when PAT is enabled. * * Returns true, if the PAT memory type of @pfn is UC, UC-, or WC. * Returns false in other cases. */ bool pat_pfn_immune_to_uc_mtrr(unsigned long pfn) { enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn)); return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS || cm == _PAGE_CACHE_MODE_WC; } EXPORT_SYMBOL_GPL(pat_pfn_immune_to_uc_mtrr); /** * memtype_reserve_io - Request a memory type mapping for a region of memory * @start: start (physical address) of the region * @end: end (physical address) of the region * @type: A pointer to memtype, with requested type. On success, requested * or any other compatible type that was available for the region is returned * * On success, returns 0 * On failure, returns non-zero */ int memtype_reserve_io(resource_size_t start, resource_size_t end, enum page_cache_mode *type) { resource_size_t size = end - start; enum page_cache_mode req_type = *type; enum page_cache_mode new_type; int ret; WARN_ON_ONCE(iomem_map_sanity_check(start, size)); ret = memtype_reserve(start, end, req_type, &new_type); if (ret) goto out_err; if (!is_new_memtype_allowed(start, size, req_type, new_type)) goto out_free; if (memtype_kernel_map_sync(start, size, new_type) < 0) goto out_free; *type = new_type; return 0; out_free: memtype_free(start, end); ret = -EBUSY; out_err: return ret; } /** * memtype_free_io - Release a memory type mapping for a region of memory * @start: start (physical address) of the region * @end: end (physical address) of the region */ void memtype_free_io(resource_size_t start, resource_size_t end) { memtype_free(start, end); } #ifdef CONFIG_X86_PAT int arch_io_reserve_memtype_wc(resource_size_t start, resource_size_t size) { enum page_cache_mode type = _PAGE_CACHE_MODE_WC; return memtype_reserve_io(start, start + size, &type); } EXPORT_SYMBOL(arch_io_reserve_memtype_wc); void arch_io_free_memtype_wc(resource_size_t start, resource_size_t size) { memtype_free_io(start, start + size); } EXPORT_SYMBOL(arch_io_free_memtype_wc); #endif pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, unsigned long size, pgprot_t vma_prot) { if (!phys_mem_access_encrypted(pfn << PAGE_SHIFT, size)) vma_prot = pgprot_decrypted(vma_prot); return vma_prot; } #ifdef CONFIG_STRICT_DEVMEM /* This check is done in drivers/char/mem.c in case of STRICT_DEVMEM */ static inline int range_is_allowed(unsigned long pfn, unsigned long size) { return 1; } #else /* This check is needed to avoid cache aliasing when PAT is enabled */ static inline int range_is_allowed(unsigned long pfn, unsigned long size) { u64 from = ((u64)pfn) << PAGE_SHIFT; u64 to = from + size; u64 cursor = from; if (!pat_enabled()) return 1; while (cursor < to) { if (!devmem_is_allowed(pfn)) return 0; cursor += PAGE_SIZE; pfn++; } return 1; } #endif /* CONFIG_STRICT_DEVMEM */ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn, unsigned long size, pgprot_t *vma_prot) { enum page_cache_mode pcm = _PAGE_CACHE_MODE_WB; if (!range_is_allowed(pfn, size)) return 0; if (file->f_flags & O_DSYNC) pcm = _PAGE_CACHE_MODE_UC_MINUS; *vma_prot = __pgprot((pgprot_val(*vma_prot) & ~_PAGE_CACHE_MASK) | cachemode2protval(pcm)); return 1; } /* * Change the memory type for the physical address range in kernel identity * mapping space if that range is a part of identity map. */ int memtype_kernel_map_sync(u64 base, unsigned long size, enum page_cache_mode pcm) { unsigned long id_sz; if (base > __pa(high_memory-1)) return 0; /* * Some areas in the middle of the kernel identity range * are not mapped, for example the PCI space. */ if (!page_is_ram(base >> PAGE_SHIFT)) return 0; id_sz = (__pa(high_memory-1) <= base + size) ? __pa(high_memory) - base : size; if (ioremap_change_attr((unsigned long)__va(base), id_sz, pcm) < 0) { pr_info("x86/PAT: %s:%d ioremap_change_attr failed %s for [mem %#010Lx-%#010Lx]\n", current->comm, current->pid, cattr_name(pcm), base, (unsigned long long)(base + size-1)); return -EINVAL; } return 0; } /* * Internal interface to reserve a range of physical memory with prot. * Reserved non RAM regions only and after successful memtype_reserve, * this func also keeps identity mapping (if any) in sync with this new prot. */ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot, int strict_prot) { int is_ram = 0; int ret; enum page_cache_mode want_pcm = pgprot2cachemode(*vma_prot); enum page_cache_mode pcm = want_pcm; is_ram = pat_pagerange_is_ram(paddr, paddr + size); /* * reserve_pfn_range() for RAM pages. We do not refcount to keep * track of number of mappings of RAM pages. We can assert that * the type requested matches the type of first page in the range. */ if (is_ram) { if (!pat_enabled()) return 0; pcm = lookup_memtype(paddr); if (want_pcm != pcm) { pr_warn("x86/PAT: %s:%d map pfn RAM range req %s for [mem %#010Lx-%#010Lx], got %s\n", current->comm, current->pid, cattr_name(want_pcm), (unsigned long long)paddr, (unsigned long long)(paddr + size - 1), cattr_name(pcm)); *vma_prot = __pgprot((pgprot_val(*vma_prot) & (~_PAGE_CACHE_MASK)) | cachemode2protval(pcm)); } return 0; } ret = memtype_reserve(paddr, paddr + size, want_pcm, &pcm); if (ret) return ret; if (pcm != want_pcm) { if (strict_prot || !is_new_memtype_allowed(paddr, size, want_pcm, pcm)) { memtype_free(paddr, paddr + size); pr_err("x86/PAT: %s:%d map pfn expected mapping type %s for [mem %#010Lx-%#010Lx], got %s\n", current->comm, current->pid, cattr_name(want_pcm), (unsigned long long)paddr, (unsigned long long)(paddr + size - 1), cattr_name(pcm)); return -EINVAL; } /* * We allow returning different type than the one requested in * non strict case. */ *vma_prot = __pgprot((pgprot_val(*vma_prot) & (~_PAGE_CACHE_MASK)) | cachemode2protval(pcm)); } if (memtype_kernel_map_sync(paddr, size, pcm) < 0) { memtype_free(paddr, paddr + size); return -EINVAL; } return 0; } /* * Internal interface to free a range of physical memory. * Frees non RAM regions only. */ static void free_pfn_range(u64 paddr, unsigned long size) { int is_ram; is_ram = pat_pagerange_is_ram(paddr, paddr + size); if (is_ram == 0) memtype_free(paddr, paddr + size); } static int follow_phys(struct vm_area_struct *vma, unsigned long *prot, resource_size_t *phys) { struct follow_pfnmap_args args = { .vma = vma, .address = vma->vm_start }; if (follow_pfnmap_start(&args)) return -EINVAL; /* Never return PFNs of anon folios in COW mappings. */ if (!args.special) { follow_pfnmap_end(&args); return -EINVAL; } *prot = pgprot_val(args.pgprot); *phys = (resource_size_t)args.pfn << PAGE_SHIFT; follow_pfnmap_end(&args); return 0; } static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr, pgprot_t *pgprot) { unsigned long prot; VM_WARN_ON_ONCE(!(vma->vm_flags & VM_PAT)); /* * We need the starting PFN and cachemode used for track_pfn_remap() * that covered the whole VMA. For most mappings, we can obtain that * information from the page tables. For COW mappings, we might now * suddenly have anon folios mapped and follow_phys() will fail. * * Fallback to using vma->vm_pgoff, see remap_pfn_range_notrack(), to * detect the PFN. If we need the cachemode as well, we're out of luck * for now and have to fail fork(). */ if (!follow_phys(vma, &prot, paddr)) { if (pgprot) *pgprot = __pgprot(prot); return 0; } if (is_cow_mapping(vma->vm_flags)) { if (pgprot) return -EINVAL; *paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT; return 0; } WARN_ON_ONCE(1); return -EINVAL; } int track_pfn_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long *pfn) { const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start; resource_size_t paddr; pgprot_t pgprot; int rc; if (!(src_vma->vm_flags & VM_PAT)) return 0; /* * Duplicate the PAT information for the dst VMA based on the src * VMA. */ if (get_pat_info(src_vma, &paddr, &pgprot)) return -EINVAL; rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1); if (rc) return rc; /* Reservation for the destination VMA succeeded. */ vm_flags_set(dst_vma, VM_PAT); *pfn = PHYS_PFN(paddr); return 0; } void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn) { untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true); /* * Reservation was freed, any copied page tables will get cleaned * up later, but without getting PAT involved again. */ } /* * prot is passed in as a parameter for the new mapping. If the vma has * a linear pfn mapping for the entire range, or no vma is provided, * reserve the entire pfn + size range with single reserve_pfn_range * call. */ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, unsigned long pfn, unsigned long addr, unsigned long size) { resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT; enum page_cache_mode pcm; /* reserve the whole chunk starting from paddr */ if (!vma || (addr == vma->vm_start && size == (vma->vm_end - vma->vm_start))) { int ret; ret = reserve_pfn_range(paddr, size, prot, 0); if (ret == 0 && vma) vm_flags_set(vma, VM_PAT); return ret; } if (!pat_enabled()) return 0; /* * For anything smaller than the vma size we set prot based on the * lookup. */ pcm = lookup_memtype(paddr); /* Check memtype for the remaining pages */ while (size > PAGE_SIZE) { size -= PAGE_SIZE; paddr += PAGE_SIZE; if (pcm != lookup_memtype(paddr)) return -EINVAL; } *prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) | cachemode2protval(pcm)); return 0; } void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn) { enum page_cache_mode pcm; if (!pat_enabled()) return; /* Set prot based on lookup */ pcm = lookup_memtype(pfn_t_to_phys(pfn)); *prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) | cachemode2protval(pcm)); } /* * untrack_pfn is called while unmapping a pfnmap for a region. * untrack can be called for a specific region indicated by pfn and size or * can be for the entire vma (in which case pfn, size are zero). */ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, unsigned long size, bool mm_wr_locked) { resource_size_t paddr; if (vma && !(vma->vm_flags & VM_PAT)) return; /* free the chunk starting from pfn or the whole chunk */ paddr = (resource_size_t)pfn << PAGE_SHIFT; if (!paddr && !size) { if (get_pat_info(vma, &paddr, NULL)) return; size = vma->vm_end - vma->vm_start; } free_pfn_range(paddr, size); if (vma) { if (mm_wr_locked) vm_flags_clear(vma, VM_PAT); else __vm_flags_mod(vma, 0, VM_PAT); } } void untrack_pfn_clear(struct vm_area_struct *vma) { vm_flags_clear(vma, VM_PAT); } pgprot_t pgprot_writecombine(pgprot_t prot) { return __pgprot(pgprot_val(prot) | cachemode2protval(_PAGE_CACHE_MODE_WC)); } EXPORT_SYMBOL_GPL(pgprot_writecombine); pgprot_t pgprot_writethrough(pgprot_t prot) { return __pgprot(pgprot_val(prot) | cachemode2protval(_PAGE_CACHE_MODE_WT)); } EXPORT_SYMBOL_GPL(pgprot_writethrough); #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_X86_PAT) /* * We are allocating a temporary printout-entry to be passed * between seq_start()/next() and seq_show(): */ static struct memtype *memtype_get_idx(loff_t pos) { struct memtype *entry_print; int ret; entry_print = kzalloc(sizeof(struct memtype), GFP_KERNEL); if (!entry_print) return NULL; spin_lock(&memtype_lock); ret = memtype_copy_nth_element(entry_print, pos); spin_unlock(&memtype_lock); /* Free it on error: */ if (ret) { kfree(entry_print); return NULL; } return entry_print; } static void *memtype_seq_start(struct seq_file *seq, loff_t *pos) { if (*pos == 0) { ++*pos; seq_puts(seq, "PAT memtype list:\n"); } return memtype_get_idx(*pos); } static void *memtype_seq_next(struct seq_file *seq, void *v, loff_t *pos) { kfree(v); ++*pos; return memtype_get_idx(*pos); } static void memtype_seq_stop(struct seq_file *seq, void *v) { kfree(v); } static int memtype_seq_show(struct seq_file *seq, void *v) { struct memtype *entry_print = (struct memtype *)v; seq_printf(seq, "PAT: [mem 0x%016Lx-0x%016Lx] %s\n", entry_print->start, entry_print->end, cattr_name(entry_print->type)); return 0; } static const struct seq_operations memtype_seq_ops = { .start = memtype_seq_start, .next = memtype_seq_next, .stop = memtype_seq_stop, .show = memtype_seq_show, }; static int memtype_seq_open(struct inode *inode, struct file *file) { return seq_open(file, &memtype_seq_ops); } static const struct file_operations memtype_fops = { .open = memtype_seq_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, }; static int __init pat_memtype_list_init(void) { if (pat_enabled()) { debugfs_create_file("pat_memtype_list", S_IRUSR, arch_debugfs_dir, NULL, &memtype_fops); } return 0; } late_initcall(pat_memtype_list_init); #endif /* CONFIG_DEBUG_FS && CONFIG_X86_PAT */ |
155 189 188 189 189 188 189 189 8 5 5 5 88 3 87 2 4 4 4 3 2 1 3 86 81 36 82 3 2 1 2 2 80 7 86 80 1 80 1 80 4 79 1 79 8 6 1 7 7 76 9 7 6 5 4 8 2 7 75 4 4 1 1 1 3 1 74 4 3 1 3 3 72 2 72 3 1 3 3 70 3 71 3 71 1 71 2 71 1 70 3 71 1 71 1 71 3 71 1 71 1 71 3 71 2 70 3 69 3 2 67 2 67 2 66 1 78 78 64 15 63 63 237 248 248 20 248 19 248 248 20 247 248 248 248 248 248 248 248 246 248 14 248 14 243 248 248 248 247 245 248 248 248 248 247 248 248 248 248 248 248 248 247 248 248 248 248 248 247 248 20 20 20 20 20 2 2 2 2 2 2 2 2 248 2 2 2 14 8 8 8 14 14 14 1 1 1 1 1 1 14 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 | // SPDX-License-Identifier: GPL-2.0-or-later /* * drivers/net/bond/bond_netlink.c - Netlink interface for bonding * Copyright (c) 2013 Jiri Pirko <jiri@resnulli.us> * Copyright (c) 2013 Scott Feldman <sfeldma@cumulusnetworks.com> */ #include <linux/module.h> #include <linux/errno.h> #include <linux/netdevice.h> #include <linux/etherdevice.h> #include <linux/if_link.h> #include <linux/if_ether.h> #include <net/netlink.h> #include <net/rtnetlink.h> #include <net/bonding.h> #include <net/ipv6.h> static size_t bond_get_slave_size(const struct net_device *bond_dev, const struct net_device *slave_dev) { return nla_total_size(sizeof(u8)) + /* IFLA_BOND_SLAVE_STATE */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_SLAVE_MII_STATUS */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_SLAVE_LINK_FAILURE_COUNT */ nla_total_size(MAX_ADDR_LEN) + /* IFLA_BOND_SLAVE_PERM_HWADDR */ nla_total_size(sizeof(u16)) + /* IFLA_BOND_SLAVE_QUEUE_ID */ nla_total_size(sizeof(u16)) + /* IFLA_BOND_SLAVE_AD_AGGREGATOR_ID */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_SLAVE_AD_ACTOR_OPER_PORT_STATE */ nla_total_size(sizeof(u16)) + /* IFLA_BOND_SLAVE_AD_PARTNER_OPER_PORT_STATE */ nla_total_size(sizeof(s32)) + /* IFLA_BOND_SLAVE_PRIO */ 0; } static int bond_fill_slave_info(struct sk_buff *skb, const struct net_device *bond_dev, const struct net_device *slave_dev) { struct slave *slave = bond_slave_get_rtnl(slave_dev); if (nla_put_u8(skb, IFLA_BOND_SLAVE_STATE, bond_slave_state(slave))) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_SLAVE_MII_STATUS, slave->link)) goto nla_put_failure; if (nla_put_u32(skb, IFLA_BOND_SLAVE_LINK_FAILURE_COUNT, slave->link_failure_count)) goto nla_put_failure; if (nla_put(skb, IFLA_BOND_SLAVE_PERM_HWADDR, slave_dev->addr_len, slave->perm_hwaddr)) goto nla_put_failure; if (nla_put_u16(skb, IFLA_BOND_SLAVE_QUEUE_ID, READ_ONCE(slave->queue_id))) goto nla_put_failure; if (nla_put_s32(skb, IFLA_BOND_SLAVE_PRIO, slave->prio)) goto nla_put_failure; if (BOND_MODE(slave->bond) == BOND_MODE_8023AD) { const struct aggregator *agg; const struct port *ad_port; ad_port = &SLAVE_AD_INFO(slave)->port; agg = SLAVE_AD_INFO(slave)->port.aggregator; if (agg) { if (nla_put_u16(skb, IFLA_BOND_SLAVE_AD_AGGREGATOR_ID, agg->aggregator_identifier)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_SLAVE_AD_ACTOR_OPER_PORT_STATE, ad_port->actor_oper_port_state)) goto nla_put_failure; if (nla_put_u16(skb, IFLA_BOND_SLAVE_AD_PARTNER_OPER_PORT_STATE, ad_port->partner_oper.port_state)) goto nla_put_failure; } } return 0; nla_put_failure: return -EMSGSIZE; } /* Limit the max delay range to 300s */ static const struct netlink_range_validation delay_range = { .max = 300000, }; static const struct nla_policy bond_policy[IFLA_BOND_MAX + 1] = { [IFLA_BOND_MODE] = { .type = NLA_U8 }, [IFLA_BOND_ACTIVE_SLAVE] = { .type = NLA_U32 }, [IFLA_BOND_MIIMON] = { .type = NLA_U32 }, [IFLA_BOND_UPDELAY] = { .type = NLA_U32 }, [IFLA_BOND_DOWNDELAY] = { .type = NLA_U32 }, [IFLA_BOND_USE_CARRIER] = { .type = NLA_U8 }, [IFLA_BOND_ARP_INTERVAL] = { .type = NLA_U32 }, [IFLA_BOND_ARP_IP_TARGET] = { .type = NLA_NESTED }, [IFLA_BOND_ARP_VALIDATE] = { .type = NLA_U32 }, [IFLA_BOND_ARP_ALL_TARGETS] = { .type = NLA_U32 }, [IFLA_BOND_PRIMARY] = { .type = NLA_U32 }, [IFLA_BOND_PRIMARY_RESELECT] = { .type = NLA_U8 }, [IFLA_BOND_FAIL_OVER_MAC] = { .type = NLA_U8 }, [IFLA_BOND_XMIT_HASH_POLICY] = { .type = NLA_U8 }, [IFLA_BOND_RESEND_IGMP] = { .type = NLA_U32 }, [IFLA_BOND_NUM_PEER_NOTIF] = { .type = NLA_U8 }, [IFLA_BOND_ALL_SLAVES_ACTIVE] = { .type = NLA_U8 }, [IFLA_BOND_MIN_LINKS] = { .type = NLA_U32 }, [IFLA_BOND_LP_INTERVAL] = { .type = NLA_U32 }, [IFLA_BOND_PACKETS_PER_SLAVE] = { .type = NLA_U32 }, [IFLA_BOND_AD_LACP_ACTIVE] = { .type = NLA_U8 }, [IFLA_BOND_AD_LACP_RATE] = { .type = NLA_U8 }, [IFLA_BOND_AD_SELECT] = { .type = NLA_U8 }, [IFLA_BOND_AD_INFO] = { .type = NLA_NESTED }, [IFLA_BOND_AD_ACTOR_SYS_PRIO] = { .type = NLA_U16 }, [IFLA_BOND_AD_USER_PORT_KEY] = { .type = NLA_U16 }, [IFLA_BOND_AD_ACTOR_SYSTEM] = { .type = NLA_BINARY, .len = ETH_ALEN }, [IFLA_BOND_TLB_DYNAMIC_LB] = { .type = NLA_U8 }, [IFLA_BOND_PEER_NOTIF_DELAY] = NLA_POLICY_FULL_RANGE(NLA_U32, &delay_range), [IFLA_BOND_MISSED_MAX] = { .type = NLA_U8 }, [IFLA_BOND_NS_IP6_TARGET] = { .type = NLA_NESTED }, [IFLA_BOND_COUPLED_CONTROL] = { .type = NLA_U8 }, }; static const struct nla_policy bond_slave_policy[IFLA_BOND_SLAVE_MAX + 1] = { [IFLA_BOND_SLAVE_QUEUE_ID] = { .type = NLA_U16 }, [IFLA_BOND_SLAVE_PRIO] = { .type = NLA_S32 }, }; static int bond_validate(struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { if (tb[IFLA_ADDRESS]) { if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) return -EINVAL; if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS]))) return -EADDRNOTAVAIL; } return 0; } static int bond_slave_changelink(struct net_device *bond_dev, struct net_device *slave_dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { struct bonding *bond = netdev_priv(bond_dev); struct bond_opt_value newval; int err; if (!data) return 0; if (data[IFLA_BOND_SLAVE_QUEUE_ID]) { u16 queue_id = nla_get_u16(data[IFLA_BOND_SLAVE_QUEUE_ID]); char queue_id_str[IFNAMSIZ + 7]; /* queue_id option setting expects slave_name:queue_id */ snprintf(queue_id_str, sizeof(queue_id_str), "%s:%u\n", slave_dev->name, queue_id); bond_opt_initstr(&newval, queue_id_str); err = __bond_opt_set(bond, BOND_OPT_QUEUE_ID, &newval, data[IFLA_BOND_SLAVE_QUEUE_ID], extack); if (err) return err; } if (data[IFLA_BOND_SLAVE_PRIO]) { int prio = nla_get_s32(data[IFLA_BOND_SLAVE_PRIO]); bond_opt_slave_initval(&newval, &slave_dev, prio); err = __bond_opt_set(bond, BOND_OPT_PRIO, &newval, data[IFLA_BOND_SLAVE_PRIO], extack); if (err) return err; } return 0; } static int bond_changelink(struct net_device *bond_dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { struct bonding *bond = netdev_priv(bond_dev); struct bond_opt_value newval; int miimon = 0; int err; if (!data) return 0; if (data[IFLA_BOND_MODE]) { int mode = nla_get_u8(data[IFLA_BOND_MODE]); bond_opt_initval(&newval, mode); err = __bond_opt_set(bond, BOND_OPT_MODE, &newval, data[IFLA_BOND_MODE], extack); if (err) return err; } if (data[IFLA_BOND_ACTIVE_SLAVE]) { int ifindex = nla_get_u32(data[IFLA_BOND_ACTIVE_SLAVE]); struct net_device *slave_dev; char *active_slave = ""; if (ifindex != 0) { slave_dev = __dev_get_by_index(dev_net(bond_dev), ifindex); if (!slave_dev) return -ENODEV; active_slave = slave_dev->name; } bond_opt_initstr(&newval, active_slave); err = __bond_opt_set(bond, BOND_OPT_ACTIVE_SLAVE, &newval, data[IFLA_BOND_ACTIVE_SLAVE], extack); if (err) return err; } if (data[IFLA_BOND_MIIMON]) { miimon = nla_get_u32(data[IFLA_BOND_MIIMON]); bond_opt_initval(&newval, miimon); err = __bond_opt_set(bond, BOND_OPT_MIIMON, &newval, data[IFLA_BOND_MIIMON], extack); if (err) return err; } if (data[IFLA_BOND_UPDELAY]) { int updelay = nla_get_u32(data[IFLA_BOND_UPDELAY]); bond_opt_initval(&newval, updelay); err = __bond_opt_set(bond, BOND_OPT_UPDELAY, &newval, data[IFLA_BOND_UPDELAY], extack); if (err) return err; } if (data[IFLA_BOND_DOWNDELAY]) { int downdelay = nla_get_u32(data[IFLA_BOND_DOWNDELAY]); bond_opt_initval(&newval, downdelay); err = __bond_opt_set(bond, BOND_OPT_DOWNDELAY, &newval, data[IFLA_BOND_DOWNDELAY], extack); if (err) return err; } if (data[IFLA_BOND_PEER_NOTIF_DELAY]) { int delay = nla_get_u32(data[IFLA_BOND_PEER_NOTIF_DELAY]); bond_opt_initval(&newval, delay); err = __bond_opt_set(bond, BOND_OPT_PEER_NOTIF_DELAY, &newval, data[IFLA_BOND_PEER_NOTIF_DELAY], extack); if (err) return err; } if (data[IFLA_BOND_USE_CARRIER]) { int use_carrier = nla_get_u8(data[IFLA_BOND_USE_CARRIER]); bond_opt_initval(&newval, use_carrier); err = __bond_opt_set(bond, BOND_OPT_USE_CARRIER, &newval, data[IFLA_BOND_USE_CARRIER], extack); if (err) return err; } if (data[IFLA_BOND_ARP_INTERVAL]) { int arp_interval = nla_get_u32(data[IFLA_BOND_ARP_INTERVAL]); if (arp_interval && miimon) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_BOND_ARP_INTERVAL], "ARP monitoring cannot be used with MII monitoring"); return -EINVAL; } bond_opt_initval(&newval, arp_interval); err = __bond_opt_set(bond, BOND_OPT_ARP_INTERVAL, &newval, data[IFLA_BOND_ARP_INTERVAL], extack); if (err) return err; } if (data[IFLA_BOND_ARP_IP_TARGET]) { struct nlattr *attr; int i = 0, rem; bond_option_arp_ip_targets_clear(bond); nla_for_each_nested(attr, data[IFLA_BOND_ARP_IP_TARGET], rem) { __be32 target; if (nla_len(attr) < sizeof(target)) return -EINVAL; target = nla_get_be32(attr); bond_opt_initval(&newval, (__force u64)target); err = __bond_opt_set(bond, BOND_OPT_ARP_TARGETS, &newval, data[IFLA_BOND_ARP_IP_TARGET], extack); if (err) break; i++; } if (i == 0 && bond->params.arp_interval) netdev_warn(bond->dev, "Removing last arp target with arp_interval on\n"); if (err) return err; } #if IS_ENABLED(CONFIG_IPV6) if (data[IFLA_BOND_NS_IP6_TARGET]) { struct nlattr *attr; int i = 0, rem; bond_option_ns_ip6_targets_clear(bond); nla_for_each_nested(attr, data[IFLA_BOND_NS_IP6_TARGET], rem) { struct in6_addr addr6; if (nla_len(attr) < sizeof(addr6)) { NL_SET_ERR_MSG(extack, "Invalid IPv6 address"); return -EINVAL; } addr6 = nla_get_in6_addr(attr); bond_opt_initextra(&newval, &addr6, sizeof(addr6)); err = __bond_opt_set(bond, BOND_OPT_NS_TARGETS, &newval, data[IFLA_BOND_NS_IP6_TARGET], extack); if (err) break; i++; } if (i == 0 && bond->params.arp_interval) netdev_warn(bond->dev, "Removing last ns target with arp_interval on\n"); if (err) return err; } #endif if (data[IFLA_BOND_ARP_VALIDATE]) { int arp_validate = nla_get_u32(data[IFLA_BOND_ARP_VALIDATE]); if (arp_validate && miimon) { NL_SET_ERR_MSG_ATTR(extack, data[IFLA_BOND_ARP_INTERVAL], "ARP validating cannot be used with MII monitoring"); return -EINVAL; } bond_opt_initval(&newval, arp_validate); err = __bond_opt_set(bond, BOND_OPT_ARP_VALIDATE, &newval, data[IFLA_BOND_ARP_VALIDATE], extack); if (err) return err; } if (data[IFLA_BOND_ARP_ALL_TARGETS]) { int arp_all_targets = nla_get_u32(data[IFLA_BOND_ARP_ALL_TARGETS]); bond_opt_initval(&newval, arp_all_targets); err = __bond_opt_set(bond, BOND_OPT_ARP_ALL_TARGETS, &newval, data[IFLA_BOND_ARP_ALL_TARGETS], extack); if (err) return err; } if (data[IFLA_BOND_PRIMARY]) { int ifindex = nla_get_u32(data[IFLA_BOND_PRIMARY]); struct net_device *dev; char *primary = ""; dev = __dev_get_by_index(dev_net(bond_dev), ifindex); if (dev) primary = dev->name; bond_opt_initstr(&newval, primary); err = __bond_opt_set(bond, BOND_OPT_PRIMARY, &newval, data[IFLA_BOND_PRIMARY], extack); if (err) return err; } if (data[IFLA_BOND_PRIMARY_RESELECT]) { int primary_reselect = nla_get_u8(data[IFLA_BOND_PRIMARY_RESELECT]); bond_opt_initval(&newval, primary_reselect); err = __bond_opt_set(bond, BOND_OPT_PRIMARY_RESELECT, &newval, data[IFLA_BOND_PRIMARY_RESELECT], extack); if (err) return err; } if (data[IFLA_BOND_FAIL_OVER_MAC]) { int fail_over_mac = nla_get_u8(data[IFLA_BOND_FAIL_OVER_MAC]); bond_opt_initval(&newval, fail_over_mac); err = __bond_opt_set(bond, BOND_OPT_FAIL_OVER_MAC, &newval, data[IFLA_BOND_FAIL_OVER_MAC], extack); if (err) return err; } if (data[IFLA_BOND_XMIT_HASH_POLICY]) { int xmit_hash_policy = nla_get_u8(data[IFLA_BOND_XMIT_HASH_POLICY]); bond_opt_initval(&newval, xmit_hash_policy); err = __bond_opt_set(bond, BOND_OPT_XMIT_HASH, &newval, data[IFLA_BOND_XMIT_HASH_POLICY], extack); if (err) return err; } if (data[IFLA_BOND_RESEND_IGMP]) { int resend_igmp = nla_get_u32(data[IFLA_BOND_RESEND_IGMP]); bond_opt_initval(&newval, resend_igmp); err = __bond_opt_set(bond, BOND_OPT_RESEND_IGMP, &newval, data[IFLA_BOND_RESEND_IGMP], extack); if (err) return err; } if (data[IFLA_BOND_NUM_PEER_NOTIF]) { int num_peer_notif = nla_get_u8(data[IFLA_BOND_NUM_PEER_NOTIF]); bond_opt_initval(&newval, num_peer_notif); err = __bond_opt_set(bond, BOND_OPT_NUM_PEER_NOTIF, &newval, data[IFLA_BOND_NUM_PEER_NOTIF], extack); if (err) return err; } if (data[IFLA_BOND_ALL_SLAVES_ACTIVE]) { int all_slaves_active = nla_get_u8(data[IFLA_BOND_ALL_SLAVES_ACTIVE]); bond_opt_initval(&newval, all_slaves_active); err = __bond_opt_set(bond, BOND_OPT_ALL_SLAVES_ACTIVE, &newval, data[IFLA_BOND_ALL_SLAVES_ACTIVE], extack); if (err) return err; } if (data[IFLA_BOND_MIN_LINKS]) { int min_links = nla_get_u32(data[IFLA_BOND_MIN_LINKS]); bond_opt_initval(&newval, min_links); err = __bond_opt_set(bond, BOND_OPT_MINLINKS, &newval, data[IFLA_BOND_MIN_LINKS], extack); if (err) return err; } if (data[IFLA_BOND_LP_INTERVAL]) { int lp_interval = nla_get_u32(data[IFLA_BOND_LP_INTERVAL]); bond_opt_initval(&newval, lp_interval); err = __bond_opt_set(bond, BOND_OPT_LP_INTERVAL, &newval, data[IFLA_BOND_LP_INTERVAL], extack); if (err) return err; } if (data[IFLA_BOND_PACKETS_PER_SLAVE]) { int packets_per_slave = nla_get_u32(data[IFLA_BOND_PACKETS_PER_SLAVE]); bond_opt_initval(&newval, packets_per_slave); err = __bond_opt_set(bond, BOND_OPT_PACKETS_PER_SLAVE, &newval, data[IFLA_BOND_PACKETS_PER_SLAVE], extack); if (err) return err; } if (data[IFLA_BOND_AD_LACP_ACTIVE]) { int lacp_active = nla_get_u8(data[IFLA_BOND_AD_LACP_ACTIVE]); bond_opt_initval(&newval, lacp_active); err = __bond_opt_set(bond, BOND_OPT_LACP_ACTIVE, &newval, data[IFLA_BOND_AD_LACP_ACTIVE], extack); if (err) return err; } if (data[IFLA_BOND_AD_LACP_RATE]) { int lacp_rate = nla_get_u8(data[IFLA_BOND_AD_LACP_RATE]); bond_opt_initval(&newval, lacp_rate); err = __bond_opt_set(bond, BOND_OPT_LACP_RATE, &newval, data[IFLA_BOND_AD_LACP_RATE], extack); if (err) return err; } if (data[IFLA_BOND_AD_SELECT]) { int ad_select = nla_get_u8(data[IFLA_BOND_AD_SELECT]); bond_opt_initval(&newval, ad_select); err = __bond_opt_set(bond, BOND_OPT_AD_SELECT, &newval, data[IFLA_BOND_AD_SELECT], extack); if (err) return err; } if (data[IFLA_BOND_AD_ACTOR_SYS_PRIO]) { int actor_sys_prio = nla_get_u16(data[IFLA_BOND_AD_ACTOR_SYS_PRIO]); bond_opt_initval(&newval, actor_sys_prio); err = __bond_opt_set(bond, BOND_OPT_AD_ACTOR_SYS_PRIO, &newval, data[IFLA_BOND_AD_ACTOR_SYS_PRIO], extack); if (err) return err; } if (data[IFLA_BOND_AD_USER_PORT_KEY]) { int port_key = nla_get_u16(data[IFLA_BOND_AD_USER_PORT_KEY]); bond_opt_initval(&newval, port_key); err = __bond_opt_set(bond, BOND_OPT_AD_USER_PORT_KEY, &newval, data[IFLA_BOND_AD_USER_PORT_KEY], extack); if (err) return err; } if (data[IFLA_BOND_AD_ACTOR_SYSTEM]) { if (nla_len(data[IFLA_BOND_AD_ACTOR_SYSTEM]) != ETH_ALEN) return -EINVAL; bond_opt_initval(&newval, nla_get_u64(data[IFLA_BOND_AD_ACTOR_SYSTEM])); err = __bond_opt_set(bond, BOND_OPT_AD_ACTOR_SYSTEM, &newval, data[IFLA_BOND_AD_ACTOR_SYSTEM], extack); if (err) return err; } if (data[IFLA_BOND_TLB_DYNAMIC_LB]) { int dynamic_lb = nla_get_u8(data[IFLA_BOND_TLB_DYNAMIC_LB]); bond_opt_initval(&newval, dynamic_lb); err = __bond_opt_set(bond, BOND_OPT_TLB_DYNAMIC_LB, &newval, data[IFLA_BOND_TLB_DYNAMIC_LB], extack); if (err) return err; } if (data[IFLA_BOND_MISSED_MAX]) { int missed_max = nla_get_u8(data[IFLA_BOND_MISSED_MAX]); bond_opt_initval(&newval, missed_max); err = __bond_opt_set(bond, BOND_OPT_MISSED_MAX, &newval, data[IFLA_BOND_MISSED_MAX], extack); if (err) return err; } if (data[IFLA_BOND_COUPLED_CONTROL]) { int coupled_control = nla_get_u8(data[IFLA_BOND_COUPLED_CONTROL]); bond_opt_initval(&newval, coupled_control); err = __bond_opt_set(bond, BOND_OPT_COUPLED_CONTROL, &newval, data[IFLA_BOND_COUPLED_CONTROL], extack); if (err) return err; } return 0; } static int bond_newlink(struct net_device *bond_dev, struct rtnl_newlink_params *params, struct netlink_ext_ack *extack) { struct nlattr **data = params->data; struct nlattr **tb = params->tb; int err; err = bond_changelink(bond_dev, tb, data, extack); if (err < 0) return err; err = register_netdevice(bond_dev); if (!err) { struct bonding *bond = netdev_priv(bond_dev); netif_carrier_off(bond_dev); bond_work_init_all(bond); } return err; } static size_t bond_get_size(const struct net_device *bond_dev) { return nla_total_size(sizeof(u8)) + /* IFLA_BOND_MODE */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_ACTIVE_SLAVE */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_MIIMON */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_UPDELAY */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_DOWNDELAY */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_USE_CARRIER */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_ARP_INTERVAL */ /* IFLA_BOND_ARP_IP_TARGET */ nla_total_size(sizeof(struct nlattr)) + nla_total_size(sizeof(u32)) * BOND_MAX_ARP_TARGETS + nla_total_size(sizeof(u32)) + /* IFLA_BOND_ARP_VALIDATE */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_ARP_ALL_TARGETS */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_PRIMARY */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_PRIMARY_RESELECT */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_FAIL_OVER_MAC */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_XMIT_HASH_POLICY */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_RESEND_IGMP */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_NUM_PEER_NOTIF */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_ALL_SLAVES_ACTIVE */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_MIN_LINKS */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_LP_INTERVAL */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_PACKETS_PER_SLAVE */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_AD_LACP_ACTIVE */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_AD_LACP_RATE */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_AD_SELECT */ nla_total_size(sizeof(struct nlattr)) + /* IFLA_BOND_AD_INFO */ nla_total_size(sizeof(u16)) + /* IFLA_BOND_AD_INFO_AGGREGATOR */ nla_total_size(sizeof(u16)) + /* IFLA_BOND_AD_INFO_NUM_PORTS */ nla_total_size(sizeof(u16)) + /* IFLA_BOND_AD_INFO_ACTOR_KEY */ nla_total_size(sizeof(u16)) + /* IFLA_BOND_AD_INFO_PARTNER_KEY*/ nla_total_size(ETH_ALEN) + /* IFLA_BOND_AD_INFO_PARTNER_MAC*/ nla_total_size(sizeof(u16)) + /* IFLA_BOND_AD_ACTOR_SYS_PRIO */ nla_total_size(sizeof(u16)) + /* IFLA_BOND_AD_USER_PORT_KEY */ nla_total_size(ETH_ALEN) + /* IFLA_BOND_AD_ACTOR_SYSTEM */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_TLB_DYNAMIC_LB */ nla_total_size(sizeof(u32)) + /* IFLA_BOND_PEER_NOTIF_DELAY */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_MISSED_MAX */ /* IFLA_BOND_NS_IP6_TARGET */ nla_total_size(sizeof(struct nlattr)) + nla_total_size(sizeof(struct in6_addr)) * BOND_MAX_NS_TARGETS + nla_total_size(sizeof(u8)) + /* IFLA_BOND_COUPLED_CONTROL */ 0; } static int bond_option_active_slave_get_ifindex(struct bonding *bond) { const struct net_device *slave; int ifindex; rcu_read_lock(); slave = bond_option_active_slave_get_rcu(bond); ifindex = slave ? slave->ifindex : 0; rcu_read_unlock(); return ifindex; } static int bond_fill_info(struct sk_buff *skb, const struct net_device *bond_dev) { struct bonding *bond = netdev_priv(bond_dev); unsigned int packets_per_slave; int ifindex, i, targets_added; struct nlattr *targets; struct slave *primary; if (nla_put_u8(skb, IFLA_BOND_MODE, BOND_MODE(bond))) goto nla_put_failure; ifindex = bond_option_active_slave_get_ifindex(bond); if (ifindex && nla_put_u32(skb, IFLA_BOND_ACTIVE_SLAVE, ifindex)) goto nla_put_failure; if (nla_put_u32(skb, IFLA_BOND_MIIMON, bond->params.miimon)) goto nla_put_failure; if (nla_put_u32(skb, IFLA_BOND_UPDELAY, bond->params.updelay * bond->params.miimon)) goto nla_put_failure; if (nla_put_u32(skb, IFLA_BOND_DOWNDELAY, bond->params.downdelay * bond->params.miimon)) goto nla_put_failure; if (nla_put_u32(skb, IFLA_BOND_PEER_NOTIF_DELAY, bond->params.peer_notif_delay * bond->params.miimon)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_USE_CARRIER, bond->params.use_carrier)) goto nla_put_failure; if (nla_put_u32(skb, IFLA_BOND_ARP_INTERVAL, bond->params.arp_interval)) goto nla_put_failure; targets = nla_nest_start_noflag(skb, IFLA_BOND_ARP_IP_TARGET); if (!targets) goto nla_put_failure; targets_added = 0; for (i = 0; i < BOND_MAX_ARP_TARGETS; i++) { if (bond->params.arp_targets[i]) { if (nla_put_be32(skb, i, bond->params.arp_targets[i])) goto nla_put_failure; targets_added = 1; } } if (targets_added) nla_nest_end(skb, targets); else nla_nest_cancel(skb, targets); if (nla_put_u32(skb, IFLA_BOND_ARP_VALIDATE, bond->params.arp_validate)) goto nla_put_failure; if (nla_put_u32(skb, IFLA_BOND_ARP_ALL_TARGETS, bond->params.arp_all_targets)) goto nla_put_failure; #if IS_ENABLED(CONFIG_IPV6) targets = nla_nest_start(skb, IFLA_BOND_NS_IP6_TARGET); if (!targets) goto nla_put_failure; targets_added = 0; for (i = 0; i < BOND_MAX_NS_TARGETS; i++) { if (!ipv6_addr_any(&bond->params.ns_targets[i])) { if (nla_put_in6_addr(skb, i, &bond->params.ns_targets[i])) goto nla_put_failure; targets_added = 1; } } if (targets_added) nla_nest_end(skb, targets); else nla_nest_cancel(skb, targets); #endif primary = rtnl_dereference(bond->primary_slave); if (primary && nla_put_u32(skb, IFLA_BOND_PRIMARY, primary->dev->ifindex)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_PRIMARY_RESELECT, bond->params.primary_reselect)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_FAIL_OVER_MAC, bond->params.fail_over_mac)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_XMIT_HASH_POLICY, bond->params.xmit_policy)) goto nla_put_failure; if (nla_put_u32(skb, IFLA_BOND_RESEND_IGMP, bond->params.resend_igmp)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_NUM_PEER_NOTIF, bond->params.num_peer_notif)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_ALL_SLAVES_ACTIVE, bond->params.all_slaves_active)) goto nla_put_failure; if (nla_put_u32(skb, IFLA_BOND_MIN_LINKS, bond->params.min_links)) goto nla_put_failure; if (nla_put_u32(skb, IFLA_BOND_LP_INTERVAL, bond->params.lp_interval)) goto nla_put_failure; packets_per_slave = bond->params.packets_per_slave; if (nla_put_u32(skb, IFLA_BOND_PACKETS_PER_SLAVE, packets_per_slave)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_AD_LACP_ACTIVE, bond->params.lacp_active)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_AD_LACP_RATE, bond->params.lacp_fast)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_AD_SELECT, bond->params.ad_select)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_TLB_DYNAMIC_LB, bond->params.tlb_dynamic_lb)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_MISSED_MAX, bond->params.missed_max)) goto nla_put_failure; if (nla_put_u8(skb, IFLA_BOND_COUPLED_CONTROL, bond->params.coupled_control)) goto nla_put_failure; if (BOND_MODE(bond) == BOND_MODE_8023AD) { struct ad_info info; if (capable(CAP_NET_ADMIN)) { if (nla_put_u16(skb, IFLA_BOND_AD_ACTOR_SYS_PRIO, bond->params.ad_actor_sys_prio)) goto nla_put_failure; if (nla_put_u16(skb, IFLA_BOND_AD_USER_PORT_KEY, bond->params.ad_user_port_key)) goto nla_put_failure; if (nla_put(skb, IFLA_BOND_AD_ACTOR_SYSTEM, ETH_ALEN, &bond->params.ad_actor_system)) goto nla_put_failure; } if (!bond_3ad_get_active_agg_info(bond, &info)) { struct nlattr *nest; nest = nla_nest_start_noflag(skb, IFLA_BOND_AD_INFO); if (!nest) goto nla_put_failure; if (nla_put_u16(skb, IFLA_BOND_AD_INFO_AGGREGATOR, info.aggregator_id)) goto nla_put_failure; if (nla_put_u16(skb, IFLA_BOND_AD_INFO_NUM_PORTS, info.ports)) goto nla_put_failure; if (nla_put_u16(skb, IFLA_BOND_AD_INFO_ACTOR_KEY, info.actor_key)) goto nla_put_failure; if (nla_put_u16(skb, IFLA_BOND_AD_INFO_PARTNER_KEY, info.partner_key)) goto nla_put_failure; if (nla_put(skb, IFLA_BOND_AD_INFO_PARTNER_MAC, sizeof(info.partner_system), &info.partner_system)) goto nla_put_failure; nla_nest_end(skb, nest); } } return 0; nla_put_failure: return -EMSGSIZE; } static size_t bond_get_linkxstats_size(const struct net_device *dev, int attr) { switch (attr) { case IFLA_STATS_LINK_XSTATS: case IFLA_STATS_LINK_XSTATS_SLAVE: break; default: return 0; } return bond_3ad_stats_size() + nla_total_size(0); } static int bond_fill_linkxstats(struct sk_buff *skb, const struct net_device *dev, int *prividx, int attr) { struct nlattr *nla __maybe_unused; struct slave *slave = NULL; struct nlattr *nest, *nest2; struct bonding *bond; switch (attr) { case IFLA_STATS_LINK_XSTATS: bond = netdev_priv(dev); break; case IFLA_STATS_LINK_XSTATS_SLAVE: slave = bond_slave_get_rtnl(dev); if (!slave) return 0; bond = slave->bond; break; default: return -EINVAL; } nest = nla_nest_start_noflag(skb, LINK_XSTATS_TYPE_BOND); if (!nest) return -EMSGSIZE; if (BOND_MODE(bond) == BOND_MODE_8023AD) { struct bond_3ad_stats *stats; if (slave) stats = &SLAVE_AD_INFO(slave)->stats; else stats = &BOND_AD_INFO(bond).stats; nest2 = nla_nest_start_noflag(skb, BOND_XSTATS_3AD); if (!nest2) { nla_nest_end(skb, nest); return -EMSGSIZE; } if (bond_3ad_stats_fill(skb, stats)) { nla_nest_cancel(skb, nest2); nla_nest_end(skb, nest); return -EMSGSIZE; } nla_nest_end(skb, nest2); } nla_nest_end(skb, nest); return 0; } struct rtnl_link_ops bond_link_ops __read_mostly = { .kind = "bond", .priv_size = sizeof(struct bonding), .setup = bond_setup, .maxtype = IFLA_BOND_MAX, .policy = bond_policy, .validate = bond_validate, .newlink = bond_newlink, .changelink = bond_changelink, .get_size = bond_get_size, .fill_info = bond_fill_info, .get_num_tx_queues = bond_get_num_tx_queues, .get_num_rx_queues = bond_get_num_tx_queues, /* Use the same number as for TX queues */ .fill_linkxstats = bond_fill_linkxstats, .get_linkxstats_size = bond_get_linkxstats_size, .slave_maxtype = IFLA_BOND_SLAVE_MAX, .slave_policy = bond_slave_policy, .slave_changelink = bond_slave_changelink, .get_slave_size = bond_get_slave_size, .fill_slave_info = bond_fill_slave_info, }; int __init bond_netlink_init(void) { return rtnl_link_register(&bond_link_ops); } void bond_netlink_fini(void) { rtnl_link_unregister(&bond_link_ops); } MODULE_ALIAS_RTNL_LINK("bond"); |
93 337 337 308 371 387 387 177 175 176 39 283 836 321 583 28 24 245 5 243 389 346 343 389 226 210 1 4 2 345 2 2 2 347 361 1 243 195 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* SCTP kernel implementation * (C) Copyright IBM Corp. 2001, 2004 * Copyright (c) 1999-2000 Cisco, Inc. * Copyright (c) 1999-2001 Motorola, Inc. * Copyright (c) 2001-2003 Intel Corp. * * This file is part of the SCTP kernel implementation * * The base lksctp header. * * Please send any bug reports or fixes you make to the * email address(es): * lksctp developers <linux-sctp@vger.kernel.org> * * Written or modified by: * La Monte H.P. Yarroll <piggy@acm.org> * Xingang Guo <xingang.guo@intel.com> * Jon Grimm <jgrimm@us.ibm.com> * Daisy Chang <daisyc@us.ibm.com> * Sridhar Samudrala <sri@us.ibm.com> * Ardelle Fan <ardelle.fan@intel.com> * Ryan Layer <rmlayer@us.ibm.com> * Kevin Gao <kevin.gao@intel.com> */ #ifndef __net_sctp_h__ #define __net_sctp_h__ /* Header Strategy. * Start getting some control over the header file dependencies: * includes * constants * structs * prototypes * macros, externs, and inlines * * Move test_frame specific items out of the kernel headers * and into the test frame headers. This is not perfect in any sense * and will continue to evolve. */ #include <linux/types.h> #include <linux/slab.h> #include <linux/in.h> #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/spinlock.h> #include <linux/jiffies.h> #include <linux/idr.h> #if IS_ENABLED(CONFIG_IPV6) #include <net/ipv6.h> #include <net/ip6_route.h> #endif #include <linux/uaccess.h> #include <asm/page.h> #include <net/sock.h> #include <net/snmp.h> #include <net/sctp/structs.h> #include <net/sctp/constants.h> #ifdef CONFIG_IP_SCTP_MODULE #define SCTP_PROTOSW_FLAG 0 #else /* static! */ #define SCTP_PROTOSW_FLAG INET_PROTOSW_PERMANENT #endif /* * Function declarations. */ /* * sctp/protocol.c */ int sctp_copy_local_addr_list(struct net *net, struct sctp_bind_addr *addr, enum sctp_scope, gfp_t gfp, int flags); struct sctp_pf *sctp_get_pf_specific(sa_family_t family); int sctp_register_pf(struct sctp_pf *, sa_family_t); void sctp_addr_wq_mgmt(struct net *, struct sctp_sockaddr_entry *, int); int sctp_udp_sock_start(struct net *net); void sctp_udp_sock_stop(struct net *net); /* * sctp/socket.c */ int sctp_inet_connect(struct socket *sock, struct sockaddr *uaddr, int addr_len, int flags); int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb); int sctp_inet_listen(struct socket *sock, int backlog); void sctp_write_space(struct sock *sk); void sctp_data_ready(struct sock *sk); __poll_t sctp_poll(struct file *file, struct socket *sock, poll_table *wait); void sctp_sock_rfree(struct sk_buff *skb); void sctp_copy_sock(struct sock *newsk, struct sock *sk, struct sctp_association *asoc); extern struct percpu_counter sctp_sockets_allocated; int sctp_asconf_mgmt(struct sctp_sock *, struct sctp_sockaddr_entry *); struct sk_buff *sctp_skb_recv_datagram(struct sock *, int, int *); typedef int (*sctp_callback_t)(struct sctp_endpoint *, struct sctp_transport *, void *); void sctp_transport_walk_start(struct rhashtable_iter *iter); void sctp_transport_walk_stop(struct rhashtable_iter *iter); struct sctp_transport *sctp_transport_get_next(struct net *net, struct rhashtable_iter *iter); struct sctp_transport *sctp_transport_get_idx(struct net *net, struct rhashtable_iter *iter, int pos); int sctp_transport_lookup_process(sctp_callback_t cb, struct net *net, const union sctp_addr *laddr, const union sctp_addr *paddr, void *p, int dif); int sctp_transport_traverse_process(sctp_callback_t cb, sctp_callback_t cb_done, struct net *net, int *pos, void *p); int sctp_for_each_endpoint(int (*cb)(struct sctp_endpoint *, void *), void *p); int sctp_get_sctp_info(struct sock *sk, struct sctp_association *asoc, struct sctp_info *info); /* * sctp/primitive.c */ int sctp_primitive_ASSOCIATE(struct net *, struct sctp_association *, void *arg); int sctp_primitive_SHUTDOWN(struct net *, struct sctp_association *, void *arg); int sctp_primitive_ABORT(struct net *, struct sctp_association *, void *arg); int sctp_primitive_SEND(struct net *, struct sctp_association *, void *arg); int sctp_primitive_REQUESTHEARTBEAT(struct net *, struct sctp_association *, void *arg); int sctp_primitive_ASCONF(struct net *, struct sctp_association *, void *arg); int sctp_primitive_RECONF(struct net *net, struct sctp_association *asoc, void *arg); /* * sctp/input.c */ int sctp_rcv(struct sk_buff *skb); int sctp_v4_err(struct sk_buff *skb, u32 info); int sctp_hash_endpoint(struct sctp_endpoint *ep); void sctp_unhash_endpoint(struct sctp_endpoint *); struct sock *sctp_err_lookup(struct net *net, int family, struct sk_buff *, struct sctphdr *, struct sctp_association **, struct sctp_transport **); void sctp_err_finish(struct sock *, struct sctp_transport *); int sctp_udp_v4_err(struct sock *sk, struct sk_buff *skb); int sctp_udp_v6_err(struct sock *sk, struct sk_buff *skb); void sctp_icmp_frag_needed(struct sock *, struct sctp_association *, struct sctp_transport *t, __u32 pmtu); void sctp_icmp_redirect(struct sock *, struct sctp_transport *, struct sk_buff *); void sctp_icmp_proto_unreachable(struct sock *sk, struct sctp_association *asoc, struct sctp_transport *t); int sctp_transport_hashtable_init(void); void sctp_transport_hashtable_destroy(void); int sctp_hash_transport(struct sctp_transport *t); void sctp_unhash_transport(struct sctp_transport *t); struct sctp_transport *sctp_addrs_lookup_transport( struct net *net, const union sctp_addr *laddr, const union sctp_addr *paddr, int dif, int sdif); struct sctp_transport *sctp_epaddr_lookup_transport( const struct sctp_endpoint *ep, const union sctp_addr *paddr); bool sctp_sk_bound_dev_eq(struct net *net, int bound_dev_if, int dif, int sdif); /* * sctp/proc.c */ int __net_init sctp_proc_init(struct net *net); /* * sctp/offload.c */ int sctp_offload_init(void); /* * sctp/stream_sched.c */ void sctp_sched_ops_init(void); /* * sctp/stream.c */ int sctp_send_reset_streams(struct sctp_association *asoc, struct sctp_reset_streams *params); int sctp_send_reset_assoc(struct sctp_association *asoc); int sctp_send_add_streams(struct sctp_association *asoc, struct sctp_add_streams *params); /* * Module global variables */ /* * sctp/protocol.c */ extern struct kmem_cache *sctp_chunk_cachep __read_mostly; extern struct kmem_cache *sctp_bucket_cachep __read_mostly; extern long sysctl_sctp_mem[3]; extern int sysctl_sctp_rmem[3]; extern int sysctl_sctp_wmem[3]; /* * Section: Macros, externs, and inlines */ /* SCTP SNMP MIB stats handlers */ #define SCTP_INC_STATS(net, field) SNMP_INC_STATS((net)->sctp.sctp_statistics, field) #define __SCTP_INC_STATS(net, field) __SNMP_INC_STATS((net)->sctp.sctp_statistics, field) #define SCTP_DEC_STATS(net, field) SNMP_DEC_STATS((net)->sctp.sctp_statistics, field) /* sctp mib definitions */ enum { SCTP_MIB_NUM = 0, SCTP_MIB_CURRESTAB, /* CurrEstab */ SCTP_MIB_ACTIVEESTABS, /* ActiveEstabs */ SCTP_MIB_PASSIVEESTABS, /* PassiveEstabs */ SCTP_MIB_ABORTEDS, /* Aborteds */ SCTP_MIB_SHUTDOWNS, /* Shutdowns */ SCTP_MIB_OUTOFBLUES, /* OutOfBlues */ SCTP_MIB_CHECKSUMERRORS, /* ChecksumErrors */ SCTP_MIB_OUTCTRLCHUNKS, /* OutCtrlChunks */ SCTP_MIB_OUTORDERCHUNKS, /* OutOrderChunks */ SCTP_MIB_OUTUNORDERCHUNKS, /* OutUnorderChunks */ SCTP_MIB_INCTRLCHUNKS, /* InCtrlChunks */ SCTP_MIB_INORDERCHUNKS, /* InOrderChunks */ SCTP_MIB_INUNORDERCHUNKS, /* InUnorderChunks */ SCTP_MIB_FRAGUSRMSGS, /* FragUsrMsgs */ SCTP_MIB_REASMUSRMSGS, /* ReasmUsrMsgs */ SCTP_MIB_OUTSCTPPACKS, /* OutSCTPPacks */ SCTP_MIB_INSCTPPACKS, /* InSCTPPacks */ SCTP_MIB_T1_INIT_EXPIREDS, SCTP_MIB_T1_COOKIE_EXPIREDS, SCTP_MIB_T2_SHUTDOWN_EXPIREDS, SCTP_MIB_T3_RTX_EXPIREDS, SCTP_MIB_T4_RTO_EXPIREDS, SCTP_MIB_T5_SHUTDOWN_GUARD_EXPIREDS, SCTP_MIB_DELAY_SACK_EXPIREDS, SCTP_MIB_AUTOCLOSE_EXPIREDS, SCTP_MIB_T1_RETRANSMITS, SCTP_MIB_T3_RETRANSMITS, SCTP_MIB_PMTUD_RETRANSMITS, SCTP_MIB_FAST_RETRANSMITS, SCTP_MIB_IN_PKT_SOFTIRQ, SCTP_MIB_IN_PKT_BACKLOG, SCTP_MIB_IN_PKT_DISCARDS, SCTP_MIB_IN_DATA_CHUNK_DISCARDS, __SCTP_MIB_MAX }; #define SCTP_MIB_MAX __SCTP_MIB_MAX struct sctp_mib { unsigned long mibs[SCTP_MIB_MAX]; }; /* helper function to track stats about max rto and related transport */ static inline void sctp_max_rto(struct sctp_association *asoc, struct sctp_transport *trans) { if (asoc->stats.max_obs_rto < (__u64)trans->rto) { asoc->stats.max_obs_rto = trans->rto; memset(&asoc->stats.obs_rto_ipaddr, 0, sizeof(struct sockaddr_storage)); memcpy(&asoc->stats.obs_rto_ipaddr, &trans->ipaddr, trans->af_specific->sockaddr_len); } } /* * Macros for keeping a global reference of object allocations. */ #ifdef CONFIG_SCTP_DBG_OBJCNT extern atomic_t sctp_dbg_objcnt_sock; extern atomic_t sctp_dbg_objcnt_ep; extern atomic_t sctp_dbg_objcnt_assoc; extern atomic_t sctp_dbg_objcnt_transport; extern atomic_t sctp_dbg_objcnt_chunk; extern atomic_t sctp_dbg_objcnt_bind_addr; extern atomic_t sctp_dbg_objcnt_bind_bucket; extern atomic_t sctp_dbg_objcnt_addr; extern atomic_t sctp_dbg_objcnt_datamsg; extern atomic_t sctp_dbg_objcnt_keys; /* Macros to atomically increment/decrement objcnt counters. */ #define SCTP_DBG_OBJCNT_INC(name) \ atomic_inc(&sctp_dbg_objcnt_## name) #define SCTP_DBG_OBJCNT_DEC(name) \ atomic_dec(&sctp_dbg_objcnt_## name) #define SCTP_DBG_OBJCNT(name) \ atomic_t sctp_dbg_objcnt_## name = ATOMIC_INIT(0) /* Macro to help create new entries in the global array of * objcnt counters. */ #define SCTP_DBG_OBJCNT_ENTRY(name) \ {.label= #name, .counter= &sctp_dbg_objcnt_## name} void sctp_dbg_objcnt_init(struct net *); #else #define SCTP_DBG_OBJCNT_INC(name) #define SCTP_DBG_OBJCNT_DEC(name) static inline void sctp_dbg_objcnt_init(struct net *net) { return; } #endif /* CONFIG_SCTP_DBG_OBJCOUNT */ #if defined CONFIG_SYSCTL void sctp_sysctl_register(void); void sctp_sysctl_unregister(void); int sctp_sysctl_net_register(struct net *net); void sctp_sysctl_net_unregister(struct net *net); #else static inline void sctp_sysctl_register(void) { return; } static inline void sctp_sysctl_unregister(void) { return; } static inline int sctp_sysctl_net_register(struct net *net) { return 0; } static inline void sctp_sysctl_net_unregister(struct net *net) { return; } #endif /* Size of Supported Address Parameter for 'x' address types. */ #define SCTP_SAT_LEN(x) (sizeof(struct sctp_paramhdr) + (x) * sizeof(__u16)) #if IS_ENABLED(CONFIG_IPV6) void sctp_v6_pf_init(void); void sctp_v6_pf_exit(void); int sctp_v6_protosw_init(void); void sctp_v6_protosw_exit(void); int sctp_v6_add_protocol(void); void sctp_v6_del_protocol(void); #else /* #ifdef defined(CONFIG_IPV6) */ static inline void sctp_v6_pf_init(void) { return; } static inline void sctp_v6_pf_exit(void) { return; } static inline int sctp_v6_protosw_init(void) { return 0; } static inline void sctp_v6_protosw_exit(void) { return; } static inline int sctp_v6_add_protocol(void) { return 0; } static inline void sctp_v6_del_protocol(void) { return; } #endif /* #if defined(CONFIG_IPV6) */ /* Map an association to an assoc_id. */ static inline sctp_assoc_t sctp_assoc2id(const struct sctp_association *asoc) { return asoc ? asoc->assoc_id : 0; } static inline enum sctp_sstat_state sctp_assoc_to_state(const struct sctp_association *asoc) { /* SCTP's uapi always had SCTP_EMPTY(=0) as a dummy state, but we * got rid of it in kernel space. Therefore SCTP_CLOSED et al * start at =1 in user space, but actually as =0 in kernel space. * Now that we can not break user space and SCTP_EMPTY is exposed * there, we need to fix it up with an ugly offset not to break * applications. :( */ return asoc->state + 1; } /* Look up the association by its id. */ struct sctp_association *sctp_id2assoc(struct sock *sk, sctp_assoc_t id); int sctp_do_peeloff(struct sock *sk, sctp_assoc_t id, struct socket **sockp); /* A macro to walk a list of skbs. */ #define sctp_skb_for_each(pos, head, tmp) \ skb_queue_walk_safe(head, pos, tmp) /** * sctp_list_dequeue - remove from the head of the queue * @list: list to dequeue from * * Remove the head of the list. The head item is * returned or %NULL if the list is empty. */ static inline struct list_head *sctp_list_dequeue(struct list_head *list) { struct list_head *result = NULL; if (!list_empty(list)) { result = list->next; list_del_init(result); } return result; } /* SCTP version of skb_set_owner_r. We need this one because * of the way we have to do receive buffer accounting on bundled * chunks. */ static inline void sctp_skb_set_owner_r(struct sk_buff *skb, struct sock *sk) { struct sctp_ulpevent *event = sctp_skb2event(skb); skb_orphan(skb); skb->sk = sk; skb->destructor = sctp_sock_rfree; atomic_add(event->rmem_len, &sk->sk_rmem_alloc); /* * This mimics the behavior of skb_set_owner_r */ sk_mem_charge(sk, event->rmem_len); } /* Tests if the list has one and only one entry. */ static inline int sctp_list_single_entry(struct list_head *head) { return list_is_singular(head); } static inline bool sctp_chunk_pending(const struct sctp_chunk *chunk) { return !list_empty(&chunk->list); } /* Walk through a list of TLV parameters. Don't trust the * individual parameter lengths and instead depend on * the chunk length to indicate when to stop. Make sure * there is room for a param header too. */ #define sctp_walk_params(pos, chunk)\ _sctp_walk_params((pos), (chunk), ntohs((chunk)->chunk_hdr.length)) #define _sctp_walk_params(pos, chunk, end)\ for (pos.v = (u8 *)(chunk + 1);\ (pos.v + offsetof(struct sctp_paramhdr, length) + sizeof(pos.p->length) <=\ (void *)chunk + end) &&\ pos.v <= (void *)chunk + end - ntohs(pos.p->length) &&\ ntohs(pos.p->length) >= sizeof(struct sctp_paramhdr);\ pos.v += SCTP_PAD4(ntohs(pos.p->length))) #define sctp_walk_errors(err, chunk_hdr)\ _sctp_walk_errors((err), (chunk_hdr), ntohs((chunk_hdr)->length)) #define _sctp_walk_errors(err, chunk_hdr, end)\ for (err = (struct sctp_errhdr *)((void *)chunk_hdr + \ sizeof(struct sctp_chunkhdr));\ ((void *)err + offsetof(struct sctp_errhdr, length) + sizeof(err->length) <=\ (void *)chunk_hdr + end) &&\ (void *)err <= (void *)chunk_hdr + end - ntohs(err->length) &&\ ntohs(err->length) >= sizeof(struct sctp_errhdr); \ err = (struct sctp_errhdr *)((void *)err + SCTP_PAD4(ntohs(err->length)))) #define sctp_walk_fwdtsn(pos, chunk)\ _sctp_walk_fwdtsn((pos), (chunk), ntohs((chunk)->chunk_hdr->length) - sizeof(struct sctp_fwdtsn_chunk)) #define _sctp_walk_fwdtsn(pos, chunk, end)\ for (pos = (void *)(chunk->subh.fwdtsn_hdr + 1);\ (void *)pos <= (void *)(chunk->subh.fwdtsn_hdr + 1) + end - sizeof(struct sctp_fwdtsn_skip);\ pos++) /* External references. */ extern struct proto sctp_prot; extern struct proto sctpv6_prot; void sctp_put_port(struct sock *sk); extern struct idr sctp_assocs_id; extern spinlock_t sctp_assocs_id_lock; /* Static inline functions. */ /* Convert from an IP version number to an Address Family symbol. */ static inline int ipver2af(__u8 ipver) { switch (ipver) { case 4: return AF_INET; case 6: return AF_INET6; default: return 0; } } /* Convert from an address parameter type to an address family. */ static inline int param_type2af(__be16 type) { switch (type) { case SCTP_PARAM_IPV4_ADDRESS: return AF_INET; case SCTP_PARAM_IPV6_ADDRESS: return AF_INET6; default: return 0; } } /* Warning: The following hash functions assume a power of two 'size'. */ /* This is the hash function for the SCTP port hash table. */ static inline int sctp_phashfn(struct net *net, __u16 lport) { return (net_hash_mix(net) + lport) & (sctp_port_hashsize - 1); } /* This is the hash function for the endpoint hash table. */ static inline int sctp_ep_hashfn(struct net *net, __u16 lport) { return (net_hash_mix(net) + lport) & (sctp_ep_hashsize - 1); } #define sctp_for_each_hentry(ep, head) \ hlist_for_each_entry(ep, head, node) /* Is a socket of this style? */ #define sctp_style(sk, style) __sctp_style((sk), (SCTP_SOCKET_##style)) static inline int __sctp_style(const struct sock *sk, enum sctp_socket_type style) { return sctp_sk(sk)->type == style; } /* Is the association in this state? */ #define sctp_state(asoc, state) __sctp_state((asoc), (SCTP_STATE_##state)) static inline int __sctp_state(const struct sctp_association *asoc, enum sctp_state state) { return asoc->state == state; } /* Is the socket in this state? */ #define sctp_sstate(sk, state) __sctp_sstate((sk), (SCTP_SS_##state)) static inline int __sctp_sstate(const struct sock *sk, enum sctp_sock_state state) { return sk->sk_state == state; } /* Map v4-mapped v6 address back to v4 address */ static inline void sctp_v6_map_v4(union sctp_addr *addr) { addr->v4.sin_family = AF_INET; addr->v4.sin_port = addr->v6.sin6_port; addr->v4.sin_addr.s_addr = addr->v6.sin6_addr.s6_addr32[3]; } /* Map v4 address to v4-mapped v6 address */ static inline void sctp_v4_map_v6(union sctp_addr *addr) { __be16 port; port = addr->v4.sin_port; addr->v6.sin6_addr.s6_addr32[3] = addr->v4.sin_addr.s_addr; addr->v6.sin6_port = port; addr->v6.sin6_family = AF_INET6; addr->v6.sin6_flowinfo = 0; addr->v6.sin6_scope_id = 0; addr->v6.sin6_addr.s6_addr32[0] = 0; addr->v6.sin6_addr.s6_addr32[1] = 0; addr->v6.sin6_addr.s6_addr32[2] = htonl(0x0000ffff); } /* The cookie is always 0 since this is how it's used in the * pmtu code. */ static inline struct dst_entry *sctp_transport_dst_check(struct sctp_transport *t) { if (t->dst && !dst_check(t->dst, t->dst_cookie)) sctp_transport_dst_release(t); return t->dst; } /* Calculate max payload size given a MTU, or the total overhead if * given MTU is zero */ static inline __u32 __sctp_mtu_payload(const struct sctp_sock *sp, const struct sctp_transport *t, __u32 mtu, __u32 extra) { __u32 overhead = sizeof(struct sctphdr) + extra; if (sp) { overhead += sp->pf->af->net_header_len; if (sp->udp_port && (!t || t->encap_port)) overhead += sizeof(struct udphdr); } else { overhead += sizeof(struct ipv6hdr); } if (WARN_ON_ONCE(mtu && mtu <= overhead)) mtu = overhead; return mtu ? mtu - overhead : overhead; } static inline __u32 sctp_mtu_payload(const struct sctp_sock *sp, __u32 mtu, __u32 extra) { return |